Wednesday 8 April 2015

"mpls ip ttl-expiration pop 1"

I encountered an odd problem when tracerouting from outside the University network to a host in the data centre network in my GNS3 simulation:

HOST-JANETW#traceroute 131.111.8.31

Type escape sequence to abort.
Tracing the route to RAVEN.CAM (131.111.8.31)

  1 GW-JANETW.JANETW.JA (128.4.0.254) 76 msec 76 msec 72 msec
  2 JANETW.ENETW.NET (146.97.41.246) 196 msec 96 msec 112 msec
  3  *  *  * 
  4  *  *  * 
  5 DCW-CUDN.DCW-SRV.NET (193.60.88.2) 176 msec 180 msec 144 msec

  6 RAVEN.CAM (131.111.8.31) 188 msec 180 msec 184 msec

[For reference, these hostnames are not the same as the real ones, but the naming system works by using OTHER-ROUTER.THIS-ROUTER.DOMAIN - so JANETW.ENETW.NET is the IP address of ENETW on the link to/from JANETW.]

Hops 3 and 4 went missing, which are those to MILL from ENETW and then to DCW[-CUDN] from MILL; once we reach DCW-SRV from DCW[-CUDN] we get responses again.  (The two DCW names are because they're different VDCs on the same Nexus 7010.)  This is shown below, with the missing hops at the tips of the red arrows:


This puzzled me for about an hour and a half - I was checking routes and trying pings from the affected hosts back to the source, which were working OK.  I then starting poking about in more details to check what exactly happened to the packets...

Plain-IPv4-over-MPLS when you don't expect it

The first thing was to see what happened to the packet on ENETW by consulting CEF to see what it would do with a packet destined for 131.111.8.31:

BDR-ENETW#show ip cef 131.111.8.31 detail 
131.111.8.0/23, epoch 0
  recursive via 193.60.88.2
    nexthop 192.84.5.133 Ethernet1/1 label 33

This shows the packet was matching the 131.111.8.0/23 route which should be sent via 193.60.88.2 (the interface on DCW-SRV coming from DCW[-CUDN]), which is reached recursively by using a next hop of 192.84.5.133 (the interface on MILL at hop 3, coming in from ENETW).  The important bit is to note that the packet will be forwarded out of interface Ethernet1/1 with MPLS label 30.

Dissecting this (as an aside - just to check there wasn't anything completely screwy) - 131.111.8.0/23 is an eBGP route to AS65110 (the server network) learnt from 193.60.88.2:

BDR-ENETW#show ip route 131.111.8.31
Routing entry for 131.111.8.0/23
  Known via "bgp 64602", distance 200, metric 0
  Tag 65110, type internal
  Last update from 193.60.88.2 01:41:18 ago
  Routing Descriptor Blocks:
  * 193.60.88.2, from 192.84.5.240, 01:41:18 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 65110

193.60.88.2 is, in turn, reached via OSPF with next hop 192.84.5.133, as CEF showed:

BDR-ENETW#show ip route 193.60.88.2
Routing entry for 193.60.88.2/32
  Known via "ospf 1", distance 110, metric 55, type extern 1
  Last update from 192.84.5.133 on Ethernet1/1, 01:45:33 ago
  Routing Descriptor Blocks:
  * 192.84.5.133, from 192.84.5.248, 01:45:33 ago, via Ethernet1/1
      Route metric is 55, traffic share count is 1

So, back to the label - let's see what that's for:

BDR-ENETW#show mpls ip binding remote-label 33 neighbor 192.84.5.133
  193.60.88.2/32
        out label:    33        lsr: 192.84.5.250:0   inuse

Here, I didn't expect MPLS to be in use as the destination address of the packet isn't one that we advertise labels for via LDP: we limit those to the backbone router loopback addresses and the outside addresses of eBGP peers to which we have inter-AS MPLS VPNs.

However, as our server network does operates as a separate AS from the backbone and we use MPLS forwarding between it and the backbone, we do advertise labels for its outside addresses - including 193.60.88.2/32: ENETW has labelled the packet for forwarding across the backbone to DCW-SRV because it has a BGP next hop with a label available.

Next step is to work out what happens to this MPLS frame...

Traceroute over an MPLS network

So, assuming that ENETW does the correct thing and sends the packet on via MPLS to MILL, lets see what that does by debugging ICMP and trying the traceroute again:

CORE-MILL#debug ip icmp
ICMP packet debugging is on
CORE-MILL#
Apr  8 19:53:01.591: Adding 4 bytes of label stack
Apr  8 19:53:01.595: MPLS: ICMP: time exceeded (time to live) sent to 128.4.0.1 (dest was 131.111.8.31)

This shows the MPLS frame is received and the TTL expired (by default, the PE router which turns the packet from a plain IPv4 packet into a labelled MPLS frame will copy the increasing TTL [as generated by traceroute]).  The P router (MILL) unwraps the packet inside and generates the 'TTL exceeded' ICMP message.

The critical bit, however, is that the P router will then label the packet up with its original MPLS label and forward it on to the destination (with a TTL reset to 255); it does NOT return it directly to the sender.  It has to do this as the IP addresses using the MPLS frame may not be ones which exist in the same routing space as the P router.

I verified all this by running Wireshark on the virtual interfaces inside GNS3 (incredibly useful that!) and checking it all looked the way I expected.

This I knew and understood and had investigated in the past (although not in an inter-AS situation) - what I hadn't thought of was what happens next...

The egress PE router

When the egress PE router receives the labelled ICMP 'TTL exceeded' packet, it unwraps it and tries to forward it on to the destination.  In this case, the PE router is DCW-SRV (in the server network) trying to return a packet with a source address of 192.84.5.133 to 128.4.0.1 (HOST-JANETW; an imaginary host, the one my traceroute was initiated from).  This is where things fail...

Because of the recent IP unicast RPF verification testing I've been doing, that source address is blocked from entering the backbone from the server network by the distribution routers to prevent address spoofing: the server network is not allowed to originate addresses it is not responsible for routing.

This can be confirmed by looking on the distribution router, DCW[-CUDN], part of the backbone network:

DIST-DCW#show ip int e1/2 | begin verify
  IP verify source reachable-via RX, ACL 10
   108 verification drops
   0 suppressed verification drops
   0 verification drop-rate

Some debugging shows the naughty source address:

DIST-DCW#debug ip cef drops rpf         
IP CEF drops for RPF debugging is on
DIST-DCW#
*Apr  8 20:08:38.171: CEF-Drop: Packet from 192.84.5.133 via Ethernet1/2 -- via-rx
DIST-DCW#

In the words of Colonel Hans Landa: that's a bingo!  The whole thing is illustrated below, with the rejected ICMP message from the DCW-SRV PE router being marked with a X in red:
MPLS Traceroute showing ICMP message from PE router being rejected due to RPF
It also explains why I'd never seen this problem with the Computer Laboratory in my simulation before, as I'd never bothered blocking traffic from other ASs to prevent address spoofing: I'd only ever tested filtering prefixes over eBGP; traffic filtering is something we've always done on the real network through access lists and not of interest in the simulation.

Solving the [non] problem

Once I'd worked out what the actual problem was, solving it turned out to be trivial...

In reality, this problem turns out to be academic: on our real network, we run with "no mpls ip propagate-ttl forwarded" configured on our PE routers, which prevents the TTL of the incoming packet from being copied into the MPLS frame.  This means that the TTL should never expire when crossing the backbone and those hops simply won't show up in a traceroute (so successful hops 5 and 6 will become hops 3 and 4).

However, in my GNS3 simulation, I like to expose the intermediate hops to verify complete paths from end station hosts for diagnostic and investigatory purposes.  Indeed, the reason I spotted the problem was whilst testing the paths for traffic when various failures occur, by shutting down partso the simulation.

The solution to this is to use the "mpls ip ttl-expiration pop 1" command - this causes MPLS frames with only a single label depth (which will be plain IPv4-over-MPLS frames) to be returned directly to the sender by the P router, rather than forwarding it on to the PE router to be bounced back.  As such, there is no problem with address spoofing as the IP packet starts its life inside the backbone itself.

After doing the later, my traceroute works:

HOST-JANETW#traceroute 131.111.8.31

Type escape sequence to abort.
Tracing the route to RAVEN.CAM (131.111.8.31)

  1 GW-JANETW.JANETW.JA (128.4.0.254) 16 msec 16 msec 8 msec
  2 JANETW.ENETW.NET (146.97.41.246) 20 msec 52 msec 20 msec
  3 ENETW.MILL.NET (192.84.5.133) [MPLS: Label 33 Exp 0] 68 msec 52 msec 24 msec
  4 MILL.DCW.NET (192.84.5.154) [MPLS: Label 16 Exp 0] 48 msec 68 msec 48 msec
  5 DCW-CUDN.DCW-SRV.NET (193.60.88.2) 80 msec 88 msec 68 msec
  6 RAVEN.CAM (131.111.8.31) 104 msec 56 msec 112 msec

You can even see the MPLS labels used in the previously-missing hops being returned in the ICMP extension block.

One thing that this has shown up is that, once the new data centre network is installed, traceroutes into our server network will not expose the backbone network router hops.  That wasn't something I was expecting and could be confusing for others.