Monday, 20 April 2015

DHCP, RPF verify, FHRP and ECMP - when protocols collide

I've been gradually enabling ECMP (Equal-Cost MultiPath) routing on parts of the University network as part of a prelude to the new data centre network and to improve performance generally (by making use of the dual downstream interfaces into institutions).

For the most part, this is fairly straightforward and just worked - I'll explain what was needed in each case in a future post (as MPLS VPN works slightly differently and needs some special consideration), but I did get a nasty problem which broke DHCP relaying in some places that I'll cover here.

How DHCP relaying works


First, let's review how DHCP relaying works (what you get when you do ip helper-address ... on a Cisco router interface towards clients).  Consider the following network:
When a client on the subnet 192.168.1.0/24 wants to DHCP, the following happens initially:
  1. The client sends out a DHCP DISCOVER message from 0.0.0.0 (as it doesn't know its IP address or even subnet, yet) on UDP port 68 (the DHCP client port) to all-hosts broadcast (255.255.255.255) port 67 (DHCP server).
  2. The DHCP relay agent (which is normally the router) will receive this broadcast and forward it as a unicast packet to the DHCP server listed in the ip helper-address ... interface command.  This will be from the router's interface address (this is the critical bit) - 192.168.1.253, in the above example, port 67 to the DHCP server, port 67.
  3. The DHCP server will receive the DISCOVER and, assuming it has an address (and other information) to give to the client, it will send a DHCP OFFER message back to the router to be relayed onto it.  This will be the reverse of the packet just received: going from the DHCP server's address, port 67 to the router's interface address, port 67.
  4. The router will receive the reply and unicast it back to the client, sending it from its interface IP address, port 67 to the client's prospective IP address, port 68.
This all works fine and, assuming the client wants to take the address, it will send a DHCP REQUEST (using the same process) and receive a DHCP ACK back from the server so it can begin using the address.

How DHCP relaying doesn't work


Now consider what happens when the same DHCP DISCOVER is relayed by the other client subnet router, without ECMP:
Here, something goes wrong at step 3, trying to return the DHCP OFFER to the relaying router:
  1. Because the backbone network (depicted with the cloud symbol) only knows about routing to the client subnet as a whole /24 (not the individual routers' addresses), it routes the packet via 192.168.1.253.
  2. 192.168.1.253 treats 192.168.1.252 as just another host on the client subnet and forwards it out of its interface onto that subnet.
  3. Because 192.168.1.252 has anti-spoofing blocks in place on the client subnet interface, it rejects this packet as the source address is that of the DHCP server: an invalid address from the client subnet.
With this configuration, typical for a network with a FHRP (First Hop Redundancy Protocol) such as HSRP or VRRP is in use, half of the DHCP replies (the ones relayed via one of the two routers) will be lost when they're returned to the relaying routers.

However, this in itself isn't particularly a problem in that both routers will relay the same packet to the DHCP server, resulting it receiving two copies of each DISCOVER [but with different relay agent / forwarder addresses], causing two OFFERs to be returned; the clients will not miss out as they'll get one of the two copies.  This is why I've never noticed this problem, even though it's been going on for years: the clients have still got their address and worked.

(Actually, I was sort-of aware of this problem, as it prevented pinging one of the routers' own addresses on a particular interface, if the source of the ping was elsewhere on the network.  However, that's just been a minor inconvenience and not service-affecting; I never realised that would also be affecting DHCP.)

Combining with ECMP


When this situation is combined with ECMP this can get messy: the returned DHCP OFFERs (and ACKs) might be returned to either of the two client subnet routers.  The routers' addresses are often 1 number offset (e.g. 192.168.1.252 vs .253) which will likely mean they each take a different path.

If the path for packets to the .253 relay address happen to go directly to the .253 router, all is fine.  Same with .252.

However, if you're really unlucky (and, of course, we were, in some situations), ECMP will return the .253 packet via the .252 router and the .252 packet via .253.  This results in both replies being rejected and the client getting neither of the responses.

Fixing the problem and creating another


I couldn't find any way to direct the replies back to the correct router (e.g. by advertising the router's interface IP address into OSPF as a /32), so dealing with them being rejected by the anti-spoofing protection seemed the only solution.

As I've written, I've been looking at the ip verify unicast source ... command recently, and it seemed a good opportunity to employ that, rather than modify lots of access control lists.  According to Cisco's documentation, that command has a special feature in to handle DHCP:
"Unicast RPF will allow packets with 0.0.0.0 source and 255.255.255.255 destination to pass so that Bootstrap Protocol (BOOTP) and Dynamic Host Configuration Protocol (DHCP) functions work properly."
— from the IOS Security Configuration Guide for IOS 12.2SX
Sounds good, except that doesn't handle the source addresses of relayed DHCP replies.  It would be nice if this included "packets with a source address of the interface ip helper-address and port 67, destined for the router's interface address port 67", but it doesn't.

However, the command has a feature to allow packets matching an access list to be accepted, even if they fail the RPF check.  It's configured as follows:

ip access-list extended 1301
 permit udp host DHCP-SERVER eq 67 192.168.1.0 0.0.0.255 eq 67
!
interface ...
 ip address 192.168.1.253 255.255.255.0
 standby ip 192.168.1.254
 ip verify unicast source reachable-via rx 1301

This will allow the initial DHCP DISCOVER in (as described in the Cisco documentation), regular 192.168.1.0/24 traffic (due to the RPF check) AND traffic from the DHCP server on port 67 to an address on the same subnet port 67 (which is less tedious than putting the interface IP address itself as it can be copied to the other router without modification).  This change can be combined with a simplification of the interface access lists (if used).  So, I implemented a few of these and all looked hunky dory.

The IP Input process - my old nemesis


However, a little while later, we started getting alarms for CPU usage on the Catalyst 6500-E routers.  A show process cpu sorted command showed high load caused by the IP Input process.

This is usually caused by excessive traffic being forwarded ("punted" in Cisco parlance) to the Route Processor (RP) for handling.  We can capture and display these with the following commands:

router# debug netdr capture rx
router# show netdr captured-packets

(Use debug netdr clear-capture to clear the buffer and our old friend undebug all to switch it off.)

The packets being punted all appeared to be regular data - nothing complicated like DHCP which needs special processing, so I started doing some more reading and found a document on Cisco's website explaining how this configuration is handled on a 6500:
"For unicast RPF check without ACL filtering, the PFC3 provides hardware support for the RPF check of traffic from multiple interfaces. 
For unicast RPF check with ACL filtering, the PFC determines whether or not traffic matches the ACL. The PFC sends the traffic denied by the RPF ACL to the route processor (RP) for the unicast RPF check. Packets permitted by the ACL are forwarded in hardware without a unicast RPF check." 
— from the IOS Network Security guide for IOS 12.2SX on Catalyst 6500 with PFC3
So it appears that, when you use an ACL, all the traffic not matching the ACL will get punted to the RP.  Excellent.

Fixing the problem for good


So, I backtracked on using the ip verify unicast ... command and reverted to using our old inbound access lists to protect against address spoofing.  These now have an extra entry and look as follows:

ip access-list extended in-subnet
 permit ip 192.168.1.0 0.0.0.255 any
 permit udp any eq bootpc host 255.255.255.255 eq bootps
 permit udp host DHCP-SERVER eq bootps 192.168.1.0 0.0.0.255 eq bootps
 deny ip any any

This appears to do the trick and doesn't involve the RP on the router going bananas.  Given this problem, I think I'll abandon using the ip verify unicast ... command!

(Update 2018-02-01 — we have since installed Catalyst 6807-XLs with Supervisor 6Ts and came up with a final solution.  I've described that on a separate blog post.)

Wednesday, 8 April 2015

"mpls ip ttl-expiration pop 1"

I encountered an odd problem when tracerouting from outside the University network to a host in the data centre network in my GNS3 simulation:

HOST-JANETW#traceroute 131.111.8.31

Type escape sequence to abort.
Tracing the route to RAVEN.CAM (131.111.8.31)

  1 GW-JANETW.JANETW.JA (128.4.0.254) 76 msec 76 msec 72 msec
  2 JANETW.ENETW.NET (146.97.41.246) 196 msec 96 msec 112 msec
  3  *  *  * 
  4  *  *  * 
  5 DCW-CUDN.DCW-SRV.NET (193.60.88.2) 176 msec 180 msec 144 msec

  6 RAVEN.CAM (131.111.8.31) 188 msec 180 msec 184 msec

[For reference, these hostnames are not the same as the real ones, but the naming system works by using OTHER-ROUTER.THIS-ROUTER.DOMAIN - so JANETW.ENETW.NET is the IP address of ENETW on the link to/from JANETW.]

Hops 3 and 4 went missing, which are those to MILL from ENETW and then to DCW[-CUDN] from MILL; once we reach DCW-SRV from DCW[-CUDN] we get responses again.  (The two DCW names are because they're different VDCs on the same Nexus 7010.)  This is shown below, with the missing hops at the tips of the red arrows:


This puzzled me for about an hour and a half - I was checking routes and trying pings from the affected hosts back to the source, which were working OK.  I then starting poking about in more details to check what exactly happened to the packets...

Plain-IPv4-over-MPLS when you don't expect it

The first thing was to see what happened to the packet on ENETW by consulting CEF to see what it would do with a packet destined for 131.111.8.31:

BDR-ENETW#show ip cef 131.111.8.31 detail 
131.111.8.0/23, epoch 0
  recursive via 193.60.88.2
    nexthop 192.84.5.133 Ethernet1/1 label 33

This shows the packet was matching the 131.111.8.0/23 route which should be sent via 193.60.88.2 (the interface on DCW-SRV coming from DCW[-CUDN]), which is reached recursively by using a next hop of 192.84.5.133 (the interface on MILL at hop 3, coming in from ENETW).  The important bit is to note that the packet will be forwarded out of interface Ethernet1/1 with MPLS label 30.

Dissecting this (as an aside - just to check there wasn't anything completely screwy) - 131.111.8.0/23 is an eBGP route to AS65110 (the server network) learnt from 193.60.88.2:

BDR-ENETW#show ip route 131.111.8.31
Routing entry for 131.111.8.0/23
  Known via "bgp 64602", distance 200, metric 0
  Tag 65110, type internal
  Last update from 193.60.88.2 01:41:18 ago
  Routing Descriptor Blocks:
  * 193.60.88.2, from 192.84.5.240, 01:41:18 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 65110

193.60.88.2 is, in turn, reached via OSPF with next hop 192.84.5.133, as CEF showed:

BDR-ENETW#show ip route 193.60.88.2
Routing entry for 193.60.88.2/32
  Known via "ospf 1", distance 110, metric 55, type extern 1
  Last update from 192.84.5.133 on Ethernet1/1, 01:45:33 ago
  Routing Descriptor Blocks:
  * 192.84.5.133, from 192.84.5.248, 01:45:33 ago, via Ethernet1/1
      Route metric is 55, traffic share count is 1

So, back to the label - let's see what that's for:

BDR-ENETW#show mpls ip binding remote-label 33 neighbor 192.84.5.133
  193.60.88.2/32
        out label:    33        lsr: 192.84.5.250:0   inuse

Here, I didn't expect MPLS to be in use as the destination address of the packet isn't one that we advertise labels for via LDP: we limit those to the backbone router loopback addresses and the outside addresses of eBGP peers to which we have inter-AS MPLS VPNs.

However, as our server network does operates as a separate AS from the backbone and we use MPLS forwarding between it and the backbone, we do advertise labels for its outside addresses - including 193.60.88.2/32: ENETW has labelled the packet for forwarding across the backbone to DCW-SRV because it has a BGP next hop with a label available.

Next step is to work out what happens to this MPLS frame...

Traceroute over an MPLS network

So, assuming that ENETW does the correct thing and sends the packet on via MPLS to MILL, lets see what that does by debugging ICMP and trying the traceroute again:

CORE-MILL#debug ip icmp
ICMP packet debugging is on
CORE-MILL#
Apr  8 19:53:01.591: Adding 4 bytes of label stack
Apr  8 19:53:01.595: MPLS: ICMP: time exceeded (time to live) sent to 128.4.0.1 (dest was 131.111.8.31)

This shows the MPLS frame is received and the TTL expired (by default, the PE router which turns the packet from a plain IPv4 packet into a labelled MPLS frame will copy the increasing TTL [as generated by traceroute]).  The P router (MILL) unwraps the packet inside and generates the 'TTL exceeded' ICMP message.

The critical bit, however, is that the P router will then label the packet up with its original MPLS label and forward it on to the destination (with a TTL reset to 255); it does NOT return it directly to the sender.  It has to do this as the IP addresses using the MPLS frame may not be ones which exist in the same routing space as the P router.

I verified all this by running Wireshark on the virtual interfaces inside GNS3 (incredibly useful that!) and checking it all looked the way I expected.

This I knew and understood and had investigated in the past (although not in an inter-AS situation) - what I hadn't thought of was what happens next...

The egress PE router

When the egress PE router receives the labelled ICMP 'TTL exceeded' packet, it unwraps it and tries to forward it on to the destination.  In this case, the PE router is DCW-SRV (in the server network) trying to return a packet with a source address of 192.84.5.133 to 128.4.0.1 (HOST-JANETW; an imaginary host, the one my traceroute was initiated from).  This is where things fail...

Because of the recent IP unicast RPF verification testing I've been doing, that source address is blocked from entering the backbone from the server network by the distribution routers to prevent address spoofing: the server network is not allowed to originate addresses it is not responsible for routing.

This can be confirmed by looking on the distribution router, DCW[-CUDN], part of the backbone network:

DIST-DCW#show ip int e1/2 | begin verify
  IP verify source reachable-via RX, ACL 10
   108 verification drops
   0 suppressed verification drops
   0 verification drop-rate

Some debugging shows the naughty source address:

DIST-DCW#debug ip cef drops rpf         
IP CEF drops for RPF debugging is on
DIST-DCW#
*Apr  8 20:08:38.171: CEF-Drop: Packet from 192.84.5.133 via Ethernet1/2 -- via-rx
DIST-DCW#

In the words of Colonel Hans Landa: that's a bingo!  The whole thing is illustrated below, with the rejected ICMP message from the DCW-SRV PE router being marked with a X in red:
MPLS Traceroute showing ICMP message from PE router being rejected due to RPF
It also explains why I'd never seen this problem with the Computer Laboratory in my simulation before, as I'd never bothered blocking traffic from other ASs to prevent address spoofing: I'd only ever tested filtering prefixes over eBGP; traffic filtering is something we've always done on the real network through access lists and not of interest in the simulation.

Solving the [non] problem

Once I'd worked out what the actual problem was, solving it turned out to be trivial...

In reality, this problem turns out to be academic: on our real network, we run with "no mpls ip propagate-ttl forwarded" configured on our PE routers, which prevents the TTL of the incoming packet from being copied into the MPLS frame.  This means that the TTL should never expire when crossing the backbone and those hops simply won't show up in a traceroute (so successful hops 5 and 6 will become hops 3 and 4).

However, in my GNS3 simulation, I like to expose the intermediate hops to verify complete paths from end station hosts for diagnostic and investigatory purposes.  Indeed, the reason I spotted the problem was whilst testing the paths for traffic when various failures occur, by shutting down partso the simulation.

The solution to this is to use the "mpls ip ttl-expiration pop 1" command - this causes MPLS frames with only a single label depth (which will be plain IPv4-over-MPLS frames) to be returned directly to the sender by the P router, rather than forwarding it on to the PE router to be bounced back.  As such, there is no problem with address spoofing as the IP packet starts its life inside the backbone itself.

After doing the later, my traceroute works:

HOST-JANETW#traceroute 131.111.8.31

Type escape sequence to abort.
Tracing the route to RAVEN.CAM (131.111.8.31)

  1 GW-JANETW.JANETW.JA (128.4.0.254) 16 msec 16 msec 8 msec
  2 JANETW.ENETW.NET (146.97.41.246) 20 msec 52 msec 20 msec
  3 ENETW.MILL.NET (192.84.5.133) [MPLS: Label 33 Exp 0] 68 msec 52 msec 24 msec
  4 MILL.DCW.NET (192.84.5.154) [MPLS: Label 16 Exp 0] 48 msec 68 msec 48 msec
  5 DCW-CUDN.DCW-SRV.NET (193.60.88.2) 80 msec 88 msec 68 msec
  6 RAVEN.CAM (131.111.8.31) 104 msec 56 msec 112 msec

You can even see the MPLS labels used in the previously-missing hops being returned in the ICMP extension block.

One thing that this has shown up is that, once the new data centre network is installed, traceroutes into our server network will not expose the backbone network router hops.  That wasn't something I was expecting and could be confusing for others.

Monday, 6 April 2015

Current GNS3 environment

I've been updating my GNS3 environment (I last posted about this back in August 2013) with some changes recently.  The new model (as of 6th April 2015) is updated to work in GNS3 1.3.0 below:


I added MPLS just under a couple of years ago (I started planning it in mid-2013 and implemented it late in the year), including an EoMPLS xconnect, as well as IPv6 multicast.  The recent changes however are:

  • Added the data centre network (being implemented at the moment) — interfacing to the backbone in the same way as a regular institutional network.
  • Added a second Janet connection (to be implemented soon, when the routers arrive!), with separate Janet router (janetw) and CUDN border router (enetw).  This involved adjusting the internal OSPF link costs and BGP advertisements.
  • Added the NAT service using a pair of IOS routers (instead of the real ASAs).
  • Expanded the Medical School network with a local, static router and added a directly-routed WBIC [Wolfson Brain Imaging Centre] host instead (which doesn't use the Medical School PoP connection, but this doesn't matter for the simulation).
  • Removed MRC CBSU [Cognition and Brain Sciences Unit] as they aren't doing anything particularly interesting any more (they were advertising a /23 we had to relay to Janet and expose and trim the AS path from, but we now do this for them; Engineering now do this, anyway).

Saturday, 4 April 2015

IP unicast RPF source verification

We currently block against spoofed addresses using extended router access lists on the inbound interfaces from institutions (departments and colleges).  Whilst this works, it is very tedious to maintain these access lists (including keeping them synchronised across routers, which we do via a complex external script mechanism).

However, Cisco IOS has a feature called "IP Unicast Reverse Path Forwarding Source Verification" which can be used to block spoofed addresses from entering via a routing interface which I've been meaning to look at for ages.

The feature is activated using the interface command "ip verify unicast source reachable-via rx" - this blocks any traffic entering via that interface if it wouldn't be used to send outbound traffic to the source address: an RPF (Reverse Path Forwarding) check, exactly as multicast does.

Once the feature is applied, the "show ip interface" command will display the number of packets which have been blocked due to it:

JANETC#show ip int e1/3 | begin verify
  IP verify source reachable-via RX
   9 verification drops
   0 suppressed verification drops
   0 verification drop-rate

Asymmetric paths

For the feature to work, it is essential that traffic paths are symmetrical across any interfaces on which it is used.  This is normally the case for simple edge interfaces as there can only be a single path to the address ranges which are connected and statically routed across them.

However, on links where dynamic routing takes place, traffic can arrive via the non-reverse path, resulting in asymmetric paths.  On the internal links of a network, this is usually not a problem as long as the traffic across those can be trusted, as long as the edge of the network is protected.

The only place where this presents a problem is where dynamic routing takes place across an administrative boundary (such as those between ASs where eBGP peerings are used).  Here, the obvious solution is to fix the traffic paths such that they are symmetric, which is required if multicast is to function correctly anyway.

However, the verification feature supports an access list to allow packets to pass, even in the event of the RPF check failing.  This is specified on the end of the "ip verify ..." command, e.g.:

ip access-list standard 3
 permit 128.232.0.0 0.0.255.255
 permit 129.169.0.0 0.0.255.255
 permit 131.111.0.0 0.0.255.255
!
interface Ethernet1/3
 ip verify unicast source reachable-via rx 3

The number of packets which have been permitted through this mechanism will also show up in the IP interface information under "suppressed verification drops":

JANETC#show ip int e1/3 | begin verify
  IP verify source reachable-via RX, ACL 3
   31 verification drops
   39 suppressed verification drops
   0 verification drop-rate

Interaction with access lists

When combining this feature with inbound access lists on an interface, the access lists are applied BEFORE the source address verification.  As such, packets which the access list blocks will not show up in the address verification counts, shown above.

The two features can be usefully combined, as access lists do not need to be written to handle the validation of source addresses - they can just permit or deny the traffic required, based on ports and protocols, leaving source address validation to be done by the verification feature.  This is especially useful when multinetting is used.  I expect this will allow us to simplify many of our access lists.