Thursday, 17 December 2015

Frama franking machine ethernet issues

The University has been refurbishing the old Arup Building on the New Museums Site for the Cambridge Conservation Initiative and renamed it as the David Attenborough Building.

In the new building, we had problems with a Frama franking machine which wouldn't connect to the network:

  • When the machine boots up, the ethernet link goes up and the device DHCPs to get an address but then disconnects.
  • When trying a connection test or trying to frank something, the link never goes live (the interface doesn't go up from a physical/line level - not getting as far as IP) and it reports an error.
Oddly, if we disconnect the franking machine and connect something else to the wallport, that device works fine, even on 1Gbit/s connections.  If we connect the franking machine directly to a laptop (in my case, a Thunderbird adapter on a MacBook Air), it also works fine and can be pinged.  We've tried other wallports, cables and switchports without resolving the issue.

Some notes:

  • The machine does, by design, disconnect from the network when it's not using it as a security measure (to minimise the chance of being hacked remotely).
  • The connection is via a wall socket and probably 30-40m cable.
  • The franking machine only has a 10Mbit/s half-duplex ethernet interface.
  • The switch the machine was connecting to is a Cisco 2960X-48LPS-L (48-port PoE+ with 10G uplink).
Fixing the port speed to 100Mbit/s half-duplex did not resolve the issue.

I ended up solving it by putting an unmanaged Netgear 8-port 10/100 switch between it and the wallport and it works fine.

However, I also wondered if PoE was upsetting the controller in the franking machine so I disabled that and removed the Netgear.  That was working when I left, so that may have fixed up, but I'll have to check in and see how they're getting on, a couple of days later.

Friday, 26 June 2015

The Double Wireshark

I've been trying to diagnose a problem with multicast forwarding over MPLS with MLDP — packets are going missing somewhere between the PE router, including the PIM Hellos, resulting in PIM neighbours occasionally timing out, even when no traffic is flowing across the MDT).

This morning, I did some packet captures over a quiet MDT (i.e. one where there was no traffic being forwarded, other than the PIM Hellos).  The backbone links are 10G and busy most of the time, making looking for a single missing packet tricky, so I changed OSPF costs and HSRP tracking object statuses to try and move as much traffic as possible away from them and making sniffing the links possible with a reasonable degree of confidence there was no packet drop.

I started at the PE router to check it was actually sending the PIM Hellos on its uplink, which it was (at the time another PE router didn't receive them).  I then moved on to one of our core (P) routers, which is also the MDT root: I needed to see if the packets were ingressing and egressing correctly and see if they were going missing.

To do this, I attached two Ethernet interfaces to my MacBook Air (one Thunderbolt and one USB) —although traffic levels were only a few megabits, I used the USB one for ingress and Thunderbolt one for egress (as the latter is arguably more critical) and ran two Wiresharks:


The core router was then set to mirror the ingress and egress ports to different monitoring destinations - Gi5/2 and Gi6/2 (the copper ports on the Supervisor cards) and attached to the MacBook Air:


The missing packets were evident by comparing the captures:


These corresponded to the destination PE router showing it had missed a PIM Hello and the timeout not resetting (and a debug ip pim vrf ... hello / terminal monitor would show it had gone missing).

I've emailed the captures and report to the support partner / Cisco.

Sunday, 21 June 2015

ECMP with OSPF, BGP and MPLS

As reported earlier, we've been gradually enabling ECMP across the University backbone to make better use of links and increase effective bandwidth.  For the most part, this is pretty straightforward, but there were a few gotchas — I thought I'd document both these: the enabling and the gotchas.

Note that ECMP typically enables load-sharing and NOT load-balancing:

  • Load-sharing is about distributing the traffic across active paths, probably by hashing the source and/or destination addresses of the packet, potentially resulting in an uneven distribution (especially if the traffic is between a small number of addresses).
  • Load-balancing is attempts to distribute the traffic such that it is evenly split across the available paths.

Note that ECMP using routing protocols will usually only distribute traffic where routers are both the ingress and egress point: it does nothing for inbound traffic from a simple host with a single subnet gateway or static default route: in the absence of a dynamic routing protocol to do this, solutions such as Cisco's GLBP (Gateway Load Balancing Protocol) can help here (which I'm not going to cover here).

OSPFv2 and OSPFv3

OSPF is easy to do - you simply increase the number of paths used:

router ospf 1
 maximum-paths 2
!
ipv6 router ospf 1
 maximum-paths 2

... the parameter to maximum-paths specifies how many of the available [lowest and best] equal-cost paths calculated in OSPF are loaded in the router's active forwarding table.  IOS supports a value up to 6.

This command must be entered on all routers in the network and applies at each hop.  For example, in a traditional two-layer core and distribution model:
  • On the ingress distribution router, it will distribute traffic across uplinks to the core routers.
  • On the core routers, it will distribute traffic across the downlinks to the egress distribution routers which serve the destination address.
Once entered, the show ip route ... command can be used to confirm multipath is in operation — here on a core router:

CORE-CENT#show ip route 131.111.10.10
Routing entry for 131.111.10.0/24
  Known via "ospf 1", distance 110, metric 27, type extern 1
  Last update from 192.84.5.18 on Ethernet1/4, 00:01:07 ago
  Routing Descriptor Blocks:
  * 192.84.5.34, from 192.84.5.238, 00:01:07 ago, via Ethernet1/2
      Route metric is 27, traffic share count is 1
    192.84.5.18, from 192.84.5.234, 00:01:07 ago, via Ethernet1/4
      Route metric is 27, traffic share count is 1

Personally, I find the show ip cef ... detail command a little clearer (and explains MPLS better, when we get round to that):

CORE-CENT#show ip cef 131.111.10.10 detail 
131.111.10.0/24, epoch 0, per-destination sharing
  local label info: global/33
  nexthop 192.84.5.18 Ethernet1/4
  nexthop 192.84.5.34 Ethernet1/2

That's all there is to it, although if you use DHCP relaying, first hop redundancy (VRRP, HSRP or GLBP) and address spoofing protection, then note the messy problem with those I covered in an earlier article!

BGP

Multipath using BGP route is similar to OSPF:

router bgp 64602
 address-family ipv4 unicast
  maximum-paths 2
  maximum-paths ibgp 2
 exit-address-family
 !
 address-family ipv6 unicast
  maximum-paths 2
  maximum-paths ibgp 2
 exit-address-family

The obvious difference is the two separate commands
  • maximum-paths ... applies to routes learnt over eBGP peerings
  • maximum-paths ibgp ... applies to routes learnt over iBGP peerings, even if they are external in origin (i.e. learnt from an eBGP ASBR in the same AS)
Because of this, usually only the latter (the iBGP version) is required on core routers and BGP route reflectors (as they don't usually have eBGP peers).

Route Reflectors (RRs) also present an additional wrinkle — an RR will only reflect a single route to its clients: that which it itself considers the best, based on the normal BGP selection methods (which can include the IGP cost).  This does not usually cause a problem because, as long as the RR client sends the traffic to the RR, the RR can then multipath traffic from itself according to the available paths, when it forwards it on.  However, this did cause a problem when trying to share traffic across the core to our internet gateways (more later).

Note that you do not (and cannot) enable multipath for the multicast address families.  The multicast routes are for RPF checking; multipath for multicast traffic forwarding is done separately (and I've not yet looked into it, so there's nothing in this article about it).

eBGP example

Let's look at an example - 129.169.0.0/16 is a block of addresses used by the Department of Engineering — their network is AS65106,  separate from the University backbone (which is AS64602).  They connect to the backbone via a series of /30 link subnets in 193.60.93.16/28 across which operate the eBGP peerings.

(Note that we're using a contrived simulation here, so the details are not necessarily accurate to reality.)

Looking in the routing table for 129.169.80.10 on one of the core routers (which, in my example is advertised equally across two of these links):

CORE-CENT#show ip route 129.169.80.10
Routing entry for 129.169.0.0/16
  Known via "bgp 64602", distance 200, metric 0
  Tag 65106, type internal
  Last update from 193.60.93.26 00:02:11 ago
  Routing Descriptor Blocks:
    193.60.93.26, from 192.84.5.234, 00:02:11 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 65106
      MPLS label: none
  * 193.60.93.18, from 192.84.5.236, 00:02:11 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 65106
      MPLS label: none

This shows the two links via eBGP:
  • one via 193.60.93.26 which was learnt from iBGP peer 192.84.5.234
  • the other via 193.60.93.18 which was learnt from iBGP peer 192.84.5.236
Checking the actual forwarding on a core router with show ip cef:

CORE-CENT#show ip cef 129.169.80.10 detail   
129.169.0.0/16, epoch 0, flags rib only nolabel, rib defined all labels, per-destination sharing
  recursive via 193.60.93.18
    recursive via 193.60.93.16/30
      nexthop 192.84.5.26 Ethernet1/6
  recursive via 193.60.93.26
    recursive via 193.60.93.24/30
      nexthop 192.84.5.18 Ethernet1/4

Note the two levels of recursion:
  • 193.60.93.18 and 193.60.93.26 are the addresses of the eBGP border routers from which the routes were learnt
  • these match the link subnet routes 193.60.93.16/30 and 193.60.93.26/30, respectively (from the IGP)
  • each of these was learnt from the IGP neighbours 192.84.5.26 (on Eth1/6) and 192.84.5.18 (on Eth1/4)
Just for completeness, let's delve into the BGP database:

CORE-CENT#show bgp ipv4 unicast 129.169.80.10
BGP routing table entry for 129.169.0.0/16, version 31
Paths: (2 available, best #2, table default)
Multipath: iBGP
  Advertised to update-groups:
     1         
  Refresh Epoch 1
  65106, (aggregated by 65106 129.169.252.1), (Received from a RR-client), (received & used)
    193.60.93.18 (metric 27) from 192.84.5.236 (192.84.5.236)
      Origin IGP, metric 0, localpref 100, valid, internal, atomic-aggregate, multipath(oldest)
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 1
  65106, (aggregated by 65106 129.169.252.2), (Received from a RR-client), (received & used)
    193.60.93.26 (metric 27) from 192.84.5.234 (192.84.5.234)
      Origin IGP, metric 0, localpref 100, valid, internal, atomic-aggregate, multipath, best

      rx pathid: 0, tx pathid: 0x0

MPLS L3 VPNs

On the surface, MPLS L3 VPNs look straightforward and similar to BGP — you just need to use maximum-paths in the corresponding BGP VRF stanzas:

router bgp 64602
 address-family ipv4 unicast vrf ucs-staff_vrf
  maximum-paths 2
  maximum-paths ibgp 2
 exit-address-family
 !
 address-family ipv6 unicast vrf ucs-staff_vrf
  maximum-paths 2
  maximum-paths ibgp 2
 exit-address-family

(In the above case, the VRFs have eBGP peerings to non-MPLS peers — ones using a regular ipv4 unicast address family with the peering inside a VRF, as opposed to a vpnv4 peering with non-VRF addresses; providing the so-called "carrier's carrier" service.  Hence the need for the maximum-paths 2 line.  If the VPN was only using connected and static routes, inside the AS, only the maximum-paths ibgp 2 line would be needed as the routes would all be internal.)

Close, but no cigar: only a single path was being used — looking on the ingress PE router:

DIST-HOSP#show ip cef vrf eng_vrf 129.169.10.10 detail 
129.169.10.0/24, epoch 0
  recursive via 192.84.5.234 label 52
    nexthop 192.84.5.29 Ethernet1/0 label 18

And BGP confirms only a single route is available:

DIST-HOSP#show bgp vpnv4 unicast vrf eng_vrf 129.169.10.10
BGP routing table entry for 64602:129:129.169.10.0/24, version 6
Paths: (1 available, best #1, table eng_vrf)
Flag: 0x820
  Not advertised to any peer
  Local
    192.84.5.234 (metric 15) from 192.84.5.240 (192.84.5.240)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:64602:129
      Originator: 192.84.5.234, Cluster list: 192.84.5.240
      Connector Attribute: count=1
       type 1 len 12 value 64602:129:192.84.5.234
      mpls labels in/out nolabel/52

Odd - let's have a look at a core RR P router (note that we have to look up the route using the RD itself (64602:129 - a combination of our AS plus a local ID) as there is no VRF configured on a P router):

CORE-CENT#show bgp vpnv4 uni rd 64602:129 129.169.10.10
BGP routing table entry for 64602:129:129.169.10.0/24, version 5
Paths: (2 available, best #1, no table)
Flag: 0x820
  Advertised to update-groups:
        2
  Local, (Received from a RR-client)
    192.84.5.234 (metric 8) from 192.84.5.234 (192.84.5.234)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:64602:129
      Connector Attribute: count=1
       type 1 len 12 value 64602:129:192.84.5.234
      mpls labels in/out nolabel/52
  Local, (Received from a RR-client)
    192.84.5.238 (metric 8) from 192.84.5.238 (192.84.5.238)
      Origin incomplete, metric 0, localpref 100, valid, internal
      Extended Community: RT:64602:129
      Connector Attribute: count=1
       type 1 len 12 value 64602:129:192.84.5.238
      mpls labels in/out nolabel/17

Both routes are present here, but only one is being selected: the one from 192.84.5.234 (marked as "best"), based on it having a lower IP address (given no other method of preference); the one from 192.84.5.238 is being discarded.

Distinguishing Router Distinguishers

This was a bit of a mystery until I found this post by Ivan Pepelnjak which described how RRs behave with multiple VPN routes and finally made the distinction between Route Distinguishers and Route Targets clear to me (and he admits it's not clearly explained in the Cisco textbooks):
  • The Route Distinguisher (RD) is used, along with the IPv4 or IPv6 addresses of the router to build a complete route ID you can imagine being in the form "RD:route", e.g. "64602:129:129.169.10.0/24" (you can see this in the first line of the output from show bgp vpnv4 ..., above).  The purpose of the RD is to distinguish a route belonging to one VPN from another in the provider network.
  • The Route Target (RT) specifies which which VPNs a particular route should be imported from or exported into, when a VRF is configured on a particular router.
The important point is that a RR will only reflect a single route with a particular ID: if the PE routers are all using the same RD, the RR will only use a single one of these routes.  Changes to this involve extensions to the BGP protocol which only appeared in IOS 15.2, with the router bgp ... / bgp additional-paths ... command.

Without this new capability, the solution to this is to use a different RD on each PE router, resulting in distinct routes in the provider network.  This made us revisit how we reassign RDs and RTs:
  • We now set the RD to be the public IPv4 loopback address of the PE router, instead of using the local [private] ASN; RDs support this as a standard format and leave 16 bits for the administratively-assigned ID.  This changes our RDs from 64602:id to 192.84.5.x:id.
  • On the other hand, the import and export RTs typically need to be the same for all VRFs across the VPN (unless partial imports are to be used).  For consistency with the RT, we've changed those from the same ASN:id format (64602:id, same as the RD) to 192.84.5.0:id (192.84.5.0/24 being the block we use for our router backbone and loopback addresses).
(Note that the address-family vpvn4 stanza does have a maximum-paths statement, but that doesn't appear to do anything useful: I'm not sure what the point of it is!)

Once changed, both routes now show up as best on the core router as completely separate (as far as it is concerned, they're completely separate routes, albeit with the same RT), searching by the new RD (it's only on the PE router where the two routes get brought together, when they're imported by the common RT into the VRF):

CORE-CENT#show bgp vpnv4 unicast rd 192.84.5.234:129 129.169.10.10
BGP routing table entry for 192.84.5.234:129:129.169.10.0/24, version 3
Paths: (1 available, best #1, no table)
  Advertised to update-groups:
     1         
  Refresh Epoch 1
  Local, (Received from a RR-client)
    192.84.5.234 (metric 8) from 192.84.5.234 (192.84.5.234)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:192.84.5.0:129
      mpls labels in/out nolabel/54
      rx pathid: 0, tx pathid: 0x0

CORE-CENT#show bgp vpnv4 unicast rd 192.84.5.238:129 129.169.10.10
BGP routing table entry for 192.84.5.238:129:129.169.10.0/24, version 2
Paths: (1 available, best #1, no table)
  Advertised to update-groups:
     1         
  Refresh Epoch 1
  Local, (Received from a RR-client)
    192.84.5.238 (metric 8) from 192.84.5.238 (192.84.5.238)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:192.84.5.0:129
      mpls labels in/out nolabel/56
      rx pathid: 0, tx pathid: 0x0

Looking on the PE router, both routes now appear and are marked as "multipath".  Note that the unique router ID contains the local RD for this VRF (192.84.5.237:129):

DIST-HOSP#show bgp vpnv4 unicast vrf eng_vrf 129.169.10.10
BGP routing table entry for 192.84.5.237:129:129.169.10.0/24, version 12
Paths: (2 available, best #2, table eng_vrf)
Multipath: iBGP
  Not advertised to any peer
  Refresh Epoch 1
  Local, imported path from 192.84.5.238:129:129.169.10.0/24 (global)
    192.84.5.238 (metric 15) from 192.84.5.240 (192.84.5.240)
      Origin incomplete, metric 0, localpref 100, valid, internal, multipath(oldest)
      Extended Community: RT:192.84.5.0:129
      Originator: 192.84.5.238, Cluster list: 192.84.5.240
      mpls labels in/out nolabel/56
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 1
  Local, imported path from 192.84.5.234:129:129.169.10.0/24 (global)
    192.84.5.234 (metric 15) from 192.84.5.240 (192.84.5.240)
      Origin incomplete, metric 0, localpref 100, valid, internal, multipath, best
      Extended Community: RT:192.84.5.0:129
      Originator: 192.84.5.234, Cluster list: 192.84.5.240
      mpls labels in/out nolabel/54
      rx pathid: 0, tx pathid: 0x0

Finally, show ip cef ... will confirm that multipath is in use:

DIST-HOSP#show ip cef vrf eng_vrf 129.169.10.10 detail    
129.169.10.0/24, epoch 0, flags rib defined all labels, per-destination sharing
  recursive via 192.84.5.234 label 54
    nexthop 192.84.5.29 Ethernet1/0 label 33
  recursive via 192.84.5.238 label 56
    nexthop 192.84.5.29 Ethernet1/0 label 17

Final note about ECMP with MPLS

The other thing to note about MPLS is that it is the ingress PE router which determines the egress PE router, rather than the P routers.  The reason for this is that the outer label the ingress PE router places on the traffic is that of the egress PE router and thus it selects it (rather than the P routers, on a per-hop basis): the P routers simply forward traffic according to that label.

As such, multipathing using MPLS does not require any configuration on the P routers.

Tuesday, 19 May 2015

IOS to NX-OS reference

We're in the process of setting up our Nexus 7010s to replace our Catalyst 6509Es as our data centre routers.  This configuration requires that we port our IOS distribution router configuration to NX-OS.

It's quite easy to find these using Google, but I've found these Cisco docwiki references helpful:

Basic IPv4 BGP comparison

As a taster, here is part of our BGP configuration for a distribution router in IOS:

router bgp 64602
 bgp router-id 192.84.5.248
 no bgp default ipv4-unicast
 bgp log-neighbor-changes
 !
 neighbor core_peers peer-group
 neighbor core_peers remote-as 64602
 neighbor core_peers update-source Loopback0
 neighbor core_peers timers 10 35
 !
 neighbor 192.84.5.240 peer-group core_peers
 neighbor 192.84.5.250 peer-group core_peers
 !
 address-family ipv4
  neighbor core_peers send-community
  neighbor core_peers soft-reconfiguration inbound
  neighbor 192.84.5.240 activate
  neighbor 192.84.5.250 activate
  maximum-paths 2
  maximum-paths ibgp 2
  no auto-summary
  no synchronization
 exit-address-family
 !
 address-family ipv4 multicast
  neighbor core_peers send-community
  neighbor core_peers soft-reconfiguration inbound
  neighbor 192.84.5.240 activate
  neighbor 192.84.5.250 activate
 exit-address-family

... and here's the equivalent in NX-OS:

router bgp 64602
  router-id 192.84.5.248
  log-neighbor-changes
  !
  address-family ipv4 unicast
    maximum-paths 2
    maximum-paths ibgp 2
  !
  address-family ipv4 multicast
    maximum-paths 2
    maximum-paths ibgp 2
  !
  template peer core_peers
    remote-as 64602
    update-source loopback0
    timers 10 35
    address-family ipv4 multicast
      send-community
      soft-reconfiguration inbound
    address-family ipv4 unicast
      send-community
      soft-reconfiguration inbound
  !
  neighbor 192.84.5.240
    inherit peer core_peers
  neighbor 192.84.5.250
    inherit peer core_peers

You can see that peer groups become templates and the parameters for each address family with a peer are configured under the peer (as part of the template) rather than the address family block directly under the BGP process.

Monday, 18 May 2015

Compiling GNS3 under Linux

I recently had cause to run GNS3 on Linux as I needed to support a cloud interface using VLANs: I needed to link connections to some virtual routers out onto a physical interface, with each link as a separate VLAN.  OS X doesn't support this (it seems - you can configure it, but it doesn't appear to work), whereas Linux does.

More on that later, but I found this guide very useful for setting up GNS3 on Ubuntu 14.04 Desktop and it worked fine in a VMware VM on my laptop:

http://www.firstdigest.com/2014/12/gns3-1-2-1-installation-on-ubuntu-14-04/

Monday, 20 April 2015

DHCP, RPF verify, FHRP and ECMP - when protocols collide

I've been gradually enabling ECMP (Equal-Cost MultiPath) routing on parts of the University network as part of a prelude to the new data centre network and to improve performance generally (by making use of the dual downstream interfaces into institutions).

For the most part, this is fairly straightforward and just worked - I'll explain what was needed in each case in a future post (as MPLS VPN works slightly differently and needs some special consideration), but I did get a nasty problem which broke DHCP relaying in some places that I'll cover here.

How DHCP relaying works


First, let's review how DHCP relaying works (what you get when you do ip helper-address ... on a Cisco router interface towards clients).  Consider the following network:
When a client on the subnet 192.168.1.0/24 wants to DHCP, the following happens initially:
  1. The client sends out a DHCP DISCOVER message from 0.0.0.0 (as it doesn't know its IP address or even subnet, yet) on UDP port 68 (the DHCP client port) to all-hosts broadcast (255.255.255.255) port 67 (DHCP server).
  2. The DHCP relay agent (which is normally the router) will receive this broadcast and forward it as a unicast packet to the DHCP server listed in the ip helper-address ... interface command.  This will be from the router's interface address (this is the critical bit) - 192.168.1.253, in the above example, port 67 to the DHCP server, port 67.
  3. The DHCP server will receive the DISCOVER and, assuming it has an address (and other information) to give to the client, it will send a DHCP OFFER message back to the router to be relayed onto it.  This will be the reverse of the packet just received: going from the DHCP server's address, port 67 to the router's interface address, port 67.
  4. The router will receive the reply and unicast it back to the client, sending it from its interface IP address, port 67 to the client's prospective IP address, port 68.
This all works fine and, assuming the client wants to take the address, it will send a DHCP REQUEST (using the same process) and receive a DHCP ACK back from the server so it can begin using the address.

How DHCP relaying doesn't work


Now consider what happens when the same DHCP DISCOVER is relayed by the other client subnet router, without ECMP:
Here, something goes wrong at step 3, trying to return the DHCP OFFER to the relaying router:
  1. Because the backbone network (depicted with the cloud symbol) only knows about routing to the client subnet as a whole /24 (not the individual routers' addresses), it routes the packet via 192.168.1.253.
  2. 192.168.1.253 treats 192.168.1.252 as just another host on the client subnet and forwards it out of its interface onto that subnet.
  3. Because 192.168.1.252 has anti-spoofing blocks in place on the client subnet interface, it rejects this packet as the source address is that of the DHCP server: an invalid address from the client subnet.
With this configuration, typical for a network with a FHRP (First Hop Redundancy Protocol) such as HSRP or VRRP is in use, half of the DHCP replies (the ones relayed via one of the two routers) will be lost when they're returned to the relaying routers.

However, this in itself isn't particularly a problem in that both routers will relay the same packet to the DHCP server, resulting it receiving two copies of each DISCOVER [but with different relay agent / forwarder addresses], causing two OFFERs to be returned; the clients will not miss out as they'll get one of the two copies.  This is why I've never noticed this problem, even though it's been going on for years: the clients have still got their address and worked.

(Actually, I was sort-of aware of this problem, as it prevented pinging one of the routers' own addresses on a particular interface, if the source of the ping was elsewhere on the network.  However, that's just been a minor inconvenience and not service-affecting; I never realised that would also be affecting DHCP.)

Combining with ECMP


When this situation is combined with ECMP this can get messy: the returned DHCP OFFERs (and ACKs) might be returned to either of the two client subnet routers.  The routers' addresses are often 1 number offset (e.g. 192.168.1.252 vs .253) which will likely mean they each take a different path.

If the path for packets to the .253 relay address happen to go directly to the .253 router, all is fine.  Same with .252.

However, if you're really unlucky (and, of course, we were, in some situations), ECMP will return the .253 packet via the .252 router and the .252 packet via .253.  This results in both replies being rejected and the client getting neither of the responses.

Fixing the problem and creating another


I couldn't find any way to direct the replies back to the correct router (e.g. by advertising the router's interface IP address into OSPF as a /32), so dealing with them being rejected by the anti-spoofing protection seemed the only solution.

As I've written, I've been looking at the ip verify unicast source ... command recently, and it seemed a good opportunity to employ that, rather than modify lots of access control lists.  According to Cisco's documentation, that command has a special feature in to handle DHCP:
"Unicast RPF will allow packets with 0.0.0.0 source and 255.255.255.255 destination to pass so that Bootstrap Protocol (BOOTP) and Dynamic Host Configuration Protocol (DHCP) functions work properly."
— from the IOS Security Configuration Guide for IOS 12.2SX
Sounds good, except that doesn't handle the source addresses of relayed DHCP replies.  It would be nice if this included "packets with a source address of the interface ip helper-address and port 67, destined for the router's interface address port 67", but it doesn't.

However, the command has a feature to allow packets matching an access list to be accepted, even if they fail the RPF check.  It's configured as follows:

ip access-list extended 1301
 permit udp host DHCP-SERVER eq 67 192.168.1.0 0.0.0.255 eq 67
!
interface ...
 ip address 192.168.1.253 255.255.255.0
 standby ip 192.168.1.254
 ip verify unicast source reachable-via rx 1301

This will allow the initial DHCP DISCOVER in (as described in the Cisco documentation), regular 192.168.1.0/24 traffic (due to the RPF check) AND traffic from the DHCP server on port 67 to an address on the same subnet port 67 (which is less tedious than putting the interface IP address itself as it can be copied to the other router without modification).  This change can be combined with a simplification of the interface access lists (if used).  So, I implemented a few of these and all looked hunky dory.

The IP Input process - my old nemesis


However, a little while later, we started getting alarms for CPU usage on the Catalyst 6500-E routers.  A show process cpu sorted command showed high load caused by the IP Input process.

This is usually caused by excessive traffic being forwarded ("punted" in Cisco parlance) to the Route Processor (RP) for handling.  We can capture and display these with the following commands:

router# debug netdr capture rx
router# show netdr captured-packets

(Use debug netdr clear-capture to clear the buffer and our old friend undebug all to switch it off.)

The packets being punted all appeared to be regular data - nothing complicated like DHCP which needs special processing, so I started doing some more reading and found a document on Cisco's website explaining how this configuration is handled on a 6500:
"For unicast RPF check without ACL filtering, the PFC3 provides hardware support for the RPF check of traffic from multiple interfaces. 
For unicast RPF check with ACL filtering, the PFC determines whether or not traffic matches the ACL. The PFC sends the traffic denied by the RPF ACL to the route processor (RP) for the unicast RPF check. Packets permitted by the ACL are forwarded in hardware without a unicast RPF check." 
— from the IOS Network Security guide for IOS 12.2SX on Catalyst 6500 with PFC3
So it appears that, when you use an ACL, all the traffic not matching the ACL will get punted to the RP.  Excellent.

Fixing the problem for good


So, I backtracked on using the ip verify unicast ... command and reverted to using our old inbound access lists to protect against address spoofing.  These now have an extra entry and look as follows:

ip access-list extended in-subnet
 permit ip 192.168.1.0 0.0.0.255 any
 permit udp any eq bootpc host 255.255.255.255 eq bootps
 permit udp host DHCP-SERVER eq bootps 192.168.1.0 0.0.0.255 eq bootps
 deny ip any any

This appears to do the trick and doesn't involve the RP on the router going bananas.  Given this problem, I think I'll abandon using the ip verify unicast ... command!

(Update 2018-02-01 — we have since installed Catalyst 6807-XLs with Supervisor 6Ts and came up with a final solution.  I've described that on a separate blog post.)

Wednesday, 8 April 2015

"mpls ip ttl-expiration pop 1"

I encountered an odd problem when tracerouting from outside the University network to a host in the data centre network in my GNS3 simulation:

HOST-JANETW#traceroute 131.111.8.31

Type escape sequence to abort.
Tracing the route to RAVEN.CAM (131.111.8.31)

  1 GW-JANETW.JANETW.JA (128.4.0.254) 76 msec 76 msec 72 msec
  2 JANETW.ENETW.NET (146.97.41.246) 196 msec 96 msec 112 msec
  3  *  *  * 
  4  *  *  * 
  5 DCW-CUDN.DCW-SRV.NET (193.60.88.2) 176 msec 180 msec 144 msec

  6 RAVEN.CAM (131.111.8.31) 188 msec 180 msec 184 msec

[For reference, these hostnames are not the same as the real ones, but the naming system works by using OTHER-ROUTER.THIS-ROUTER.DOMAIN - so JANETW.ENETW.NET is the IP address of ENETW on the link to/from JANETW.]

Hops 3 and 4 went missing, which are those to MILL from ENETW and then to DCW[-CUDN] from MILL; once we reach DCW-SRV from DCW[-CUDN] we get responses again.  (The two DCW names are because they're different VDCs on the same Nexus 7010.)  This is shown below, with the missing hops at the tips of the red arrows:


This puzzled me for about an hour and a half - I was checking routes and trying pings from the affected hosts back to the source, which were working OK.  I then starting poking about in more details to check what exactly happened to the packets...

Plain-IPv4-over-MPLS when you don't expect it

The first thing was to see what happened to the packet on ENETW by consulting CEF to see what it would do with a packet destined for 131.111.8.31:

BDR-ENETW#show ip cef 131.111.8.31 detail 
131.111.8.0/23, epoch 0
  recursive via 193.60.88.2
    nexthop 192.84.5.133 Ethernet1/1 label 33

This shows the packet was matching the 131.111.8.0/23 route which should be sent via 193.60.88.2 (the interface on DCW-SRV coming from DCW[-CUDN]), which is reached recursively by using a next hop of 192.84.5.133 (the interface on MILL at hop 3, coming in from ENETW).  The important bit is to note that the packet will be forwarded out of interface Ethernet1/1 with MPLS label 30.

Dissecting this (as an aside - just to check there wasn't anything completely screwy) - 131.111.8.0/23 is an eBGP route to AS65110 (the server network) learnt from 193.60.88.2:

BDR-ENETW#show ip route 131.111.8.31
Routing entry for 131.111.8.0/23
  Known via "bgp 64602", distance 200, metric 0
  Tag 65110, type internal
  Last update from 193.60.88.2 01:41:18 ago
  Routing Descriptor Blocks:
  * 193.60.88.2, from 192.84.5.240, 01:41:18 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 65110

193.60.88.2 is, in turn, reached via OSPF with next hop 192.84.5.133, as CEF showed:

BDR-ENETW#show ip route 193.60.88.2
Routing entry for 193.60.88.2/32
  Known via "ospf 1", distance 110, metric 55, type extern 1
  Last update from 192.84.5.133 on Ethernet1/1, 01:45:33 ago
  Routing Descriptor Blocks:
  * 192.84.5.133, from 192.84.5.248, 01:45:33 ago, via Ethernet1/1
      Route metric is 55, traffic share count is 1

So, back to the label - let's see what that's for:

BDR-ENETW#show mpls ip binding remote-label 33 neighbor 192.84.5.133
  193.60.88.2/32
        out label:    33        lsr: 192.84.5.250:0   inuse

Here, I didn't expect MPLS to be in use as the destination address of the packet isn't one that we advertise labels for via LDP: we limit those to the backbone router loopback addresses and the outside addresses of eBGP peers to which we have inter-AS MPLS VPNs.

However, as our server network does operates as a separate AS from the backbone and we use MPLS forwarding between it and the backbone, we do advertise labels for its outside addresses - including 193.60.88.2/32: ENETW has labelled the packet for forwarding across the backbone to DCW-SRV because it has a BGP next hop with a label available.

Next step is to work out what happens to this MPLS frame...

Traceroute over an MPLS network

So, assuming that ENETW does the correct thing and sends the packet on via MPLS to MILL, lets see what that does by debugging ICMP and trying the traceroute again:

CORE-MILL#debug ip icmp
ICMP packet debugging is on
CORE-MILL#
Apr  8 19:53:01.591: Adding 4 bytes of label stack
Apr  8 19:53:01.595: MPLS: ICMP: time exceeded (time to live) sent to 128.4.0.1 (dest was 131.111.8.31)

This shows the MPLS frame is received and the TTL expired (by default, the PE router which turns the packet from a plain IPv4 packet into a labelled MPLS frame will copy the increasing TTL [as generated by traceroute]).  The P router (MILL) unwraps the packet inside and generates the 'TTL exceeded' ICMP message.

The critical bit, however, is that the P router will then label the packet up with its original MPLS label and forward it on to the destination (with a TTL reset to 255); it does NOT return it directly to the sender.  It has to do this as the IP addresses using the MPLS frame may not be ones which exist in the same routing space as the P router.

I verified all this by running Wireshark on the virtual interfaces inside GNS3 (incredibly useful that!) and checking it all looked the way I expected.

This I knew and understood and had investigated in the past (although not in an inter-AS situation) - what I hadn't thought of was what happens next...

The egress PE router

When the egress PE router receives the labelled ICMP 'TTL exceeded' packet, it unwraps it and tries to forward it on to the destination.  In this case, the PE router is DCW-SRV (in the server network) trying to return a packet with a source address of 192.84.5.133 to 128.4.0.1 (HOST-JANETW; an imaginary host, the one my traceroute was initiated from).  This is where things fail...

Because of the recent IP unicast RPF verification testing I've been doing, that source address is blocked from entering the backbone from the server network by the distribution routers to prevent address spoofing: the server network is not allowed to originate addresses it is not responsible for routing.

This can be confirmed by looking on the distribution router, DCW[-CUDN], part of the backbone network:

DIST-DCW#show ip int e1/2 | begin verify
  IP verify source reachable-via RX, ACL 10
   108 verification drops
   0 suppressed verification drops
   0 verification drop-rate

Some debugging shows the naughty source address:

DIST-DCW#debug ip cef drops rpf         
IP CEF drops for RPF debugging is on
DIST-DCW#
*Apr  8 20:08:38.171: CEF-Drop: Packet from 192.84.5.133 via Ethernet1/2 -- via-rx
DIST-DCW#

In the words of Colonel Hans Landa: that's a bingo!  The whole thing is illustrated below, with the rejected ICMP message from the DCW-SRV PE router being marked with a X in red:
MPLS Traceroute showing ICMP message from PE router being rejected due to RPF
It also explains why I'd never seen this problem with the Computer Laboratory in my simulation before, as I'd never bothered blocking traffic from other ASs to prevent address spoofing: I'd only ever tested filtering prefixes over eBGP; traffic filtering is something we've always done on the real network through access lists and not of interest in the simulation.

Solving the [non] problem

Once I'd worked out what the actual problem was, solving it turned out to be trivial...

In reality, this problem turns out to be academic: on our real network, we run with "no mpls ip propagate-ttl forwarded" configured on our PE routers, which prevents the TTL of the incoming packet from being copied into the MPLS frame.  This means that the TTL should never expire when crossing the backbone and those hops simply won't show up in a traceroute (so successful hops 5 and 6 will become hops 3 and 4).

However, in my GNS3 simulation, I like to expose the intermediate hops to verify complete paths from end station hosts for diagnostic and investigatory purposes.  Indeed, the reason I spotted the problem was whilst testing the paths for traffic when various failures occur, by shutting down partso the simulation.

The solution to this is to use the "mpls ip ttl-expiration pop 1" command - this causes MPLS frames with only a single label depth (which will be plain IPv4-over-MPLS frames) to be returned directly to the sender by the P router, rather than forwarding it on to the PE router to be bounced back.  As such, there is no problem with address spoofing as the IP packet starts its life inside the backbone itself.

After doing the later, my traceroute works:

HOST-JANETW#traceroute 131.111.8.31

Type escape sequence to abort.
Tracing the route to RAVEN.CAM (131.111.8.31)

  1 GW-JANETW.JANETW.JA (128.4.0.254) 16 msec 16 msec 8 msec
  2 JANETW.ENETW.NET (146.97.41.246) 20 msec 52 msec 20 msec
  3 ENETW.MILL.NET (192.84.5.133) [MPLS: Label 33 Exp 0] 68 msec 52 msec 24 msec
  4 MILL.DCW.NET (192.84.5.154) [MPLS: Label 16 Exp 0] 48 msec 68 msec 48 msec
  5 DCW-CUDN.DCW-SRV.NET (193.60.88.2) 80 msec 88 msec 68 msec
  6 RAVEN.CAM (131.111.8.31) 104 msec 56 msec 112 msec

You can even see the MPLS labels used in the previously-missing hops being returned in the ICMP extension block.

One thing that this has shown up is that, once the new data centre network is installed, traceroutes into our server network will not expose the backbone network router hops.  That wasn't something I was expecting and could be confusing for others.

Monday, 6 April 2015

Current GNS3 environment

I've been updating my GNS3 environment (I last posted about this back in August 2013) with some changes recently.  The new model (as of 6th April 2015) is updated to work in GNS3 1.3.0 below:


I added MPLS just under a couple of years ago (I started planning it in mid-2013 and implemented it late in the year), including an EoMPLS xconnect, as well as IPv6 multicast.  The recent changes however are:

  • Added the data centre network (being implemented at the moment) — interfacing to the backbone in the same way as a regular institutional network.
  • Added a second Janet connection (to be implemented soon, when the routers arrive!), with separate Janet router (janetw) and CUDN border router (enetw).  This involved adjusting the internal OSPF link costs and BGP advertisements.
  • Added the NAT service using a pair of IOS routers (instead of the real ASAs).
  • Expanded the Medical School network with a local, static router and added a directly-routed WBIC [Wolfson Brain Imaging Centre] host instead (which doesn't use the Medical School PoP connection, but this doesn't matter for the simulation).
  • Removed MRC CBSU [Cognition and Brain Sciences Unit] as they aren't doing anything particularly interesting any more (they were advertising a /23 we had to relay to Janet and expose and trim the AS path from, but we now do this for them; Engineering now do this, anyway).

Saturday, 4 April 2015

IP unicast RPF source verification

We currently block against spoofed addresses using extended router access lists on the inbound interfaces from institutions (departments and colleges).  Whilst this works, it is very tedious to maintain these access lists (including keeping them synchronised across routers, which we do via a complex external script mechanism).

However, Cisco IOS has a feature called "IP Unicast Reverse Path Forwarding Source Verification" which can be used to block spoofed addresses from entering via a routing interface which I've been meaning to look at for ages.

The feature is activated using the interface command "ip verify unicast source reachable-via rx" - this blocks any traffic entering via that interface if it wouldn't be used to send outbound traffic to the source address: an RPF (Reverse Path Forwarding) check, exactly as multicast does.

Once the feature is applied, the "show ip interface" command will display the number of packets which have been blocked due to it:

JANETC#show ip int e1/3 | begin verify
  IP verify source reachable-via RX
   9 verification drops
   0 suppressed verification drops
   0 verification drop-rate

Asymmetric paths

For the feature to work, it is essential that traffic paths are symmetrical across any interfaces on which it is used.  This is normally the case for simple edge interfaces as there can only be a single path to the address ranges which are connected and statically routed across them.

However, on links where dynamic routing takes place, traffic can arrive via the non-reverse path, resulting in asymmetric paths.  On the internal links of a network, this is usually not a problem as long as the traffic across those can be trusted, as long as the edge of the network is protected.

The only place where this presents a problem is where dynamic routing takes place across an administrative boundary (such as those between ASs where eBGP peerings are used).  Here, the obvious solution is to fix the traffic paths such that they are symmetric, which is required if multicast is to function correctly anyway.

However, the verification feature supports an access list to allow packets to pass, even in the event of the RPF check failing.  This is specified on the end of the "ip verify ..." command, e.g.:

ip access-list standard 3
 permit 128.232.0.0 0.0.255.255
 permit 129.169.0.0 0.0.255.255
 permit 131.111.0.0 0.0.255.255
!
interface Ethernet1/3
 ip verify unicast source reachable-via rx 3

The number of packets which have been permitted through this mechanism will also show up in the IP interface information under "suppressed verification drops":

JANETC#show ip int e1/3 | begin verify
  IP verify source reachable-via RX, ACL 3
   31 verification drops
   39 suppressed verification drops
   0 verification drop-rate

Interaction with access lists

When combining this feature with inbound access lists on an interface, the access lists are applied BEFORE the source address verification.  As such, packets which the access list blocks will not show up in the address verification counts, shown above.

The two features can be usefully combined, as access lists do not need to be written to handle the validation of source addresses - they can just permit or deny the traffic required, based on ports and protocols, leaving source address validation to be done by the verification feature.  This is especially useful when multinetting is used.  I expect this will allow us to simplify many of our access lists.