Sunday, 21 June 2015

ECMP with OSPF, BGP and MPLS

As reported earlier, we've been gradually enabling ECMP across the University backbone to make better use of links and increase effective bandwidth.  For the most part, this is pretty straightforward, but there were a few gotchas — I thought I'd document both these: the enabling and the gotchas.

Note that ECMP typically enables load-sharing and NOT load-balancing:

  • Load-sharing is about distributing the traffic across active paths, probably by hashing the source and/or destination addresses of the packet, potentially resulting in an uneven distribution (especially if the traffic is between a small number of addresses).
  • Load-balancing is attempts to distribute the traffic such that it is evenly split across the available paths.

Note that ECMP using routing protocols will usually only distribute traffic where routers are both the ingress and egress point: it does nothing for inbound traffic from a simple host with a single subnet gateway or static default route: in the absence of a dynamic routing protocol to do this, solutions such as Cisco's GLBP (Gateway Load Balancing Protocol) can help here (which I'm not going to cover here).

OSPFv2 and OSPFv3

OSPF is easy to do - you simply increase the number of paths used:

router ospf 1
 maximum-paths 2
!
ipv6 router ospf 1
 maximum-paths 2

... the parameter to maximum-paths specifies how many of the available [lowest and best] equal-cost paths calculated in OSPF are loaded in the router's active forwarding table.  IOS supports a value up to 6.

This command must be entered on all routers in the network and applies at each hop.  For example, in a traditional two-layer core and distribution model:
  • On the ingress distribution router, it will distribute traffic across uplinks to the core routers.
  • On the core routers, it will distribute traffic across the downlinks to the egress distribution routers which serve the destination address.
Once entered, the show ip route ... command can be used to confirm multipath is in operation — here on a core router:

CORE-CENT#show ip route 131.111.10.10
Routing entry for 131.111.10.0/24
  Known via "ospf 1", distance 110, metric 27, type extern 1
  Last update from 192.84.5.18 on Ethernet1/4, 00:01:07 ago
  Routing Descriptor Blocks:
  * 192.84.5.34, from 192.84.5.238, 00:01:07 ago, via Ethernet1/2
      Route metric is 27, traffic share count is 1
    192.84.5.18, from 192.84.5.234, 00:01:07 ago, via Ethernet1/4
      Route metric is 27, traffic share count is 1

Personally, I find the show ip cef ... detail command a little clearer (and explains MPLS better, when we get round to that):

CORE-CENT#show ip cef 131.111.10.10 detail 
131.111.10.0/24, epoch 0, per-destination sharing
  local label info: global/33
  nexthop 192.84.5.18 Ethernet1/4
  nexthop 192.84.5.34 Ethernet1/2

That's all there is to it, although if you use DHCP relaying, first hop redundancy (VRRP, HSRP or GLBP) and address spoofing protection, then note the messy problem with those I covered in an earlier article!

BGP

Multipath using BGP route is similar to OSPF:

router bgp 64602
 address-family ipv4 unicast
  maximum-paths 2
  maximum-paths ibgp 2
 exit-address-family
 !
 address-family ipv6 unicast
  maximum-paths 2
  maximum-paths ibgp 2
 exit-address-family

The obvious difference is the two separate commands
  • maximum-paths ... applies to routes learnt over eBGP peerings
  • maximum-paths ibgp ... applies to routes learnt over iBGP peerings, even if they are external in origin (i.e. learnt from an eBGP ASBR in the same AS)
Because of this, usually only the latter (the iBGP version) is required on core routers and BGP route reflectors (as they don't usually have eBGP peers).

Route Reflectors (RRs) also present an additional wrinkle — an RR will only reflect a single route to its clients: that which it itself considers the best, based on the normal BGP selection methods (which can include the IGP cost).  This does not usually cause a problem because, as long as the RR client sends the traffic to the RR, the RR can then multipath traffic from itself according to the available paths, when it forwards it on.  However, this did cause a problem when trying to share traffic across the core to our internet gateways (more later).

Note that you do not (and cannot) enable multipath for the multicast address families.  The multicast routes are for RPF checking; multipath for multicast traffic forwarding is done separately (and I've not yet looked into it, so there's nothing in this article about it).

eBGP example

Let's look at an example - 129.169.0.0/16 is a block of addresses used by the Department of Engineering — their network is AS65106,  separate from the University backbone (which is AS64602).  They connect to the backbone via a series of /30 link subnets in 193.60.93.16/28 across which operate the eBGP peerings.

(Note that we're using a contrived simulation here, so the details are not necessarily accurate to reality.)

Looking in the routing table for 129.169.80.10 on one of the core routers (which, in my example is advertised equally across two of these links):

CORE-CENT#show ip route 129.169.80.10
Routing entry for 129.169.0.0/16
  Known via "bgp 64602", distance 200, metric 0
  Tag 65106, type internal
  Last update from 193.60.93.26 00:02:11 ago
  Routing Descriptor Blocks:
    193.60.93.26, from 192.84.5.234, 00:02:11 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 65106
      MPLS label: none
  * 193.60.93.18, from 192.84.5.236, 00:02:11 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 65106
      MPLS label: none

This shows the two links via eBGP:
  • one via 193.60.93.26 which was learnt from iBGP peer 192.84.5.234
  • the other via 193.60.93.18 which was learnt from iBGP peer 192.84.5.236
Checking the actual forwarding on a core router with show ip cef:

CORE-CENT#show ip cef 129.169.80.10 detail   
129.169.0.0/16, epoch 0, flags rib only nolabel, rib defined all labels, per-destination sharing
  recursive via 193.60.93.18
    recursive via 193.60.93.16/30
      nexthop 192.84.5.26 Ethernet1/6
  recursive via 193.60.93.26
    recursive via 193.60.93.24/30
      nexthop 192.84.5.18 Ethernet1/4

Note the two levels of recursion:
  • 193.60.93.18 and 193.60.93.26 are the addresses of the eBGP border routers from which the routes were learnt
  • these match the link subnet routes 193.60.93.16/30 and 193.60.93.26/30, respectively (from the IGP)
  • each of these was learnt from the IGP neighbours 192.84.5.26 (on Eth1/6) and 192.84.5.18 (on Eth1/4)
Just for completeness, let's delve into the BGP database:

CORE-CENT#show bgp ipv4 unicast 129.169.80.10
BGP routing table entry for 129.169.0.0/16, version 31
Paths: (2 available, best #2, table default)
Multipath: iBGP
  Advertised to update-groups:
     1         
  Refresh Epoch 1
  65106, (aggregated by 65106 129.169.252.1), (Received from a RR-client), (received & used)
    193.60.93.18 (metric 27) from 192.84.5.236 (192.84.5.236)
      Origin IGP, metric 0, localpref 100, valid, internal, atomic-aggregate, multipath(oldest)
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 1
  65106, (aggregated by 65106 129.169.252.2), (Received from a RR-client), (received & used)
    193.60.93.26 (metric 27) from 192.84.5.234 (192.84.5.234)
      Origin IGP, metric 0, localpref 100, valid, internal, atomic-aggregate, multipath, best

      rx pathid: 0, tx pathid: 0x0

MPLS L3 VPNs

On the surface, MPLS L3 VPNs look straightforward and similar to BGP — you just need to use maximum-paths in the corresponding BGP VRF stanzas:

router bgp 64602
 address-family ipv4 unicast vrf ucs-staff_vrf
  maximum-paths 2
  maximum-paths ibgp 2
 exit-address-family
 !
 address-family ipv6 unicast vrf ucs-staff_vrf
  maximum-paths 2
  maximum-paths ibgp 2
 exit-address-family

(In the above case, the VRFs have eBGP peerings to non-MPLS peers — ones using a regular ipv4 unicast address family with the peering inside a VRF, as opposed to a vpnv4 peering with non-VRF addresses; providing the so-called "carrier's carrier" service.  Hence the need for the maximum-paths 2 line.  If the VPN was only using connected and static routes, inside the AS, only the maximum-paths ibgp 2 line would be needed as the routes would all be internal.)

Close, but no cigar: only a single path was being used — looking on the ingress PE router:

DIST-HOSP#show ip cef vrf eng_vrf 129.169.10.10 detail 
129.169.10.0/24, epoch 0
  recursive via 192.84.5.234 label 52
    nexthop 192.84.5.29 Ethernet1/0 label 18

And BGP confirms only a single route is available:

DIST-HOSP#show bgp vpnv4 unicast vrf eng_vrf 129.169.10.10
BGP routing table entry for 64602:129:129.169.10.0/24, version 6
Paths: (1 available, best #1, table eng_vrf)
Flag: 0x820
  Not advertised to any peer
  Local
    192.84.5.234 (metric 15) from 192.84.5.240 (192.84.5.240)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:64602:129
      Originator: 192.84.5.234, Cluster list: 192.84.5.240
      Connector Attribute: count=1
       type 1 len 12 value 64602:129:192.84.5.234
      mpls labels in/out nolabel/52

Odd - let's have a look at a core RR P router (note that we have to look up the route using the RD itself (64602:129 - a combination of our AS plus a local ID) as there is no VRF configured on a P router):

CORE-CENT#show bgp vpnv4 uni rd 64602:129 129.169.10.10
BGP routing table entry for 64602:129:129.169.10.0/24, version 5
Paths: (2 available, best #1, no table)
Flag: 0x820
  Advertised to update-groups:
        2
  Local, (Received from a RR-client)
    192.84.5.234 (metric 8) from 192.84.5.234 (192.84.5.234)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:64602:129
      Connector Attribute: count=1
       type 1 len 12 value 64602:129:192.84.5.234
      mpls labels in/out nolabel/52
  Local, (Received from a RR-client)
    192.84.5.238 (metric 8) from 192.84.5.238 (192.84.5.238)
      Origin incomplete, metric 0, localpref 100, valid, internal
      Extended Community: RT:64602:129
      Connector Attribute: count=1
       type 1 len 12 value 64602:129:192.84.5.238
      mpls labels in/out nolabel/17

Both routes are present here, but only one is being selected: the one from 192.84.5.234 (marked as "best"), based on it having a lower IP address (given no other method of preference); the one from 192.84.5.238 is being discarded.

Distinguishing Router Distinguishers

This was a bit of a mystery until I found this post by Ivan Pepelnjak which described how RRs behave with multiple VPN routes and finally made the distinction between Route Distinguishers and Route Targets clear to me (and he admits it's not clearly explained in the Cisco textbooks):
  • The Route Distinguisher (RD) is used, along with the IPv4 or IPv6 addresses of the router to build a complete route ID you can imagine being in the form "RD:route", e.g. "64602:129:129.169.10.0/24" (you can see this in the first line of the output from show bgp vpnv4 ..., above).  The purpose of the RD is to distinguish a route belonging to one VPN from another in the provider network.
  • The Route Target (RT) specifies which which VPNs a particular route should be imported from or exported into, when a VRF is configured on a particular router.
The important point is that a RR will only reflect a single route with a particular ID: if the PE routers are all using the same RD, the RR will only use a single one of these routes.  Changes to this involve extensions to the BGP protocol which only appeared in IOS 15.2, with the router bgp ... / bgp additional-paths ... command.

Without this new capability, the solution to this is to use a different RD on each PE router, resulting in distinct routes in the provider network.  This made us revisit how we reassign RDs and RTs:
  • We now set the RD to be the public IPv4 loopback address of the PE router, instead of using the local [private] ASN; RDs support this as a standard format and leave 16 bits for the administratively-assigned ID.  This changes our RDs from 64602:id to 192.84.5.x:id.
  • On the other hand, the import and export RTs typically need to be the same for all VRFs across the VPN (unless partial imports are to be used).  For consistency with the RT, we've changed those from the same ASN:id format (64602:id, same as the RD) to 192.84.5.0:id (192.84.5.0/24 being the block we use for our router backbone and loopback addresses).
(Note that the address-family vpvn4 stanza does have a maximum-paths statement, but that doesn't appear to do anything useful: I'm not sure what the point of it is!)

Once changed, both routes now show up as best on the core router as completely separate (as far as it is concerned, they're completely separate routes, albeit with the same RT), searching by the new RD (it's only on the PE router where the two routes get brought together, when they're imported by the common RT into the VRF):

CORE-CENT#show bgp vpnv4 unicast rd 192.84.5.234:129 129.169.10.10
BGP routing table entry for 192.84.5.234:129:129.169.10.0/24, version 3
Paths: (1 available, best #1, no table)
  Advertised to update-groups:
     1         
  Refresh Epoch 1
  Local, (Received from a RR-client)
    192.84.5.234 (metric 8) from 192.84.5.234 (192.84.5.234)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:192.84.5.0:129
      mpls labels in/out nolabel/54
      rx pathid: 0, tx pathid: 0x0

CORE-CENT#show bgp vpnv4 unicast rd 192.84.5.238:129 129.169.10.10
BGP routing table entry for 192.84.5.238:129:129.169.10.0/24, version 2
Paths: (1 available, best #1, no table)
  Advertised to update-groups:
     1         
  Refresh Epoch 1
  Local, (Received from a RR-client)
    192.84.5.238 (metric 8) from 192.84.5.238 (192.84.5.238)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:192.84.5.0:129
      mpls labels in/out nolabel/56
      rx pathid: 0, tx pathid: 0x0

Looking on the PE router, both routes now appear and are marked as "multipath".  Note that the unique router ID contains the local RD for this VRF (192.84.5.237:129):

DIST-HOSP#show bgp vpnv4 unicast vrf eng_vrf 129.169.10.10
BGP routing table entry for 192.84.5.237:129:129.169.10.0/24, version 12
Paths: (2 available, best #2, table eng_vrf)
Multipath: iBGP
  Not advertised to any peer
  Refresh Epoch 1
  Local, imported path from 192.84.5.238:129:129.169.10.0/24 (global)
    192.84.5.238 (metric 15) from 192.84.5.240 (192.84.5.240)
      Origin incomplete, metric 0, localpref 100, valid, internal, multipath(oldest)
      Extended Community: RT:192.84.5.0:129
      Originator: 192.84.5.238, Cluster list: 192.84.5.240
      mpls labels in/out nolabel/56
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 1
  Local, imported path from 192.84.5.234:129:129.169.10.0/24 (global)
    192.84.5.234 (metric 15) from 192.84.5.240 (192.84.5.240)
      Origin incomplete, metric 0, localpref 100, valid, internal, multipath, best
      Extended Community: RT:192.84.5.0:129
      Originator: 192.84.5.234, Cluster list: 192.84.5.240
      mpls labels in/out nolabel/54
      rx pathid: 0, tx pathid: 0x0

Finally, show ip cef ... will confirm that multipath is in use:

DIST-HOSP#show ip cef vrf eng_vrf 129.169.10.10 detail    
129.169.10.0/24, epoch 0, flags rib defined all labels, per-destination sharing
  recursive via 192.84.5.234 label 54
    nexthop 192.84.5.29 Ethernet1/0 label 33
  recursive via 192.84.5.238 label 56
    nexthop 192.84.5.29 Ethernet1/0 label 17

Final note about ECMP with MPLS

The other thing to note about MPLS is that it is the ingress PE router which determines the egress PE router, rather than the P routers.  The reason for this is that the outer label the ingress PE router places on the traffic is that of the egress PE router and thus it selects it (rather than the P routers, on a per-hop basis): the P routers simply forward traffic according to that label.

As such, multipathing using MPLS does not require any configuration on the P routers.

2 comments: