Thursday, 7 August 2014

IPv4 multicast over an MPLS VPN using mGRE

I've wanted to sort out multicast forwarding over an MPLS VPN for some time — while most institutions at the University of Cambridge don't have multicast-enabled networks, it seems a loose end that might prevent take up of the service.

After some poking about, trying to find free stuff online and failing, it seemed like MPLS and VPN Architectures vol. II from CiscoPress would explain it: in particular Chapter 7 ("Multicast VPN").  The good news is that it does.

However, the book dates from 2003 and I found an update document on Cisco's website about developments in MBGP to handle multicast over MPLS VPNs from about 2008 ("Multicast VPN - IP Multicast Support for MPLS VPNs"); there was a reprint of the book in paperback a month or so ago (late June 2014) and I thought it might have been updated to cover that.  Unfortunately, the book hasn't been updated, although the changes are minimal so it doesn't hurt to start with the book and then read the update.

Note: the solution described here uses mGRE (multipoint GRE); replacing much of this is MLDP (Multicast Label Distribution Protocol)   The latter looks like the right solution in the longer term but is fairly recent (dating from 2010) and needs IOS 15 on the Cisco platform, I believe.  I'll look at that later.

Anyway — onto the technical bits...

The key facts

There are several key things to know in advance, which I think helps understand multicast over an MPLS VPN:
  • Firstly, it doesn't actually involve MPLS: multicast traffic is forwarded completely differently from unicast traffic and doesn't involve any MPLS labelled frames.
  • Instead, multicast traffic inside the VPN is tunnelled, by encapsulating it using GRE, inside multicast traffic in the global space.  This is commonly called 'mGRE' (multipoint GRE).
  • In the global space, one or more MDTs (Multicast Distribution Trees) are constructed to carry this tunnelled traffic between the PE routers in the VPN.  When traffic is received on one of these trees, it is de-encapsulated and forwarded to the VRF (Virtual Routing and Forwarding) instance for the VPN.  Inside the VRF, traffic is forwarded into the MDT, when it needs to reach other routers in the VPN.
  • MDTs essentially put all the participating routers on a single layer 2 segment: when querying the PIM neighbours on the tunnel interface created by an MDT, all of the other routers will be seen.
  • The first MDT is known as the Default MDT — all PE routers in the VPN will join this tree and it will, by default, be used to forward all multicast traffic on the VPN.
  • If there is a lot of multicast traffic and it is not typically required on all the PE routers (because only some have members) then this is inefficient.  To handle this situation, Data MDTs can be configured to set up separate groups in the global space — Cisco let you do this for groups where the bandwidth exceeds a certain threshold and/or with an address matched by an access list.
  • The Data MDTs use a pool of group addresses which is automatically recycled as groups come and go, much like a DHCP address pool.  If the pool is exhausted, group addresses will be reused on a least-used (by mapping of group addresses) basis.
  • The indication that a particular group has been switched from the Default MDT to the Data MDT is done by a PIM message that is forwarded across the Default MDT.
  • Note that the decision to switch traffic to a Data MDT is done by the ingress router (nearest the source), rather than the final hop router (as would be the case with SPT switchover under regular PIM).
  • If the groups for the Default MDTs are in a PIM-SSM range (232.0.0.0/8, by default), the BGP address-family ipv4 mdt can be used to locate other PE routers in the MVPN.  This is not necessary if the groups use PIM-SM with an RP.
  • The VPN must be configured to provide a BSR and RP in the normal manner: these work as normal across the MDTs (so a single BSR and RP can be set up on a PE for the VPN, or even inside the customer network).
  • The interfaces to CE routers work just as normal and also support BSR and RP discovery.
This arrangement is designed to be scalable and limit control plane resources in the provider network: the Data MDT group address pool allows the provider to prevent a large number of customer groups in the VPN from creating a similar number on the P routers, limiting the effect to just the PE routers.

All this business requires a good understanding of how multicast works and requires careful tuning of PIM modes (SM, SSM, Bidir) as well as the MDT group addresses, to get the best performance.

Keeping traffic separate

The main alarm bell I had with all this was about keeping traffic inside the VPN from escaping onto the global network, where it could be intercepted, or false traffic inserted.

This problem is actually straightforward: Cisco provide the ip multicast boundary command which is used to filter multicast traffic on an interface — if the group addresses used by the MDTs are in a range not permitted by the list specified with this, they will not pass across the interface, keeping the traffic safe.

In our case, we filter 239.255.0.0/16 as being 'institution private' so it doesn't ever pass across an interface between an edge subnet or institution and our backbone.  If the MDTs are created in here, they should be safe.

Testing multicast boundaries

To check this all worked, I did some tests with this using GNS3 and Wireshark sniffing the virtual link between a host in the global space, which had joined the Default MDT group.  By switching the filtering on and off, I could watch the effect...
  • Installing a boundary had an immediate affect, stopping the MDT traffic from reaching the host.  The interface connecting the host disappeared from the output interface list of show ip mroute.
  • However, the group still showed up as being joined in the output of show ip igmp group, even though traffic did not pass.  This would remain there until the IGMP reporting timer expired; other groups would refresh, but not this one
  • When the boundary was removed, traffic would not immediately flow, even if the interface remained in the IGMP list.  However, when the IGMP membership report was retransmitted by the host, it caused the traffic to start flowing to the host again.

Example

(Update 2015-050-24 - I'm about to convert my GNS3 simulation over to MLDP, now I've upgraded to IOS 15, so I thought I'd do some tests using mGRE before dismantling it.)

A quick demo of this all in action and the output of some show commands.

Configuration

The configuration below builds on a working [non-VRF] PIM-SM on the backbone, as well as an MPLS IPv4 VPN.  I'm doing this in IOS 15.

The MDTs are configured in the VRF, identically on each router, in the address-family block - the extra commands highlighted:

vrf definition eng_vrf
 rd 192.84.5.238:129
 route-target export 192.84.5.0:129
 route-target import 192.84.5.0:129
 !
 address-family ipv4
  mdt default 239.255.32.1
  mdt data 239.255.36.0 0.0.0.31
  mdt data threshold 10
 exit-address-family
!
ip multicast-routing vrf eng_vrf

I've deliberately set the Data MDT switchover threshold high to avoid it being triggered, initially.

Note that IOS 15 replaces the mdt data ... threshold ... option with a separate mdt data threshold command that operates per-VRF.  If you use the old format, you will receive a warning that it's deprecated.

Once enabled on the router, a VRF interface is enabled the same as a non-VRF one:

interface Ethernet1/2.210
 description eng-vpn-nms
 encapsulation dot1Q 210
 vrf forwarding eng_vrf
 ip address 129.169.10.253 255.255.255.0
 !
 ip pim bsr-border
 ip pim sparse-mode
 ip igmp version 3

Checking the Default MDT

Once this has been configured on all routers, the Default MDT should be formed across the backbone (using 239.255.32.1, in the above example) and they should all discover each other as if on a single network segment using PIM:

DIST-NMS#show ip pim vrf eng_vrf neighbor 
PIM Neighbor Table
Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority,
      P - Proxy Capable, S - State Refresh Capable, G - GenID Capable
Neighbor          Interface                Uptime/Expires    Ver   DR
Address                                                            Prio/Mode
192.84.5.236      Tunnel3                  00:00:15/00:01:35 v2    1 / S P G
192.84.5.237      Tunnel3                  00:00:39/00:01:29 v2    1 / S P G

The above router - DIST-NMS - has found the other routers across the Default MDT on the backbone using the Tunnel3 interface; their loopback addresses in 192.84.5.x are shown.  No DR is shown as that is us (our loopback is 192.84.5.238 - the highest address):

DIST-NMS#show ip pim vrf eng_vrf int tu3        

Address          Interface                Ver/   Nbr    Query  DR     DR
                                          Mode   Count  Intvl  Prior
192.84.5.238     Tunnel3                  v2/S   2      30     1      192.84.5.238

You can check the MDTs and their tunnel interfaces as follows:

DIST-NMS#show ip pim vrf eng_vrf mdt
  * implies mdt is the default MDT
  MDT Group/Num   Interface   Source                   VRF
* 239.255.32.1    Tunnel3     Loopback0                eng_vrf

You can see the Default MDT group is visible in the global IP multicast forwarding table - note the Z flag, showing the group is used as an mGRE multicast tunnel:

DIST-NMS#show ip mroute 239.255.32.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
       L - Local, P - Pruned, R - RP-bit set, F - Register flag,
       T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
       X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
       U - URD, I - Received Source Specific Host Report, 
       Z - Multicast Tunnel, z - MDT-data group sender, 
       Y - Joined MDT-data group, y - Sending to MDT-data group, 
       G - Received BGP C-Mroute, g - Sent BGP C-Mroute, 
       Q - Received BGP S-A Route, q - Sent BGP S-A Route, 
       V - RD & Vector, v - Vector
Outgoing interface flags: H - Hardware switched, A - Assert winner
 Timers: Uptime/Expires
 Interface state: Interface, Next-Hop or VCD, State/Mode

(*, 239.255.32.1), 00:06:39/stopped, RP 192.84.5.240, flags: SJCFZ
  Incoming interface: Ethernet1/0, RPF nbr 192.84.5.33
  Outgoing interface list:
    MVRF eng_vrf, Forward/Sparse, 00:06:39/00:02:20

(192.84.5.236, 239.255.32.1), 00:05:40/00:01:10, flags: JTZ
  Incoming interface: Ethernet1/0, RPF nbr 192.84.5.33
  Outgoing interface list:
    MVRF eng_vrf, Forward/Sparse, 00:05:40/00:00:19

(192.84.5.237, 239.255.32.1), 00:06:04/00:00:59, flags: JTZ
  Incoming interface: Ethernet1/0, RPF nbr 192.84.5.33
  Outgoing interface list:
    MVRF eng_vrf, Forward/Sparse, 00:06:04/00:02:56

(192.84.5.238, 239.255.32.1), 00:06:39/00:02:54, flags: FT
  Incoming interface: Loopback0, RPF nbr 0.0.0.0
  Outgoing interface list:
    Ethernet1/0, Forward/Sparse, 00:06:38/00:02:50

The PIM domain inside the VRF

The multicast forwarding table for the VRF is initially empty:

DIST-NMS# show ip mroute vrf eng_vrf
IP Multicast Routing Table
...

(*, 224.0.1.40), 00:59:21/00:02:14, RP 0.0.0.0, flags: DCL
  Incoming interface: Null, RPF nbr 0.0.0.0
  Outgoing interface list:
    Ethernet1/2.210, Forward/Sparse, 00:59:20/00:02:14

In my simulation, the PIM-SM RP is on a router in an institutional (departmental) network, not part of the MPLS backbone.  This is advertised as normal using PIM BSR and detected over the MDT:

DIST-NMS#show ip pim vrf eng_vrf rp mapping 
PIM Group-to-RP Mappings

Group(s) 224.0.0.0/4
  RP 129.169.252.1 (TRUMP.ENG), v2
    Info source: 129.169.252.1 (TRUMP.ENG), via bootstrap, priority 64, holdtime 150
         Uptime: 00:17:01, expires: 00:02:12

Adding group members

I'm using 234.131.111.10 as a test group.  First, I set up two group members on institutional networks using ip igmp join-group ....  Here's one:

interface FastEthernet0/0
 description eng-vpn-medschl
 ip address 129.169.86.86 255.255.255.0
 ip igmp join-group 234.131.111.10

These show up in the multicast forwarding table on the RP, ENG-TRUMP:

ENG-TRUMP#show ip mroute 234.131.111.10
IP Multicast Routing Table
...

(*, 234.131.111.10), 00:24:05/00:03:14, RP 129.169.252.1, flags: SJC
  Incoming interface: Null, RPF nbr 0.0.0.0
  Outgoing interface list:
    Ethernet1/0.252, Forward/Sparse, 00:21:03/00:03:14
    Ethernet1/2, Forward/Sparse, 00:24:05/00:01:58

On that router, Ethernet1/2 is the link to the backbone router which is inside the VRF (and has the above configured member) and Ethernet1/0.252 has another, local member.

Adding a source

Now I can start a source by setting a ping going on an end host (with IP 129.169.10.10) to the group address and see the responses from the two members:

HOST-ENG-NMS#ping 234.131.111.10 repeat 1000

Type escape sequence to abort.
Sending 1000, 100-byte ICMP Echos to 234.131.111.10, timeout is 2 seconds:

Reply to request 0 from HOST-TRUMP.ENG (129.169.254.1), 444 ms
Reply to request 0 from HOST-MEDSCHL.ENG (129.169.86.86), 572 ms
Reply to request 1 from HOST-MEDSCHL.ENG (129.169.86.86), 208 ms
Reply to request 1 from HOST-TRUMP.ENG (129.169.254.1), 404 ms
...

The multicast forwarding table on the first hop router will show traffic being forwarded over the tunnel:

DIST-NMS#show ip mroute vrf eng_vrf 234.131.111.10
IP Multicast Routing Table
...

(*, 234.131.111.10), 00:00:10/stopped, RP 129.169.252.1, flags: SPF
  Incoming interface: Tunnel3, RPF nbr 192.84.5.236
  Outgoing interface list: Null

(129.169.10.10, 234.131.111.10), 00:00:10/00:03:22, flags: FT
  Incoming interface: Ethernet1/2.210, RPF nbr 0.0.0.0
  Outgoing interface list:

    Tunnel3, Forward/Sparse, 00:00:09/00:03:20

Because there's not a sufficient level of traffic (above the configured 10kbit/s), no Data MDT has been created - we can check on the first hop router:

DIST-NMS#show ip mroute vrf eng_vrf 234.131.111.10 active  
Use "show ip mfib active" to get better response time for a large number of mroutes.

Active IP Multicast Sources - sending >= 4 kbps

DIST-NMS#show ip pim vrf eng_vrf mdt send                  
DIST-NMS#

And on a receiving router:

DIST-HOSP#show ip pim vrf eng_vrf mdt receive detail 
DIST-HOSP#

Increasing the data rate

Now everything's working, let's increase the rate of traffic by making the ping send packets without the 1s delay between them:

HOST-ENG-NMS#ping 234.131.111.10 rep 100000 timeout 0

Type escape sequence to abort.
Sending 100000, 100-byte ICMP Echos to 234.131.111.10, timeout is 0 seconds:
......................................................................
......................................................................
...

And check the traffic data rate:

DIST-NMS#show ip mroute vrf eng_vrf 234.131.111.10 active 
Use "show ip mfib active" to get better response time for a large number of mroutes.

Active IP Multicast Sources - sending >= 4 kbps

Group: 234.131.111.10, (MCAST-131-111-10.CAM)
   Source: 129.169.10.10 (HOST-NMS.ENG)
     Rate: 141 pps/113 kbps(1sec), 113 kbps(last 40 secs), 3 kbps(life avg)

Now the data rate has gone above the configured 10kbps (kbit/s) threshold, a Data MDT should have been created and traffic switched across to it.  This can be confirmed on the first hop router, along with the address of the new group:

DIST-NMS#show ip pim vrf eng_vrf mdt send                 

MDT-data send list for VRF: eng_vrf
  (source, group)                     MDT-data group/num   ref_count
  (129.169.10.10, 234.131.111.10)     239.255.36.0         1

On the egress (from the mGRE cloud) router, this can be confirmed:

DIST-HOSP#show ip pim vrf eng_vrf mdt receive detail 
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
       L - Local, P - Pruned, R - RP-bit set, F - Register flag,
       T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
       X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
       U - URD, I - Received Source Specific Host Report, 
       Z - Multicast Tunnel, z - MDT-data group sender, 
       Y - Joined MDT-data group, y - Sending to MDT-data group, 
       G - Received BGP C-Mroute, g - Sent BGP C-Mroute, 
       Q - Received BGP S-A Route, q - Sent BGP S-A Route, 
       V - RD & Vector, v - Vector

Joined MDT-data [group/mdt number : source]  uptime/expires for VRF: eng_vrf
 [239.255.36.0 : 192.84.5.238]  00:04:56/00:00:36
  (129.169.10.10, 234.131.111.10), 00:10:39/00:01:44/00:00:36, OIF count: 1, flags: JTY

Once the source stops and the MDT times out (about 3 minutes), the Data MDT will then be destroyed.