Multicast Lessons Learned from Decades of Deployment Experience
lispers.net
farinacci@gmail.com
Juniper
lenny@juniper.net
Futurewei
michael.mcbride@futurewei.com
This document gives a historical perspective about the design and deployment of multicast routing protocols. The document describes the technical
challenges discovered from building these protocols. Even though multicast has enjoyed success of deployment in special use-cases, we discuss what
were, and are, the obstacles for mass deployment across the Internet.
There are many multicast related drafts and RFC's around IPv4, IPv6, tunnel and label based solutions. These protocols include
DVMRP , PIM-DM , PIM-SM ,
PIM-BIDIR , PIM-SSM , MSDP , MBGP ,
MVPN , P2MP RSVP-TE , MLDP , BIER ,
LISP , MOSPF IGMP , MLD and several others. Perhaps due to these
many multicast protocols, and their perceived complexity over unicast, there has been much angst over deploying IP Multicast over the last 30 years.
It is not uncommon, with technical topics on multicast routing, for the discussion to evolve into what makes up a multicast address,
whether that address identifies the source content or the set of receivers, does multicast create too much state on the network, why hasn't it captured
the heart of the internet, why is it so complicated, what's the best multicast protocol to use, amongst many other questions. Despite the existence of multicast
related BCPs, the authors felt it important to have a draft which helps answer some of these questions through identifying the lessons learned from multicast
development and deployment over the last 30 years. We attempt to better understand the current, and future, state of multicast affairs by reviewing the
distractions, hype and innovation over the years and what we've learned from the evolution of IP Multicast.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119.
PIM: Protocol Independent Multicast
PIM-DM: PIM Dense Mode
PIM-SM: PIM Sparse Mode
PIM-BIDIR: PIM Bi-Directional
PIM-SSM: PIM Source Specific Multicast
DVMRP: Distance Vector Multicast Routing Protocol
MVPN: Multicast Virtual Private Network
MSDP: Multicast Source Discovery Protocol
MBGP: Multi-protocol Border Gateway Protocol
BIER: Bit Indexed Explicit Routing
IGMP: Internet Group Management Protocol
MLD: Multicast Listener Discovery
P2MP RSVP-TE: Point-to-Multipoint TE Label Switched Paths
MLDP: Multicast Label Distribution Protocol
MOSPF: Multicast OSPF
We will address various topics, in this section, which are relevant enough to warrant a discussion around what we've learned since their
development. We will start with one of the original multicast routing protocols called Distance Vector Multicast Routing Protocol (DVMRP).
DVMRP computes its own routing table to determine the best path back to the source. DVMRP uses a distance-vector routing algorithm.
This algorithm requires that each router periodically inform its neighbors of its routing table. DVMRP was a unicast routing algorithm but it had
tree building messages which formed distribution trees which could be pruned. There are no join messages in DVMRP because the RPF-tree
is the default distribution tree. The flooding and pruning of DVMRP was a good initial solution but we quickly realized that it wouldn't scale when
using increasingly higher bit rates for multicast content. Using the network to discover sources was also something originally thought to be a
good idea but later discovered to be resource and state intensive. DVMRP is a flood and prune distance vector protocol, similar to RIP, that
relied on a hop count and depended upon itself as a routing protocol to build the RPF table rather than using existing unicast routing tables to
build the rpf table as, the later developed, PIM-SM does. DVMRP worked good for small scale deployments but began to suffer when deployed
in larger multicast environments so we needed better solutions.
With PIM shared trees, all sources send to a root of a shared distribution tree called the Rendezvous Point (RP). When multicast group members
join a group, they cause branches of the distribution tree to be appended to the existing shared tree. New sources that send to the multicast group,
send their traffic to the RP so existing receivers can receive packets. The path multicast packets take, are from the source encapsulated to the RP
and then natively sent on the shared-tree branches. When a better/shorter path is desired, the source tree can be built. A source-tree is a multicast
distribution tree routed at the source. As receivers on the shared-tree discover new sources, they join those sources on the source tree. The path
on the source tree is determined by the unicast routing table and is also known as the "RPF path". With source trees, on the other
hand, multicast traffic bypasses the RP and instead flows from the multicast source down the tree towards the receivers using the multicast forwarding
table and the shortest available path. There is machinery to allow the multicast data to switch from the shared tree to a source tree once the source is
discovered. Shared trees were designed to reduce state at a time when memory was scarce and expensive, while shortest path trees were simpler, and more optimal,
but consumed more state.
Utilizing the network to provide the discovery of sources and receivers, and the machinery necessary to provide it, was an important
development at the time. But there was no way to discover sources when adhering to this Deering model, The Deering model was like an ethernet and sources
could just send and receivers would just receive the packets.
When Deering augmented multicast routing, the receivers then needed to be discovered, so he added IGMP. But then he decided to not have source discovery
and as he continued developing the model, he added DVMRP where the sources still didn't need to be discovered because their packets would flow
down a default distribution tree and then later pruned the per-group tree so packets wouldn't flow where there were no receivers. When PIM was built, we
wanted to change the default behavior to where the multicast packets would go nowhere and hence explicit joins built a tree. We had to fix the flood-and-prune
problem that DVMRP had. We fixed that problem but didn't provide any explicit signaling from the source to discover them. So the multicast routing protocol
discovered the sources (via the PIM shared-tree).
Having two types of trees was the hard part. Switching from one tree (shared) to the other (source) was a difficult routing distribution problem. Because
as you joined the source-tree, you had to prune that source from the shared-tree so duplicates wouldn't continue for a long time. As protocol designers and
implementors, that was a challenge to get right. What we then later realized was that we needed source trees which discover
the multicast source outside of the network thus removing the source discovery burden from the network. Source-discovery originally had to be
performed in the network because the multicast service model did not have a signaling mechanism like we now have with SSM and IGMPv3.
During this process we also learned that PIM-SM (or more generally ASM (Any Source Multicast)) is more susceptible to DoS attacks by unwanted sources than
is PIM-SSM. And address allocation with ASM is much more restrictive than it is with PIM-SSM.
When a router, with a directly connected source (First Hop Router), receives the first multicast packet of a stream, it selects an optimal route from the unicast routing
table based on the source address of the packet.
The outbound interface of the unicast route, towards the source, is the RPF interface, and the next hop of the route is the RPF neighbor. The router compares the
inbound interface of the packet with the RPF interface of the selected RPF route. If the inbound interface is the same as the RPF interface, the router considers that
the packet has arrived on the correct path from the source and forwards the packet downstream. If a router does a lookup in the unicast routing table to perform an RPF
check on every multicast data packet received, system resources would be overwhelmed. To save system resources, a router first performs a lookup for the matching (S, G)
entry after receiving a data packet sent from a source to a group. If no matching (S, G) entry is found, the router performs an RPF check to find the RPF interface for
the packet. The router then creates a multicast route with the RPF interface as the upstream interface towards the source and delivers the route to the multicast
forwarding information base (MFIB). If the RPF check succeeds, the inbound interface of the packet is the RPF interface, and the router forwards the packet to all
the downstream interfaces in the forwarding entry. If the RPF check fails, the packet has been forwarded along an incorrect path, so the router drops the packet.
The RPF is a security feature but it has caused some problems. When there are RPF changes, inconsistencies in the MFIB are created which can cause
forwarding failures. Problems may occur when hosts (not ip forwarders) are also configured with RPF check. It is important to note that SSM doesn't have
the data-driven state creation described above. It's also important to note the subtle difference between a "state problem" and a "state problem on a particular
platform from a particular vendor".
PIM runs on a control-plane processor where the multicast routing table is maintained, and (S,G) state is downloaded to data-plane hardware forwarders. Whenever
there is an RPF change, all routes that had changed in the multicast routing table have to get updated to the hardware forwarders.
Multicast was not originally supported with MPLS. That is a lesson learned in and of itself. The workaround was point-to-point GRE tunnels from CE to CE which was not scalable
when having many CE routers. MVPN solutions were complicated at times in the ietf. The MVPN complexity was organic because PE based unicast VPNs were already deployed.
So it didn't allow for simpler multicast designs. The architecture was already built, multicast functionality was an incremental add-on, which made it easier to deploy but the cost of
running the service was the same, or worse, than running unicast VPNs. We had years of debate about PIM based draft-rosen mvpn vs bgp based mvpn
using P2MP RSVP-TE. Cisco wound up progressing an independent submission with because it defined procedures which predated the publication of IETF
mvpn standards, and these procedures differ in some respects from a fully standards-compliant implementation. Eventually the pim and bgp based mvpn solutions were progressed
together in Multicast in MPLS/BGP IP VPNs in . Perhaps one lesson learned here is that there will often be a conflict between providing timely implementations
for customer needs vs waiting for the untimeliness of standards to work themselves out. A combined draft from the beginning, providing multiple multicast vpn solutions, would
have been helpful in preventing years of conflict and non standard compliant solutions. Another lesson is that it was good to decouple the control plane from the data plane so that the
control plane could scale better and the dataplane could have more options. Tunnels may now be built by PIM (any flavor), Multicast LDP (p2mp or mp2mp), RSVP-TE p2mp and we can
map multiple provider multicast service interface's (PMSI) onto one aggregated tunnel.
SD and SDR were good initial applications but we didn’t go far enough with them to help source discovery since the app layer is indeed a better place to handle source discovery
(than the network). SDR is a session directory tool designed to allow the advertisement and joining of multicast streams particularly targeted for the Mbone.
The Mbone (multicast backbone) was an experimental backbone and virtual network built on top of the Internet for carrying IP multicast traffic. The
Session Directory Revised tool (SDR) was developed to help discover the group and port used for a multicast
multimedia session. The original Session Directory (SD) tool was written by Lawrence Berkley Labs and was
replaced by SDR. SDR is a multicast application that listens for SAP packets on a well known multicast group.
These SAP packets contain a session description, the time the session is active, its IP multicast group
addresses, media format, contact person and other information about the advertised multimedia session. In
hindsight we should have continued developing SDR to more fully help with source discovery perhaps by utilizing http. That would have been better than focusing on the network
to provide multicast source discovery.
For multicast to function, every layer 3 hop between the sourcing and
receiving end hosts must support a multicast routing protocol. This may
not be a difficult challenge for enterprises and walled-garden networks
where the benefits of multicast are perceived to be much greater than the
costs to deploy (eg, financial, video distribution, MVPN SPs, etc).
However, on the global Internet, where the cost/benefits of multicast (or
any service, for that matter) are not likely to ever be universally agreed
upon, this "all or nothing" requirement tends to create an insurmountable
barrier. It should be noted that IPv6 suffers the same challenge, which
explains why IPv6 has not been ubiquitously deployed across the Internet
to the same degree as IPv4, despite decades of trying. Simply put, any
technology that requires new protocols to be enabled on every interface on
every router and firewall on the Internet is not likely to succeed.
One approach to address this challenge is to develop solutions that
facilitate incremental deployment and minimize/eliminate the need for
coordination of multiple parties. Overlay networking is one such approach
and allows the service to work for end users without requiring every
underlay hop to support multicast- only the layer 3 hops in the overlay
topology require multicast support. For example, AMT allows end
users on unicast-only networks to receive multicast content by dynamically
tunneling to devices (AMT Relays) on multicast-enabled networks. This
empowers interested end users to enjoy the service while also enabling
content providers and operators who have deployed multicast to realize the
benefits of more efficient delivery while tunneling over the parts of the
network (last/middle/first mile) that haven't deployed multicast.
Further, this incremental approach can provide the necessary incentive for
operators who haven't deployed multicast natively to do so in order to
avoid carrying duplicate tunneled traffic. Another example is Locator/ID Separation Protocol (LISP) ,
where multicast sources and receivers can be on the overlay and work with a any combination of unicast and/or
native multicast delivery from the underlay. Endpoint identifiers (EIDs) are assigned to end hosts. Routing locators (RLOCs) are
assigned to devices (primarily routers) that make up the global routing system.The LISP overlay nodes can roam while keeping their same EID address,
can be multi-homed to load-split packets across multiple interfaces, and can encrypt packets at the overlay layer
(freeing applications from dealing with security).
In ASM, the network is responsible for discovering all multicast sources. This responsibility leads to massive protocol complexity, which
imposes a huge operational cost for designing, operating and troubleshooting multicast. In SSM, source discovery is moved out of network
and is handled by some sort of out-of-band mechanism, typically in the application layer. By eliminating network-based source discovery in SSM,
we eliminate the need for shared trees, PIM register message encap/decap, RPs, SPT-switchover, data-driven state creation and MSDP, and
the resulting protocol, PIM-SSM, is dramatically simpler than previous ASM routing protocols. Indeed, PIM-SSM is merely a small subset of
PIM-SM functionality. The key insight is that source discovery is not a function the network should provide. One would never expect ISIS/OSPF
and BGP to discover and maintain a globally synchronized database of all active websites on the Internet, yet that is precisely what is required
of PIM-SM and MSDP for ASM. This insight can apply more generally to other functions, like accounting, access control, transport reliability, etc.
One simple heuristic for whether a function should exist in the multicast routing protocol is to simply ask what would unicast do (WWUD)? If
unicast routing protocols like OSPF, ISIS or BGP do not provide such a function, then multicast routing protocols like PIM should not be expected
to provide that function either. Further, moving functionality to the application layer, rather than in the network layer, allows allows faster innovation
and greater levels of creativity, as these two layers tend to have vastly different requirements, expectations (and, therefore upgrade cycles) for
stability, scale, functionality and innovation.
Premature optimization can saddle the protocols with complexity burdens long after the optimizations are no longer relevant or even
before the optimizations can be used. Typically those optimizations are implemented for scale even though you don't need or see a need for them
in early deployments. But they must be thought ahead of time and planned for (that means designed and implemented up front). Shared trees
were born in the 1990s out of a (well-founded at the time) concern for state exhaustion when memory was a scarce resource. As memory got
cheaper and more abundant, these concerns were reduced, but the complexity remained. It was once ironically noted that we eliminated the state problem
by making the protocols so complex that no one deployed them. Although, to be fair, other protocols also have had state problems
and private enterprises have successfully used multicast in their wall-gardens without state problems.
In hindsight, what we should have done with multicast is the same thing QUIC did which is implemented as a library rather than in the kernel. If we had done
that, then when the app is deployed that needs a network function, it comes at the same time (inside the app). This is similar to what we have
done with AMT in VLC which was a practical decision to get apps access to a native multicast cloud.
By packaging the protocol stack in the application, it allows a developer to add features and fix bugs quickly. And get the updates deployed quickly by
having users download and update the app. This rather modern way of distributing new code has proved successful in may mobile and cloud based
environments. With respect to multicast, we could have made faster deployed changes to IGMP as well as any tunneling technology we felt useful.
IGMPv1 was the first protocol to allow end hosts to indicate their interest in receiving a multicast stream. There was no message to indicate the receiver
has left receiving the multicast stream so the router had to eventually figure it out. This caused bandwidth problems especially when quickly changing channels.
IGMPv2 provided a leave message to prevent wasted bandwidth. And IGMPv3 provided support for source specific multicast. IGMPv1 and IGMPv2 do not have the
capability to specify a particular sender of multicast traffic. This capability is provided in IGMPv3.
In hindsight we could have easily developed SSM with IGMPv2 from the start. All an (S,G) is, is a longer group address. So if we changed IGMPv2 to have a more
general encoding, we would have created IPv6 groups, IPv6 (S,G), and IPv4 (S,G) encoding all at the same time. And, if we had made it a library, it would have
likely been deployed faster. Additionally, because we were working on "Integrated IS-IS" and "IPv6" all at the same time, we could have developed one protocol - similar
to how we do it for BGP today. PIM was integrated but it was developed as "ships in the night" with other protocols.
We've learned many things over the years about the problems (such as high packet error rates, no acknowledgements and low data rates)
with deploying multicast in 802.11 (Wi-Fi) networks. We even created specifically
to address all the many ways multicast is problematic over Wi-Fi. Performance issues, for instance, have been observed over the years, when multicast packets transmit over
IEEE 802 wireless media, so much so that that it is often disallowed over Wi-Fi networks. Various workarounds have been developed including converting multicast to unicast at
layer 2 (aka, ingress replication) in order to more successfully transit the wireless medium. There are various optimizations that can be implemented to mitigate some of the
many issues involving multicast over Wi-Fi. The lesson we've learned now is that we (vendors, IETF) should have worked closely with the IEEE many years ago on detailing the
problems in order to improve the performance of multicast transmissions at Layer 2. The IEEE is now designing features to improve multicast performance over Wi-Fi but it's
expensive to do so and will take time.
Beau Williamson's publications helped with some of the history of the protocols discussed.