< draft-marques-l3vpn-mcast-edge-00.txt   draft-marques-l3vpn-mcast-edge-01.txt >
Network Working Group P. Marques Network Working Group P. Marques
Internet-Draft Contrail Systems Internet-Draft Contrail Systems
Intended status: Standards Track L. Fang Intended status: Standards Track L. Fang
Expires: October 31, 2012 Cisco Systems Expires: December 01, 2012 Cisco Systems
D. Winkworth D. Winkworth
FIS FIS
Y. Cai Y. Cai
P. Lapukhov
Microsoft Corporation Microsoft Corporation
May 2012 June 2012
Edge multicast replication for BGP IP VPNs. Edge multicast replication for BGP IP VPNs.
draft-marques-l3vpn-mcast-edge-00 draft-marques-l3vpn-mcast-edge-01
Abstract Abstract
In data-center networks it is common to use Clos network topologies In data-center networks it is common to use Clos network topologies
[clos] in order to provide a non-blocking switched network. In these [clos] in order to provide a non-blocking switched network. In these
topologies it is often not desirable to provide native IP multicast topologies it is often not desirable to provide native IP multicast
service. service.
This document defines a multicast replication algorithm along with This document defines a multicast replication algorithm along with
its control and data forwarding procedures that provides a multicast its control and data forwarding procedures that provides a multicast
skipping to change at page 1, line 43 skipping to change at page 1, line 44
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on October 31, 2012. This Internet-Draft will expire on December 01, 2012.
Copyright Notice Copyright Notice
Copyright (c) 2012 IETF Trust and the persons identified as the Copyright (c) 2012 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (http://trustee.ietf.org/ Provisions Relating to IETF Documents (http://trustee.ietf.org/
license-info) in effect on the date of publication of this document. license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components and restrictions with respect to this document. Code Components
extracted from this document must include Simplified BSD License text extracted from this document must include Simplified BSD License text
as described in Section 4.e of the Trust Legal Provisions and are as described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Simplified BSD License. provided without warranty as described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. VPN Forwarder behavior . . . . . . . . . . . . . . . . . . . . 5 3. VPN Forwarder behavior . . . . . . . . . . . . . . . . . . . . 6
4. Multicast tree management . . . . . . . . . . . . . . . . . . 7 4. Multicast tree management . . . . . . . . . . . . . . . . . . 8
5. BGP Protocol Extensions . . . . . . . . . . . . . . . . . . . 11 5. BGP Protocol Extensions . . . . . . . . . . . . . . . . . . . 11
5.1. Multicast Tree Route Type . . . . . . . . . . . . . . . . 11 5.1. Multicast Tree Route Type . . . . . . . . . . . . . . . . 12
5.2. Multicast Edge Discovery Attribute . . . . . . . . . . . . 11 5.2. Multicast Edge Discovery Attribute . . . . . . . . . . . . 12
5.3. Multicast Edge Forwarding Attribute . . . . . . . . . . . 12 5.3. Multicast Edge Forwarding Attribute . . . . . . . . . . . 13
6. Security Considerations . . . . . . . . . . . . . . . . . . . 12 6. Security Considerations . . . . . . . . . . . . . . . . . . . 14
7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 13 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.1. Normative References . . . . . . . . . . . . . . . . . . . 13 7.1. Normative References . . . . . . . . . . . . . . . . . . . 14
7.2. Informational References . . . . . . . . . . . . . . . . . 13 7.2. Informational References . . . . . . . . . . . . . . . . . 15
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 15
1. Introduction 1. Introduction
In Wide-Area Networks having native multicast service on hop-by-hop In Wide-Area Networks having native multicast service on hop-by-hop
basis allows for more efficient use of scarse link bandwidth. In basis allows for more efficient use of scarse link bandwidth. In
Clos network topologies [clos] the trade-offs are different. Clos network topologies [clos] the trade-offs are different.
A Clos network is often used to provide full cross-sectional A Clos network is often used to provide full cross-sectional
bandwidth between all the ports on the network. When used in a bandwidth between all the ports on the network. When used in a
switching infrastructure it achieves this goal by spreading flows switching infrastructure it achieves this goal by spreading flows
skipping to change at page 3, line 29 skipping to change at page 3, line 30
The solution itself does not assume a specific topology on the The solution itself does not assume a specific topology on the
underlying infrastructure network. We simply assume that it is underlying infrastructure network. We simply assume that it is
undesirable to use native multicast service. This can be a result of undesirable to use native multicast service. This can be a result of
topology as per the CLOS example above or some other constraint that topology as per the CLOS example above or some other constraint that
makes it undesirable to create multicast groups based on the overlay makes it undesirable to create multicast groups based on the overlay
topology. topology.
2. Overview 2. Overview
This document defines a mechanism to construct and manage multicast
distribution trees for overlay networks that does not rely on the
underlying physical network to provide multicast capabilities. The
solution places an upper bound on the number of copies that a
particular network node has to generate in contrast with ingress
replication in which the ingress node must generate one packet
replica for each receiver in the group.
Using this approach ingress node and link load is traded off for
additional packet replication steps in other nodes in the network.
This is achieved by building a K-ary tree where each node is
responsible to generate up-to K replicas. For a multicast group with
m receivers the height of the tree is approximately "log K(m)".
Where the height of the tree determines the maximum number of
forwarding hops required to deliver a packet to the receiver.
A separate overlay distribution tree is constructed for each
multicast group, using an MPLS label to identify the tree at each
hop. The nodes in the tree are VPN forwarders with local receivers
for the specific group. The tree uses a bi-directional forwarding
algorithm. A shared tree is used for all the sources in the group in
the case of an ASM group.
The distribution tree is constructed hierarchically:
1. Signaling Gateways build a tree the contains all locally
registered VPN forwarders with local multicast receivers,
observing the out-degree constraint K.
2. Each Signaling Gateway announces a collection of available edges
that can be used to join its local distribution tree with other
trees built by other Signaling Gateways. The number of such
edges also respects the out-degree constraint.
3. One of the Signaling Gateways that has been previously elected to
assume the role of "tree manager" for the specific group, assigns
the edges that connect the lowest level trees together and
advertises this information to the other Signaling Gateways.
IP hosts use IGMP [RFC3376]/MLD [RFC3810] to request the delivery of IP hosts use IGMP [RFC3376]/MLD [RFC3810] to request the delivery of
multicast packets for a particular (*, g) or (s, g). Discovery multicast packets for a particular (*, g) or (s, g). Discovery
applications where the intent is to allow applications to discover applications where the intent is to allow applications to discover
the group membership use (*, g) JOINs. Content delivery applications the group membership use (*, g) JOINs. Content delivery applications
may use an (s, g) JOIN after initially performing discovery either may use an (s, g) JOIN after initially performing discovery either
via multicast or by other means. via multicast or by other means.
In the context of end-system VPNs, the VPN Forwarder acts as an IGMP In the context of end-system VPNs, the VPN Forwarder acts as an IGMP
querier on the virtual interfaces and receives IGMP/MLD Membership querier on the virtual interfaces and receives IGMP/MLD Membership
Report packets. It uses this information to generate VPN-specific Report packets. It uses this information to generate VPN-specific
skipping to change at page 4, line 28 skipping to change at page 5, line 17
connected and there are no cycles. The resulting graph is a spanning connected and there are no cycles. The resulting graph is a spanning
tree. tree.
The Signaling Gateway can use any algorithm to manage the graph. In The Signaling Gateway can use any algorithm to manage the graph. In
practice, we expect that the Signaling Gateway would attempt to practice, we expect that the Signaling Gateway would attempt to
minimize the cost of the tree subject to the out-degree constraint minimize the cost of the tree subject to the out-degree constraint
(at most K edges) while also minimizing the disruption caused by each (at most K edges) while also minimizing the disruption caused by each
individual node JOIN or LEAVE. individual node JOIN or LEAVE.
The Signaling Gateway constructs an OLIST for each VPN Forwarder, The Signaling Gateway constructs an OLIST for each VPN Forwarder,
where its OLIST is constituted by an incoming edge (for all nodes where its OLIST is constituted by an upstream edge (for all nodes
except for the root) plus up-to K outgoing edges. except for the root) plus up-to K downstream edges. Each VPN
Forwarder delivers traffic locally to the virtual interfaces that
have JOINed the specific group as well as replicate the packet up-to
K times according to the OLIST.
Whenever the OLIST for a given node changes, the Signaling Gateway Whenever the OLIST for a given node changes, the Signaling Gateway
MUST allocate a different label that corresponds to that version of MUST allocate a different label that corresponds to that version of
the OLIST. This is used to avoid forwarding loops. The assumption is the OLIST. This is used to avoid forwarding loops. The assumption is
that at each run of its tree management algorithm the Gateway is that at each run of its tree management algorithm the Gateway is
capable of building a acyclic graph. However signaling updates from capable of building a acyclic graph. However signaling updates from
the Gateway to the VPN Forwarders are not synchronous. Each modified the Gateway to the VPN Forwarders are not synchronous. Each modified
OLIST will have a different label assigned, which means that in OLIST will have a different label assigned, which means that in
transient state traffic may be discarded if a VPN forwarder with transient state traffic may be discarded if a VPN forwarder with
information regarding an old edge send traffic to a VPN forwarder information regarding an old edge send traffic to a VPN forwarder
which has already received information of the new topology. However which has already received information of the new topology. However
this eliminates the possibility of forwarding loops. this eliminates the possibility of forwarding loops.
Traffic forwarding is done according to a bi-directional forwarding Traffic forwarding is done according to a bi-directional forwarding
algorithm. Packets flowing from the root are distributed to all the algorithm. Packets flowing from the root are distributed to all the
outgoing edges. Traffic received from one of the leaves is sent to outgoing edges. Traffic received from one of the leaves is sent to
the root facing interface plus remaining descendants. This assumes the root facing interface plus remaining descendants. This assumes
that the VPN forwarder has the ability to determine the source of the that the VPN forwarder has the ability to determine the source of the
traffic, by examining the outer IP header of the packet. traffic, by examining the outer IP header of the packet. The MPLS
label contained in the packet identifies the multicast distribution
tree but it is not sufficient to determine the OLIST element from
which the packet has been received.
Signaling Gateways communicate multicast membership information to Signaling Gateways communicate multicast membership information to
each other using BGP L3VPN C-Multicast routes [RFC6514]. Associated each other using BGP L3VPN C-Multicast routes [RFC6514]. Associated
with each C-Multicast route, the Signaling Gateway also advertises with each C-Multicast route, the Signaling Gateway also advertises
up-to K edges that can be use to interconnect the multicast up-to K edges that can be use to interconnect the multicast
distribution tree that it manages with other trees managed by its distribution tree that it manages with other trees managed by its
peers. The C-Multicast routes are known to all signaling gateways peers. The C-Multicast routes are known to all signaling gateways
which have local membership in the corresponding VPNs. which have local membership in the corresponding VPNs.
A predefined hash function is used to determine a 32-bit value X A predefined hash function is used to determine a 32-bit value X
skipping to change at page 6, line 39 skipping to change at page 7, line 30
specific group. When the distribution tree is built, the signaling specific group. When the distribution tree is built, the signaling
gateway will include as members all the (*, *) receivers of ASM gateway will include as members all the (*, *) receivers of ASM
groups and all (*, *) and (*, g) receivers of SSM groups. groups and all (*, *) and (*, g) receivers of SSM groups.
Once the subscription is received, the gateway sends XMPP event Once the subscription is received, the gateway sends XMPP event
notifications that contain forwarding information for the specific notifications that contain forwarding information for the specific
group. These messages contain an incoming label, assigned by the group. These messages contain an incoming label, assigned by the
gateway, and a list of up-to K+1 next-hops, where each next-hop gateway, and a list of up-to K+1 next-hops, where each next-hop
consists of an IP destination address and an outgoing label. consists of an IP destination address and an outgoing label.
When the last local member of a multicast group leaves the group,
either explicitly or as a result of a expiration timer, the VPN
forwarder generates an XMPP pubsub 'delete' message to the Signaling
Gateway.
Multicast forwarding state update from gateway to VPN forwarder: Multicast forwarding state update from gateway to VPN forwarder:
<message to='system-id@domain.org from='network-control.domain.org> <message to='system-id@domain.org from='network-control.domain.org>
<event xmlns='http://jabber.org/protocol/pubsub#event'> <event xmlns='http://jabber.org/protocol/pubsub#event'>
<items node='vpn-customer-name/224.1.1.1'> <items node='vpn-customer-name/224.1.1.1'>
<item id='ae890ac52d0df67ed7cfdf51b644e901'> <item id='ae890ac52d0df67ed7cfdf51b644e901'>
<entry xmlns='http://ietf.org/protocol/bgpvpn'> <entry xmlns='http://ietf.org/protocol/bgpvpn'>
<label>10000</label> <!-- incoming label number --> <label>10000</label> <!-- incoming label number -->
<olist> <olist>
<next-hop address='10.1.1.1' label='10101'/> <next-hop address='10.1.1.1' label='10101'/>
skipping to change at page 7, line 24 skipping to change at page 8, line 4
<next-hop address='10.1.10.10' label='10222'/> <next-hop address='10.1.10.10' label='10222'/>
</olist> </olist>
</entry> </entry>
</item> </item>
<item > <item >
... ...
</item> </item>
</items> </items>
</event> </event>
</message> </message>
The VPN forwarder updates its multicast forwarding table with the The VPN forwarder updates its multicast forwarding table with the
information received in this event notification. Any label that was information received in this event notification. Any label that was
previously assigned to the (vrf, *, g) or (vrf, s, g) forwarding previously assigned to the (vrf, *, g) or (vrf, s, g) forwarding
entry is implicitly withdrawn. entry is implicitly withdrawn.
Multicast packets are encapsulated in an IP tunnel that contains a Multicast packets are encapsulated in an IP tunnel that contains a
20-bit as well as the original multicast datagram. This 20-bit label 20-bit label as well as the original multicast datagram. This 20-bit
uniquely identifies the multicast replication state as specified by label uniquely identifies the multicast replication state as
the OLIST. specified by the OLIST.
The VPN Forwarder MUST drop an incoming multicast packet unless it is The VPN Forwarder MUST drop an incoming multicast packet unless it is
either received from a local virtual interface or the source is either received from a local virtual interface or the source is
present in the OLIST. present in the OLIST.
The VPN Forwarder MUST generate a copy of the incoming packet to all The VPN Forwarder MUST generate a copy of the incoming packet to all
next-hops in the OLIST except the next-hop with the same IP address next-hops in the OLIST except the next-hop with the same IP address
as the outer header source of the incoming packet. as the outer header source of the incoming packet.
Additionally, the VPN Forwarder MUST generate additional copies to Additionally, the VPN Forwarder MUST generate additional copies to
skipping to change at page 10, line 39 skipping to change at page 11, line 28
+-----------+------------------------------------+ +-----------+------------------------------------+
| Router-Id | Edges | | Router-Id | Edges |
+-----------+------------------------------------+ +-----------+------------------------------------+
| A | (a1, 0), (a1, 0), (a2, 0), (a3, 0) | | A | (a1, 0), (a1, 0), (a2, 0), (a3, 0) |
| B | (b1, 0) | | B | (b1, 0) |
| C | (c1, 0), (c2, 0), (c3, 0) | | C | (c1, 0), (c2, 0), (c3, 0) |
| D | (d1, 0), (d2, 0), (d3, 0) | | D | (d1, 0), (d2, 0), (d3, 0) |
+-----------+------------------------------------+ +-----------+------------------------------------+
In the table above, each pair represents the IP address an assigned
incoming label of a VPN forwarder.
In this example all the signaling gateways decided to advertise less In this example all the signaling gateways decided to advertise less
than K+1 edges. than K+1 edges.
One possible assignment is to make the node A the root of the top- One possible assignment is to make node A's tree the root of the top-
level distribution tree.This can be accomplished by creating the level distribution tree. This can be accomplished by creating the
edges (a1, b1), (a1, c1), (a2, d1). The tree manager must allocate a edges (a1, b1), (a1, c1), (a2, d1). The tree manager must allocate a
label for each of the next-hops from their respective label space. label for each of the next-hops from their respective label space.
As a result of this tree assignment, the multicast tree manager (B) As a result of this tree assignment, the multicast tree manager (B)
generates the following Multicast Tree Route Type updates: generates the following Multicast Tree Route Type updates:
+-----------------+-------------------------------------------------+ +-----------------+-------------------------------------------------+
| Router-Id | Edges | | Router-Id | Edges |
+-----------------+-------------------------------------------------+ +-----------------+-------------------------------------------------+
| A | (a1, b1, 10000, 20000), (a1, c1, 10000, 21000), | | A | (a1, b1, 10000, 20000), (a1, c1, 10000, 21000), |
skipping to change at page 12, line 6 skipping to change at page 13, line 6
which the multicast forwarding state is being advertised. which the multicast forwarding state is being advertised.
5.2. Multicast Edge Discovery Attribute 5.2. Multicast Edge Discovery Attribute
The Multicast Edge Discovery Path Attribute is associated with The Multicast Edge Discovery Path Attribute is associated with
C-Multicast routes and contains one or more next-hop information C-Multicast routes and contains one or more next-hop information
elements where each information element follows the model described elements where each information element follows the model described
bellow: bellow:
+------------------------------+ +------------------------------+
| Next-hop Length (1 octect) | | IP addr Length (1 octect) |
+------------------------------+ +------------------------------+
| Next-hop (variable) | | IP address (variable) |
+------------------------------+ +------------------------------+
|Label Range Length (1 octect) | |Label Range Length (1 octect) |
+------------------------------+ +------------------------------+
| Start Label (4 octects) | | Start Label (4 octects) |
+------------------------------+ +------------------------------+
| End Label (4 octects) | End Label (4 octects)
+------------------------------+ +------------------------------+
| ... | | ... |
+------------------------------+ +------------------------------+
| Start Label (4 octects) | | Start Label (4 octects) |
+------------------------------+ +------------------------------+
| End Label (4 octects) | | End Label (4 octects) |
+------------------------------+ +------------------------------+
Each 'Next-hop' information element identifies an incoming edge that
can be used to connect Signaling Gateway locally managed replication
tree with other replication trees for the same group. The 'IP
address' value corresponds to the IP address of VPN Forwarder that is
a member of the local tree.
The same VPN Forwarder address can appear multiple times in the
Discovery Path Attribute. Signaling Gateways advertise up-to K + 1
Next-hop elements.
Each attribute specifies one or more contiguous label ranges
available for assignment at the specified VPN Forwarder. If the VPN
forwarder appears multiple times in the list, the label range
advertisements SHOULD be the same.
The Signaling Gateway SHALL ensure that the number of locally
assigned edges on a VPN forwarder plus the number of Next-hop
information elements that refer to that VPN forwarder do not exceed K
+ 1.
5.3. Multicast Edge Forwarding Attribute 5.3. Multicast Edge Forwarding Attribute
The Multicast Edge Forwarding Path Attribute is associated with The Multicast Edge Forwarding Path Attribute is associated with
Multicast Tree Route Type NLRI routes and contains one or more edge Multicast Tree Route Type NLRI routes and contains one or more edge
information elements where each information element follows the model information elements where each information element follows the model
described bellow: described bellow:
+------------------------------+ +------------------------------+
| Next-hop Length (1 octect) | | Next-hop Length (1 octect) |
+------------------------------+ +------------------------------+
skipping to change at page 13, line 5 skipping to change at page 14, line 32
6. Security Considerations 6. Security Considerations
It is helpful to differentiate between the control plane and data It is helpful to differentiate between the control plane and data
plane security aspects of the solution. plane security aspects of the solution.
The control plane assumes that XMPP sessions between VPN forwarders The control plane assumes that XMPP sessions between VPN forwarders
and Signaling Gateway are authenticated such that the Signaling and Signaling Gateway are authenticated such that the Signaling
Gateway is able to verify the identity of the VPN Forwarder. Gateway is able to verify the identity of the VPN Forwarder.
BGP sessions bewteen Signaling Gateways should also be subject to BGP sessions between Signaling Gateways should also be subject to
authentication. authentication.
At the data-plane, it is important to note that a comprimised VPN At the data-plane, it is important to note that a compromised VPN
forwarder is able to modify message that traverse through it. forwarder is able to modify message that traverse through it.
7. References 7. References
7.1. Normative References 7.1. Normative References
[RFC3376] Cain, B., Deering, S., Kouvelas, I., Fenner, B. and A. [RFC3376] Cain, B., Deering, S., Kouvelas, I., Fenner, B. and A.
Thyagarajan, "Internet Group Management Protocol, Version Thyagarajan, "Internet Group Management Protocol, Version
3", RFC 3376, October 2002. 3", RFC 3376, October 2002.
skipping to change at page 14, line 4 skipping to change at page 15, line 31
generalizations", Bell System Technical Journal 36, 1957. generalizations", Bell System Technical Journal 36, 1957.
Authors' Addresses Authors' Addresses
Pedro Marques Pedro Marques
Contrail Systems Contrail Systems
440 N. Wolfe Rd. 440 N. Wolfe Rd.
Sunnyvale, CA 94085 Sunnyvale, CA 94085
Email: roque@contrailsystems.com Email: roque@contrailsystems.com
Luyuan Fang Luyuan Fang
Cisco Systems Cisco Systems
111 Wood Avenue South 111 Wood Avenue South
Iselin, NJ 08830 Iselin, NJ 08830
Email: lufang@cisco.com Email: lufang@cisco.com
Derick Winkworth Derick Winkworth
FIS FIS
Email: derick.winkworth@fisglobal.com Email: derick.winkworth@fisglobal.com
Yiqun Cai Yiqun Cai
Microsoft Corporation Microsoft Corporation
1065 La Avenida 1065 La Avenida
Mountain View, CA 94043 Mountain View, CA 94043
Email: yiqunc@microsoft.com Email: yiqunc@microsoft.com
Petr Lapukhov
Microsoft Corporation
Email: petrlapu@microsoft.com
 End of changes. 22 change blocks. 
27 lines changed or deleted 101 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/