Multi-path Label Switched Paths Signaled Using RSVP-TE
Juniper Networks1194 N. Mathilda Ave.SunnyvaleCA94089USkireeti.kompella@gmail.comLINXmikeh@linx.net
Routing
ECMP multipath TE
This document describes extensions to Resource ReSerVation
Protocol - Traffic Engineering for the set up of multi-path
Traffic Engineered Label Switched Paths (LSPs) in Multi Protocol
Label Switching (MPLS) and Generalized MPLS networks, i.e., LSPs
that conform to traffic engineering constraints, but follow
multiple independent paths from source to destination.
In selecting a protocol for setting up and signaling "tunnel"
Labeled Switched Paths (LSPs) in Multi Protocol Label Switching
(MPLS) and Generalized MPLS (GMPLS) networks, one first chooses
whether one wants Equal Cost Multi-Path (ECMP) load balancing or
Traffic Engineering (TE). For the former, one uses the Label
Distribution Protocol (LDP) (); for the
latter, the Resource ReSerVation Protocol - Traffic Engineering
(RSVP-TE) (). [Two other criteria, the
need for fast protection and the desire for less configuration,
are no longer the deciding factors they used to be, thanks to "IP
fast reroute" () and "RSVP-TE automesh"
()].
This document describes how one can set up a tunnel LSP that has
both ECMP and TE characteristics using RSVP-TE. The techniques
described in this document can be used to create a "Multipath LSP"
(MLSP) to a destination, that consists of several "sub-LSPs", each
potentially taking a different path through the network to the
destination. The techniques can also be used to create a single
MLSP to multiple equivalent destinations (such as equidistant BGP
nexthops announcing a common set of reachable addresses), such
that each destination is served by one or more sub-LSPs.
There are several alternatives to choose from when considering
MLSPs. One is whether the ingress Label Switching Router (LSR)
computes (or otherwise obtains) the full path for each sub-LSP, or
whether LSRs along the various paths can compute paths further
downstream (using techniques such as "loose hop expansion", as in
). Another is whether the various paths
that make up the MLSP have equal cost (or distance) from ingress
to egress (i.e., ECMP), whether they may have differing costs.
Finally, one can choose whether to terminate a multi-path LSP on a
single egress or on several equivalent egresses. For now, the
first of each of these alternatives is assumed; future work can
explore other choices.
The term Multipath LSP, or MLSP, will be used to denote the
(logical) container LSP from an ingress LSR to one or more
egress LSR(s). An MLSP is the unit of configuration and
management.
An MLSP consists of one or more "sub-LSPs". A sub-LSP consists
of a single path from the ingress of the MLS to one of its
egresses. A sub-LSP is the unit of signaling of an MLSP. An
Explicit Route Object (ERO) will be used to define the path of a
sub-LSP.
The "downstream links" of an MLSP Z at LSR X is the union of the
downstream links of all sub-LSPs of Z traversing X. Similarly,
the "upstream links" of an MLSP Z at LSR X is the union of
upstream links of all sub-LSPs of Z traversing X.
The agent that takes the configuration parameters of a tunnel
and computes the corresponding paths is called the Path
Computation Agent (PCA). The PCA is responsible for acquiring
the tunnel configuration, computing the paths of the sub-LSPs,
and, if the PCA is not co-located with the ingress, informing
the ingress about the tunnel and the EROs for the sub-LSPs.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described
in .
An MLSP is configured with various constraints associated with
TE LSPs, such as destination LSR(s), bandwidth (on a per-class
basis, if desired), link colors, Shared Risk Link Groups, etc.
[Auto-mesh techniques () can be used to
reduce configuration; this is not described further here.] In
addition, parameters specifically related to MLSPs, such as how
many (or the maximum number of) sub-LSPs to create, whether
traffic should be split equally across sub-LSPs or not, etc. may
also be specified. This configuration lives on the PCA, which
is responsible for computing the paths (i.e., the EROs) for the
various sub-LSPs. The PCA informs the ingress LSR about the
MLSP and the constituent sub-LSPs, including EROs and
bandwidths.
The PCA uses the configuration parameters to decide how many
sub-LSPs to compute for this MLSP, what paths they should take,
and how much bandwidth each sub-LSP is responsible for. Each
sub-LSP MUST meet all the constraints of the MLSP (except
bandwidth). The bandwidths (per-class, if applicable) of all
the sub-LSPs MUST add up to the bandwidth of the MLSP. A Path
Computation Element () that is
multi-path LSP-aware may be used as the PCA.
Having computed (or otherwise obtained) the paths of all the
sub-LSPs, the ingress A then signals the MLSP by signaling all
the individual sub-LSPs across the MPLS/GMPLS network. To do
this, the ingress first picks an MLSP ID, a 16-bit number that
is unique in the context of the ingress. This ID is used in an
ASSOCIATION object that is placed in each sub-LSP to let all
transit LSRs know that the sub-LSPs belong to the same MLSP.
If multiple sub-LSPs of the same MLSP pass through LSR Y, and Y
has downstream links YP, YQ and YR for the various sub-LSPs,
then Y has to load balance incoming traffic for the MLSP across
the three downstream links in proportion to the sum of the
bandwidths of the sub-LSPs going to each downstream (see ).
One must distinguish carefully between the signaled bandwidth of
a sub-LSP, a static value capturing the expected or maximum
traffic on the sub-LSP, and the instantaneous traffic received
on a sub-LSP, a constantly varying quantity. Suppose there are
three sub-LSPs traversing Y, with bandwidths 10Gbps, 20Gbps and
30Gbps, going to P, Q and R respectively. Suppose further Y
receives some traffic over each of these sub-LSPs. Y must
balance this received traffic over the three downstream links
YP, YQ and YR in the ratio 1:2:3.
In an IP or LDP network, incoming traffic arriving at A headed
for B will be split equally between M and X at A. Similarly,
traffic for B arriving at Y will be split equally among P, Q and
R. If the traffic arriving at A for B is 120Gbps, then the AMB
path will carry 60Gbps, the paths AXYPTB, AXYQTB and AXYRB will
each carry 10Gbps, and the AXSB path will carry 30Gbps. We'll
call this "IP-style" load balancing.
Note: all load balancing is subject to the overriding
requirement of mapping the same "flow" to the same downstream.
(What constitutes a "flow" is beyond the scope of this
document.) This requirement takes precedence over all attempts
to balance traffic among downstreams. Thus, the statements
above (e.g., "the AMB path will carry 60Gbps") are to be
interpreted as ideal targets, not hard requirements, of load
balancing.
One can simulate the IP or LDP ECMP behavior with TE-based ECMP
by creating an MLSP with five sub-LSPs S1 through S5 taking
paths AMB, AXYPTB, AXYQTB, AXYRB and AXSB, with bandwidths
60Gbps, 10Gbps, 10Gbps, 10Gbps and 30Gbps, respectively.
With such an arrangement, the MB link carries 60Gbps while the
RB link carries just 10Gbps. If one wishes instead to carry
equal amounts of traffic on the links incoming to B, then one
could arrange the sub-LSPs S1 to S5 to have bandwidths 30Gbps,
15Gbps, 15Gbps, 30Gbps and 30Gbps, respectively. In this case,
the bandwidth on each of the four links going to B is 30Gbps,
illustrating some of the capabilities of TE-based ECMP.
Staying with this example, A has one sub-LSP of bandwidth 30Gbps
to M and four sub-LSPs of total bandwidth 90Gbps to X. Thus, A
should load balance traffic in the ratio 1:3 between the AM and
the AX links. Similarly, X has three sub-LSPs of total
bandwidth 60Gbps to Y and one sub-LSP of bandwidth 30Gbps to S,
so X should load balance traffic 2:1 between Y and S. Y has a
sub-LSP of bandwidth 15Gbps to each of P and Q and one sub-LSP
of bandwidth 30Gbps to R, so Y should load balance traffic 1:1:2
among P, Q and R, respectively. Thus, in general, TE-based ECMP
does not assume equal distribution of traffic among downstream
LSRs, unlike IP- or LDP-style ECMP.
In , there are potentially 2x3x5=30 ECMP
paths between A and B. With IP or LDP, exploiting all these
paths is straightforward, and doesn't need a lot of state. With
an MLSP as seen so far, this would require 30 sub-LSPs to
achieve equivalent load balancing. This suggests that a
different approach is needed to efficiently achieve IP-style
load balancing with TE LSPs. To this end, we introduce the
notion of "equi-bandwidth" (EB) sub-LSPs and EB MLSPs. A
sub-LSP is equi-bandwidth if its "E" bit is set (see
). An MLSP is equi-bandwidth if all
of its sub-LSPs are equi-bandwidth.
If a set of EB sub-LSPs of the same MLSP traverse an LSR S, say
to downstream links SP, SQ and SR, then S MUST attempt to load
balance traffic received on these EB sub-LSPs equally among the
links SP, SQ and SR, independent of how many sub-LSPs go over
each of these links. Furthermore, S MUST redistribute traffic
received from each of its upstream LSRs, and SHOULD redistribute
all traffic received from upstream as a whole. One can do the
former by signaling the same label to each of its upstream LSRs;
one can do the latter by signaling the same label to all
upstream LSRs (see ). For example, in
, if L sends 12Gbps of traffic to S and M
sends 18Gbps to S, S can redistribute L's traffic by sending
4Gbps to each of P, Q and R; and can similarly send 6Gbps of M's
traffic to each of P, Q and R. Alternatively, S can load
balance the aggregate 30Gbps of traffic received from L and M to
each of P, Q and R, thus sending 10Gbps to each. EB sub-LSPs
have an added benefit of not requiring unequal load balancing
across links, which may pose problems for some hardware.
Given the notion of EB sub-LSPs and EB MLSPs, A can signal an EB
MLSP Z comprised of five EB sub-LSPs E1 through E5 with the
following paths: ALSPTUB, AMSQTVB, ALSRTWB, AMSPTXB and ALSQTYB
(respectively). Then, A has two downstream links for the five
sub-LSPs, AL and AM, between which A will load balance equally.
Similarly, S has three downstream links, SP, SQ and SR; and T
has five downstreams, TU, TV, TW, TX and TY. Thus the load
balancing behavior of the MLSP will replicate IP load balancing.
The state required for an EB MLSP to achieve IP-style load
balancing is somewhat greater than for LDP LSPs, but
significantly less than that for multiple "regular" TE LSPs, or
for a non-EB MLSP.
Some of the power of TE-based ECMP was illustrated in the above
examples. Another is ability to request that all sub-LSPs avoid
links colored red. If in the example network in
, the QT link is colored red but all other
links are not, then there are four ECMP paths that satisfy these
constraints, and the traffic distribution among them will
naturally be different than it would without the link color
constraint.
One can also ask whether an MLSP with sub-LSPs is any better
than N "regular" LSPs from the same ingress to the same egress.
Here are some benefits of an MLSP:
With an MLSP, there is a single entity to provision, manage
and monitor, versus N separate entities in the case of LSPs.
A consequence of this is that with an MLSP, changes in
topology can be dealt with easily and autonomously by the
ingress LSR, by adding, changing or removing sub-LSPs to
rebalance traffic, while maintaining the same TE
constraints. With individual LSPs, such changes would
require changes in configuration, and thus are harder to
automate.
An ingress LSR, knowing that an MLSP is for load balancing,
can decide on an optimum number of sub-LSPs, and place them
appropriately across the network to optimize load balancing.
On the other hand, an ingress LSR asked to create N
independent LSPs will do so without regard to whether N is a
good number of equal cost paths, and, more importantly, may
place several of the N LSPs on the same path, defeating the
purpose of load balancing.
The EB sub-LSP mechanism will, in many cases, result in far
fewer sub-LSPs than independent LSPs and thus less control
plane state.
Finally, an MLSP will usually have less data plane state
than N independent LSPs: whenever multiple sub-LSPs traverse
a link, a single label will be used for all of them, whereas
if multiple LSPs traverse a link, each will need a separate
label.
Definition: Let G=(V, E) be a directed graph (or network), and
let A and B in V be two nodes in G. Let T be the traffic
arriving at A destined for B. T is said to be "IP-style" load
balanced if for every node X on a shortest path from A to B, the
portion of T arriving at X is split equally among all nodes Yi
that are adjacent to X and are on a shortest path from X to B.
Theorem: An MLSP can accurately mimic IP-style load balancing
between any two nodes in any network.
Proof: left to the reader.
Corollary: MLSPs provide a strictly more powerful load balancing
mechanism than IP-style load balancing.
Sub-LSPs of an MLSP are tied together using ASSOCIATION objects.
ASSOCIATION objects have a new Association Type for MLSPs (TBD).
The Association ID is chosen by the ingress of the MLSP; the
Association Source is the loopback address of the ingress of the
MLSP. All sub-LSPs containing an ASSOCIATION object with a
given Association Source and Type belong to the same MLSP.
A LSR S that receives Path messages for several sub-LSPs of the
same MLSP from the same upstream LSR SHOULD allocate the same
label for all the sub-LSPs. This simplifies load balancing for
the aggregate traffic on those sub-LSPs. If the sub-LSPs are EB
sub-LSPs, then S SHOULD allocate the same label for all EB
sub-LSPs of the same MLSP that pass through S, regardless of
which upstream LSR they come from. This allows S to load
balance the aggregate traffic received on the MLSP, as all the
MLSP traffic arrives at S with the same label. However, an LSR
that can achieve the load balancing requirements independent of
label allocation strategies is free to do so.
Since MLSPs are traffic engineered, there needs to be strict
bandwidth accounting, or admission control, on every link that
an MLSP traverses. For non-EB sub-LSPs, this is
straightforward, and analogous to regular TE LSPs. However, for
EB sub-LSPs, two new procedures are needed, one for signaling
bandwidth, and the other for admission control. First, for a
given MLSP Z, an LSR X MUST ensure (via signaling) that the
total incoming bandwidth of EB sub-LSPs of MLSP Z is divided
equally among all the downstream links of X which at least one
of the EB sub-LSPs traverses. Second, LSR X MUST ensure that,
for each upstream link of X, there is sufficient bandwidth to
accommodate all EB sub-LSPs of MLSP Z that traverse that link.
Let's take the example of , with MLSP Z
having five EB sub-LSPs E1 to E5, and say that MLSP Z is
configured with a bandwidth of 30Gbps. Here are some of the
steps involved.
LSR A, being the ingress, has no upstream links. A has two
downstream links, AL and AM. Three EB sub-LSPs of MLSP Z
traverse AL, and two traverse AM. A MUST signal a total of
15Gbps for the sub-LSPs to L, and a total of 15Gbps for the
sub-LSPs to M. The required bandwidth may be divided up
among the sub-LSPs to L (similarly, to M) in any manner so
long as the total is 15Gbps. For example, A can signal
sub-LSP E1 with 15Gbps, and sub-LSPs E3 and E5 with 0
bandwidth.
LSR L has one upstream link AL with three EB sub-LSPs with a
total bandwidth of 15Gbps. L MUST ensure that 15Gbps is
available for the AL link. If this bandwidth is not
available, L MUST send a PathErr on ALL of the EB sub-LSPs
on the AL link. Let's assume that the AL link has
sufficient bandwidth.
Next, it is up to L to decide how to divide the incoming
15Gbps among the three downstream EB sub-LSPs to S. Say L
signals sub-LSP E1 with 15Gbps, and the others with 0
bandwidth.
LSR S has two upstream links: LS with three EB sub-LSPs with
a total bandwidth of 15Gbps, and MS with two EB sub-LSPs
with a total bandwidth of 15Gbps. S MUST ensure that 15Gbps
is available for each of the LS and MS links. S has thus a
total incoming bandwidth of 30Gbps on MLSP Z. S has to
divide this equally among its downstream links SP, SQ and
SR, yielding 10Gbps each. S MUST ensure that the total
bandwidth requested on the SP link for sub-LSPs E1 and E4 is
10Gbps. S may choose to signal these sub-LSPs with 5Gbps
each. Similarly for the SQ and SR links.
There are two important points to note here. One is that the
bandwidth reservation (TSpec) for a given EB sub-LSP can (and
usually will) change hop-by-hop. The second is that as new EB
sub-LSPs are signaled for an MLSP, the bandwidth reservations
for existing EB sub-LSPs belonging to the same MLSP may have to
be updated. To minimize these updates, it is RECOMMENDED that
the first EB sub-LSP on a link be signaled with the total
required bandwidth (as far as is known), and later sub-LSPs on
the same link be signaled with 0 bandwidth.
Traffic intended to be sent over an MLSP is determined at the
ingress LSR by means outside the scope of this document, and at
transit LSRs by the label(s) assigned by the transit LSR to its
upstream LSRs. In the case of non-EB sub-LSPs, this traffic is
load balanced across downstream links in the ratio of the
bandwidths of the sub-LSPs that comprise the MLSP. In the case
of EB sub-LSPs, the traffic belonging to an MLSP from an
upstream LSR (or better still, the aggregate traffic for the
MLSP from all upstream LSRs) is load balanced equally among all
downstream links.
As noted above, the overriding concern is that flows are mapped
to the same downstream link (except when the MLSP or some
constituent sub-LSPs are changing); this is typically done by
hashing fields that define a flow, and mapping hash results to
different downstream LSRs. Hash-based load balancing typically
assumes that the numbers of flows is sufficiently large and the
bandwidth per flow is reasonably well-balanced so that the
results of hashing yields reasonable traffic distribution.
Entropy labels ( and ) can be used to improve load balancing at
intermediate nodes.
This document introduces no new security concerns in the setup and
signaling of LSPs using RSVP-TE, or in the use of the RSVP
protocol. specifies the message
integrity mechanisms for RSVP signaling. These mechanisms apply
to RSVP-TE signaling of MLSPs described in this document, and are
highly recommended pending newer integrity mechanisms for RSVP.
The author would like to thank the Routing Protocol group at
Juniper Networks for their questions, comments and encouragement
for this proposal. While many participated, special thanks go to
Yakov Rekhter, John Drake and Rahul Aggarwal. Many thanks too to
John for suggesting the use of ASSOCIATION objects.
IANA is requested to assign a new Association Type for MLSP. This
Association Type is to be used for ASSOCIATION objects with C-Type
1 (IPv4 Source) and 2 (IPv6 Source).