idnits 2.17.1 

draft-ietf-pwe3-fat-pw-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** The document seems to lack a License Notice according IETF Trust
     Provisions of 28 Dec 2009, Section 6.b.i or Provisions of 12 Sep 2009
     Section 6.b -- however, there's a paragraph with a matching beginning.
     Boilerplate error?

     (You're using the IETF Trust Provisions' Section 6.b License Notice from
     12 Feb 2009 rather than one of the newer Notices.  See
     https://trustee.ietf.org/license-info/.)


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (September 23, 2009) is 5328 days in the past.  Is
     this intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC 4379 (Obsoleted by RFC 8029)

  ** Obsolete normative reference: RFC 4447 (Obsoleted by RFC 8077)

  == Outdated reference: A later version (-11) exists of
     draft-ietf-bfd-base-09

  == Outdated reference: A later version (-02) exists of
     draft-kompella-mpls-entropy-label-00


     Summary: 3 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	PWE3                                                      S. Bryant, Ed.
3	Internet-Draft                                               C. Filsfils
4	Intended status: Standards Track                           Cisco Systems
5	Expires: March 27, 2010                                         U. Drafz
6	                                                        Deutsche Telekom
7	                                                             V. Kompella
8	                                                                J. Regan
9	                                                          Alcatel-Lucent
10	                                                               S. Amante
11	                                                  Level 3 Communications
12	                                                      September 23, 2009

14	          Flow Aware Transport of Pseudowires over an MPLS PSN
15	                       draft-ietf-pwe3-fat-pw-01

17	Status of this Memo

19	   This Internet-Draft is submitted to IETF in full conformance with the
20	   provisions of BCP 78 and BCP 79.

22	   Internet-Drafts are working documents of the Internet Engineering
23	   Task Force (IETF), its areas, and its working groups.  Note that
24	   other groups may also distribute working documents as Internet-
25	   Drafts.

27	   Internet-Drafts are draft documents valid for a maximum of six months
28	   and may be updated, replaced, or obsoleted by other documents at any
29	   time.  It is inappropriate to use Internet-Drafts as reference
30	   material or to cite them other than as "work in progress."

32	   The list of current Internet-Drafts can be accessed at
33	   http://www.ietf.org/ietf/1id-abstracts.txt.

35	   The list of Internet-Draft Shadow Directories can be accessed at
36	   http://www.ietf.org/shadow.html.

38	   This Internet-Draft will expire on March 27, 2010.

40	Copyright Notice

42	   Copyright (c) 2009 IETF Trust and the persons identified as the
43	   document authors.  All rights reserved.

45	   This document is subject to BCP 78 and the IETF Trust's Legal
46	   Provisions Relating to IETF Documents in effect on the date of
47	   publication of this document (http://trustee.ietf.org/license-info).
48	   Please review these documents carefully, as they describe your rights
49	   and restrictions with respect to this document.

51	Abstract

53	   Where the payload carried over a pseudowire carries a number of
54	   identifiable flows it can in some circumstances be desirable to carry
55	   those flows over the equal cost multiple paths (ECMPs) that exist in
56	   the packet switched network.  Most forwarding engines are able to
57	   hash based on label stacks and use this to balance flows over ECMPs.
58	   This draft describes a method of identifying the flows, or flow
59	   groups, to the label switched routers by including an additional
60	   label in the label stack.

62	Requirements Language

64	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
65	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
66	   document are to be interpreted as described in RFC2119 [RFC2119].

68	Table of Contents

70	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
71	     1.1.  ECMP in Label Switched Routers . . . . . . . . . . . . . .  5
72	     1.2.  Flow Label . . . . . . . . . . . . . . . . . . . . . . . .  5
73	   2.  Native Service Processing Function . . . . . . . . . . . . . .  6
74	   3.  Pseudowire Forwarder . . . . . . . . . . . . . . . . . . . . .  6
75	     3.1.  Encapsulation  . . . . . . . . . . . . . . . . . . . . . .  7
76	   4.  Signaling the Presence of the Flow Label . . . . . . . . . . .  8
77	     4.1.  Structure of Flow Label TLV  . . . . . . . . . . . . . . .  9
78	   5.  OAM  . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  9
79	   6.  Applicability  . . . . . . . . . . . . . . . . . . . . . . . . 10
80	     6.1.  Equal Cost Multiple Paths  . . . . . . . . . . . . . . . . 11
81	     6.2.  Link Aggregation Groups  . . . . . . . . . . . . . . . . . 12
82	     6.3.  The Single Large Flow Case . . . . . . . . . . . . . . . . 12
83	     6.4.  MPLS-TP  . . . . . . . . . . . . . . . . . . . . . . . . . 14
84	   7.  Applicability to MPLS  . . . . . . . . . . . . . . . . . . . . 14
85	   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 14
86	   9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 15
87	   10. Congestion Considerations  . . . . . . . . . . . . . . . . . . 15
88	   11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 15
89	   12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16
90	     12.1. Normative References . . . . . . . . . . . . . . . . . . . 16
91	     12.2. Informative References . . . . . . . . . . . . . . . . . . 16
92	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 17

94	1.  Introduction

96	   A pseudowire (PW) [RFC3985] is normally transported over one single
97	   network path, even if multiple Equal Cost Multiple Paths (ECMP) exit
98	   between the ingress and egress PW provider edge (PE)
99	   equipments[RFC4385] [RFC4928].  This is required to preserve the
100	   characteristics of the emulated service (e.g. to avoid misordering
101	   SAToP pseudowire packets [RFC4553] or subjecting the packets to
102	   unusable inter-arrival times ).  The use of a single path to preserve
103	   order remains the default mode of operation of a pseudowire (PW).
104	   The new capability proposed in this document is an OPTIONAL mode
105	   which may be used when the use of ECMP paths for is known to be
106	   beneficial (and not harmful) to the operation of the PW.

108	   Some pseudowires are used to transport large volumes of IP traffic
109	   between routers at two locations.  One example of this is the use of
110	   an Ethernet pseudowire to create a virtual direct link between a pair
111	   of routers.  Such pseudowire's may carry from hundred's of Mbps to
112	   Gbps of traffic.  Such pseudowire's do not require strict ordering to
113	   be preserved between packets of the pseudowire.  They only require
114	   ordering to be preserved within the context of each individual
115	   transported IP flow.  Some operators have requested the ability to
116	   explicitly configure such a pseudowire to leverage the availability
117	   of multiple ECMP paths.  This allows for better capacity planning as
118	   the statistical multiplexing of a larger number of smaller flows is
119	   more efficient than with a smaller set of larger flows.  Although
120	   Ethernet is used as an example above, the mechanisms described in
121	   this draft are general mechanisms that may be applied to any
122	   pseudowire type in which there are identifiable flows, and in which
123	   the there is no requirement to preserve the order between those
124	   flows.

126	   Typically, forwarding hardware can deduce that an IP payload is being
127	   directly carried by an MPLS label stack, and is capable of looking at
128	   some fields in packets to construct hash buckets for conversations or
129	   flows.  However, an intermediate node has no information on the type
130	   pseudowire being carried in the packet.  This limits the forwarder at
131	   the intermediate node to only being able to make an ECMP choice based
132	   on a hash of the label stack.  In the case of a pseudowire emulating
133	   a high bandwidth trunk, the granularity obtained by hashing the
134	   default label stack is inadequate for satisfactory load-balancing.
135	   The ingress node, however, is in the special position of being able
136	   to look at the un-encapsulated packet and spread flows amongst any
137	   available ECMP paths, or even any Loop-Free Alternates [RFC5286] .
138	   This draft proposes a method to introduce granularity on the hashing
139	   of traffic running over pseudowires by introducing an additional
140	   label, chosen by the ingress node, and placed at the bottom of the
141	   label stack.

143	   In addition to providing an indication of the flow structure for use
144	   in ECMP forwarding decisions, the mechanism described in the document
145	   may also be used to select flows for distribution over an 802.1ad
146	   link aggregation group that has been used in an MPLS network.

148	1.1.  ECMP in Label Switched Routers

150	   Label switched routers commonly hash the label stack or some elements
151	   of the label stack as a method of discriminating between flows, in
152	   order to distribute those flows over the available equal cost
153	   multiple paths that exist in the network.  Since the label at the
154	   bottom of stack is usually the label most closely associated with the
155	   flow, this normally provides the greatest entropy, and hence is
156	   usually included in the hash.  This draft describes a method of
157	   adding an additional label at the bottom of stack in order to
158	   facilitate the load balancing of the flows within a pseudowire over
159	   the available ECMPs.  A similar design for general MPLS use has also
160	   been proposed [I-D.kompella-mpls-entropy-label], however that is
161	   outside the scope of this draft.

163	   An alternative method of load balancing by creating a number of
164	   pseudowires and distributing the flows amongst them was considered,
165	   but was rejected because:

167	   o  It did not introduce as much entropy as the load balance label
168	      method.

170	   o  It required additional pseudowires to be set up and maintained.

172	1.2.  Flow Label

174	   An additional label is interposed between the pseudowire label and
175	   the control word, or if the control word is not present, between the
176	   pseudowire label and the pseudowire payload.  This additional label
177	   is called the Flow label.  Indivisible flows within the pseudowire
178	   MUST be mapped to the same Flow label by the ingress PE.  The flow
179	   label stimulates the correct ECMP load balancing behaviour in the
180	   packet switched network (PSN).  On receipt of the pseudowire packet
181	   at the egress PE (which knows this additional label is present) the
182	   flow label is discarded without processing.

184	   Note that the flow label MUST NOT be an MPLS reserved label (values
185	   in the range 0..15) [RFC3032], but is otherwise unconstrained by the
186	   protocol.

188	   Considerations of the TTL value are described in the Security section
189	   of this document.  The flow label can never become the top label in
190	   normal operation, and hence the TTL in the flow label is never used
191	   to determine whether the packet should be discarded due to TTL
192	   expiry.  Therefore there are no lower restrictions on the TTL value.

194	2.  Native Service Processing Function

196	   The Native Service Processing (NSP) function [RFC3985] is a component
197	   of a PE that has knowledge of the structure of the emulated service
198	   and is able to take action on the service outside the scope of the
199	   pseudowire.  In this case it is required that the NSP in the ingress
200	   PE identify flows, or groups of flows within the service, and
201	   indicate the flow (group) identity of each packet as it is passed to
202	   the pseudowire forwarder.  As an example, where the PW type is an
203	   Ethernet, the NSP might parse the ingress Ethernet traffic and
204	   consider all of the IP traffic.  This traffic could then be
205	   categorised into flows by considering all traffic with the same
206	   source and destination address pair to be a single indivisible flow.
207	   Since this is an NSP function, by definition, the method used to
208	   identify a flow is outside the scope of the pseudowire design.
209	   Similarly, since the NSP is internal to the PE, the method of flow
210	   indication to the pseudowire forwarder is outside the scope of this
211	   document.

213	3.  Pseudowire Forwarder

215	   The pseudowire forwarder must be provided with a method of mapping
216	   flows to load balanced paths.

218	   The forwarder must generate a label for the flow or group of flows.
219	   How the load balance label values are determined is outside the scope
220	   of this document, however the load balance label allocated to a flow
221	   MUST NOT be an MPLS reserved label and SHOULD remain constant for the
222	   life of the flow.  It is recommended that the method chosen to
223	   generate the load balancing labels introduces a high degree of
224	   entropy in their values, to maximise the entropy presented to the
225	   ECMP path selection mechanism in the LSRs in the PSN, and hence
226	   distribute the flows as evenly as possible over the available PSN
227	   ECMP paths.  The forwarder at the ingress PE prepends the pseudowire
228	   control word (if applicable), and then pushes the flow label,
229	   followed by the pseudowire label.

231	   The forwarder at the egress PE uses the pseudowire label to identify
232	   the pseudowire.  From the context associated with the pseudowire
233	   label, the egress PE can determine whether a flow label is present.
234	   If a flow label is present, the label is discarded.

236	   All other pseudowire forwarding operations are unmodified by the
237	   inclusion of the flow label.

239	3.1.  Encapsulation

241	   The PWE3 Protocol Stack Reference Model modified to include flow
242	   label is shown in Figure 1 below

244	      +-------------+                                +-------------+
245	      |  Emulated   |                                |  Emulated   |
246	      |  Ethernet   |                                |  Ethernet   |
247	      | (including  |         Emulated Service       | (including  |
248	      |  VLAN)      |<==============================>|  VLAN)      |
249	      |  Services   |                                |  Services   |
250	      +-------------+                                +-------------+
251	      |    Flow     |                                |    Flow     |
252	      +-------------+            Pseudowire          +-------------+
253	      |Demultiplexer|<==============================>|Demultiplexer|
254	      +-------------+                                +-------------+
255	      |    PSN      |            PSN Tunnel          |    PSN      |
256	      |   MPLS      |<==============================>|   MPLS      |
257	      +-------------+                                +-------------+
258	      |  Physical   |                                |  Physical   |
259	      +-----+-------+                                +-----+-------+

261	               Figure 1: PWE3 Protocol Stack Reference Model

263	   The encapsulation of a pseudowire with a flow label is shown in
264	   Figure 2 below
265	    +-------------------------------+
266	    |      MPLS Tunnel label(s)     | n*4 octets (four octets per label)
267	    +-------------------------------+
268	    |      PW label                 |  4 octets
269	    +-------------------------------+
270	    |      Flow label               |  4 octets
271	    +-------------------------------+
272	    |   Optional Control Word       |  4 octets
273	    +-------------------------------+
274	    |            Payload            |
275	    |                               |
276	    |                               |  n octets
277	    |                               |
278	    +-------------------------------+

280	      Figure 2: Encapsulation of a pseudowire with a pseudowire load
281	                              balancing label

283	4.  Signaling the Presence of the Flow Label

285	   When using the signalling procedures in [RFC4447], there is a
286	   Pseudowire Interface Parameter Sub-TLV type used to synchronise the
287	   flow label states between the ingress and egress PEs.

289	   The absence of a flow label (FL) TLV by either party indicates that
290	   the PE concerned is unable to recognise this TLV and the sender of
291	   the FL TLV MUST send a new label mapping without the FL TLV.  This
292	   preserves backwards compatibility with existing PEs that do not
293	   understand the FL TLV or that cannot, or do not wish to, process the
294	   flow label.

296	   A PE that wishes to use a flow label sends an FL TLV with the F bit
297	   set (see Section 4.1).  A PE that can correctly process a flow label
298	   and is willing to receive one, but does not wish to send a flow label
299	   sends an FL TLV with the F bit clear.  A PE that sends an FL TLV with
300	   the F bit set and receives an FL TLV with or without the F bit set
301	   MUST include the flow label between the pseudowire label and the
302	   control word (or is the control word is not present between the
303	   pseudowire label and the pseudowire payload).

305	   If PWE3 signalling [RFC4447] is not in use for a pseudowire, then
306	   whether the flow label is used MUST be identically provisioned in
307	   both PEs at the pseudowire endpoints.  If there is no provisioning
308	   support for this option, the default behaviour is not to include the
309	   flow label.

311	   Note that what is signalled is the desire to include the flow label
312	   in the label stack.  The value of the label is a local matter for the
313	   ingress PE, and the label value itself is not signalled.

315	4.1.  Structure of Flow Label TLV

317	   The structure of the flow label TLV is shown in Figure 3.

319	    0                   1                   2                   3
320	    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
321	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
322	   | FL            |    Length     |F|       must be zero          |
323	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

325	                         Figure 3: Multiple VC TLV

327	   Where:

329	   o  FL is the flow label TLV identifier assigned by IANA.

331	   o  Length is the length of the TLV in octets and is 4.

333	   o  When F=1 a flow label will be pushed.  When F=0 a flow label will
334	      not be pushed.

336	5.  OAM

338	   The following OAM considerations apply to this method of load
339	   balancing.

341	   Where the OAM is only to be used to perform a basic test that the
342	   pseudowires have been configured at the PEs, VCCV [RFC5085] messages
343	   may be sent using any load balance pseudowire path, i.e. using any
344	   value for the flow label.

346	   Where it is required to verify that a pseudowire is fully functional
347	   for all flows, VCCV [RFC5085] connection verification message MUST be
348	   sent over each ECMP path to the pseudowire egress PE.  This problem
349	   is difficult to solve and scales poorly.  We believe that this
350	   problem is addressed by the following two methods:

352	   1.  If a failure occurs within the PSN, this failure will normally be
353	       detected by the PSN's Interior Gateway protocol (IGP) link/node
354	       failure detection mechanism (loss of light, bidirectional
355	       forwarding detection [I-D.ietf-bfd-base] or IGP hello detection),
356	       and the IGP convergence will naturally modify the ECMP set of
357	       network paths between the Ingress and Egress PE's.  Hence the PW
358	       is only impacted during the normal IGP convergence time.

360	   2.  If the failure is related to the individual corruption of an
361	       Label Forwarding Information dataBase (LFIB) entry in a router,
362	       then only the network path using that specific entry is impacted.
363	       If the PW is load balanced over multiple network paths, then this
364	       failure can only be detected if, by chance, the transported OAM
365	       flow is mapped onto the impacted network path, or all paths are
366	       tested.  This type of error may be better solved be solved by
367	       other means such as LSP self test [I-D.ietf-mpls-lsr-self-test].

369	   To troubleshoot the MPLS PSN, including multiple paths, the
370	   techniques described in [RFC4378] and [RFC4379] can be used.

372	   Where the pseudowire OAM is carried out of band (VCCV Type 2)
373	   [RFC5085] it is necessary to insert an "MPLS Router Alert Label" in
374	   the label stack.  The resultant label stack is a follows:

376	    +-------------------------------+
377	    |      MPLS Tunnel label(s)     | n*4 octets (four octets per label)
378	    +-------------------------------+
379	    |      Router Alert label       |  4 octets
380	    +-------------------------------+
381	    |      PW label                 |  4 octets
382	    +-------------------------------+
383	    |      Flow label               |  4 octets
384	    +-------------------------------+
385	    |   Optional Control Word       |  4 octets
386	    +-------------------------------+
387	    |            Payload            |
388	    |                               |
389	    |                               |  n octets
390	    |                               |
391	    +-------------------------------+

393	                    Figure 4: Use of Router Alert LAbel

395	6.  Applicability

397	   A node within the PSN is not able to perform deep-packet-inspection
398	   (DPI) of the PW as the PW technology is not self-describing: the
399	   structure of the PW payload is only known to the ingress and egress
400	   PE devices.  The method proposed in this document provides a
401	   statistical mitigation of the problem of load balance in those cased
402	   where a PE is able to discern flows embedded in the traffic received
403	   on the attachment circuit.

405	   The methods describe in this document are transparent to the PSN and
406	   as such do not require any new capability from the PSN.

408	   The requirement to load-balance over multiple PSN paths occurs when
409	   the ratio between the PW access speed and the PSN's core link
410	   bandwidth is large (e.g. >= 10%).  ATM and FR are unlikely to meet
411	   this property.  Ethernet may have this property, and for that reason
412	   this document focuses on Ethernet.  Applications for other high-
413	   access-bandwidth PW's (e.g.  Fibre Channel) may be defined in the
414	   future.

416	   This design applies to MPLS pseudowires where it is meaningful to de-
417	   construct the packets presented to the ingress PE into flows.  The
418	   mechanism described in this document promotes the distribution of
419	   flows within the pseudowire over different network paths.  This in
420	   turn means that whilst packets within a flow are delivered in order
421	   (subject to normal IP delivery perturbations due to topology
422	   variation), order is not maintained amongst packets of different
423	   flows.  It is not proposed to associate a different sequence number
424	   with each flow.  If sequence number support is required this
425	   mechanism is not applicable.

427	   Where it is known that the traffic carried by the Ethernet pseudowire
428	   is IP the method of identifying the flows are well known and can be
429	   applied.  Such methods typically include hashing on the source and
430	   destination addresses, the protocol ID and higher-layer flow-
431	   dependent fields such as TCP/UDP ports, L2TPv3 Session ID's etc.

433	   Where it is known that the traffic carried by the Ethernet pseudowire
434	   is non-IP, techniques used for link bundling between Ethernet
435	   switches may be reused.  In this case however the latency
436	   distribution would be larger than is found in the link bundle case.
437	   The acceptability of the increased latency is for further study.  Of
438	   particular importance the Ethernet control frames SHOULD always be
439	   mapped to the same PSN path to ensure in-order delivery.

441	6.1.  Equal Cost Multiple Paths

443	   ECMP in packet switched networks is statistical in nature.  The
444	   mapping of flows to a particular path does not take into account the
445	   bandwidth of the flow being mapped or the current bandwidth usage of
446	   the members of the ECMP set.  This simplification works well when the
447	   distribution of flows is evenly spread over the ECMP set and there
448	   are a large number of flows that have low bandwidth relative to the
449	   paths.  The random allocation of a flow to a path provides a good
450	   approximation to an even spread of flows, provided that polarisation
451	   effects are avoided.  The method proposed in this document has the
452	   same statistical properties as an IP PSN.

454	   ECMP is a load-sharing mechanism that is based on sharing the load
455	   over a number of layer 3 paths through the PSN.  Often however
456	   multiple links exist between a pair of LSRs that are considered by
457	   the IGP to be a single link.  These are known as link bundles.  The
458	   mechanism described in this document can also be used to distribute
459	   the flows within a pseudowire over the members of the link bundle by
460	   using the flow label value to identify candidate flows.  How that
461	   mapping takes place is outside the scope of this specification.
462	   Similar considerations apply to link aggregation groups.

464	   In the ECMP case and the link bundling case the NSP may attempt to
465	   take bandwidth into consideration when allocating groups of flows to
466	   a common path.  That is permitted, but it must be borne in mind that
467	   the semantics of a label stack entry (LSE) as defined by [RFC3032]
468	   cannot be modified, the value of the flow label cannot be modified at
469	   any point on the LSP, and the interpretation of bit patterns in, or
470	   values of, the flow label by an LSR are undefined.

472	   A different type of load balancing is the desire to carry a
473	   pseudowire over a set of PSN links in which the bandwidth of members
474	   of the link set is less than the bandwidth of the pseudowire.  This
475	   problem is addressed in [I-D.stein-pwe3-pwbonding].  Such a mechanism
476	   can be considered complementary to this mechanism.

478	6.2.  Link Aggregation Groups

480	   A Link Aggregation Group (LAG) is used to bond together several
481	   physical circuits between two adjacent nodes so they appear to
482	   higher-layer protocols as a single, higher bandwidth "virtual" pipe.
483	   These may co-exist in various parts of a given network.  An advantage
484	   of LAGs is that they reduce the number of routing and signalling
485	   protocol adjacencies between devices, reducing control plane
486	   processing overhead.  As with ECMP, the key problem related to LAGs
487	   is that due to inefficiencies in LAG load-distribution algorithms, a
488	   particular component of a LAG may experience congestion.  The
489	   mechanism proposed here may be able to assist in producing a more
490	   uniform flow distribution.

492	   The same considerations requiring a flow to go over a single member
493	   of an ECMP path set apply to a member of a LAG.

495	6.3.  The Single Large Flow Case

497	   Clearly the operator should make sure that the service offered using
498	   PW technology and the method described in this document does not
499	   exceed the maximum planned link capacity, unless it can be guaranteed
500	   that it conforms to the Internet traffic profile of a very large
501	   number of small flows.

503	   If the payload on a PW is made of a single inner flow (i.e. an
504	   encrypted connection between two routers), or the flow identifiers
505	   are too deeply buried in the packet, then the functionality described
506	   in this document does not give any benefits, though neither does it
507	   cause harm relative to the existing situation.  The most common case
508	   where a single flow dominated the traffic on a PW is when it is used
509	   to transport enterprise traffic.  Enterprise traffic may well consist
510	   of a large single TCP flows, or encrypted flows that cannot be
511	   handled by the methods described in this document.

513	   An operator has six options under these circumstances:

515	   1.  The operator can do nothing and the system will work as it does
516	       without the flow label.

518	   2.  The operator can make the customer aware that the service
519	       offering has a restriction on flow bandwidth and police flows to
520	       that restriction.  This would allow customers offering multiple
521	       flows to use a larger fraction their access bandwidth, whilst
522	       preventing an single flow from consuming a fraction of internal
523	       link bandwidth that the operator considered excessive.

525	   3.  The operator could configure the ingress PE to assign a constant
526	       flow label to all high bandwidth flows so that only one path was
527	       affected by these flows,

529	   4.  The operator could configure the ingress PE to assign a random
530	       flow label to all high bandwidth flows so as to minimise the
531	       disruption to the network as a cost of out of order traffic to
532	       the user.

534	   5.  The operator could configure the ingress to assign a label of
535	       special significance (such as a reserved label) to all high
536	       bandwidth flows so that some other action (not specified in this
537	       document) could be taken on the flow.

539	   The issues described above are mitigated by the following two
540	   factors:

542	   o  Firstly, the customer of a high-bandwidth PW service has an
543	      incentive to get the best transport service because an inefficient
544	      use of the PSN leads to jitter and eventually to loss to the PW's
545	      payload.

547	   o  Secondly, the customer is usually able to tailor their
548	      applications to generate many flows in the PSN.  A well-known
549	      example is massive data transport between servers which use many
550	      parallel TCP sessions.  This same technique can be used by any
551	      transport protocol: multiple UDP ports, multiple L2TPv3 Session
552	      ID's, multiple GRE keys may be used to decompose a large flow into
553	      smaller components.  This approach may be applied to IPsec
554	      [RFC4301] where multiple Security Parameters Indexes (SPI's) may
555	      be allocated to the same security association.

557	6.4.  MPLS-TP

559	   The MPLS Transport Profile (MPLS-TP) [I-D.ietf-mpls-tp-requirements]
560	   requirement 44 states that "MPLS-TP SHOULD support mechanisms to
561	   enable the reserved bandwidth of a transport path to be decreased
562	   without impacting the existing traffic on that transport path,
563	   provided that the level of existing traffic is smaller than the
564	   reserved bandwidth following the decrease."  The flow aware transport
565	   of a PW reorders packets (albeit in an application friendly way),
566	   therefore SHOULD NOT be deployed in a network conforming to the
567	   MPLS-TP.

569	7.  Applicability to MPLS

571	   A further application of this technique would be to create a basis
572	   for hash diversity without having to peek below the label stack for
573	   IP traffic carried over LDP LSPs.  Work on the generalisation of this
574	   to MPLS has been described in [I-D.kompella-mpls-entropy-label].
575	   This is can be regarded as a complementary but distinct approach
576	   since although similar consideration may apply to the identification
577	   of flows and the allocation of flow label values, the flow labels are
578	   imposed by different network components, and the associated
579	   signalling mechanisms are different.

581	8.  Security Considerations

583	   The pseudowire generic security considerations described in [RFC3985]
584	   and the security considerations applicable to a specific pseudowire
585	   type (for example, in the case of an Ethernet pseudowire [RFC4448]
586	   apply.

588	   The ingress PE SHOULD take steps to ensure that the load-balance
589	   label is not used as a covert channel.

591	   It is useful to give consideration to the choice of TTL value in the
592	   flow label stack entry [RFC3032].  The flow label is at the bottom of
593	   label stack.  Therefore, even when penultimate hop popping is
594	   employed, it will always be will preceded by the PW label on arrival
595	   at the PE.  The flow label TTL will therefore never be considered by
596	   the MPLS forwarder, and hence MAY be set to a value of 1.  This will
597	   prevent the packet being inadvertently forwarded based on the value
598	   of the flow label.  Note that this may be a departure from
599	   considerations that apply to the general MPLS case.

601	9.  IANA Considerations

603	   IANA is requested to allocate the next available values from the IETF
604	   Consensus range in the Pseudowire Interface Parameters Sub-TLV type
605	   Registry as a Flow Label indicator.

607	   Parameter  Length       Description

609	   TBD         4            Load Balancing Label

611	10.  Congestion Considerations

613	   The congestion considerations applicable to pseudowires as described
614	   in [RFC3985] and any additional congestion considerations developed
615	   at the time of publication apply to this design.

617	   The ability to explicitly configure a PW to leverage the availability
618	   of multiple ECMP paths is beneficial to capacity planning as, all
619	   other parameters being constant, the statistical multiplexing of a
620	   larger number of smaller flows is more efficient than with a smaller
621	   number of larger flows.

623	   Note that if the classification into flows is only performed on IP
624	   packets the behaviour of those flows in the face of congestion will
625	   be as already defined by the IETF for packets of that type and no
626	   additional congestion processing is required.

628	   Where flows that are not IP are classified pseudowire congestion
629	   avoidance must be applied to each non-IP load balance group.

631	11.  Acknowledgements

633	   The authors wish to thank Eric Grey, Kireeti Kompella, Joerg
634	   Kuechemann, Wilfried Maas, Luca Martini, Mark Townsley, and Lucy Yong
635	   for valuable comments on this document.

637	12.  References
638	12.1.  Normative References

640	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
641	              Requirement Levels", BCP 14, RFC 2119, March 1997.

643	   [RFC3032]  Rosen, E., Tappan, D., Fedorkow, G., Rekhter, Y.,
644	              Farinacci, D., Li, T., and A. Conta, "MPLS Label Stack
645	              Encoding", RFC 3032, January 2001.

647	   [RFC4379]  Kompella, K. and G. Swallow, "Detecting Multi-Protocol
648	              Label Switched (MPLS) Data Plane Failures", RFC 4379,
649	              February 2006.

651	   [RFC4385]  Bryant, S., Swallow, G., Martini, L., and D. McPherson,
652	              "Pseudowire Emulation Edge-to-Edge (PWE3) Control Word for
653	              Use over an MPLS PSN", RFC 4385, February 2006.

655	   [RFC4447]  Martini, L., Rosen, E., El-Aawar, N., Smith, T., and G.
656	              Heron, "Pseudowire Setup and Maintenance Using the Label
657	              Distribution Protocol (LDP)", RFC 4447, April 2006.

659	   [RFC4448]  Martini, L., Rosen, E., El-Aawar, N., and G. Heron,
660	              "Encapsulation Methods for Transport of Ethernet over MPLS
661	              Networks", RFC 4448, April 2006.

663	   [RFC4553]  Vainshtein, A. and YJ. Stein, "Structure-Agnostic Time
664	              Division Multiplexing (TDM) over Packet (SAToP)",
665	              RFC 4553, June 2006.

667	   [RFC4928]  Swallow, G., Bryant, S., and L. Andersson, "Avoiding Equal
668	              Cost Multipath Treatment in MPLS Networks", BCP 128,
669	              RFC 4928, June 2007.

671	   [RFC5085]  Nadeau, T. and C. Pignataro, "Pseudowire Virtual Circuit
672	              Connectivity Verification (VCCV): A Control Channel for
673	              Pseudowires", RFC 5085, December 2007.

675	12.2.  Informative References

677	   [I-D.ietf-bfd-base]
678	              Katz, D. and D. Ward, "Bidirectional Forwarding
679	              Detection", draft-ietf-bfd-base-09 (work in progress),
680	              February 2009.

682	   [I-D.ietf-mpls-lsr-self-test]
683	              Swallow, G., "Label Switching Router Self-Test",
684	              draft-ietf-mpls-lsr-self-test-07 (work in progress),
685	              May 2007.

687	   [I-D.ietf-mpls-tp-requirements]
688	              Niven-Jenkins, B., Brungard, D., Betts, M., Sprecher, N.,
689	              and S. Ueno, "MPLS-TP Requirements",
690	              draft-ietf-mpls-tp-requirements-10 (work in progress),
691	              August 2009.

693	   [I-D.kompella-mpls-entropy-label]
694	              Kompella, K. and S. Amante, "The Use of Entropy Labels in
695	              MPLS Forwarding", draft-kompella-mpls-entropy-label-00
696	              (work in progress), July 2008.

698	   [I-D.stein-pwe3-pwbonding]
699	              Stein, Y., Mendelsohn, I., and R. Insler, "PW Bonding",
700	              draft-stein-pwe3-pwbonding-01 (work in progress),
701	              November 2008.

703	   [RFC3985]  Bryant, S. and P. Pate, "Pseudo Wire Emulation Edge-to-
704	              Edge (PWE3) Architecture", RFC 3985, March 2005.

706	   [RFC4301]  Kent, S. and K. Seo, "Security Architecture for the
707	              Internet Protocol", RFC 4301, December 2005.

709	   [RFC4378]  Allan, D. and T. Nadeau, "A Framework for Multi-Protocol
710	              Label Switching (MPLS) Operations and Management (OAM)",
711	              RFC 4378, February 2006.

713	   [RFC5286]  Atlas, A. and A. Zinin, "Basic Specification for IP Fast
714	              Reroute: Loop-Free Alternates", RFC 5286, September 2008.

716	Authors' Addresses

718	   Stewart Bryant (editor)
719	   Cisco Systems
720	   250 Longwater Ave
721	   Reading  RG2 6GB
722	   United Kingdom

724	   Phone: +44-208-824-8828
725	   Email: stbryant@cisco.com
726	   Clarence Filsfils
727	   Cisco Systems
728	   Brussels
729	   Belgium

731	   Email: cfilsfil@cisco.com

733	   Ulrich Drafz
734	   Deutsche Telekom
735	   Muenster,
736	   Germany

738	   Phone:
739	   Fax:
740	   Email: Ulrich.Drafz@t-com.net
741	   URI:

743	   Vach Kompella
744	   Alcatel-Lucent

746	   Phone:
747	   Fax:
748	   Email: Alcatel-Lucent vach.kompella@alcatel-lucent.com
749	   URI:

751	   Joe Regan
752	   Alcatel-Lucent

754	   Phone:
755	   Fax:
756	   Email: joe.regan@alcatel-lucent.comRegan
757	   URI:

759	   Shane Amante
760	   Level 3 Communications

762	   Phone:
763	   Fax:
764	   Email: shane@castlepoint.net
765	   URI: