idnits 2.17.1 draft-ietf-pwe3-fat-pw-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 22, 2010) is 4934 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 4379 (Obsoleted by RFC 8029) ** Obsolete normative reference: RFC 4447 (Obsoleted by RFC 8077) == Outdated reference: A later version (-02) exists of draft-kompella-mpls-entropy-label-01 Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 PWE3 S. Bryant, Ed. 3 Internet-Draft C. Filsfils 4 Intended status: Standards Track Cisco Systems 5 Expires: April 25, 2011 U. Drafz 6 Deutsche Telekom 7 V. Kompella 8 J. Regan 9 Alcatel-Lucent 10 S. Amante 11 Level 3 Communications 12 October 22, 2010 14 Flow Aware Transport of Pseudowires over an MPLS PSN 15 draft-ietf-pwe3-fat-pw-05 17 Abstract 19 Where the payload carried over a pseudowire carries a number of 20 identifiable flows it can in some circumstances be desirable to carry 21 those flows over the equal cost multiple paths (ECMPs) that exist in 22 the packet switched network. Most forwarding engines are able to 23 hash based on label stacks and use this to balance flows over ECMPs. 24 This draft describes a method of identifying the flows, or flow 25 groups, to the label switched routers by including an additional 26 label in the label stack. 28 Requirements Language 30 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 31 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 32 document are to be interpreted as described in RFC2119 [RFC2119]. 34 Status of this Memo 36 This Internet-Draft is submitted in full conformance with the 37 provisions of BCP 78 and BCP 79. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF). Note that other groups may also distribute 41 working documents as Internet-Drafts. The list of current Internet- 42 Drafts is at http://datatracker.ietf.org/drafts/current/. 44 Internet-Drafts are draft documents valid for a maximum of six months 45 and may be updated, replaced, or obsoleted by other documents at any 46 time. It is inappropriate to use Internet-Drafts as reference 47 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on April 25, 2011. 50 Copyright Notice 52 Copyright (c) 2010 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (http://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 68 1.1. ECMP in Label Switched Routers . . . . . . . . . . . . . . 5 69 1.2. Flow Label . . . . . . . . . . . . . . . . . . . . . . . . 5 70 2. Native Service Processing Function . . . . . . . . . . . . . . 6 71 3. Pseudowire Forwarder . . . . . . . . . . . . . . . . . . . . . 6 72 3.1. Encapsulation . . . . . . . . . . . . . . . . . . . . . . 7 73 4. Signaling the Presence of the Flow Label . . . . . . . . . . . 8 74 4.1. Structure of Flow Label Sub-TLV . . . . . . . . . . . . . 9 75 5. Multi-Segment Pseudowires . . . . . . . . . . . . . . . . . . 10 76 6. OAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 77 7. Applicability of FAT PWs . . . . . . . . . . . . . . . . . . . 11 78 7.1. Equal Cost Multiple Paths . . . . . . . . . . . . . . . . 12 79 7.2. Link Aggregation Groups . . . . . . . . . . . . . . . . . 13 80 7.3. Multiple RSVP-TE Paths . . . . . . . . . . . . . . . . . . 13 81 7.4. The Single Large Flow Case . . . . . . . . . . . . . . . . 13 82 7.5. Applicability to MPLS-TP . . . . . . . . . . . . . . . . . 15 83 7.6. Asymmetric Operation . . . . . . . . . . . . . . . . . . . 15 84 8. Applicability to MPLS . . . . . . . . . . . . . . . . . . . . 15 85 9. Security Considerations . . . . . . . . . . . . . . . . . . . 15 86 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 87 11. Congestion Considerations . . . . . . . . . . . . . . . . . . 16 88 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 17 89 13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 17 90 13.1. Normative References . . . . . . . . . . . . . . . . . . . 17 91 13.2. Informative References . . . . . . . . . . . . . . . . . . 18 92 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 18 94 1. Introduction 96 A pseudowire (PW) [RFC3985] is normally transported over one single 97 network path, even if multiple Equal Cost Multiple Paths (ECMP) exit 98 between the ingress and egress PW provider edge (PE) 99 equipments[RFC4385] [RFC4928]. This is required to preserve the 100 characteristics of the emulated service (e.g. to avoid misordering 101 SAToP pseudowire packets [RFC4553] or subjecting the packets to 102 unusable inter-arrival times ). The use of a single path to preserve 103 order remains the default mode of operation of a pseudowire (PW). 104 The new capability proposed in this document is an OPTIONAL mode 105 which may be used when the use of ECMP paths for is known to be 106 beneficial (and not harmful) to the operation of the PW. 108 Some pseudowires are used to transport large volumes of IP traffic 109 between routers at two locations. One example of this is the use of 110 an Ethernet pseudowire to create a virtual direct link between a pair 111 of routers. Such pseudowire's may carry from hundred's of Mbps to 112 Gbps of traffic. Such pseudowire's do not require strict ordering to 113 be preserved between packets of the pseudowire. They only require 114 ordering to be preserved within the context of each individual 115 transported IP flow. Some operators have requested the ability to 116 explicitly configure such a pseudowire to leverage the availability 117 of multiple ECMP paths. This allows for better capacity planning as 118 the statistical multiplexing of a larger number of smaller flows is 119 more efficient than with a smaller set of larger flows. Although 120 Ethernet is used as an example above, the mechanisms described in 121 this draft are general mechanisms that may be applied to any 122 pseudowire type in which there are identifiable flows, and in which 123 there is no requirement to preserve the order between those flows. 125 Typically, forwarding hardware can deduce that an IP payload is being 126 directly carried by an MPLS label stack, and is capable of looking at 127 some fields in packets to construct hash buckets for conversations or 128 flows. However, an intermediate node has no information on the type 129 pseudowire being carried in the packet. This limits the forwarder at 130 the intermediate node to only being able to make an ECMP choice based 131 on a hash of the label stack. In the case of a pseudowire emulating 132 a high bandwidth trunk, the granularity obtained by hashing the 133 default label stack is inadequate for satisfactory load-balancing. 134 The ingress node, however, is in the special position of being able 135 to look at the un-encapsulated packet and spread flows amongst any 136 available ECMP paths, or even any Loop-Free Alternates [RFC5286] . 137 This draft proposes a method to introduce granularity on the hashing 138 of traffic running over pseudowires by introducing an additional 139 label, chosen by the ingress node, and placed at the bottom of the 140 label stack. 142 In addition to providing an indication of the flow structure for use 143 in ECMP forwarding decisions, the mechanism described in the document 144 may also be used to select flows for distribution over an 802.1ad 145 link aggregation group that has been used in an MPLS network. 147 1.1. ECMP in Label Switched Routers 149 Label switched routers commonly hash the label stack or some elements 150 of the label stack as a method of discriminating between flows, in 151 order to distribute those flows over the available equal cost 152 multiple paths that exist in the network. Since the label at the 153 bottom of stack is usually the label most closely associated with the 154 flow, this normally provides the greatest entropy, and hence is 155 usually included in the hash. This draft describes a method of 156 adding an additional label at the bottom of stack in order to 157 facilitate the load balancing of the flows within a pseudowire over 158 the available ECMPs. A similar design for general MPLS use has also 159 been proposed [I-D.kompella-mpls-entropy-label], however that is 160 outside the scope of this draft. 162 An alternative method of load balancing by creating a number of 163 pseudowires and distributing the flows amongst them was considered, 164 but was rejected because: 166 o It did not introduce as much entropy as the load balance label 167 method. 169 o It required additional pseudowires to be set up and maintained. 171 1.2. Flow Label 173 An additional label [RFC3032] is interposed between the pseudowire 174 label and the control word, or if the control word is not present, 175 between the pseudowire label and the pseudowire payload. This 176 additional label is called the flow label. Indivisible flows within 177 the pseudowire MUST be mapped to the same flow label by the ingress 178 PE. The flow label stimulates the correct ECMP load balancing 179 behaviour in the packet switched network (PSN). On receipt of the 180 pseudowire packet at the egress PE (which knows this additional label 181 is present) the flow label is discarded without processing. 183 Note that the flow label MUST NOT be an MPLS reserved label (values 184 in the range 0..15) [RFC3032], but is otherwise unconstrained by the 185 protocol. 187 Considerations of the TTL value are described in the Security section 188 of this document. The flow label can never become the top label in 189 normal operation, and hence the TTL in the flow label is never used 190 to determine whether the packet should be discarded due to TTL 191 expiry. Therefore there are no lower restrictions on the TTL value. 193 The use of the TC bits (formerly known as the EXP bits) in the flow 194 label is outside the scope of this document. Unless otherwise agreed 195 by the ingress and egress PEs these bits MUST be set to zero by the 196 ingress PE and MUST be ignored by the egress PE. 198 2. Native Service Processing Function 200 The Native Service Processing (NSP) function [RFC3985] is a component 201 of a PE that has knowledge of the structure of the emulated service 202 and is able to take action on the service outside the scope of the 203 pseudowire. In this case it is required that the NSP in the ingress 204 PE identify flows, or groups of flows within the service, and 205 indicate the flow (group) identity of each packet as it is passed to 206 the pseudowire forwarder. As an example, where the PW type is an 207 Ethernet, the NSP might parse the ingress Ethernet traffic and 208 consider all of the IP traffic. This traffic could then be 209 categorised into flows by considering all traffic with the same 210 source and destination address pair to be a single indivisible flow. 211 Since this is an NSP function, by definition, the method used to 212 identify a flow is outside the scope of the pseudowire design. 213 Similarly, since the NSP is internal to the PE, the method of flow 214 indication to the pseudowire forwarder is outside the scope of this 215 document. 217 3. Pseudowire Forwarder 219 The pseudowire forwarder must be provided with a method of mapping 220 flows to load balanced paths. 222 The forwarder must generate a label for the flow or group of flows. 223 How the load balance label values are determined is outside the scope 224 of this document, however the load balance label allocated to a flow 225 MUST NOT be an MPLS reserved label and SHOULD remain constant for the 226 life of the flow. It is recommended that the method chosen to 227 generate the load balancing labels introduces a high degree of 228 entropy in their values, to maximise the entropy presented to the 229 ECMP path selection mechanism in the LSRs in the PSN, and hence 230 distribute the flows as evenly as possible over the available PSN 231 ECMP paths. The forwarder at the ingress PE prepends the pseudowire 232 control word (if applicable), and then pushes the flow label, 233 followed by the pseudowire label. 235 The forwarder at the egress PE uses the pseudowire label to identify 236 the pseudowire. From the context associated with the pseudowire 237 label, the egress PE can determine whether a flow label is present. 238 If a flow label is present, the label is discarded. 240 All other pseudowire forwarding operations are unmodified by the 241 inclusion of the flow label. 243 3.1. Encapsulation 245 The PWE3 Protocol Stack Reference Model modified to include flow 246 label is shown in Figure 1 below 248 +-------------+ +-------------+ 249 | Emulated | | Emulated | 250 | Ethernet | | Ethernet | 251 | (including | Emulated Service | (including | 252 | VLAN) |<==============================>| VLAN) | 253 | Services | | Services | 254 +-------------+ +-------------+ 255 | Flow | | Flow | 256 +-------------+ Pseudowire +-------------+ 257 |Demultiplexer|<==============================>|Demultiplexer| 258 +-------------+ +-------------+ 259 | PSN | PSN Tunnel | PSN | 260 | MPLS |<==============================>| MPLS | 261 +-------------+ +-------------+ 262 | Physical | | Physical | 263 +-----+-------+ +-----+-------+ 265 Figure 1: PWE3 Protocol Stack Reference Model 267 The encapsulation of a pseudowire with a flow label is shown in 268 Figure 2 below 269 +-------------------------------+ 270 | | 271 | Payload | 272 | | n octets 273 | | 274 +-------------------------------+ 275 | Optional Control Word | 4 octets 276 +-------------------------------+ 277 | Flow label | 4 octets 278 +-------------------------------+ 279 | PW label | 4 octets 280 +-------------------------------+ 281 | MPLS Tunnel label(s) | n*4 octets (four octets per label) 282 +-------------------------------+ 284 Figure 2: Encapsulation of a pseudowire with a pseudowire load 285 balancing label 287 4. Signaling the Presence of the Flow Label 289 When using the signalling procedures in [RFC4447], a Pseudowire 290 Interface Parameter Flow Label Sub-TLV (FL Sub-TLV) type is used to 291 synchronise the flow label states between the ingress and egress PEs. 293 The absence of a FL Sub-TLV indicates that the PE is unable process 294 flow labels. A PE that is using PW signalling and that does not send 295 a FL Sub-TLV MUST NOT include a flow label in the PW packet. A PE 296 that is using PW signalling and which does not receive a FL Sub-TLV 297 from its peer MUST NOT include a flow label in the PW packet. This 298 preserves backwards compatibility with existing PW specifications. 300 A PE that wishes to send a flow label in a PW packet MUST include in 301 its label mapping message a FL Sub-TLV with T = 1 (see Section 4.1). 303 A PE that is willing to receive a flow label MUST include in its 304 label mapping message a FL Sub-TLV with R = 1 (see Section 4.1). 306 A PE that receives a label mapping message a FL Sub-TLV with R = 0 307 MUST NOT include a flow label in the PW packet. 309 Thus a PE sending a FL Sub-TLV with T = 1 and receiving a FL Sub-TLV 310 with R = 1 MUST include a flow label in the PW packet. Under all 311 other combinations of FL Sub-TLV signalling a PE MUST NOT include a 312 flow label in the PW packet. 314 The signalling procedures in [RFC4447] state that "Processing of the 315 interface parameters should continue when unknown interface 316 parameters are encountered, and they MUST be silently ignored." The 317 signalling procedure described here is therefore backwards compatible 318 with existing implementations. 320 If PWE3 signalling [RFC4447] is not in use for a pseudowire, then 321 whether the flow label is used MUST be identically provisioned in 322 both PEs at the pseudowire endpoints. If there is no provisioning 323 support for this option, the default behaviour is not to include the 324 flow label. 326 Note that what is signalled is the desire to include the flow label 327 in the label stack. The value of the label is a local matter for the 328 ingress PE, and the label value itself is not signalled. 330 4.1. Structure of Flow Label Sub-TLV 332 The structure of the flow label TLV is shown in Figure 3. 334 0 1 2 3 335 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 336 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 337 | FL | Length |T|R| Reserved | 338 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 340 Figure 3: Flow Label Sub-TLV 342 Where: 344 o FL is the flow label sub-TLV identifier assigned by IANA. 346 o Length is the length of the TLV in octets and is 4. 348 o When T=1 the PE is requesting the ability to send a PW packet that 349 includes a flow label. When T= 0, the PE is indicating that it 350 will not send a PW packet containing a flow label. 352 o When R=1 the PE is able to receive a PW packet with a flow label 353 present. When R=0 the PE is unable to receive a PW packet with 354 the flow label present. 356 o Reserved bits MUST be zero on transmit and MUST be ignored on 357 receive. 359 5. Multi-Segment Pseudowires 361 The flow label mechanism described in this document works on multi- 362 segment PWs without requiring modification to the Switching PEs 363 (S-PEs). This is because the flow label is transparent to the label 364 swap operation, and because interface parameter Sub-TLV signalling is 365 transitive. 367 6. OAM 369 The following OAM considerations apply to this method of load 370 balancing. 372 Where the OAM is only to be used to perform a basic test that the 373 pseudowires have been configured at the PEs, VCCV [RFC5085] messages 374 may be sent using any load balance pseudowire path, i.e. using any 375 value for the flow label. 377 Where it is required to verify that a pseudowire is fully functional 378 for all flows, VCCV [RFC5085] connection verification message MUST be 379 sent over each ECMP path to the pseudowire egress PE. This problem 380 is difficult to solve and scales poorly. We believe that this 381 problem is addressed by the following two methods: 383 1. If a failure occurs within the PSN, this failure will normally be 384 detected by the PSN's Interior Gateway protocol (IGP) link/node 385 failure detection mechanism (loss of light, bidirectional 386 forwarding detection [RFC5880] or IGP hello detection), and the 387 IGP convergence will naturally modify the ECMP set of network 388 paths between the Ingress and Egress PE's. Hence the PW is only 389 impacted during the normal IGP convergence time. 391 2. If the failure is related to the individual corruption of an 392 Label Forwarding Information database (LFIB) entry in a router, 393 then only the network path using that specific entry is impacted. 394 If the PW is load balanced over multiple network paths, then this 395 failure can only be detected if, by chance, the transported OAM 396 flow is mapped onto the impacted network path, or all paths are 397 tested. This type of error may be better solved be solved by 398 other means such as LSP self test [I-D.ietf-mpls-lsr-self-test]. 400 To troubleshoot the MPLS PSN, including multiple paths, the 401 techniques described in [RFC4378] and [RFC4379] can be used. 403 Where the pseudowire OAM is carried out of band (VCCV Type 2) 404 [RFC5085] it is necessary to insert an "MPLS Router Alert Label" in 405 the label stack. The resultant label stack is a follows: 407 +-------------------------------+ 408 | | 409 | Payload | 410 | | n octets 411 | | 412 +-------------------------------+ 413 | Optional Control Word | 4 octets 414 +-------------------------------+ 415 | Flow label | 4 octets 416 +-------------------------------+ 417 | PW label | 4 octets 418 +-------------------------------+ 419 | Router Alert label | 4 octets 420 +-------------------------------+ 421 | MPLS Tunnel label(s) | n*4 octets (four octets per label) 422 +-------------------------------+ 424 Figure 4: Use of Router Alert LAbel 426 7. Applicability of FAT PWs 428 A node within the PSN is not able to perform deep-packet-inspection 429 (DPI) of the PW as the PW technology is not self-describing: the 430 structure of the PW payload is only known to the ingress and egress 431 PE devices. The method proposed in this document provides a 432 statistical mitigation of the problem of load balance in those cased 433 where a PE is able to discern flows embedded in the traffic received 434 on the attachment circuit. 436 The methods describe in this document are transparent to the PSN and 437 as such do not require any new capability from the PSN. 439 The requirement to load-balance over multiple PSN paths occurs when 440 the ratio between the PW access speed and the PSN's core link 441 bandwidth is large (e.g. >= 10%). ATM and FR are unlikely to meet 442 this property. Ethernet may have this property, and for that reason 443 this document focuses on Ethernet. Applications for other high- 444 access-bandwidth PW's (e.g. Fibre Channel) may be defined in the 445 future. 447 This design applies to MPLS pseudowires where it is meaningful to de- 448 construct the packets presented to the ingress PE into flows. The 449 mechanism described in this document promotes the distribution of 450 flows within the pseudowire over different network paths. This in 451 turn means that whilst packets within a flow are delivered in order 452 (subject to normal IP delivery perturbations due to topology 453 variation), order is not maintained amongst packets of different 454 flows. It is not proposed to associate a different sequence number 455 with each flow. If sequence number support is required this 456 mechanism is not applicable. 458 Where it is known that the traffic carried by the Ethernet pseudowire 459 is IP the method of identifying the flows are well known and can be 460 applied. Such methods typically include hashing on the source and 461 destination addresses, the protocol ID and higher-layer flow- 462 dependent fields such as TCP/UDP ports, L2TPv3 Session ID's etc. 464 Where it is known that the traffic carried by the Ethernet pseudowire 465 is non-IP, techniques used for link bundling between Ethernet 466 switches may be reused. In this case however the latency 467 distribution would be larger than is found in the link bundle case. 468 The acceptability of the increased latency is for further study. Of 469 particular importance the Ethernet control frames SHOULD always be 470 mapped to the same PSN path to ensure in-order delivery. 472 7.1. Equal Cost Multiple Paths 474 ECMP in packet switched networks is statistical in nature. The 475 mapping of flows to a particular path does not take into account the 476 bandwidth of the flow being mapped or the current bandwidth usage of 477 the members of the ECMP set. This simplification works well when the 478 distribution of flows is evenly spread over the ECMP set and there 479 are a large number of flows that have low bandwidth relative to the 480 paths. The random allocation of a flow to a path provides a good 481 approximation to an even spread of flows, provided that polarisation 482 effects are avoided. The method proposed in this document has the 483 same statistical properties as an IP PSN. 485 ECMP is a load-sharing mechanism that is based on sharing the load 486 over a number of layer 3 paths through the PSN. Often however 487 multiple links exist between a pair of LSRs that are considered by 488 the IGP to be a single link. These are known as link bundles. The 489 mechanism described in this document can also be used to distribute 490 the flows within a pseudowire over the members of the link bundle by 491 using the flow label value to identify candidate flows. How that 492 mapping takes place is outside the scope of this specification. 493 Similar considerations apply to link aggregation groups. 495 In the ECMP case and the link bundling case the NSP may attempt to 496 take bandwidth into consideration when allocating groups of flows to 497 a common path. That is permitted, but it must be borne in mind that 498 the semantics of a label stack entry (LSE) as defined by [RFC3032] 499 cannot be modified, the value of the flow label cannot be modified at 500 any point on the LSP, and the interpretation of bit patterns in, or 501 values of, the flow label by an LSR are undefined. 503 A different type of load balancing is the desire to carry a 504 pseudowire over a set of PSN links in which the bandwidth of members 505 of the link set is less than the bandwidth of the pseudowire. This 506 problem is addressed in [I-D.stein-pwe3-pwbonding]. Such a mechanism 507 can be considered complementary to this mechanism. 509 7.2. Link Aggregation Groups 511 A Link Aggregation Group (LAG) is used to bond together several 512 physical circuits between two adjacent nodes so they appear to 513 higher-layer protocols as a single, higher bandwidth "virtual" pipe. 514 These may co-exist in various parts of a given network. An advantage 515 of LAGs is that they reduce the number of routing and signalling 516 protocol adjacencies between devices, reducing control plane 517 processing overhead. As with ECMP, the key problem related to LAGs 518 is that due to inefficiencies in LAG load-distribution algorithms, a 519 particular component of a LAG may experience congestion. The 520 mechanism proposed here may be able to assist in producing a more 521 uniform flow distribution. 523 The same considerations requiring a flow to go over a single member 524 of an ECMP path set apply to a member of a LAG. 526 7.3. Multiple RSVP-TE Paths 528 In some networks it is desirable for a Label Edge Router (LER) to be 529 able to load balance a PW across multiple RSVP-TE tunnels. The flow 530 label mechanism described in this document may be used to provide the 531 LER with the required flow information, and necessary entropy to 532 provide this type of load balancing. An example of such a case is 533 the of the flow label mechanism in networks using a link bundle with 534 the all ones component [RFC4201]. 536 Methods by which the LER is configured to apply this type of ECMP is 537 outside the scope of this document. 539 7.4. The Single Large Flow Case 541 Clearly the operator should make sure that the service offered using 542 PW technology and the method described in this document does not 543 exceed the maximum planned link capacity, unless it can be guaranteed 544 that it conforms to the Internet traffic profile of a very large 545 number of small flows. 547 If the payload on a PW is made of a single inner flow (i.e. an 548 encrypted connection between two routers), or the flow identifiers 549 are too deeply buried in the packet, then the functionality described 550 in this document does not give any benefits, though neither does it 551 cause harm relative to the existing situation. The most common case 552 where a single flow dominated the traffic on a PW is when it is used 553 to transport enterprise traffic. Enterprise traffic may well consist 554 of a large single TCP flows, or encrypted flows that cannot be 555 handled by the methods described in this document. 557 An operator has six options under these circumstances: 559 1. The operator can do nothing and the system will work as it does 560 without the flow label. 562 2. The operator can make the customer aware that the service 563 offering has a restriction on flow bandwidth and police flows to 564 that restriction. This would allow customers offering multiple 565 flows to use a larger fraction their access bandwidth, whilst 566 preventing an single flow from consuming a fraction of internal 567 link bandwidth that the operator considered excessive. 569 3. The operator could configure the ingress PE to assign a constant 570 flow label to all high bandwidth flows so that only one path was 571 affected by these flows, 573 4. The operator could configure the ingress PE to assign a random 574 flow label to all high bandwidth flows so as to minimise the 575 disruption to the network as a cost of out of order traffic to 576 the user. 578 5. The operator could configure the ingress to assign a label of 579 special significance (such as a reserved label) to all high 580 bandwidth flows so that some other action (not specified in this 581 document) could be taken on the flow. 583 The issues described above are mitigated by the following two 584 factors: 586 o Firstly, the customer of a high-bandwidth PW service has an 587 incentive to get the best transport service because an inefficient 588 use of the PSN leads to jitter and eventually to loss to the PW's 589 payload. 591 o Secondly, the customer is usually able to tailor their 592 applications to generate many flows in the PSN. A well-known 593 example is massive data transport between servers which use many 594 parallel TCP sessions. This same technique can be used by any 595 transport protocol: multiple UDP ports, multiple L2TPv3 Session 596 ID's, multiple GRE keys may be used to decompose a large flow into 597 smaller components. This approach may be applied to IPsec 598 [RFC4301] where multiple Security Parameters Indexes (SPI's) may 599 be allocated to the same security association. 601 7.5. Applicability to MPLS-TP 603 The MPLS Transport Profile (MPLS-TP) [RFC5654] requirement 44 states 604 that "MPLS-TP MUST support mechanisms that ensure the integrity of 605 the transported customer's service traffic as required by its 606 associated SLA. Loss of integrity may be defined as packet 607 corruption, reordering, or loss during normal network conditions. " 608 The flow aware transport of a PW reorders packets, therefore MUST NOT 609 be deployed in a network conforming to the MPLS-TP unless these 610 integrity requirements specified in the SLA can be satisfied. 612 7.6. Asymmetric Operation 614 The protocol defined in this document supports the asymmetric 615 inclusion of the FAT label. Asymmetric operation can be expected 616 when there is asymmetry in the bandwidth requirements making it 617 unprofitable for one PE to perform the flow classification, or when 618 that PE is otherwise unable to perform the classification but is able 619 to receive flow labeled packet from its peer. Asymmetric operation 620 of the PW may also be required when one PE has a high transmission 621 bandwidth requirement, but has a need to receive the entire PW on a 622 single interface in order to perform a processing operation that 623 requires the context of the complete PW (for example policing of the 624 egress traffic). 626 8. Applicability to MPLS 628 A further application of this technique would be to create a basis 629 for hash diversity without having to peek below the label stack for 630 IP traffic carried over LDP LSPs. Work on the generalisation of this 631 to MPLS has been described in [I-D.kompella-mpls-entropy-label]. 632 This is can be regarded as a complementary, but distinct, approach 633 since although similar consideration may apply to the identification 634 of flows and the allocation of flow label values, the flow labels are 635 imposed by different network components, and the associated 636 signalling mechanisms are different. 638 9. Security Considerations 640 The pseudowire generic security considerations described in [RFC3985] 641 and the security considerations applicable to a specific pseudowire 642 type (for example, in the case of an Ethernet pseudowire [RFC4448] 643 apply. 645 The ingress PE SHOULD take steps to ensure that the load-balance 646 label is not used as a covert channel. 648 It is useful to give consideration to the choice of TTL value in the 649 flow label stack entry [RFC3032]. The flow label is at the bottom of 650 label stack. Therefore, even when penultimate hop popping is 651 employed, it will always be will preceded by the PW label on arrival 652 at the PE. The flow label TTL should therefore never be considered 653 by the forwarder, and hence SHOULD be set to a value of 1. This will 654 prevent the packet being inadvertently forwarded based on the value 655 of the flow label. Note that this may be a departure from 656 considerations that apply to the general MPLS case. 658 10. IANA Considerations 660 IANA is requested to allocate the next available values from the IETF 661 Consensus range in the Pseudowire Interface Parameters Sub-TLV type 662 Registry as a Flow Label indicator. The allocation of value 17 is 663 requested. 665 Parameter Length Description 666 ID 668 17 4 Flow Label 670 11. Congestion Considerations 672 The congestion considerations applicable to pseudowires as described 673 in [RFC3985] and any additional congestion considerations developed 674 at the time of publication apply to this design. 676 The ability to explicitly configure a PW to leverage the availability 677 of multiple ECMP paths is beneficial to capacity planning as, all 678 other parameters being constant, the statistical multiplexing of a 679 larger number of smaller flows is more efficient than with a smaller 680 number of larger flows. 682 Note that if the classification into flows is only performed on IP 683 packets the behaviour of those flows in the face of congestion will 684 be as already defined by the IETF for packets of that type and no 685 additional congestion processing is required. 687 Where flows that are not IP are classified pseudowire congestion 688 avoidance must be applied to each non-IP load balance group. 690 12. Acknowledgements 692 The authors wish to thank Eric Grey, Kireeti Kompella, Joerg 693 Kuechemann, Wilfried Maas, Luca Martini, Mark Townsley, and Lucy Yong 694 for valuable comments on this document. 696 13. References 698 13.1. Normative References 700 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 701 Requirement Levels", BCP 14, RFC 2119, March 1997. 703 [RFC3032] Rosen, E., Tappan, D., Fedorkow, G., Rekhter, Y., 704 Farinacci, D., Li, T., and A. Conta, "MPLS Label Stack 705 Encoding", RFC 3032, January 2001. 707 [RFC4379] Kompella, K. and G. Swallow, "Detecting Multi-Protocol 708 Label Switched (MPLS) Data Plane Failures", RFC 4379, 709 February 2006. 711 [RFC4385] Bryant, S., Swallow, G., Martini, L., and D. McPherson, 712 "Pseudowire Emulation Edge-to-Edge (PWE3) Control Word for 713 Use over an MPLS PSN", RFC 4385, February 2006. 715 [RFC4447] Martini, L., Rosen, E., El-Aawar, N., Smith, T., and G. 716 Heron, "Pseudowire Setup and Maintenance Using the Label 717 Distribution Protocol (LDP)", RFC 4447, April 2006. 719 [RFC4448] Martini, L., Rosen, E., El-Aawar, N., and G. Heron, 720 "Encapsulation Methods for Transport of Ethernet over MPLS 721 Networks", RFC 4448, April 2006. 723 [RFC4553] Vainshtein, A. and YJ. Stein, "Structure-Agnostic Time 724 Division Multiplexing (TDM) over Packet (SAToP)", 725 RFC 4553, June 2006. 727 [RFC4928] Swallow, G., Bryant, S., and L. Andersson, "Avoiding Equal 728 Cost Multipath Treatment in MPLS Networks", BCP 128, 729 RFC 4928, June 2007. 731 [RFC5085] Nadeau, T. and C. Pignataro, "Pseudowire Virtual Circuit 732 Connectivity Verification (VCCV): A Control Channel for 733 Pseudowires", RFC 5085, December 2007. 735 13.2. Informative References 737 [I-D.ietf-mpls-lsr-self-test] 738 Swallow, G., "Label Switching Router Self-Test", 739 draft-ietf-mpls-lsr-self-test-07 (work in progress), 740 May 2007. 742 [I-D.kompella-mpls-entropy-label] 743 Kompella, K. and S. Amante, "The Use of Entropy Labels in 744 MPLS Forwarding", draft-kompella-mpls-entropy-label-01 745 (work in progress), July 2010. 747 [I-D.stein-pwe3-pwbonding] 748 Stein, Y., Mendelsohn, I., and R. Insler, "PW Bonding", 749 draft-stein-pwe3-pwbonding-01 (work in progress), 750 November 2008. 752 [RFC3985] Bryant, S. and P. Pate, "Pseudo Wire Emulation Edge-to- 753 Edge (PWE3) Architecture", RFC 3985, March 2005. 755 [RFC4201] Kompella, K., Rekhter, Y., and L. Berger, "Link Bundling 756 in MPLS Traffic Engineering (TE)", RFC 4201, October 2005. 758 [RFC4301] Kent, S. and K. Seo, "Security Architecture for the 759 Internet Protocol", RFC 4301, December 2005. 761 [RFC4378] Allan, D. and T. Nadeau, "A Framework for Multi-Protocol 762 Label Switching (MPLS) Operations and Management (OAM)", 763 RFC 4378, February 2006. 765 [RFC5286] Atlas, A. and A. Zinin, "Basic Specification for IP Fast 766 Reroute: Loop-Free Alternates", RFC 5286, September 2008. 768 [RFC5654] Niven-Jenkins, B., Brungard, D., Betts, M., Sprecher, N., 769 and S. Ueno, "Requirements of an MPLS Transport Profile", 770 RFC 5654, September 2009. 772 [RFC5880] Katz, D. and D. Ward, "Bidirectional Forwarding Detection 773 (BFD)", RFC 5880, June 2010. 775 Authors' Addresses 777 Stewart Bryant (editor) 778 Cisco Systems 779 250 Longwater Ave 780 Reading RG2 6GB 781 United Kingdom 783 Phone: +44-208-824-8828 784 Email: stbryant@cisco.com 786 Clarence Filsfils 787 Cisco Systems 788 Brussels 789 Belgium 791 Email: cfilsfil@cisco.com 793 Ulrich Drafz 794 Deutsche Telekom 795 Muenster 796 Germany 798 Email: Ulrich.Drafz@t-com.net 800 Vach Kompella 801 Alcatel-Lucent 803 Email: Alcatel-Lucent vach.kompella@alcatel-lucent.com 805 Joe Regan 806 Alcatel-Lucent 808 Email: joe.regan@alcatel-lucent.comRegan 810 Shane Amante 811 Level 3 Communications 813 Email: shane@castlepoint.net