idnits 2.17.1 draft-ietf-idr-bgp-nh-cost-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Introduction section. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 121: '...ll costs in NHIB MUST be comparable wi...' RFC 2119 keyword, line 133: '...sed to a peer it SHOULD use cost infor...' RFC 2119 keyword, line 155: '...ge next-hop information MUST advertise...' RFC 2119 keyword, line 160: '...nd IPv6, then it MUST advertise two ca...' RFC 2119 keyword, line 170: '... SAFI messages MUST contain BGP COMM...' (13 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 27, 2012) is 4412 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'BGP-ORR' is mentioned on line 96, but not defined == Unused Reference: 'I-D.raszuk-bgp-optimal-route-reflection' is defined on line 292, but no explicit reference was found in the text == Unused Reference: 'RFC2918' is defined on line 298, but no explicit reference was found in the text Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force I. Varlashkin 3 Internet-Draft Easynet Global Services 4 Intended status: Standards Track R. Raszuk 5 Expires: September 28, 2012 NTT MCL Inc. 6 March 27, 2012 8 Carrying next-hop cost information in BGP 9 draft-ietf-idr-bgp-nh-cost-01 11 Abstract 13 This document describes new BGP SAFI to exchange cost information to 14 next-hops for the purpose of calculating best path from a peer 15 perspective rather than local BGP speaker own perspective. 17 Status of this Memo 19 This Internet-Draft is submitted in full conformance with the 20 provisions of BCP 78 and BCP 79. 22 Internet-Drafts are working documents of the Internet Engineering 23 Task Force (IETF). Note that other groups may also distribute 24 working documents as Internet-Drafts. The list of current Internet- 25 Drafts is at http://datatracker.ietf.org/drafts/current/. 27 Internet-Drafts are draft documents valid for a maximum of six months 28 and may be updated, replaced, or obsoleted by other documents at any 29 time. It is inappropriate to use Internet-Drafts as reference 30 material or to cite them other than as "work in progress." 32 This Internet-Draft will expire on September 28, 2012. 34 Copyright Notice 36 Copyright (c) 2012 IETF Trust and the persons identified as the 37 document authors. All rights reserved. 39 This document is subject to BCP 78 and the IETF Trust's Legal 40 Provisions Relating to IETF Documents 41 (http://trustee.ietf.org/license-info) in effect on the date of 42 publication of this document. Please review these documents 43 carefully, as they describe your rights and restrictions with respect 44 to this document. Code Components extracted from this document must 45 include Simplified BSD License text as described in Section 4.e of 46 the Trust Legal Provisions and are provided without warranty as 47 described in the Simplified BSD License. 49 This document may contain material from IETF Documents or IETF 50 Contributions published or made publicly available before November 51 10, 2008. The person(s) controlling the copyright in some of this 52 material may not have granted the IETF Trust the right to allow 53 modifications of such material outside the IETF Standards Process. 54 Without obtaining an adequate license from the person(s) controlling 55 the copyright in such materials, this document may not be modified 56 outside the IETF Standards Process, and derivative works of it may 57 not be created outside the IETF Standards Process, except to format 58 it for publication as an RFC or to translate it into languages other 59 than English. 61 Table of Contents 63 1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 3 64 2. NEXT-HOP INFORMATION BASE . . . . . . . . . . . . . . . . . . . 3 65 3. BGP BEST PATH SELECTION MODIFICATION . . . . . . . . . . . . . 3 66 4. USING BGP TO POPULATE NHIB . . . . . . . . . . . . . . . . . . 4 67 4.1. NEXT-HOP SAFI . . . . . . . . . . . . . . . . . . . . . . . 4 68 4.2. CAPABILITY ADVERTISEMENT . . . . . . . . . . . . . . . . . 4 69 4.3. INFORMATION ENCODING . . . . . . . . . . . . . . . . . . . 4 70 4.4. SESSION ESTABLISHMENT . . . . . . . . . . . . . . . . . . . 5 71 4.5. INFORMATION EXCHANGE . . . . . . . . . . . . . . . . . . . 5 72 4.6. TERMINATION OF NH SAFI SESSION . . . . . . . . . . . . . . 6 73 4.7. GRACEFUL RESTART AND ROUTE REFRESH . . . . . . . . . . . . 6 74 5. Security considerations . . . . . . . . . . . . . . . . . . . . 6 75 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 6 76 7. Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . 7 77 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 7 78 8.1. Normative References . . . . . . . . . . . . . . . . . . . 7 79 8.2. Informative References . . . . . . . . . . . . . . . . . . 7 80 Appendix A. USAGE SCENARIOS . . . . . . . . . . . . . . . . . . . 7 81 A.1. Trivial case . . . . . . . . . . . . . . . . . . . . . . . 7 82 A.2. Non-IGP based cost . . . . . . . . . . . . . . . . . . . . 8 83 A.3. Multiple route-reflectors . . . . . . . . . . . . . . . . . 8 84 A.4. Inter-AS MPLS VPN . . . . . . . . . . . . . . . . . . . . . 9 85 A.5. Corner case . . . . . . . . . . . . . . . . . . . . . . . . 9 86 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 9 88 1. Motivation 90 In certain situation route-reflector clients may not get optimum path 91 to certain destinations. ADDPATH solves this problem by letting 92 route-reflector to advertise multiple paths for given prefix. If 93 number of advertised paths sufficiently big, route-reflector clients 94 can choose same route as they would in case of full-mesh. This 95 approach however places additional burden on the control plane. 96 Solutions proposed by [BGP-ORR] use different approach - instead of 97 calculating best path from local speaker own perspective the 98 calculations are done using cost from the client to the next-hops. 99 Although they eliminate need for transmitting redundant routing 100 information between peers, there are scenarios where cost to the 101 next-hop cannot be obtained accurately using this methods. For 102 example, if next-hop information itself has been learned via BGP then 103 simple SPF run on link-state database won't be sufficient to obtain 104 cost information. To address such scenarios this document proposes a 105 solution where cost information to the next-hops is carried within 106 BGP itself using dedicated SAFI. 108 2. NEXT-HOP INFORMATION BASE 110 To facilitate further description of the proposed solution we 111 introduce new table for all known next hops and costs to it from 112 various routers on the network. 114 Next-Hop Information Base (NHIB) stores cost to reach next-hop from 115 arbitrary router on the network. This information is essential for 116 choosing best path from a peer perspective rather than BGP-speaker 117 own perspective. In canonical form NHIB entry is triplet (router, 118 next-hop, cost), however this specification does not impose any 119 restriction on how BGP implementations store that information 120 internally. The cost in NHIB is does not have to be an IGP cost, but 121 all costs in NHIB MUST be comparable with each other. 123 NHIB can be populated from various sources both static and dynamic. 124 This document focuses on populating NHIB using BGP. However it is 125 possible that protocols other than BGP could be also used to populate 126 NHIB. 128 3. BGP BEST PATH SELECTION MODIFICATION 130 This section applies regardless of method used to populate NHIB. 132 When BGP speaker conforming to this specification selects routes to 133 be advertised to a peer it SHOULD use cost information from NHIB 134 rather than its own IGP cost to the next-hop after step (d) of 135 9.1.2.2 in [RFC4271]. 137 4. USING BGP TO POPULATE NHIB 139 This section describes extension to base BGP specification that 140 allows BGP to be used for exchanging next-hop information between BGP 141 speakers via new SAFI in order to populate NHIB. Although next-hops 142 costs are exchanged via dedicated SAFI, this information is vital to 143 best path selection process for other AFI/SAFI (e.g. IPv4 and IPv6 144 unicast). It's therefore recommended that next-hop cost information 145 is exchanged before other AFI/SAFI. 147 4.1. NEXT-HOP SAFI 149 This document introduces Next-Hop SAFI (NH SAFI) with value to be 150 assigned by IANA and purpose of exchanging information about cost to 151 next-hops. 153 4.2. CAPABILITY ADVERTISEMENT 155 A BGP speaker willing to exchange next-hop information MUST advertise 156 this in the OPEN message using BGP Capability Code 1 (Multiprotocol 157 Extensions, see [RFC4760]) setting AFI appropriately to indicate IPv4 158 or IPv6 and SAFI to the value assigned by IANA for NH SAFI. Note 159 that if BGP speaker whishes to exchange cost information for both 160 IPv4 and IPv6, then it MUST advertise two capabilities: one NH SAFI 161 for IPv4 and one NH SAFI for IPv6. 163 4.3. INFORMATION ENCODING 165 Routers use standard BGP UPDATE messages to exchange NH SAFI 166 information. Cost to reachable next-hops is communicated using 167 MP_REACH_NLRI (attribute 14) with NLRI part as described below. 168 Requests are also sent using MP_REACH_NLRI. Informing a neighbour 169 about unreachable next-hop is done using MP_UNREACH_NLRI. All NH 170 SAFI messages MUST contain BGP COMMUNITY attribute with value 171 NO_ADVERTISE (0xFFFFFF02) and their propagation MUST follow normal 172 BGP rules (i.e. they're not to be propagated). 174 To request cost to a next-hop from peer or to inform peer about cost 175 to a next-hop BGP attribute 14 is used as follow: 177 1. AFI is set to indicate IPv4 or IPv6 (whichever is appropriate) 179 2. SAFI is set to NH SAFI 180 3. Network Address of Next-Hop field is zeroed out 182 4. NLRI field is encoded as shown in the next figure 184 Format of NH SAFI NLRI is as follow: 185 +-----+------+-------+----------+------+ 186 | AFI | SAFI | Flags | NEXT_HOP | cost | 187 +-----+------+-------+----------+------+ 189 Flags - 1 octet field. Least significant bit MUST be set to 1 for 190 Request and to zero for Response 192 AFI/SAFI fields can be set either to one of the registered values to 193 indicate that next-hop cost info applies only to specified AFI/SAFI. 194 Alternatively when both fields are be set to zero, the cost 195 information applies to any compatible AFI/SAFI negotiated with given 196 peer. 198 Next-hop - IPv4 or IPv6 address for which cost is being communicated 199 or requested. Type is determined from context, and length is 200 inferred from total length of attribute. 202 Cost is 32-bit unsigned integer (value described below), and NEXT_HOP 203 is AFI-specific address of the next-hop cost to which is being 204 communicated or requested. Size of NEXT_HOP field is inferred from 205 total length of attribute 14. 207 To inform peer that particular next-hop is unreachable 208 MP_UNREACH_NLRI attribute is used with same NLRI format as described 209 above. In this case cost field SHOULD be set to 0xFFFFFFFF. 211 4.4. SESSION ESTABLISHMENT 213 BGP speakers willing to exchange next-hop information SHOULD NOT 214 establish more then one session for given AFI and NH SAFI, even using 215 different transport addresses. This can be ensured for example by 216 checking peer's Router Id. 218 4.5. INFORMATION EXCHANGE 220 Typically NH SAFI sessions will be established between route- 221 reflectors and its internal peers (both clients and non-clients). As 222 soon as the NH SAFI session is ESTABLISHED requests for next-hop cost 223 and information information about next-hop costs MAY be sent 224 independently. That is, route-reflector MAY send multiple requests 225 without waiting for response, and its peers MAY send cost information 226 before or after receiving such request. On the other hand, Router 227 Reflectors SHOULD request cost information from their internal peers 228 as soon as possible (due to reasons stated in section "BGP best path 229 selection modification"). BGP speaker does not need to track 230 outstanding requests to the peer. 232 When a BGP speaker receives request for cost information it MUST 233 reply with actual cost (not necessarily IGP cost, but whatever has 234 been chosen to be carried in NH SAFI) to given next-hop or with cost 235 set to all-ones indicating that next-hop is unreachable. If next-hop 236 information is obtained from sender's routing table, then sender MUST 237 perform lookup exactly the same way as it would for resolving next- 238 hop in BGP UPDATE message. For example, for non-labelled 239 destinations (e.g. AFI/SAFI 1/1 or 2/1) lookup would be done using 240 longest match, whereas for labelled IPv4 (AFI/SAFI 1/4, 1/128 or 2/4) 241 exact-match would be used. 243 When a BGP speaker detects change in cost to previously advertised 244 next-hop with delta equal or exceeding configured advertisement 245 threshold, it SHOULD inform peer by sending MP_UNREACH_NLRI as 246 described earlier. 248 When a BGP speaker discovers new next-hop among candidate routes it 249 SHOULD request cost information from the peer. 251 4.6. TERMINATION OF NH SAFI SESSION 253 When BGP speaker terminates (for whatever reason) NH SAFI session 254 with a peer, it SHOULD remove all cost information received from that 255 peer unless instructed by configuration to do otherwise. 257 4.7. GRACEFUL RESTART AND ROUTE REFRESH 259 NH SAFI sessions could use graceful restart and route refresh 260 mechanisms in the same way as it's used for IPv4 and IPv6 unicast - 261 preservation and purge of next-hop cost information follows normal GR 262 rules. 264 5. Security considerations 266 No new security issues are introduced to the BGP protocol by this 267 specification. 269 6. IANA Considerations 271 IANA is requested to allocate value for Next-Hop Subsequent Address 272 Family Identifier. 274 7. Acknowledgment 276 Authors would like to thank Keyur Patel, Anton Elita, Nagendra Kumar 277 for critical reviews and feedback. 279 8. References 281 8.1. Normative References 283 [RFC4271] Rekhter, Y., Li, T., and S. Hares, "A Border Gateway 284 Protocol 4 (BGP-4)", RFC 4271, January 2006. 286 [RFC4760] Bates, T., Chandra, R., Katz, D., and Y. Rekhter, 287 "Multiprotocol Extensions for BGP-4", RFC 4760, 288 January 2007. 290 8.2. Informative References 292 [I-D.raszuk-bgp-optimal-route-reflection] 293 Raszuk, R., Cassar, C., Aman, E., and B. Decraene, "BGP 294 Optimal Route Reflection (BGP-ORR)", 295 draft-raszuk-bgp-optimal-route-reflection-01 (work in 296 progress), March 2011. 298 [RFC2918] Chen, E., "Route Refresh Capability for BGP-4", RFC 2918, 299 September 2000. 301 Appendix A. USAGE SCENARIOS 303 A.1. Trivial case 305 --+---NetA---+-- 306 | | 307 r1 r2 308 | | 309 R1--RR-----R2 310 | \ | 311 | +------R4 312 R3 314 In this scenario r1 and r3 along with NetA are part of AS1; and R1-R4 315 along with RR are in AS2. 317 If RR implements non-optimized route-reflection, then it will choose 318 path to NetA via R1 and advertise it to both R3 and R4. Such choice 319 is good from R3 perspective, but it results in suboptimal traffic 320 flow from R4 to NetA. 322 Using NH SAFI the route-reflector will learn that cost from R4 to R1 323 is 8 whereas to R2 it's only 1. RR will announce NetA to R4 with 324 next-hop set to R2, while its announce to R3 will still have R1 as 325 next-hop. Both R3 and R4 now will send traffic to NetA via closest 326 exit, achieving same behaviour as if full iBGP mesh would have been 327 configured. 329 A.2. Non-IGP based cost 331 When it's desirable to direct traffic over an exit other than the one 332 with smallest IGP cost, NH SAFI can be used to convey cost which is 333 not based on IGP. For example, network operator may arrange exit 334 points in order of administrative preference and configure routers to 335 send this instead of IGP cost. Route reflector then will then 336 calculate best path based on administrative preference rather than 337 IGP metrics. 339 Network operators should excercise care to ensure that all routers up 340 to and including exit point do not devert packets on to a different 341 path, otherwise routing loops may occur. One way to achieve this is 342 to have consistent administrative preference among all routers. 343 Another option is to use a tunneling mechanism (e.g. MPLS-TE tunnel) 344 between source and the exit point, provided that the router serving 345 as exit point will send packets out of the network rather than 346 diverting them to another exit point. 348 A.3. Multiple route-reflectors 350 This example demonstrates that NH SAFI peerings are necessary only 351 between routers that already exchange other AFI/SAFI. 353 | 354 R1----R3---------R5----R7--+ 355 | | | 356 RR1 | NetA 357 | RR2 | 358 | | | 359 R2----R4---------R6----R8--+ 360 | 362 In the above network the routers R1-R4 are clients of RR1, and R5-R8 363 are clients of RR2. RR1 and RR2 also peer with each other and use 364 ADDPATH. 366 RR2 learns about NetA from R7 and R8. Since it sends not just best- 367 path but all prefixes to RR1, there is no need for RR2 to learn cost 368 information from R1 and R2 towards R7 and R8. On the other hand RR1 369 does exchange NH SAFI information with R1 and R2 so that each of them 370 can receive routes, which are best from their perspective. 372 As addition to ADDPATH a mechanism could be devised that would allow 373 RR2 to learn how many alternative routes does it need to send to RR1. 374 For example, if NetA would also be connected to R9 (not shown) but 375 all clients of RR1 prefer R7 as exit point and R9 as next-best, then 376 there is no need for RR2 to send NetA routes with next-hop R8 to RR1. 378 Discussion: authors would like to solicit discussion whether there is 379 sufficient interest in such mechanism. 381 A.4. Inter-AS MPLS VPN 383 Previous example could be transposed to Inter-AS MPLS VPN Option C 384 scenario. In this case route reflectors RR1 and RR2 can be from 385 different autonomous system. Essentially the behaviour of routers 386 remains as already described. 388 A.5. Corner case 390 --+---NetA--+-- 391 | | 392 RR---R1 R2 393 \ / 394 R3---R4 396 In the above network cost from R3 to R1 is 10, all other costs are 1. 397 If RR advertises NetA to R3 based on cost information received from 398 R3, but uses its own cost when advertising NetA to R4, there will be 399 a loop formed. This is the reason why section "BGP best path 400 selection modification" requires RR to have next-hop cost information 401 for every next-hop and every peer. 403 Note that the problem is the same as if RR would not use extensions 404 described in this document and R3 would peer directly with R1 and R2, 405 while R4 would peer only with RR. 407 Authors' Addresses 409 Ilya Varlashkin 410 Easynet Global Services 412 Email: ilya.varlashkin@easynet.com 413 Robert Raszuk 414 NTT MCL Inc. 415 101 S Ellsworth Avenue Suite 350 416 San Mateo, CA 94401 417 US 419 Email: robert@raszuk.net