idnits 2.17.1 draft-sun-idr-bgp-ls-notification-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 8) being 77 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** There are 2 instances of too long lines in the document, the longest one being 4 characters in excess of 72. ** There are 9 instances of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Unrecognized Status in 'Intended Status: ', assuming Proposed Standard (Expected one of 'Standards Track', 'Full Standard', 'Draft Standard', 'Proposed Standard', 'Best Current Practice', 'Informational', 'Experimental', 'Informational', 'Historic'.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC 2328' is mentioned on line 171, but not defined == Missing Reference: 'RFC 5880' is mentioned on line 171, but not defined == Unused Reference: 'RFC2328' is defined on line 314, but no explicit reference was found in the text == Unused Reference: 'RFC4271' is defined on line 316, but no explicit reference was found in the text == Unused Reference: 'RFC5880' is defined on line 320, but no explicit reference was found in the text == Unused Reference: 'RFC7153' is defined on line 323, but no explicit reference was found in the text == Unused Reference: 'RFC3765' is defined on line 331, but no explicit reference was found in the text == Unused Reference: 'RFC6286' is defined on line 334, but no explicit reference was found in the text == Unused Reference: 'RFC6608' is defined on line 337, but no explicit reference was found in the text == Unused Reference: 'RFC7606' is defined on line 340, but no explicit reference was found in the text == Unused Reference: 'RFC7705' is defined on line 343, but no explicit reference was found in the text == Unused Reference: 'RFC7752' is defined on line 347, but no explicit reference was found in the text ** Downref: Normative reference to an Informational RFC: RFC 7938 -- Obsolete informational reference (is this intentional?): RFC 7752 (Obsoleted by RFC 9552) Summary: 4 errors (**), 0 flaws (~~), 15 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 INTERNET-DRAFT M. Sun 3 Intended Status: B.Pithawala 4 Expires: December 30,2017 HUAWEI Technologies 5 F.Gao 6 Baidu Inc 7 June 28,2017 9 10 draft-sun-idr-bgp-ls-notification-00 12 Abstract 14 This document describes the use of Border Gateway Protocol (BGP) 15 community. This optional transitive community will instruct router to 16 monitor itself ports . With this community, controller only needs to 17 send route update message once and will get the feedback only if link 18 status changes. In particular this community can help controller get 19 the link status changing notification much faster than current 20 method. 22 Status of this Memo 24 This Internet-Draft is submitted to IETF in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF), its areas, and its working groups. Note that 29 other groups may also distribute working documents as 30 Internet-Drafts. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 The list of current Internet-Drafts can be accessed at 38 http://www.ietf.org/1id-abstracts.html 40 The list of Internet-Draft Shadow Directories can be accessed at 41 http://www.ietf.org/shadow.html 43 Copyright and License Notice 45 Copyright (c) 2017 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Table of Contents 60 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 61 1.1 Large-scale DC Routing Solution . . . . . . . . . . . . . . 3 62 1.2 BFD protocol and Hellos Protocol . . . . . . . . . . . . . . 5 63 2. Another Centralized Link Detection Method Based on BGP . . . . 5 64 2.1 Basic Principle . . . . . . . . . . . . . . . . . . . . . . 5 65 2.2 Advantages and Benefits of this solution . . . . . . . . . . 7 66 3 IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 7 67 4 References . . . . . . . . . . . . . . . . . . . . . . . . . . 8 68 4.1 Normative References . . . . . . . . . . . . . . . . . . . 8 69 4.2 Informative References . . . . . . . . . . . . . . . . . . 8 70 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 8 72 1 Introduction 74 With the advent of micro services application architecture and the 75 continued advances in massively scaled distributed systems, majority 76 of traffic traversing the data center network is within the data 77 center (east-west). This necessitates the data center network to have 78 deterministic latency (preferably ultra-low), high scalability, high 79 availability and low cost. For those requirements, current large- 80 scale data center network is mostly based on CLOS architecture, 81 [RFC7938] shows a typical 3 layer(5 stages) CLOS architecture(in 82 Figure 1,3 layer means Leaf-Agg-Spine ). 84 Spine 85 +-----+ 86 | | 87 +--| |--+ 88 | +-----+ | +-----------------------+ 89 Agg | | | Agg POD | 90 +-----+ | +-----+ | |+-----+ | 91 +-------------| DEV |------+--| |--+-+| |-------------+ | 92 | +-----| C |------+ | | +-+| |-----+ | | 93 | | +-----+ +-----+ |+-----+ | | | 94 | | | | | | 95 | | +-----+ +-----+ |+-----+ | | | 96 | +-----------| DEV |------+ | | +-+| |-----------+ | | 97 | | | +---| D |------+--| |--+-+| |---+ | | | | 98 | | | | +-----+ | +-----+ | |+-----+ | | | | | 99 | | | | | | | | | | | | 100 +-----+ +-----+ | +-----+ | | +-----+ +-----+| 101 | DEV | | DEV | +--| |--+ | | | | || 102 | A | | B |Leaf | | | Leaf | | | || 103 +-----+ +-----+ +-----+ | +-----+ +-----+| 104 | | | | +----------+-+-----+-+--+ 105 O O O O O O O O 106 Servers Servers 108 Figure 1 3-Layer Clos Topology 110 Note: Leaf is switching node that is connected with servers, Agg is 111 exchange node that aggregates Leaf, and Spine is core exchange node. 113 Nowadays, the scale of this architecture can support 100k servers. 114 The number of links in network is nearly up to 200k links. Managing 115 the large number of switches and links in a data center from a 116 Controller is a difficult scale problem. 118 1.1 Large-scale DC Routing Solution 120 [RFC7938] introduces a link detection solution based on BGP.This RFC 121 uses ebgp to connect switches (physical link) and use ibgp to connect 122 switches and controller (logical link). The ebgp connections are made 123 using the local loopback addresses of the Routers/Switches.Since this 124 solution does not have any IGP in the network to convey the local 125 loopback addresses to form the EBGP connection, the solution uses a 126 centralized controller to initiate the messages to convey loopback 127 address of a Router to its neighbor. It uses a combination of ibgp 128 and ebgp connections and messages to achieve the following as Figure 129 2. 131 +----------+ 132 inject Prefix +-----+Controller+----+ 133 for R1 with | +----------+ | expect Prefix 134 one-hop | | for R1 from R2 135 community +-++ +-++ 136 |R1+------------------+R2| 137 +--+ Prefix for R1 +-++ 138 relayed | Prefix for R1 139 +-++NOT relayed 140 |R3| 141 +--+ 143 Figure 2 one kind of link detection method 145 In Figure 2, the controller periodically updates the packets to the 146 source of the link, determines link status (status of link connecting 147 to routers/switches) according to whether controller receives update 148 message from destination link node.The controller sends route message 149 to switch R1 periodically, which only contains one-hop community 150 attribute.R1 publishes this message to its neighbor R2 through ebgp 151 with no_export attribute in it.R2 sends this message to controller 152 through ibgp instead of sending message to R3 because of no_export 153 attribute.If controller receives route message from R2 within 154 specified time, it is assumed that R1->R2 status is normal. 155 Otherwise, R1->R2 status is down. 157 But when link detection packets sending frequency is high, the 158 controller load is heavy, i.e. controller processing capacity is not 159 enough, and firewall device does not accept this large flow of 160 traffic.On the other hand,when link detection packets sending 161 frequency is low, the convergence speed of network is slow, that will 162 lead to loop or network interruption and other issues. Network 163 reliability is unacceptable.With single controller multi-threaded 164 exabgp + virtual router vyatta, experimental test data shows that 165 this solution can only support 1k links and 512 servers in non-block 166 network. 168 1.2 BFD protocol and Hellos Protocol 170 Existing mainstream distributed link monitoring methods are Protocol 171 Hellos [RFC 2328]and BFD protocol[RFC 5880]. 173 Protocol Hellos: Since a protocol (ebgp) is initiated over the link, 174 the status of the link could be inferred by receiving periodic hellos 175 (or the lack of hellos).Protocol hellos are generally regarded as a 176 slow link detection mechanism. Increasing the frequency of hellos 177 only creates a scale issues at many points in the network without 178 really providing sub-second link detection. 180 BFD solution configures BFD session at both ends of the link which 181 need to be detected. Each end sends detection BFD messages and link 182 will report failure if the detection message is not received on 183 time.BFD needs plenty of configurations to different devices and 184 different ports. In VRRP track, 100k servers need to configure 200k 185 links and 200k ends. At the same time, 100k servers use BFD need to 186 configure 200k links and 400k ends which may cause some unexpectable 187 errors with high cost. 189 2. Another Centralized Link Detection Method Based on BGP 191 2.1 Basic Principle 193 Considering current large-scale DCN link detection method, there are 194 many problems of periodical detection method. When the frequency of 195 sending and receiving messages is high, the controller load will be 196 too heavy. The controller processing capacity is not enough and 197 firewall devices cannot accept this large flow of traffic. On the 198 other hand, when the frequency is low, the convergence speed of 199 network will decrease. This may cause network interruption and worse 200 network reliability. 202 Compared with traditional link detection method, this solution 203 propose an efficient optimization method which can monitor links 204 automatically. This method can reduce lots of manual configuration 205 work, avoid various types of errors and high cost. Furthermore, it 206 also eases the collection of link status notifications for the 207 controller. 209 In Figure 3, if the controller need to detect link status from R1 to 210 R2, the process is as following. 212 +------------+ 213 +-+ +-----------+ Controller +------+ +-+ 214 |1| | ibgp1 +------------+ ibgp2| |3| 215 +-+ | | +-+ 216 +--+--+ +--+--+ 217 | R1 | ebgp | R2 | 218 | AS1 +------------------------/+ AS2 | 219 +--+--+ +-+ / +--+--+ 220 | |2| / | 221 |ebgp +-+ / |ebgp 222 +--+--+ / +--+--+ 223 |R5 | / | R3 | 224 |AS5 | port is / | AS3 | 225 +-----+ automatically +-----+ 226 monitored 228 Figure 3 the principle of this solution 230 Step 1: 232 a) Controller sends route update message A1 to R1 (nonperiodic, just 233 once) then they can establish a peer. In A1, there's instructions 234 that can enable R1's port (link) status monitoring function. 236 b) is the same as a>, only the objective is R2. 238 c) The A1 message only contains one-hop community attribute and its 239 prefix is used to identify device R1. 241 Step 2: 243 When R1 receives route update message A1 from controller, it will add 244 a no_export attribute so it can only publish to egbp neighbor R2. R2 245 will publish this route message to controller through ibgp instead of 246 its ebgp neighbor device R3. 248 a) R2 finds that message A1 comes from R1 according to the community 249 in A1. 251 b) Here we need to define a dedicated bit in communities to specify 252 that R2 should start to monitor its link when it receives this 253 indication. Hence, start to monitor all the links from R1 to R2 in 254 this step. 256 step 3 258 If it detects ports (links) status has changed in step 2 b), on the 259 one hand, if the port status switches from normal to fault, R2 will 260 tell controller a withdraw message through ibgp. On the other hand, 261 R2 will tell controller a announce message through ibgp. 263 step 4 265 When controller receives route A1 update message from R2: 267 a) Find corresponding link based on received A1 update message 268 . Prefix marks network device R1 and srcIP means 269 device R2. The can tell controller this is the link 270 from R1 to R2. 272 b) If the message is route announce type, link status is normal, 273 otherwise, the withdraw type means link status is fault. 275 It is important to notice here that we do not prefer any link 276 detection mechanism and the BGP implementation on a vendor's device 277 is free to activate any link detection mechanism it chooses (some 278 examples are BFD, either auto-sensing feature etc.). 280 2.2 Advantages and Benefits of this solution 282 Generally speaking, we need a dedicated bit of communities that can 283 notify R2 to start monitoring the link between R1 and R2. It's quite 284 simple but there are many advantages of this solution. 286 1. It needs no extra configuration and can monitor corresponding 287 ports (links) automatically. It helps controller know about every 288 link status with existing BGP protocols. It can avoid lots of manual 289 configuration and unnecessary errors and costs caused by manual 290 configuration. 292 2. It can solve the conflict that network needs fast convergence time 293 but controller capacity constraint. Using this solution, network with 294 single controller can support 100k servers while other method can 295 only support 512 servers. 297 3. The performance of real-time link failure recovery is better. With 298 experiments, link failure report time reduces from 3s to less than 299 50ms, link failure recovery time decreases from 1s to less than 50ms. 301 3 IANA Considerations 303 The IANA has registered Transitive Extended Community Types in 304 RFC7153. This registry contains values of the high-order octet (the 305 "Type" field) of a Transitive Extended Community. 307 This method only needs one unassigned type value to notify device 308 monitoring corresponding links(ports). 310 4 References 312 4.1 Normative References 314 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, April 1998. 316 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 317 Border Gateway Protocol 4 (BGP-4)", RFC 4271, January 318 2006. 320 [RFC5880] Katz, D. and D. Ward, "Bidirectional Forwarding Detection 321 (BFD)", RFC 5880, June 2010. 323 [RFC7153] E. Rosen, Y. Rekhter, "IANA Registries for BGP Extended 324 Communities", RFC 7153, March 2014. 326 [RFC7938] P. Lapukhov, A. Premji, J. Mitchell, Ed., "Use of BGP for 327 Routing in Large-Scale Data Centers", RFC 7938, August 2016. 329 4.2 Informative References 331 [RFC3765] Huston, G., "NOPEER Community for Border Gateway Protocol 332 (BGP) Route Scope Control", RFC 3765, April 2004. 334 [RFC6286] E. Chen, J. Yuan, "Autonomous-System-Wide Unique BGP 335 Identifier for BGP-4", RFC 6286, June 2011. 337 [RFC6608] J. Dong, M. Chen, A. Suryanarayana, "Subcodes for BGP Finite 338 State Machine Error", RFC 6608, May 2012. 340 [RFC7606] E. Chen, Ed., J. Scudder, Ed., P. Mohapatra, K. Patel, "Revised 341 Error Handling for BGP UPDATE Messages", RFC 7606, August 2015. 343 [RFC7705] W. George, S. Amante, "Autonomous System Migration 344 Mechanisms and Their Effects on the BGP AS_PATH Attribute", 345 RFC 7705, November 2015. 347 [RFC7752] H. Gredler, Ed., J. Medved, S. Previdi, A. Farrel, S. Ray, 348 "North-Bound Distribution of Link-State and Traffic Engineering 349 (TE) Information Using BGP", RFC 7752, March 2016. 351 Authors' Addresses 353 Marcus Sun 354 HUAWEI TECHNOLOGIES CO.,LTD 355 12 E. Mozhou Rd.Nanjing,Jiangsu 356 China 358 EMail: marcus.sun@huawei.com 360 Burjiz Pithawala 361 HUAWEI TECHNOLOGIES CO.,LTD 362 2330 Central Expressway, Santa Clara, CA 95050 363 US 365 EMail: burjiz.pithawala1@huawei.com 367 Feng Gao 368 BAIDU Inc. 369 10 shangdi shijie Haidian, Beijing 371 Email:gaofeng04@baidu.com