idnits 2.17.1 draft-lin-idr-bgp-nof-nlri-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 8, 2021) is 898 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group C. Lin 3 Internet-Draft M. Chen 4 Intended status: Standards Track H. Li 5 Expires: May 12, 2022 H3C 6 November 8, 2021 8 Distribution of Device Discovery Information in NVMe Over RoCEv2 Storage 9 Network Using BGP 10 draft-lin-idr-bgp-nof-nlri-00 12 Abstract 14 This document proposes a method of distributing device discovery 15 information in NVMe over RoCEv2 storage network using the BGP routing 16 protocol. A new BGP Network Layer Reachability Information (NLRI) 17 encoding format, named NoF NLRI, is defined. 19 Status of This Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at https://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on May 12, 2022. 36 Copyright Notice 38 Copyright (c) 2021 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (https://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 54 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 55 2. Distribution of Device Discovery Information Using BGP . . . 3 56 3. BGP Extentions . . . . . . . . . . . . . . . . . . . . . . . 5 57 3.1. TLV Format . . . . . . . . . . . . . . . . . . . . . . . 5 58 3.2. NoF NLRI . . . . . . . . . . . . . . . . . . . . . . . . 6 59 3.3. Device Discovery NLRI . . . . . . . . . . . . . . . . . . 7 60 3.3.1. IPv4 Address TLV . . . . . . . . . . . . . . . . . . 8 61 3.3.2. IPv6 Address TLV . . . . . . . . . . . . . . . . . . 8 62 3.3.3. Role Type TLV . . . . . . . . . . . . . . . . . . . . 9 63 3.3.4. Online/Offline Status TLV . . . . . . . . . . . . . . 9 64 3.3.5. More Device Info TLVs . . . . . . . . . . . . . . . . 10 65 3.4. Device Zone NLRI . . . . . . . . . . . . . . . . . . . . 10 66 3.5. Operations . . . . . . . . . . . . . . . . . . . . . . . 11 67 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 68 5. Security Considerations . . . . . . . . . . . . . . . . . . . 11 69 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 70 6.1. Normative References . . . . . . . . . . . . . . . . . . 11 71 6.2. Informative References . . . . . . . . . . . . . . . . . 12 72 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12 74 1. Introduction 76 As data center networks keep growing, the performance of 77 communication methods needs to accelerate. At present, NVMe over 78 RoCEv2 is becoming a popular solution of storage network based on 79 Ethernet. In such network, a host accesses to an NVMe storage 80 subsystem via Ethernet Fabric with RoCEv2 protocol. 82 In the traditional way, the discovery of hosts and storage subsystems 83 is achieved by manual configurations. However the manual way is 84 difficult for management and maintenance. In addition, the reaction 85 speed is slow when a device goes online or offline, making it hard to 86 realize hot-plug and failover. To solve these problems, automatic 87 discovery method should be deployed. 89 LLDP is generally used to achieve the discovery task when a host or 90 storage subsystem is directly connected to a switch. Then, the 91 device discovery information is distributed to others switches in the 92 fabric. Finally, other devices get the information from the switches 93 which they directly connect with. 95 This document proposes a new method of distributing device discovery 96 information among switches in NVMe over RoCEv2 storage network using 97 the BGP routing protocol [RFC4271]. 99 1.1. Requirements Language 101 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 102 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 103 "OPTIONAL" in this document are to be interpreted as described in BCP 104 14 [RFC2119] [RFC8174] when, and only when, they appear in all 105 capitals, as shown here. 107 2. Distribution of Device Discovery Information Using BGP 109 In hierarchical topology, a host or storage subsystem is usually 110 connected to a switch at access layer. In Clos topology, a host or 111 storage subsystem is usually connected to a "Leaf" switch. To keep 112 terminology uniform, in this document the switches which the hosts 113 and storage subsystems directed connect with will be referred to as 114 the access switches. 116 When any host or storage subsystem is connected with an access 117 switch, it periodically sends LLDP messages to the access switch. 118 According to the received LLDP messages, the access switch maintains 119 the states of directly connected devices. If the state of any device 120 changes, such as going online or offline, the access switch will 121 announce the other devices connected with it. However, the devices 122 on the other access switches may also be concerned with the device 123 discovery information, especially in a large-scale storage network. 124 For example, when a storage subsystem is newly connecting to an 125 access switch, a host located in another access switch needs to know 126 that it gets online. Then the host will establish connection with 127 the storage subsystem, and transmit data through NVMe over RoCEv2. 128 Therefore, the access switches are required to distribute device 129 discovery information among them. 131 In this document the distribution of device discovery information 132 among access switches is achieved by using BGP. All the access 133 switches are BGP speakers, and the device discovery information is 134 exchanged as BGP routes among them. 136 In order to reduce the number of BGP connections, the application of 137 BGP Route Reflectors [RFC4456] is recommended. Figure 1 shows an 138 example of BGP connections with route reflectors. SW 1 and SW 2 139 serve as reflectors, and SW 3, SW 4, SW 5 and SW 6 are their clients. 140 When a client sends a BGP route, which contains device discovery 141 information, to a reflector, the reflector will reflect the route to 142 the other clients. Therefore, all the access switches work as 143 clients, and each of them only needs to establish BGP connections to 144 the reflectors, rather than establishing BGP connections between each 145 other. In this example, there are two reflectors, SW 1 and SW 2, 146 which run as a hot standby for each other. It is also fine to deploy 147 only one reflector in the network. However, to improve availability, 148 deploying more than one reflectors are recommended. 150 +---------+ +---------+ 151 | SW 1 | | SW 2 | BGP Reflector 152 +---------+ +---------+ 153 +-----+ | | | | | | | 154 | +---|-|-|------------------+ | | | 155 | | | | | +---------------+ | | 156 | | | | | | | +-----+ 157 | | | | | | +----+ | 158 | | | | +----|------------|--------+ | 159 | | | +------|--------+ | | | 160 | | +----+ | | | | | 161 | | | | | | | | 162 +-------+ +-------+ +-------+ +-------+ 163 | SW 3 | | SW 4 | | SW 5 | | SW 6 | BGP Client 164 +-------+ +-------+ +-------+ +-------+ 165 | | | | | | | | 166 | | | | | | | | 167 H3 SS3 H4 SS4 H5 SS5 H6 SS6 169 SW: Switch 170 H: Host 171 SS: Storage Subsystem 173 Figure 1 BGP Connections with Route Reflectors 175 In Figure 1, the reflector switches are not directly connected with 176 hosts or storage subsystems, and they are not access switches. 177 Figure 2 shows another example, in which case two of the access 178 switches serve as BGP route reflectors. The main difference with 179 Figure 1 is that the reflectors, SW 1 and SW 2, also need to 180 establish BGP connections between each other. If any device directly 181 connected with the reflector goes online or offline, the reflector 182 not only sends the device discovery information to its clients, but 183 also sends information to the other reflectors. 185 H1 SS1 H2 SS2 186 | | | | 187 | | | | 188 +---------+ +---------+ 189 | SW 1 |--------------| SW 2 | BGP Reflector 190 +---------+ +---------+ 191 +-----+ | | | | | | | 192 | +---|-|-|------------------+ | | | 193 | | | | | +---------------+ | | 194 | | | | | | | +-----+ 195 | | | | | | +----+ | 196 | | | | +----|------------|--------+ | 197 | | | +------|--------+ | | | 198 | | +----+ | | | | | 199 | | | | | | | | 200 +-------+ +-------+ +-------+ +-------+ 201 | SW 3 | | SW 4 | | SW 5 | | SW 6 | BGP Client 202 +-------+ +-------+ +-------+ +-------+ 203 | | | | | | | | 204 | | | | | | | | 205 H3 SS3 H4 SS4 H5 SS5 H6 SS6 207 SW: Switch 208 H: Host 209 SS: Storage Subsystem 211 Figure 2 Access Switches Serve as Reflectors 213 This document mainly focus on the distribution method of device 214 discovery information among access switches. The interaction between 215 access switch and host, or the interaction between access switch and 216 storage subsystem, is beyond the scope of this document. 218 3. BGP Extentions 220 This document describes a mechanism by which device discovery 221 information can be distributed using the BGP routing protocol. This 222 is achieved using a new BGP Network Layer Reachability Information 223 (NLRI) encoding format, named NoF NLRI. 225 3.1. TLV Format 227 Information in the NoF NLRI is encoded in Type/Length/Value triplets. 228 The TLV format is shown in Figure 3. 230 0 1 2 3 231 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 232 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 233 | Type | Length | 234 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 235 // Value (variable) // 236 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 238 Figure 3: TLV Format 240 The Length field defines the length of the value portion in octets 241 (thus, a TLV with no value portion would have a length of zero). The 242 TLV is not padded to 4-octet alignment. Unrecognized types MUST be 243 preserved and propagated. 245 3.2. NoF NLRI 247 New AFI and SAFI are defined for the NoF NLRI: the NoF AFI/SAFI 248 (values to be assigned by the IANA). 250 In order for two BGP speakers to exchange NoF NLRI, they MUST use BGP 251 Capabilities Advertisement to ensure that they are both capable of 252 properly processing such NLRI. This is done as specified in 253 [RFC4760]. 255 The format of the NoF NLRI is shown in the following figure. 257 +------------------+ 258 | Type | 2 octets 259 +------------------+ 260 | Length | 2 octets 261 +------------------+ 262 | NoF NLRI | variable 263 +------------------+ 265 where: 267 o Type: the type of NoF NLRI. 269 o Length: the length of the rest of the NLRI in octets, not 270 including the Type field or itself. 272 o NoF NLRI: carrying the device discovery information in NVMe over 273 Fabric networks. 275 BGP NoF NLRI for both IPv4 and IPv6 networks can be carried over 276 either an IPv4 BGP session or an IPv6 BGP session. If an IPv4 BGP 277 session is used, then the next hop in the MP_REACH_NLRI SHOULD be an 278 IPv4 address. Similarly, if an IPv6 BGP session is used, then the 279 next hop in the MP_REACH_NLRI SHOULD be an IPv6 address. Usually, 280 the next hop will be set to the local endpoint address of the BGP 281 session. The next-hop address MUST be encoded as described in 282 [RFC4760]. 284 The Device Discovery NLRI and Device Zone NLRI are currently defined 285 in this document. More types of NLRI will be included in the future 286 version. 288 +------+---------------------------+ 289 | Type | NoF NLRI Type | 290 +------+---------------------------+ 291 | 1 | Device Discovery NLRI | 292 | 2 | Device Zone NLRI | 293 +------+---------------------------+ 295 3.3. Device Discovery NLRI 297 The Device Discovery NLRI is used to carry the discovery information 298 of directly connected devices. The format of the Device Discovery 299 NLRI is shown in the following figure. 301 +------------------+ 302 | Router ID | 4 octets 303 +------------------+ 304 | Mac Address | 6 octets 305 +------------------+ 306 | Port Name Length| 2 octets 307 +------------------+ 308 | Port Name | variable 309 +------------------+ 310 | Device Info | variable 311 +------------------+ 313 where: 315 o Router ID: the Router ID of the access switch which originates 316 this NLRI, usually the same as the BGP Identifier. 318 o Mac Address: the Mac Address of a connected device. 320 o Port Name Length: the length of the following Port Name field in 321 octets. 323 o Port Name: the name of the connecting port, to distinguishing 324 different ports which share the same Mac Address. 326 o Device Info: the specific information of the connected device and 327 its connecting port, which are identified by the above Mac Address 328 and Port Name fields. 330 The Device Discovery NLRI carries the information of a device which 331 is identified by the Router ID of the access switch and the Mac 332 Address and Port Name of the connected port. 334 For the purpose of BGP route key processing, only the Router ID, Mac 335 Address, MAC Address, Port Name Length, and Port Name fields are 336 considered to be part of the prefix in the NLRI. 338 The Device Info field may contain the following TLVs. 340 3.3.1. IPv4 Address TLV 342 The format of the IPv4 Address TLV is shown in the following figure. 344 +------------------+ 345 | Type | 2 octets 346 +------------------+ 347 | Length | 2 octets 348 +------------------+ 349 | IPv4 Address | 4 octets 350 +------------------+ 352 where: 354 o Type: 1. 356 o Length: 4. 358 o IPv4 Address: the IPv4 Address of the connecting port. 360 3.3.2. IPv6 Address TLV 362 The format of the IPv6 Address TLV is shown in the following figure. 364 +------------------+ 365 | Type | 2 octets 366 +------------------+ 367 | Length | 2 octets 368 +------------------+ 369 | IPv6 Address | 16 octets 370 +------------------+ 372 where: 374 o Type: 2. 376 o Length: 16. 378 o IPv6 Address: the IPv6 Address of the connecting port. 380 3.3.3. Role Type TLV 382 The format of the Role Type TLV is shown in the following figure. 384 +------------------+ 385 | Type | 2 octets 386 +------------------+ 387 | Length | 2 octets 388 +------------------+ 389 | Role Type | 1 octets 390 +------------------+ 392 where: 394 o Type: 3. 396 o Length: 1. 398 o Role Type: the role of the device. The following values are 399 defined. 401 * 1: storage subsystem. 403 * 2: host. 405 * 3: the device can serve as both a host and a storage subsystem. 407 3.3.4. Online/Offline Status TLV 409 The format of the Online/Offline Status TLV is shown in the following 410 figure. 412 +------------------------+ 413 | Type | 2 octets 414 +------------------------+ 415 | Length | 2 octets 416 +------------------------+ 417 | Online/Offline Status | 1 octets 418 +------------------------+ 420 where: 422 o Type: 4. 424 o Length: 1. 426 o Online/Offline Status: indicating the device is online or offline. 427 The following values are defined. 429 * 0: offline. 431 * 1: online. 433 3.3.5. More Device Info TLVs 435 More Device Info TLVs will be included in the future version of this 436 document. 438 3.4. Device Zone NLRI 440 In storage networks, hosts and storage subsystems are generally 441 divided into several zones. Only the devices in the same zone are 442 allowed to discover and communicate with each other. 444 The Device Zone NLRI is used to distribute the zone configuration of 445 a device. The format of the Device Zone NLRI is shown in the 446 following figure. 448 +------------------+ 449 | Router ID | 4 octets 450 +------------------+ 451 | IP Address | 4 or 16 octets 452 +------------------+ 453 | Zone Name Length| 2 octets 454 +------------------+ 455 | Zone Name | variable 456 +------------------+ 458 where: 460 o Router ID: the Router ID of the access switch which originates 461 this NLRI, usually the same as the BGP Identifier. 463 o IP Address: the IPv4 or IPv6 Address of a connected device. 465 o Zone Name Length: the length of the following Zone Name field in 466 octets. 468 o Zone Name: the name of the zone which the connected device belongs 469 to. 471 3.5. Operations 473 The source of the NoF NLRI can be a dedicated module which receive 474 LLDP messages and maintain the states of directly connected devices. 475 For the originator of an NoF NLRI route, BGP receives information 476 from relevant module, encapsulates the information into an NoF NLRI 477 route, and sends the route to other peers. For the receiver of an 478 NoF NLRI route, BGP extracts the NoF NLRI from the route and sends 479 the information to relevant module. 481 The NoF NLRI field may be treated as an opaque hexadecimal string, 482 depending on the implementation. 484 4. IANA Considerations 486 TBD. 488 5. Security Considerations 490 TBD. 492 6. References 494 6.1. Normative References 496 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 497 Requirement Levels", BCP 14, RFC 2119, 498 DOI 10.17487/RFC2119, March 1997, 499 . 501 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 502 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 503 DOI 10.17487/RFC4271, January 2006, 504 . 506 [RFC4760] Bates, T., Chandra, R., Katz, D., and Y. Rekhter, 507 "Multiprotocol Extensions for BGP-4", RFC 4760, 508 DOI 10.17487/RFC4760, January 2007, 509 . 511 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 512 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 513 May 2017, . 515 6.2. Informative References 517 [RFC4456] Bates, T., Chen, E., and R. Chandra, "BGP Route 518 Reflection: An Alternative to Full Mesh Internal BGP 519 (IBGP)", RFC 4456, DOI 10.17487/RFC4456, April 2006, 520 . 522 Authors' Addresses 524 Changwang Lin 525 H3C 527 Email: linchangwang.04414@h3c.com 529 Mengxiao Chen 530 H3C 532 Email: chen.mengxiao@h3c.com 534 Hao Li 535 H3C 537 Email: lihao@h3c.com