idnits 2.17.1 draft-fox-tcpm-shared-memory-rdma-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 684 has weird spacing: '...essages requi...' == Line 3347 has weird spacing: '...WR data l ...' == Line 3357 has weird spacing: '...WR data l ...' == Line 4702 has weird spacing: '...request messa...' == Line 5789 has weird spacing: '...NK over anoth...' -- The document date (May 7, 2015) is 3270 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'ROCE' is defined on line 3862, but no explicit reference was found in the text == Unused Reference: 'IBTA' is defined on line 3866, but no explicit reference was found in the text == Unused Reference: 'RFC793' is defined on line 3869, but no explicit reference was found in the text == Unused Reference: 'RFC4727' is defined on line 3873, but no explicit reference was found in the text == Unused Reference: 'RFC 6994' is defined on line 3878, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) Summary: 1 error (**), 0 flaws (~~), 12 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 TCPM working group M. Fox 2 Internet Draft C. Kassimis 3 Intended Status: Informational J. Stevens 4 Expires: 11/7/2015 IBM 5 May 7, 2015 7 IBM's Shared Memory Communications over RDMA 8 draft-fox-tcpm-shared-memory-rdma-07.txt 10 Status of this Memo 12 This Internet-Draft is submitted in full conformance with the 13 provisions of BCP 78 and BCP 79. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that 17 other groups may also distribute working documents as Internet- 18 Drafts. 20 Internet-Drafts are draft documents valid for a maximum of six months 21 and may be updated, replaced, or obsoleted by other documents at any 22 time. It is inappropriate to use Internet-Drafts as reference 23 material or to cite them other than as "work in progress." 25 The list of current Internet-Drafts can be accessed at 26 http://www.ietf.org/ietf/1id-abstracts.txt 28 The list of Internet-Draft Shadow Directories can be accessed at 29 http://www.ietf.org/shadow.html 31 This Internet-Draft will expire on November 7, 2015. 33 Copyright Notice 35 Copyright (c) 2015 IETF Trust and the persons identified as the 36 document authors. All rights reserved. 38 This document is subject to BCP 78 and the IETF Trust's Legal 39 Provisions Relating to IETF Documents 40 (http://trustee.ietf.org/license-info) in effect on the date of 41 publication of this document. Please review these documents 42 carefully, as they describe your rights and restrictions with respect 43 to this document. Code Components extracted from this document must 44 include Simplified BSD License text as described in Section 4.e of 45 the Trust Legal Provisions and are provided without warranty as 46 described in the Simplified BSD License. 48 Abstract 50 This document describes the IBM's Shared Memory Communications over 51 RDMA (SMC-R) protocol. This protocol provides RDMA communications to 52 TCP endpoints in a manner that is transparent to socket applications. 53 It further provides for dynamic discovery of partner RDMA 54 capabilities and dynamic setup of RDMA connections, transparent high 55 availability and load balancing when redundant RDMA network paths are 56 available, and it maintains many of the traditional TCP/IP qualities 57 of service such as filtering that enterprise users demand, as well as 58 TCP socket semantics such as urgent data. 60 Table of Contents 62 1. Introduction...................................................5 63 1.1. Summary of changes in this draft..........................6 64 1.2. Protocol overview.........................................6 65 1.2.1. Hardware requirements................................8 66 1.3. Definition of common terms................................8 67 2. Link Architecture.............................................10 68 2.1. Remote Memory Buffers (RMBs).............................12 69 2.2. SMC-R Link groups........................................16 70 2.2.1. Link group types....................................17 71 2.2.2. Maximum number of links in link group...............20 72 2.2.3. Forming and managing link groups....................21 73 2.2.4. SMC-R link identifiers..............................22 74 2.3. SMC-R resilience and load balancing......................23 75 3. SMC-R Rendezvous architecture.................................25 76 3.1. TCP options..............................................25 77 3.2. Connection Layer Control (CLC) messages..................26 78 3.3. LLC messages.............................................26 79 3.4. CDC Messages.............................................28 80 3.5. Rendezvous flows.........................................28 81 3.5.1. First contact.......................................28 82 3.5.1.1. TCP Options pre-negotiation....................28 83 3.5.1.2. Client Proposal................................29 84 3.5.1.3. Server acceptance..............................30 85 3.5.1.4. Client confirmation............................31 86 3.5.1.5. Link (QP) confirmation.........................31 87 3.5.1.6. Second SMC-R link setup........................34 88 3.5.1.6.1. Client processing of "Add Link" LLC message 89 from server..........................................34 90 3.5.1.6.2. Server processing of "Add Link" reply LLC 91 message from the client..............................35 92 3.5.1.6.3. Exchange of Rkeys on second SMC-R link....37 93 3.5.1.6.4. Aborting SMC-R and falling back to IP.....37 94 3.5.2. Subsequent contact..................................37 95 3.5.2.1. SMC-R proposal.................................38 96 3.5.2.2. SMC-R acceptance...............................39 97 3.5.2.3. SMC-R confirmation.............................40 98 3.5.2.4. TCP data flow race with SMC Confirm CLC message40 99 3.5.3. First contact variation: creating a parallel link group 100 ...........................................................41 101 3.5.4. Normal SMC-R link termination.......................42 102 3.5.5. Link group management flows.........................43 103 3.5.5.1. Adding and deleting links in an SMC-R link group43 104 3.5.5.1.1. Server initiated Add Link processing......43 105 3.5.5.1.2. Client initiated Add Link processing......44 106 3.5.5.1.3. Server initiated Delete Link Processing...44 107 3.5.5.1.4. Client initiated Delete Link request......46 108 3.5.5.2. Managing multiple Rkeys over multiple SMC-R links 109 in a link group.........................................48 110 3.5.5.2.1. Adding a new RMB to an SMC-R link group...49 111 3.5.5.2.2. Deleting an RMB from an SMC-R link group..52 112 3.5.5.2.3. Adding a new SMC-R link to a link group with 113 multiple RMBs........................................53 114 3.5.5.3. Serialization of LLC exchanges, and collisions.54 115 3.5.5.3.1. Collisions with ADD LINK / CONFIRM LINK 116 exchange.............................................56 117 3.5.5.3.2. Collisions during DELETE LINK exchange....57 118 3.5.5.3.3. Collisions during CONFIRM_RKEY exchange...57 119 4. SMC-R memory sharing architecture.............................59 120 4.1. RMB element allocation considerations....................59 121 4.2. RMB and RMBE format......................................59 122 4.3. RMBE control information.................................59 123 4.4. Use of RMBEs.............................................60 124 4.4.1. Initializing and accessing RMBEs....................60 125 4.4.2. RMB element reuse and conflict resolution...........61 126 4.5. SMC-R protocol considerations............................62 127 4.5.1. SMC-R protocol optimized window size updates........62 128 4.5.2. Small data sends....................................63 129 4.5.3. TCP Keepalive processing............................63 130 4.6. TCP connection failover between SMC-R links..............66 131 4.6.1. Validating data integrity...........................66 132 4.6.2. Resuming the TCP connection on a new SMCR link......67 133 4.7. RMB data flows...........................................67 134 4.7.1. Scenario 1: Send flow, window size unconstrained....68 135 4.7.2. Scenario 2: Send/Receive flow, window unconstrained.70 136 4.7.3. Scenario 3: Send Flow, window constrained...........71 137 4.7.4. Scenario 4: Large send, flow control, full window size 138 writes.....................................................73 139 4.7.5. Scenario 5: Send flow, urgent data, window size 140 unconstrained..............................................76 141 4.7.6. Scenario 6: Send flow, urgent data, window size closed78 142 4.8. Connection termination...................................80 143 4.8.1. Normal SMC-R connection termination flows...........80 144 4.8.1.1. Abnormal SMC-R connection termination flows....85 145 4.8.1.2. Other SMC-R connection termination conditions..87 146 5. Security considerations.......................................88 147 5.1. VLAN considerations......................................88 148 5.2. Firewall considerations..................................88 149 5.3. Host-based IP Filters....................................89 150 5.4. Intrusion Detection Services.............................89 151 5.5. IP Security (IPSec)......................................89 152 5.6. TLS/SSL..................................................89 153 6. IANA considerations...........................................89 154 7. References....................................................90 155 7.1. Normative References.....................................90 156 7.2. Informative References...................................90 157 8. Acknowledgments...............................................90 158 9. Conventions used in this document.............................90 159 Appendix A. Formats..............................................91 160 A.1. TCP option...............................................91 161 A.2. CLC messages.............................................91 162 A.2.1. Peer ID format......................................91 163 A.2.2. SMC Proposal CLC message format.....................93 164 A.2.3. SMC Accept CLC message format.......................96 165 A.2.4. SMC Confirm CLC message format......................99 166 A.2.5. SMC Decline CLC message format.....................102 167 A.3. LLC messages............................................103 168 A.3.1. CONFIRM LINK LLC message format....................104 169 A.3.2. ADD LINK LLC message format........................106 170 A.3.3. ADD LINK CONTINUATION LLC message format...........108 171 A.3.4. DELETE LINK LLC message format.....................111 172 A.3.5. CONFIRM RKEY LLC message format....................113 173 A.3.6. CONFIRM RKEY CONTINUATION LLC message format.......116 174 A.3.7. DELETE RKEY LLC message format.....................118 175 A.3.8. TEST LINK LLC message format.......................120 176 Appendix B. Socket API considerations...........................126 177 Appendix C. Rendezvous Error scenarios..........................128 178 C.1. SMC Decline during CLC negotiation......................128 179 C.2. SMC Decline during LLC negotiation......................128 180 C.3. The SMC Decline window..................................130 181 C.4. Out of synch conditions during SMC-R negotiation........130 182 C.5. Timeouts during CLC negotiation.........................131 183 C.6. Protocol errors during CLC negotiation..................131 184 C.7. Timeouts during LLC negotiation.........................132 185 C.7.1. Recovery actions for LLC timeouts and failures.....133 187 C.8. Failure to add second SMC-R link to a link group........140 189 1. Introduction 191 This document specificies IBM's Shared Memory Communications over 192 RDMA (SMC-R) protocol. SMC-R is a protocol for Remote Direct Memory 193 Access (RDMA) communication between TCP socket endpoints. SMC-R 194 runs over networks that support RDMA over Converged Ethernet (ROCE). 195 It is designed to permit existing TCP applications to benefit from 196 RDMA without requiring modifications to the applications or 197 predefinition of RDMA partners. 199 SMC-R provides dynamic discovery of the RDMA capabilities of TCP 200 peers and automatic setup of RDMA connections that those peers can 201 use. SMC-R also provides transparent high availability and load 202 balancing capabilities that are demanded by enterprise installations 203 but are missing from current RDMA protocols. If redundant RoCE 204 capable hardware such as RDMA NICs (RNICs)and RoCE capable switches 205 is present, SMC-R can load balance over that redundant hardware and 206 can also non-disruptively move TCP traffic from failed paths to 207 surviving paths, all seamlessly to the application and the sockets 208 layer. Because SMC-R preserves socket semantics and the TCP three-way 209 handshake, many TCP qualities of service such as filtering, load 210 balancing, and SSL encryption are preserved, as are TCP features such 211 as urgent data. 213 Because of the dynamic discovery and setup of SMC-R connectivity 214 between peers, no RDMA connection manager (RDMA-CM) is required. This 215 also means that support for UD queue pairs is also not required. 217 It is recommended that the SMC-R services be implemented in kernel 218 space, which enables optimizations such as resource sharing between 219 connections across multiple processes and also permits applications 220 using SMC-R to spawn multiple processes (e.g. fork) without losing 221 SMC-R functionality. A user space implementation is compatible with 222 this architecture, but it may not support spawned processes (i.e. 223 fork) which limits sharing and resource optimization to TCP 224 connections that originate from the same process. This might be an 225 appropriate design choice if the use case is a system that hosts a 226 large single process application that creates many TCP connections to 227 a peer host, or in implementations where a kernel space 228 implementation is not possible or introduces excessive overhead for 229 kernel space to user space context switches. 231 1.1. Summary of changes in this draft 233 Changed the title to add "IBM's". 235 1.2. Protocol overview 237 SMC-R defines the concept of the SMC-R Link, which is a logical 238 point-to-point link using reliably connected queue pairs between 239 TCP/IP stack peers over a RoCE fabric. An SMC-R link is bound to a 240 specific hardware path, meaning a specific RNIC on each peer. SMC-R 241 links are created and maintained by an SMC-R layer, which may reside 242 in kernel or user space depending upon operating system and 243 implementation requirements. The SMC-R layer resides below the 244 sockets layer and directs data traffic for TCP connections between 245 connected peers over the RoCE fabric using RDMA rather than over a 246 TCP connection. The TCP/IP stack with its fragmentation, 247 packetization, etc. requirements is bypassed and the application data 248 is moved between peers using RDMA. 250 Multiple SMC-R links between the same two TCP/IP stack peers are also 251 supported. A set of SMC-R links called a link group can be logically 252 bonded together to provide redundant connectivity. If there is 253 redundant hardware, for example two RNICs on each peer, separate SMC- 254 R links are created between the peers to exploit that redundant 255 hardware. The link group architecture with redundant links provide 256 load balancing, increased bandwidth as well as seamless failover. 258 Each SMC-R link group is associated with an area of memory called 259 Remote Memory Buffers (RMBs), which are areas of memory that are 260 available for SMC-R peers to write into using RDMA writes. Multiple 261 TCP connections between peers may be multiplexed over a single SMC-R 262 link, in which case the SMC-R layer manages the partitioning of the 263 RMBs between the TCP connections. This multiplexing reduces the RDMA 264 resources such as queue pairs and RMBs that are required to support 265 multiple connections between peers, and also reduces the processing 266 and delays related to setting up queue pairs, pinning memory, and 267 other RDMA setup tasks when new TCP connections are created. In a 268 kernel space SMC-R implementation in which the RMBs reside in kernel 269 storage, this sharing and optimization works across multiple 270 processes executing on the same host. In a user space SMC-R 271 implementation in which the RMBs reside in user space, this sharing 272 and optimization is limited to multiple TCP connections created by a 273 single process, as separate RMBs and QPs will be required for each 274 process. 276 SMC-R also introduces a rendezvous protocol that is used to 277 dynamically discover the RDMA capabilities of TCP connection partners 278 and exchange credentials necessary to exploit that capability if 279 present. TCP connections are set up using the normal TCP 3-way 280 handshake, with the addition of a new TCP option that indicates SMC-R 281 capability. If both partners indicate SMC-R capability then at the 282 completion of the 3-way TCP handshake the SMC-R layers in each peer 283 take control of the TCP connection and use it to exchange additional 284 connection level control (CLC) messages to negotiate SMC-R 285 credentials such as queue pair (QP) information, addressability over 286 the RoCE fabric, RMB buffer sizes, keys and addresses for accessing 287 RMBs over RDMA, etc. If at any time during this negotiation a 288 failure or decline occurs, the TCP connection falls back to using the 289 IP fabric. 291 If the SMC-R negotiation succeeds and either a new SMC-R link is set 292 up or an existing SMC-R link is chosen for the TCP connection, then 293 the SMC-R layers open the sockets to the applications and the 294 applications use the sockets as normal. The SMC-R layer intercepts 295 the socket reads and writes and moves the TCP connection data over 296 the SMC-R link, "out of band" to the TCP connection which remains 297 open and idle over the IP fabric, except for termination flows and 298 possible keepalive flows. Regular TCP sequence numbering methods are 299 used for the TCP flows that do occur; data flowing over RDMA does not 300 use or affect TCP sequence numbers. 302 This architecture does not support fallback of active SMC-R 303 connections to IP. Once connection data has completed the switch to 304 RDMA, a TCP connection cannot be switched back to IP and will reset 305 if RDMA becomes unusable. 307 The SMC-R protocol defines the format of the Remote Memory Buffers 308 that are used to receive TCP connection data written over RDMA, as 309 well as the semantics for managing and writing to these buffers using 310 Connection Data Control (CDC) messages. 312 Finally, SMC-R defines link level control (LLC) messages that are 313 exchanged over the RoCE fabric between peer SMC-R layers to manage 314 the SMC-R links and link groups. These include messages to test and 315 confirm connectivity over an SMC-R link, add and delete SMC-R links 316 to or from the link group, and exchange RMB addressability 317 information. 319 1.2.1. Hardware requirements 321 SMC-R does not require full Converged Enhanced Ethernet switch 322 functionality. SMC-R functions over standard Ethernet fabrics 323 provided endpoint RNICs are provided and IEEE 802.3x Global Pause 324 Frame is supported and enabled in the switch fabric. 326 While SMC-R as specified in this document is designed to operate over 327 RoCE fabrics, adjustments to the rendezvous methods could enable it 328 to run over other RDMA fabrics such as Infiniband and iWARP. 330 1.3. Definition of common terms 332 This section provides definitions of terms that have a specific 333 meaning to the SMC-R protocol and are used throughout this document. 335 SMC-R link 337 An SMC-R Link is a logical point to point connection over the 338 RoCE fabric via specific physical adapters (MAC/GID). The Link 339 is formed during the first contact sequence of the TCP/IP 3 way 340 handshake sequence that occurs over the IP fabric. During this 341 handshake an RDMA RC-QP connection is formed between the two peer 342 SMC hosts and is defined as the SMC Link. The SMC Link can then 343 support multiple TCP connections between the two peers. An SMC 344 link is associated with a single LAN (or VLAN) segment and is not 345 routable. 347 SMC-R link group 349 An SMC-R Link Group is a group of SMC-R Links typically each over 350 unique RoCE adapters between the same two SMC-R peers. Each link 351 in the link group has equal characteristics such as the same VLAN 352 ID (if VLANs are in use), access to the same RMB(s) and the same 353 TCP server / client 355 SMC-R peer 357 The SMC-R Peer is the peer software stack within the peer 358 Operating System with respect the Shared Memory Communications 359 (messaging) protocol. 361 SMC-R Rendezvous 363 The SMC-R Rendezvous is the SMC-R peer discovery and handshake 364 sequence that occurs transparently over the IP (Ethernet) fabric 365 during and immediately after the TCP connection 3 way handshake 366 by exchanging the SMC capabilities and credentials using 367 experimental TCP option and CLC messages. 369 TCP Client 371 The TCP socket-based peer that initiates a TCP connection 373 TCP Server 375 The TCP socket-based peer that accepts a TCP connection 377 CLC messages 379 The SMC-R protocol defines a set of Connection Layer Control 380 Messages that flow over the TCP connection that are used to 381 manage SMC link rendezvous at TCP connection setup time. This 382 mechanism is analogous to SSL setup messages 384 LLC Commands 386 The SMC-R protocol defines a set of RoCE Link Layer Control 387 Commands that flow over the RoCE fabric using RDMA sendmsg, that 388 are used to manage SMC Links, SMC Link Groups and SMC Link Group 389 RMB expansion and contraction. 391 CDC message 393 The SMC-R protocol defines a Connection Data Control message that 394 flows over the RoCE fabric using RDMA sendmsg that is used to 395 manage the SMC-R connection data. This message provides 396 information about data being transferred over the out of band 397 RDMA connection, such as data cursors, sequence numbers, and data 398 flags (for example urgent data). The receipt of this message 399 also provides an interrupt to inform the receiver that it has 400 received RDMA data. 402 RMB 404 A Remote (RDMA) Memory Buffer is a fixed or pinned buffer 405 allocated in each of the peer hosts for a TCP (via SMC-R) 406 connection. The RMB is registered to the RNIC and allows remote 407 access by the remote peer using RDMA semantics. Each host is 408 passed the peer's RMB specific access information (RKey and RMB 409 Element offset) during the SMC-R rendezvous process. The host 410 stores socket application user data directly into the peer's RMB 411 using RDMA over RoCE. 413 Rtoken 415 The combination of an RMB's Rkey and RDMA virtual addressing, an 416 Rtoken provides addressability to an RMB to an RDMA peer 418 RMBE 420 The Remote Memory Buffer Element is an area of an RMB that is 421 allocated to a specific TCP connection. The RMBE contains data 422 for the TCP connection. The RMBE represents the TCP receive 423 buffer whereby the remote peer writes into the RMBE and the local 424 peer reads from the local RMBE. The alert token resolves to a 425 specific RMBE. 427 Alert Token 429 The SMC-R alert token is a four byte value that uniquely 430 identifies the TCP connection over an SMC-R connection. The 431 alert token allows the SMC peer to quickly identify the target 432 TCP connection that now has new work. The format of the token is 433 defined by the owning SMC-R end point and is considered opaque to 434 the remote peer. However the token should not simply be an index 435 to an RMBE element; it should reference a TCP connection and be 436 able to be validated to avoid reading data from stale 437 connections. 439 RNIC 441 The RDMA capable Network Interface Card (RNIC) is an Ethernet NIC 442 that supports RDMA semantics and verbs using RoCE. 444 First Contact 446 Describes an SMC-R negotiation to set up the first link in a link 447 group 449 Subsequent Contact 451 Describes an SMC-R negotiation between peers who are using an 452 already existing SMC-R link group 454 2. Link Architecture 456 An SMC-R link is based on reliably connected queue pairs (QPs) that 457 form a "logical point to point link" between the two SMC-R peers over 458 a RoCE fabric. An SMC-R link extends from SMC-R peer to SMC-R peer, 459 where typically each peer would be a TCP/IP stack would reside on 460 separate hosts. 462 ,,.--..,_ 463 +----+ _-`` `-, +-----+ 464 |QP 8| - RoCE ', |QP 64| 465 | | / VLAN M . | | 466 +----+--------+/ \+-------+-----+ 467 | RNIC 1 | SMC-R Link | RNIC 2 | 468 | |<--------------------->| | 469 +------------+ , /+------------+ 470 MAC A (GID A) MAC B (GID B) 471 . .` 472 `', ,-` 473 ``''--''`` 475 Figure 1 SMC-R Link Overview 477 Figure 1 illustrates an overview of the basic concepts of SMC-R peer 478 to peer connectivity which is called the SMC-R Link. The SMC-R Link 479 forms a logical point to point connection between two SMC-R peers via 480 RoCE. The SMC Link is defined and identified by the following 481 attributes: 483 SMC-R Link = RC QPs (source VMAC GID QP + target VMAC GID QP + VLAN 484 ID) 486 The SMC-R Link can optionally be associated with a VLAN ID. If VLANs 487 are in use for the associated IP (LAN) connection then the VLAN 488 attribute is carried over on the SMC-R link. When VLANs are in use 489 each SMC-R link group is associated with a single and specific VLAN. 490 The RoCE fabric is the same physical Ethernet LAN used for standard 491 TCP/IP over Ethernet communications, with switches as described in 492 1.2.1. 494 An SMC-R Link is designed to support multiple TCP connections between 495 the same two peers. An SMC Link is intended to be long lived while 496 the underlying TCP connections can dynamically come and go. The 497 associated RMBs can also be dynamically added and removed from the 498 link as needed. The first TCP connection between the peers 499 establishes the SMC-R link. Subsequent TCP connections then use the 500 previously established link. When the last TCP connection terminates 501 the link can then be terminated, typically after an implementation 502 defined idle time-out period has elapsed. The TCP server is 503 responsible for initiating and terminating the SMC Link. 505 2.1. Remote Memory Buffers (RMBs) 507 Figure 2 shows the hosts X and Y and their associated RMBs within 508 each host. With the SMC-R link and the associated RMB keys (Rkeys)and 509 RDMA virtual addresses each SMC-R enabled TCP/IP stack can remotely 510 access its peer's RMBs using RDMA. The RKeys and virtual addresses 511 are exchanged during the rendezvous processing when the link is 512 established. The combination of the Rkey and the virtual address is 513 the Rtoken. Note that the SMC-R Link ends at the QP providing access 514 to the RMB (via the Link + RToken). 516 Host X Host Y 517 +-------------------+ ,.--.,_ +-------------------+ 518 | | .'` '. | | 519 | Protection | ,' `, | Protection | 520 | Domain X | / \ | Domain Y | 521 | +------+ / \ +------+ | 522 | QP 8 |RNIC 1| | SMC-R Link | |RNIC 2| QP 64 | 523 | | | |<-------------------->| | | | 524 | | | || || | | | 525 | | +------+| VLAN A |+------+ | | 526 | | || || | | 527 | | | | RoCE | | | | 528 | |RTokenX) | \ / |RToken (Y)| | 529 | | | \ / | | | 530 | V | `. ,' | V | 531 | +--------+ | '._ ,' | +--------+ | 532 | | | | `''-'`` | | | | 533 | | RMB | | | | RMB | | 534 | | | | | | | | 535 | +--------+ | | +--------+ | 536 +-------------------+ +-------------------+ 537 Figure 2 SMC link and RMBs 539 An SMC-R link can support multiple RMBs which are independently 540 managed by each peer. The number of and the size of RMBs are managed 541 by the peers based on host unique memory management requirements; 542 however the maximum number of RMBs that can be associated to a link 543 group on one peer is 255. The QP has a single protection domain, but 544 each RMB has a unique RToken. All RTokens must be exchanged with the 545 peer. 547 Each peer manages the RMBs in its local memory for its remote SMC-R 548 peer by sharing access to the RMBs via Rtokens with its peers. The 549 remote peer writes into the RMBs via RDMA and the local peer (RMB 550 owner) then reads from the RMBs. 552 When two peers decide to use SMC-R for a given TCP connection, they 553 each allocate a local RMB Element for the TCP connection and 554 communicate the location of this local RMB Element during rendezvous 555 processing. To that end, RMB elements are created in pairs, with one 556 RMB element allocated locally on each peer of the SMC-R link. 558 --- +-----------+----------------+ 559 /\ |Eyecatcher | | 560 | +-----------+ | 561 | | | 562 RMB Element 1 | | 563 | | Receive Buffer | 564 | | | 565 | | | 566 \/ | | 567 --- +-----------+----------------+ 568 /\ |Eyecatcher | | 569 | +-----------+ | 570 | | | 571 RMB Element 2 | | 572 | | Receive Buffer | 573 | | | 574 | | | 575 \/ | | 576 --- +----------------------------+ 577 | . | 578 | . | 579 | . | 580 | . | 581 | (up to 255 elements) | 582 +----------------------------+ 583 Figure 3 RMB Format 585 Figure 3 illustrates the basic format of an RMB. The RMB is a virtual 586 memory buffer whose backing real memory is pinned, which can support 587 up to 255 TCP connections to exactly one remote SMC-R peer. Each RMB 588 is therefore associated with the SMC-R links within a link group for 589 the two peers and a specific RoCE Protection Domain. Other than the 2 590 peers identified by the SMC-R link no other SMC-R peers can have RDMA 591 access to an RMB; this requires a unique Protection Domain for every 592 SMC-R Link. This is critical to ensure integrity of SMC-R 593 communications. 595 RMBs are subdivided into multiple elements for efficiency, with each 596 RMBE element (RMBE) is associated with a single TCP connection. 597 Therefore multiple TCP connections across an SMC link group can share 598 the same memory for RDMA purposes, reducing the overhead of having to 599 register additional memory with the RNIC for every new TCP 600 connection. The number of elements in an RMB and the size of each RMB 601 Element is entirely governed by the owning peer subject to the SMC-R 602 architecture rules, however, all RMB elements within a given RMB must 603 be the same size. Each peer can decide the level of resource sharing 604 that is desirable across TCP connections based on local constraints 605 such as available system memory, etc. An RMB Element is identified to 606 the remote SMC-R peer via an RMB Element Token which consists of the 607 following: 609 o RMB RToken: The combination of the Rkey and virtual address 610 provided by the RNIC that identifies the start of the RMB for RDMA 611 operations. 613 o RMB Index: Identifies the RMB element index in the RMB. Used to 614 locate a specific RMB element within an RMB. Valid value range is 615 1-255. 617 o RMB element length: The length of the RMB element's eyecatcher 618 plus the length of receive buffer. This length is equal for all 619 RMB elements in a given RMB. This length can be variable across 620 different RMBs. 622 Multiple RMBs can be associated to an SMC-R link group and each peer 623 in an SMC-R link group manages allocation of its RMBs. RMB allocation 624 can be asymmetric. For example, server X can allocate 2 RMBs to an 625 SMC-R link group while server Y allocates 5. This provides maximum 626 implementation flexibility to allow hosts optimize RMB management for 627 their own local requirements. The maximum number of RMBs that can be 628 allocated on one peer to a link group is 255. If more RMBs are 629 required, the peer may fall back to IP for subsequent connections or, 630 if the peer is the server, create a parallel link group. 632 One use case for multiple RMBs is multiple receive buffer sizes. 633 Since every element in an RMB must be the same size, multiple RMBs 634 with different element sizes can be allocated if varying receive 635 buffer sizes are required. 637 Also since the maximum number of TCP connections whose receive 638 buffers can be allocated to an RMB is 255, multiple RMBs may be 639 required to provide capacity for large numbers of TCP connections 640 between two peers. 642 Separately from the RMB, the TCP/IP stack that owns each RMB 643 maintains control data for each RMB element within its local control 644 structures. The control data contains flags for maintaining the 645 state of the TCP data (for example, urgent indicator) and most 646 importantly, two cursors which are illustrated in Figure 4: 648 o The peer producer cursor: This is a wrapping offset into the RMB 649 element's receive buffer that points to the next byte of data to 650 be written by the remote peer. This cursor is provided by the 651 remote peer in a Connection Data Control (CDC message), which is 652 sent using RDMA sendmsg processing, and tells the local peer how 653 far it can consume data in the RMBE buffer. 655 o The peer consumer cursor: This is a wrapping offset into the 656 remote peer's RMB element's receive buffer that points to the next 657 byte of data to be consumed by the remote peer in its own RMBE. 658 The local cannot write into the remote peer's RMBE beyond this 659 point without causing data loss. This cursor is also provided by 660 the peer using a Connection Data Control message. 662 Each TCP connection peer maintains its cursors for a TCP connection's 663 RMBE in its local control structures. In other words, the peer who 664 writes into a remote peer's RMBE provides its producer cursor to the 665 peer whose RMBE it has written into. The peer who reads from its 666 RMBE provides its consumer cursor to the writing peer. In this 667 manner the reads and writes between peers are kept coordinated. 669 For example, referring to Figure 4, peer B writes the hashed data 670 into the receive buffer of peer A's RMBE. After that write 671 completes, peer B uses a CDC message to update its producer cursor to 672 peer A, to indicate to peer A how much data is available for peer A 673 to consume. The CDC message that peer B sends to peer A wakes up 674 peer A and notifies it that there is data to be consumed. 676 Similarly, when peer A consumes data written by peer B, it uses a CDC 677 message to update its consumer cursor to peer B to let peer B know 678 how much data it has consumed, so peer B knows how much space is 679 available for further writes. If peer B were to write enough data to 680 peer A that it would wrap the RMBE receive buffer and exceed the 681 consumer cursor, data loss would result. 683 Note that this is a simplistic description of the control flows and 684 they are optimized to minimize the number of CDC messages required, 685 as described in 4.7. RMB data flows. 687 Peer A's RMBE Control Info Peer B's RMBE Control Info 688 +--------------------------+ +--------------------------+ 689 | | | | 690 /----Peer producer cursor | +-----+-Peer consumer cursor | 691 /| | | | | 692 | +--------------------------+ | +--------------------------+ 693 | Peer A's RMBE | 694 | +--------------------------+ | 695 | | +------------------+ 696 | | | | 697 | | \/ | 698 | | +------------| 699 | |-------------+/////////// | 700 | |//RMA data written by /// | 701 | |/// peer B that is ////// | 702 | |/available to be consumed/| 703 | |///////////////////////// | 704 | |///////// +---------------| 705 | |----------+/\ | 706 | | | | 707 \| | | 708 \ / | 709 |\---------/ | 710 | | 711 | | 712 Figure 4 RMBE cursors 714 Additional flags and indicators are communicated between peers. In 715 all cases, these flags and indicators are updated by the peer using 716 CDC messages with the control information contained in inline data. 717 More details on these additional flags and indicators are described 718 in . 4.3. RMBE control information. 720 2.2. SMC-R Link groups 722 SMC-R links are logically grouped together to form an SMC-R Link 723 Group. The purpose of the Link Group is for supporting multiple links 724 between the same two peers to provide for: 726 o Resilience: Provides transparent and dynamic switching of the link 727 used by existing TCP connections during link failures, typically 728 hardware related. TCP traffic using the failing link can be 729 switched to an active link within the link group avoiding 730 disruptions to application workloads. 732 o Link utilization: Provides an active/active link usage model 733 allowing TCP traffic to be balanced across the links, which 734 increases bandwidth and avoids hardware imbalances and 735 bottlenecks. Note that both adapter and switch utilization can 736 become potential resource constraint issues 738 SMC-R Link Group support is required. Resilience is not optional. 739 However, the user can elect to provision a single RNIC (on one or 740 both hosts). 742 Multiple links that are formed between the same two peers fall into 743 two distinct categories: 745 1. Equal Links: Links providing equal access to the same RMB(s) at 746 both endpoints whereby all TCP connections associated with the 747 links must have the same VLAN ID and have the same TCP server 748 and TCP client roles or relationship. 750 2. Unequal Links: Links providing access to unique, unrelated and 751 isolated RMB(s) (i.e. for unique VLANs or unique and isolated 752 application workloads, etc.) or have unique TCP server or client 753 roles. 755 Links that are logically grouped together forming an SMC Link Group 756 must be equal links. 758 2.2.1. Link group types 760 Equal links within a link group also have another "Link Group Type" 761 attribute based on the link's associated underlying physical path. 762 The following SMC-R link types are defined: 764 1. Single Link: the only active link within a link group 766 2. Parallel Link: not allowed - SMC Links having the same physical 767 RNIC at both hosts 769 3. Asymmetric Link: links that have unique RNIC adapters at one 770 host but share a single adapter at the peer host 772 4. Symmetric Link: links that have unique RNIC adapters at both 773 hosts 775 These link group types are further explained in the following figures 776 and descriptions. 778 Figure 2 above shows the single link case. The single link 779 illustrated in Figure 2 also establishes the SMC-R Link Group. Link 780 groups are supposed to have multiple links, but when only one RNIC is 781 available at both hosts then only a single link can be created. This 782 is expected to be a transient case. 784 Figure 5 shows the symmetric link case. Both hosts have unique and 785 redundant RNIC adapters. This configuration meets the objectives for 786 providing full RoCE redundancy required to provide the level of 787 resilience required for high availability for SMC-R. While this 788 configuration is not required, it is a strongly recommended "best 789 practice" for the exploitation of SMC-R. Single and asymmetric links 790 must be supported but are intended to provide for short term 791 transient conditions, for example during a temporary outage or 792 recycle of a RNIC. 794 Host X Host Y 795 +-------------------+ +-------------------+ 796 | | | | 797 | Protection | | Protection | 798 | Domain X | | Domain Y | 799 | +------+ +------+ | 800 | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 64 | 801 |RToken X| | |<-------------------->| | | | 802 | | | | | | |RToken Y| 803 | \/ +------+ +------+ \/ | 804 |+--------+ | | +--------+ | 805 || | | | | | | 806 || RMB | | | | RMB | | 807 || | | | | | | 808 |+--------+ | | +--------+ | 809 | /\ +------+ +------+ /\ | 810 |RToken Z| | | SMC-R Link 2 | | |RToken W| 811 | | |RNIC 3|<-------------------->|RNIC 4| | | 812 | QP 9 | | | | QP 65 | 813 | +------+ +------+ | 814 +-------------------+ +-------------------+ 815 Figure 5 Symmetric SMC-R links 817 Host X Host Y 818 +-------------------+ +-------------------+ 819 | | | | 820 | Protection | | Protection | 821 | Domain X | | Domain Y | 822 | +------+ +------+ | 823 | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 64 | 824 |RToken X| | |<-------------------->| | | | 825 | | | | .->| | |RToken Y| 826 | \/ +------+ .` +------+ \/ | 827 |+--------+ | .` | +--------+ | 828 || | | .` | | | | 829 || RMB | | .` | | RMB | | 830 || | | .`SMC-R | | | | 831 |+--------+ | .` Link 2 | +--------+ | 832 | /\ +------+ .` +------+ | 833 |Rtoken Z| | | .` | |down or | 834 | | |RNIC 3|<-` |RNIC 4|unavailable | 835 | QP 9 | | | | | 836 | +------+ +------+ | 837 +-------------------+ +-------------------+ 838 Figure 6 Asymmetric SMC-R links 840 In the example provided by Figure 6, host X has two RNICs but Host Y 841 only has one RNIC. This configuration allows for the creation of an 842 asymmetric link. While an asymmetric link will provide some 843 resilience (i.e. when RNIC 1 fails) ideally each host should provide 844 two redundant RNICs. This should be a transient case, and when RNIC 845 4 becomes available, this configuration must transition to a 846 symmetric link configuration. This transition is accomplished by 847 first creating the new symmetric link, then deleting the asymmetric 848 link with reason code "Asymmetric link no longer needed" specified in 849 the DELETE LINK LLC message. 851 Host X Host Y 852 +-------------------+ +-------------------+ 853 | | | | 854 | Protection | | Protection | 855 | Domain X | | Domain Y | 856 | +------+ SMC-R link 1 +------+ | 857 | QP 8 |RNIC 1|<-------------------->|RNIC 2| QP 64 | 858 |RToken X| | | | | | | 859 | | | |<-------------------->| | |Rtoken Y| 860 | \/ +------+ SMC-R link 2 +------+ \/ | 861 |+--------+ QP 9 | | QP 65 +--------+ | 862 || | | | | | | | | 863 || RMB |<-- + | | +---->| RMB | | 864 || | | | | | | 865 |+--------+ | | +--------+ | 866 | +------+ +------+ | 867 | down or| | | |down or | 868 | unavailale|RNIC 3| |RNIC 4|unavailable | 869 | | | | | | 870 | +------+ +------+ | 871 +-------------------+ +-------------------+ 872 Figure 7 SMC-R parallel links (not supported) 874 Figure 7 shows parallel links, which are two links in the link group 875 that use the same hardware. This configuration is not permitted. 876 Because SMC-R multiplexes multiple TCP connections over an SMC-R link 877 and both links are using the exact same hardware, there is no 878 additional redundancy or capacity benefit obtained from this 879 configuration. However this configuration does add unnecessary 880 overhead of additional queue pairs, generation of additional Rkeys, 881 etc. 883 2.2.2. Maximum number of links in link group 885 The SMC-R protocol defines a maximum of 8 symmetric SMC-R links 886 within a single SMC-R link group. This allows for support for up to 887 8 unique physical paths between peer hosts. However, in terms of 888 meeting the basic requirements for redundancy support for at least 2 889 symmetric links must be implemented. Supporting greater than 2 890 links also simplifies implementation for practical matters relating 891 to dynamically adding and removing links, for example starting a 892 third SMC-R link prior to taking down one of the two existing links. 893 Recall that all links within a link group must have equal access to 894 all associated RMBs. 896 The SMC-R protocol allows an implementation to implement an 897 implementation specific and appropriate value for maximum symmetric 898 links. The implementation value must not exceed the architecture 899 limit of 8 and the implementation must not be lower than 2, because 900 the SMC-R protocol requires redundancy. This does not mean that two 901 RNICs are physically required to enable SMC-R connectivity, but at 902 least two RNICs for redundancy are strongly recommended. 904 The SMC-R peers exchange their implementation maximum link values 905 during the link group establishment using the defined maximum link 906 value in the CONFIRM LINK LLC command. Once the initial exchange 907 completes the value is set for the life of the link group. The 908 maximum link value can be provided by both the server and client. The 909 server must supply a value, whereas the client maximum link value is 910 optional. When the client does not supply a value, it indicates that 911 the client accepts the server supplied maximum value. If the client 912 provides a value it can not exceed the server maximum value. If the 913 client passes a lower value then this lower value then becomes the 914 final negotiated maximum number of symmetric links for this link 915 group. Again, the minimum value is 2. 917 During run time the client must never request that the server add a 918 symmetric link to a link group that would exceed the negotiated 919 maximum link value. Likewise the server must never attempt to add a 920 symmetric link to a link group that would exceed the negotiated 921 maximum value. 923 In terms of counting the active link count within a link group, the 924 initial link (or the only / last) link is always counted as 1. Then 925 as additional links are added they are either symmetric or asymmetric 926 links. 928 With regards to enforcing the maximum link rules, asymmetric links 929 are an exception having a unique set of rules: 931 o Asymmetric links are always limited to one asymmetric link allowed 932 per link group 934 o Asymmetric links must not be counted in the maximum symmetric link 935 count calculation. When tracking the current count or enforcing 936 the negotiated maximum number of links, an asymmetric link is not 937 to be counted 939 2.2.3. Forming and managing link groups 941 SMC-R link groups are self-defining. The first SMC-R link in a link 942 group is created using TCP option flows on the TCP three-way 943 handshake followed by CLC message flows over the TCP connection. 944 Subsequent SMC-R links in the link group are created by sending LLC 945 messages over an SMC-R link that already exists in the link group. 946 Once an SMC-R link group is created, no additional SMC-R links in 947 that group are created using TCP and CLC negotiation. Because 948 subsequent SMC-R links are created exclusively by sending LLC 949 messages over an existing SMC-R link in a link group, the membership 950 of SMC-R links to a link group is self-defining. 952 This architecture does not define a specific identifier for an SMC-R 953 link group. This identification may be useful for network management 954 and may be assigned in a platform specific manner, or in an extension 955 to this architecture. 957 In each SMC-R link group, one peer is the server for all TCP 958 connections and the other peer is the client. If there are 959 additional TCP connections between the peers that use SMC-R and have 960 the client and server roles reversed, another SMC-R link group is set 961 up between them with the opposite client-server relationship. 963 This is required because there are specific responsibilities divided 964 between the client and server in the management of an SMC-R link 965 group. 967 In this architecture, the decision of whether or not to use an 968 existing SMC-R link group or create a new SMC-R link group for a TCP 969 connection is made exclusively by the server. 971 Management of the links in an SMC-R link group is also a server 972 responsibility. The server is responsible for adding and deleting 973 links in a link group. The client may request that the server take 974 certain actions but the final responsibility is the server's. 976 2.2.4. SMC-R link identifiers 978 This architecture defines multiple identifiers to identify SMC-R 979 links and peers. 981 o Link number: This is a one-byte value that identifies an SMC-R 982 link within a link group. Both the server and the client use this 983 number to distinguish an SMC-R link from other links within the 984 same link group. It is only unique within a link group. In order 985 to prevent timing windows that may occur when a server creates a 986 new link while the client is still cleaning up a previously 987 existing link, link numbers cannot be reused until the entire link 988 numbering space has been exhausted. 990 o Link User ID: This is an architecturally opaque four byte value 991 that a peer uses to uniquely define an SMC-R link within its own 992 space. This means that a link user ID is unique within one peer 993 only. Each peer defines its own link user ID for a link. The 994 peers exchange this information once during link setup and it is 995 never used architecturally again. The purpose of this identifier 996 is for network management, display, and debugging purposes. For 997 example an operator on a client could provide the operator on the 998 server with the server's link user ID if he requires the server's 999 operator to check on the operation of a link that the client is 1000 having trouble with. 1002 o Peer ID: The SMC-R peer ID uniquely identifies a specific instance 1003 of a specific TCP/IP stack. It is required because in clustered 1004 and load balancing environments, an IP address does not uniquely 1005 identify a TCP/IP stack. An RNIC's MAC/GID also doesn't uniquely 1006 or reliably identify a TCP/IP stack because RNICs can go up and 1007 down and even be redeployed to other TCP/IP stacks in a multiple 1008 partitioned or virtualized environment. The peer ID is not only 1009 unique per TCP/IP stack but is also unique per instance of a 1010 TCP/IP stack, meaning that if a TCP/IP stack is restarted, its 1011 peer ID changes. 1013 2.3. SMC-R resilience and load balancing 1015 The SMC-R multi-link architecture provides resilience for network 1016 high availability via failover capability to an alternate RoCE 1017 adapter. 1019 The SMC-R multilink architecture does not define primary, secondary 1020 or alternate roles to the links. Instead there are multiple active 1021 links representing multiple redundant RoCE paths over the same LAN. 1023 Assignment of TCP connections to links is unidirectional and 1024 asymmetric. This means that the client and server may each choose a 1025 separate link for their RDMA writes associated with a specific TCP 1026 connection. 1028 If a hardware failure occurs or a QP failure associated with an 1029 individual link, then the TCP connections that were associated with 1030 the failing link are dynamically and transparently switched to use 1031 another available link. The server or the client can detect a 1032 failure and immediately move their TCP connections and then notify 1033 their peer via the DELETE LINK LLC command. While the client can 1034 notify the server of an apparent link failure with the DELETE LINK 1035 LLC command, the server performs the actual link deletion. 1037 The movement of TCP connections to another link can be accomplished 1038 with minimal coordination between the peers. The TCP connection 1039 movement is also transparent to and non disruptive to the TCP socket 1040 application workloads for most failure scenarios. After a failure, 1041 the surviving links and all associated hardware must handle the link 1042 group's workload. 1044 As each SMC-R peer begins to move active TCP connections to another 1045 link all current RDMA write operations must be allowed to complete. 1046 Then the moving peer sends a signal to verify receipt of the last 1047 successful write by its peer. If this verification fails, the TCP 1048 connection must be reset. Once this verification is complete, all 1049 writes that failed may then be retried, in order, over the new link. 1050 Any data writes or CDC messages for which the sender did not receive 1051 write completion must be replayed before any subsequent data or CDC 1052 write operations are sent. LLC messages are not retried over the new 1053 link because they are dependent on a known link configuration, which 1054 has just changed because of the failure. The initiator of an LLC 1055 message exchange that fails will be responsible for retrying once the 1056 link group configuration stabilizes. 1058 When a new link becomes available and is re-added to the link group 1059 then each peer is free to rebalance its current TCP connections as 1060 needed or only assign new TCP connections to the newly added link. 1061 Both the server and client are free to manage TCP connections across 1062 the link group as needed. TCP connection movement does not have to 1063 stimulated by a link failure. 1065 The SMC-R architecture also defines orderly vs. disorderly failover. 1066 The type is communicated in the LLC Delete Link command and is simply 1067 a means to indicate that the link has terminated (disorderly) or link 1068 termination is imminent (orderly). The orderly link deletion could 1069 be initiated via operator command or programmatically to bring down 1070 an idle link. For example an operator command could initiate orderly 1071 shut down of an adapter for service. Implementation of the two types 1072 is based on implementation requirements and is beyond the scope of 1073 the SMC-R architecture. 1075 3. SMC-R Rendezvous architecture 1077 Rendezvous is the process that SMC-R capable peers use to dynamically 1078 discover each others' capabilities, negotiate SMC-R connections, set 1079 up SMC-R links and link groups, and manage those link groups. A key 1080 aspect of SMC-R rendezvous is that it occurs dynamically and 1081 automatically, without requiring SMC link configuration to be defined 1082 by an administrator. 1084 SMC-R Rendezvous starts with the TCP/IP three-way handshake during 1085 which connection peers use TCP options to announce their SMC-R 1086 capabilities. If both endpoints are SMC-R capable, then Connection 1087 Layer Control (CLC) messages are exchanged between the peers' SMC-R 1088 layers over the newly established TCP connection to negotiate SMC-R 1089 credentials. The CLC message mechanism is analogous to the messages 1090 exchanged by SSL for its handshake processing. 1092 If a new SMC-R link is being set up, Link Layer Control (LLC) 1093 messages are used to confirm RDMA connectivity. LLC messages are 1094 also used by the SMC-R layers at each peer to manage the links and 1095 link groups. 1097 Once an SMC-R link is set up or agreed to by the peers, the TCP 1098 sockets are passed to the peer applications which use them as normal. 1099 The SMC-R layer, which resides under the sockets layer, transmits the 1100 socket data between peers over RDMA using the SMC-R protocol, 1101 bypassing the TCP/IP stack. 1103 3.1. TCP options 1105 During the TCP/IP three-way handshake, the client and server indicate 1106 their support for SMC-R by including experimental TCP option 254 on 1107 the three-way handshake flows, in accordance with RFC 6994 "Shared 1108 Use of Experimental TCP Options". The ExID value used is the string 1109 'SMCR' in EBCDIC (IBM-1047) encoding (0xE2D4C3D9). This ExID has 1110 been registered in the TCP ExIDs registry maintained by IANA. 1112 After completion of the 3-way TCP handshake each peer queries its 1113 peer's options. If both peers set the TCP option on the three-way 1114 handshake, inline SMC-R negotiation occurs using CLC messages. If 1115 neither peer or only one peer set the TCP option, SMC-R cannot be 1116 used for the TCP connection, and the TCP connection completes setup 1117 using the IP fabric. 1119 3.2. Connection Layer Control (CLC) messages 1121 CLC messages are sent as data payload over the IP network using the 1122 TCP connection between SMC-R layers at the peers. They are analogous 1123 to the messages used to exchange parameters for SSL. 1125 Use of CLC messages is detailed in the following sections. The 1126 following list provides a summary of the defined CLC messages and 1127 their purposes: 1129 o SMC PROPOSAL: Sent from the client to propose that this TCP 1130 connection is eligible to be moved to SMC-R. The client identifies 1131 itself and its subnet to the server and passes the SMC-R elements 1132 for a suggested RoCE path via the MAC and GID. 1134 o SMC ACCEPT: Sent from the server to accept the client's TCP 1135 connection SMC proposal. The server responds to the client's 1136 proposal by identifying itself to the client and passing the 1137 elements of a RoCE path that the client can use to to perform RDMA 1138 writes to the server. This consists of SMC-R ink elements such as 1139 RoCE MAC, GID, RMB information etc. 1141 o SMC CONFIRM: Sent from the client to confirm the server's 1142 acceptance of SMC connection. The client responds to the server's 1143 acceptance by passing the elements of a RoCE path that the server 1144 can use to to perform RDMA writes to the client. This consists of 1145 SMC-R ink elements such as RoCE MAC, GID, RMB information etc. 1147 o SMC DECLINE: Sent from either the server or the client to reject 1148 the SMC connection, indicating the reason the peer must decline 1149 the SMC proposal and allowing the TCP connection to revert back to 1150 IP connectivity. 1152 3.3. LLC messages 1154 Link Layer Control (LLC) messages are sent between peer SMC-R layers 1155 over an SMC-R link to manage the link or the link group. LLC 1156 messages are sent using RoCE sendmsg with inline data and are 44 1157 bytes long. The 44 bytes size is based on what can fit into a RoCE 1158 Work Queue Element (WQE) without requiring the posting of receive 1159 buffers. 1161 LLC messages generally follow a request-reply semantic. Each message 1162 has a request flavor and a reply flavor, and each request must be 1163 confirmed with a reply, except where otherwise noted. Use of LLC 1164 messages is detailed in the following sections. The following list 1165 provides a summary of the defined LLC messages and their purposes: 1167 o ADD LINK: Add a new link to a link group. Sent from the server to 1168 the client to initiate addition of a new link to the link group, 1169 or from the client to the server to request that the server 1170 initiate addition of a new link. 1172 o ADD LINK CONTINUATION: This is a continuation of ADD link that 1173 allows the ADD link to span multiple commands, because all the 1174 link information cannot be contained in a single ADD LINK message 1176 o CONFIRM LINK: Used to confirm that RoCE connectivity over a newly 1177 created SMC-R link is working correctly. Initiated by the server, 1178 and both this message and its reply must flow over the SMC-R link 1179 being confirmed. 1181 o DELETE LINK: When initiated by the server, deletes a specific link 1182 from the link group or deletes the entire link group. When 1183 initiated by the client, requests that the server delete a 1184 specific link or the entire link group. 1186 o CONFIRM RKEY: Informs the peer on the SMC-R link of the addition 1187 of an RMB to the link group. 1189 o CONFIRM RKEY CONTINUATION: This is a continuation of CONFIRM RKEY 1190 that allows the ADD link to span multiple commands, in the event 1191 that all of the information cannot be contained in a single 1192 CONFIRM RKEY message. 1194 o DELETE RKEY: Informs the peer on the SMC-R link of the deletion of 1195 one or more RMBs from the link group 1197 o TEST LINK: Verifies that an already-active SMC-R link is active 1198 and healthy 1200 o Optional LLC message: Any LLC message in which the two high order 1201 bits of the opcode are b'10' is an optional message and must be 1202 silently discarded by a receiving peer that does not support the 1203 opcode. No such messages are defined in this version of the 1204 architecture, however the concept is defined to allow for 1205 toleration of possible advanced, optional functions. 1207 CONFIRM LINK and TEST LINK are sensitive to which link they flow on 1208 and must flow on the link being confirmed or tested. The other flows 1209 may flow over any active link in the link group. When there are 1210 multiple links in a link group, a response to an LLC message must 1211 flow over the same link that the original message flowed over, with 1212 the following exceptions: 1214 o ADD LINK request from a server in response to an ADD LINK from a 1215 client 1217 o DELETE LINK request from a server in response to a DELETE LINK 1218 from a client 1220 3.4. CDC Messages 1222 Connection Data Control (CDC) messages are sent over the RoCE fabric 1223 between peers using RoCE sendmsg with inline data, and are 44 bytes 1224 long which is based on the size that can fit into a RoCE Work Queue 1225 Element (WQE) without requiring the posting of receive buffers. CDC 1226 messages are used to describe the socket application data passed via 1227 RDMA write operations, and TCP connection state information including 1228 producer and consumer cursors, RMBE state information, and failover 1229 data validation. 1231 3.5. Rendezvous flows 1233 Rendezvous information for SMC-R is be exchanged as TCP options on 1234 the TCP 3-way handshake flows to indicate capability, followed by in- 1235 line TCP negotiation messages to actually do the SMC-R setup. Formats 1236 of all rendezvous options and messages discussed in this section are 1237 detailed in Appendix A. 1239 3.5.1. First contact 1241 First contact between RoCE peers occurs when a new SMC-R link group 1242 is being set up. This could be because no SMC-R links already exist 1243 between the peers, or the server decides to create a new SMC-R link 1244 group in parallel with an existing one. 1246 3.5.1.1. TCP Options pre-negotiation 1248 The client and server indicate their SMC-R capability to each other 1249 using TCP option 254 on the TCP 3-way handshake flows. 1251 A client who wishes to do SMC-R will include TCP option 254 using an 1252 ExID equal to the EBCDIC (codepage IBM-1047) encoding of "SMCR" on 1253 its SYN flow. 1255 A server that supports SMC-R will include TCP option 254 with the 1256 ExID value of EBCDIC "SMCR" on its SYN-ACK flow. Because the server 1257 is listening for connections and does not know where client 1258 connections will come from, the server implementation may choose to 1259 unconditionally include this TCP option if it supports SMC-R. This 1260 may be required for server implementations where extensions to the 1261 TCP stack are not practical. For server implementations which can 1262 add code to examine and react to packets during the three-way 1263 handshake, the server should only include the SMC-R TCP option on 1264 SYN-ACK if the client included it on its SYN packet. 1266 A client who supports SMC-R and meets the three conditions outlined 1267 above may optionally include the TCP option for SMC-R on its ACK 1268 flow, regardless of whether or not the server included it on its SYN- 1269 ACK flow. Some TCP/IP stacks may have to include it if the SMC-R 1270 layer cannot modify the options on the socket until the 3-way 1271 handshake completes. Proprietary servers should not include this 1272 option on the ACK flow, since including it on the SYN flow was 1273 sufficient to indicate the client's capabilities. 1275 Once the initial three-way TCP handshake is completed, each peer 1276 examines the socket options. SMC-R implementations may do this by 1277 examining what was actually provided on the SYN and SYN-ACK packets 1278 or by performing a getsockopt() operation to determine the options 1279 set by the peer. If neither peer, or only one peer, specified the TCP 1280 option for SMC-R, then SMC-R cannot be used on this connection and it 1281 proceeds using normal IP flows and processing. 1283 If both peers specified the TCP option for SMC-R, then the TCP 1284 connection is not started yet and the peers proceed to SMC-R 1285 negotiation using inline data flows. The socket is not yet turned 1286 over to the applications; instead the respective SMC layers exchange 1287 CLC messages over the newly formed TCP connection. 1289 3.5.1.2. Client Proposal 1291 If SMC-R is supported by both peers, the client sends an SMC Proposal 1292 CLC message to the server. On this flow from client to server it is 1293 not immediately apparent if this is a new or existing SMC-R link 1294 because in clustered environments a single IP address may represent 1295 multiple hosts. This type of cluster virtual IP address can be owned 1296 by a network based or host based layer 4 load balancer that 1297 distributes incoming TCP connections across a cluster of 1298 servers/hosts. Other clustered environments may also support the 1299 movement of a virtual IP address dynamically from one host in the 1300 cluster to another for high availability purposes. In summary, the 1301 client can not pre-determine that a connection is targeting the same 1302 host simply by matching the destination IP address for outgoing TCP 1303 connections. Therefore it cannot pre-determine the SMC-R link that 1304 will be used for a new TCP connection. This information will be 1305 dynamically learned and the appropriate actions will be taken as the 1306 SMC-R negotiation handshake unfolds. 1308 On the SMC-R proposal message, the initiator (client) proposes use of 1309 SMC-R by including its peer ID and GID and MAC addresses, as well as 1310 the IP subnet number of the outgoing interface (if IPv4) or the IP 1311 prefix list for the network that the proposal is sent over (if IPv6). 1312 At this point in the flow, the client makes no local commitments of 1313 resources for SMC-R. 1315 When the server receives the SMC Proposal CLC message, it uses the 1316 peer ID provided by the client plus subnet or prefix information 1317 provided by the client, to determine if it already has a usable SMC-R 1318 link with this SMC-R peer. If there is one or more existing SMC-R 1319 links with this SMC-R peer, the server then decides which SMC link it 1320 will use for this TCP connection. See subsequent sections for the 1321 cases of reusing an existing SMC-R link or creating a parallel SMC 1322 link group between SMC-R peers. 1324 If this is a first contact between SMC-R peers the server must 1325 validate that it is on the same LAN as the client before continuing. 1326 For IPv4, the server does this by verifying that it has an interface 1327 with an IP subnet number that matches the subnet number set by the 1328 client on the SMC Proposal. For IPv6 it does this by verifying that 1329 it is directly attached to at least one IP prefix that was listed by 1330 the client in its SMC Proposal message. 1332 If server agrees to use SMC-R, the server begins setup of a new SMC-R 1333 link by allocating local QP and RMB resources (setting its QP state 1334 to INIT) and providing its full SMC-R information in an SMC Accept 1335 CLC message to the client over the TCP connection, along with a flag 1336 set indicating that this is a first contact flow. While the SMC 1337 Accept message could flow over any route back to the client depending 1338 upon IP routing, the SMC-R credentials provided must be for the 1339 common subnet or prefix between the server and client, as determined 1340 above. If the server cannot or does not want to do SMC-R with the 1341 client it sends an SMC Decline CLC message to the client and the 1342 connection data may begin flowing using normal TCP/IP flows. 1344 3.5.1.3. Server acceptance 1346 When the client receives the SMC Accept from the server, it uses the 1347 combination of the first contact flag, its GID/MAC and the GID/MAC 1348 returned by the server plus the LAN that the connection is setting up 1349 over and the QP number provided by the server to determine if this is 1350 a new or existing SMC-R link. 1352 If it is an existing SMC-R link, and the client agrees to use that 1353 link for the TCP connection, see 3.5.2. Subsequent contact below. If 1354 it is a new SMC-R link between peers that already have an SMC link, 1355 then the server is starting a new SMC link group. 1357 Assuming this is either a first contact between peers or the server 1358 is starting a new SMC link group, the client now allocates local QP 1359 and RMB resources for the SMC-R link (setting the QP state to RTR or 1360 "ready to receive"), associates them with the server QP as learned on 1361 the SMC Accept CLC message, and sends an SMC Confirm CLC message to 1362 the server over the TCP connection with its SMC-R link information 1363 included. The client also starts a timer to wait for the server to 1364 confirm the reliable connected QP as described below. 1366 3.5.1.4. Client confirmation 1368 Upon receipt of the client's SMC Confirm CLC message, the server 1369 associates its QP for this SMC-R link with the client's QP as learned 1370 on the SMC Confirm CLC message and sets its QP state to RTS (ready to 1371 send). Now the client and the server have reliable connected QPs. 1373 3.5.1.5. Link (QP) confirmation 1375 Since setting up the SMC-R link and its QPs did not require any 1376 network flows on the RoCE fabric, the client and server must now 1377 confirm connectivity over the RoCE fabric. To accomplish this, the 1378 server will send a "Confirm Link" Link Layer Control (LLC) message to 1379 the client over the RoCE fabric. The "Confirm Link" LLC message will 1380 provide the server's MAC, GID, and QP information for the connection, 1381 allow each partner to communicate the maximum number of links it can 1382 tolerate in this link group (the "link limit"), and will additionally 1383 provide two link IDs: 1385 o a one-byte server-assigned Link number that is used by both peers 1386 to identify the link within the link group and is only unique 1387 within a link group. 1389 o a four byte link user id. This opaque value is assigned by the 1390 server for the server's local use and is provided to the client 1391 for management purposes, for example to use in network management 1392 displays and products. 1394 When the server sends this message, it will set a timer for receiving 1395 confirmation from the client. 1397 When the client receives the server's confirmation "Confirm Link" LLC 1398 message it will cancel the confirmation timer it set when it sent the 1399 SMC Confirm message. It will also advance its QP state to RTS and 1400 respond over the RoCE fabric with a "Confirm Link" response LLC 1401 message, providing its MAC, GID, QP number, link limit, confirming 1402 the one byte link number sent by the server, and providing its own 1403 four byte link user id to the server. 1405 Host X -- Server Host Y -- Client 1406 +-------------------+ +-------------------+ 1407 | PeerID = PS1 | | PeerID = PC1 | 1408 | +------+ +------+ | 1409 | QP 8 |RNIC 1| |RNIC 2| QP 64 | 1410 |RToken X| |MAC MA| |MAC MB| | | 1411 | | |GID GA| |GID GB| |Rtoken Y| 1412 | \/ +------+ (Subnet S1) +------+ \/ | 1413 |+--------+ | | +--------+ | 1414 || RMB | | | | RMB | | 1415 |+--------+ | | +--------+ | 1416 | +------+ +------+ | 1417 | |RNIC 3| |RNIC 4| | 1418 | |MAC MC| |MAC MD| | 1419 | |GID GC| |GID GD| | 1420 | +------+ +------+ | 1421 +-------------------+ +-------------------+ 1423 SYN TCP options(254,"SMCR") 1424 <--------------------------------------------------------- 1426 SYN-ACK TCP options(254, "SMCR") 1427 ---------------------------------------------------------> 1429 ACK [TCP options(254, "SMCR")] 1430 <-------------------------------------------------------- 1432 SMC Proposal(PC1,MB,GB,S1) 1433 <-------------------------------------------------------- 1435 SMC Accept(PS1,first contact,MA,GA,MTU,QP8,RToken=X,RMB elem ndx) 1436 ---------------------------------------------------------> 1438 SMC Confirm(PC1,MB,GB,MTU,QP64,RToken=Y, RMB element index) 1439 <-------------------------------------------------------- 1441 Confirm Link (MA,GA,QP8, link lim, server's link userid, linknum) 1442 .........................................................> 1444 Confirm Link Rsp(MB,GB,QP64, link lim, client link userid, linknum) 1445 <........................................................ 1447 Legend: 1448 ------------ TCP/IP and CLC flows 1449 ............ RoCE (LLC) flows 1451 Figure 8 First contact rendezvous flows 1453 Technically, the data for the TCP connection could now flow over the 1454 RoCE path. However if this is first contact, there is no alternate 1455 for this recently established RoCE path. Since in the current 1456 architecture there is no failover from RoCE to IP once connection 1457 data starts flowing, this means that a failure of this path would 1458 disrupt the TCP connection, meaning that the level of redundancy and 1459 failover is less than that provided by IP. If the network has 1460 alternate RoCE paths available, they would not be usable at this 1461 point, which is an unacceptable condition 1463 3.5.1.6. Second SMC-R link setup 1465 Because of the unacceptable situation described above, TCP data will 1466 not be allowed to flow on the newly established SMC-R link until a 1467 second path has been set up, or at least attempted. 1469 If the server has a second RNIC available on the same LAN, it 1470 attempts to set up the second SMC-R link over that second RNIC. If 1471 it only has one RNIC available on the LAN, it will attempt to set up 1472 the second SMC-R link over that one RNIC. In the latter case, the 1473 server is attempting to set up an asymmetric link, in case the client 1474 does have a second RNIC on the LAN. 1476 In either case the server allocates a new QP over the RNIC it is 1477 attempting to use for the second link, assigns a link number to the 1478 new link and also creates an RToken for the RMB over this second QP 1479 (note that this means that the first and second QP each has its own 1480 RToken to represent the same RMB). The server provides this 1481 information, as well as the MAC and GID of the RNIC it is attempting 1482 set up the second link over in an "Add Link" LLC message which it 1483 sends to the client over the SMC-R link that is already set up. 1485 3.5.1.6.1. Client processing of "Add Link" LLC message from server 1487 When the client receives the server's "Add Link" LLC message, it 1488 examines the GID and MAC provided by the server to determine if the 1489 server is attempting to use the same server-side RNIC as the existing 1490 SMC-R link, or a different one. 1492 If the server is attempting to use the same server-side RNIC as the 1493 existing SMC-R link, then the client verifies that it has a second 1494 RNIC on the same LAN. If it does not, the client rejects the "Add 1495 Link" request from the server, because the resulting link would be a 1496 parallel link which is not supported within a link group. If the 1497 client does have a second RNIC on the same LAN, it accepts the 1498 request and an asymmetric link will be set up. 1500 If the server is using a different server-side RNIC from the existing 1501 SMC-R link then the client will accept the request and a second SMC-R 1502 link will set up in this SMC-R link group. If the client has a 1503 second RNIC on the same LAN, that second RNIC will be used for the 1504 second SMC-R link, creating symmetric links. If the client does not 1505 have a second RNIC on the same LAN, it will use the same RNIC as was 1506 used for the initial SMC-R link, resulting in the setup of an 1507 asymmetric link in the SMC-R link group. 1509 In either case, when the client accepts the server's "Add Link" 1510 request, it allocates a new QP on the chosen RNIC and creates an Rkey 1511 over that new QP for the client-side RMB for the SMC link group, then 1512 sends an "Add Link" reply LLC message to the server providing that 1513 information as well as echoing the Link number that was set by the 1514 server. 1516 If the client rejects the server's "Add Link" request, it sends an 1517 "Add Link" reply LLC message to the server with the reason code for 1518 the rejection. 1520 3.5.1.6.2. Server processing of "Add Link" reply LLC message from the 1521 client 1523 If the client sends a negative response to the server or no reply is 1524 received, the server frees the RoCE resources it had allocated for 1525 the new link. Having a single link in an SMC-R link group is 1526 undesirable and the server's recovery is detailed in C.8. Failure to 1527 add second SMC-R link to a link group. 1529 If the client sends a positive reply to the server with 1530 MAC/GID/QP/Rkey information, the server associates its QP for the new 1531 SMC-R link to the QP that the client provided. Now the new SMC-R 1532 link is in the same situation that the first was in after the client 1533 sent its ACK packet - there is a reliable connected QP over the new 1534 RoCE path, but there have been no RoCE flows to confirm that it's 1535 actually usable. So at this point the client and server will 1536 exchange "Confirm Link" LLC messages just like they did on the first 1537 SMC-R link. 1539 If either peer receives a failure during this second "Confirm Link" 1540 LLC exchange (either an immediate failure which implies that the 1541 message did not reach the partner, or a timeout), it sends a "Delete 1542 Link" LLC message to the partner over the first (and now only) link 1543 in the link group which must be acknowledged before data can flow on 1544 the single link in the link group. 1546 Host X -- Server Host Y -- Client 1547 +-------------------+ +-------------------+ 1548 | PeerID = PS1 | | PeerID = PC1 | 1549 | +------+ +------+ | 1550 | QP 8 |RNIC 1| |RNIC 2| QP 64 | 1551 |RToken X| |MAC MA| |MAC MB| | | 1552 | | |GID GA| |GID GB| |RToken Y| 1553 | \/ +------+ +------+ \/ | 1554 |+--------+ | | +--------+ | 1555 || | | | | | | 1556 || RMB | | | | RMB | | 1557 || | | | | | | 1558 |+--------+ | | +--------+ | 1559 | /\ +------+ +------+ /\ | 1560 | | |RNIC 3| |RNIC 4| | | 1561 |RToken Z| |MAC MC| |MAC MD| |RToken W| 1562 | QP 9 |GID GC| |GID GD| QP 65 | 1563 | +------+ +------+ | 1564 +-------------------+ +-------------------+ 1566 First SMC-R link setup as shown in Figure 8 1567 <-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-> 1569 ADD link request (QP9,MC,GC, link number=2) 1570 ............................................> 1572 ADD link response (QP65,MD,GD, link number=2) 1573 <............................................ 1575 ADD link continuation request (RToken=Z) 1576 ............................................> 1578 ADD link continuation response(RToken=W) 1579 <............................................ 1581 Confirm Link(MC,GC,QP9,link number=2, link userid) 1582 .............................................> 1584 Confirm Link response(MD,GD,QP65,link number=2, link userid) 1585 <............................................. 1587 Legend: 1588 ------------ TCP/IP and CLC flows 1589 ............ RoCE (LLC) flows 1591 Figure 9 First contact, second link setup 1593 3.5.1.6.3. Exchange of Rkeys on second SMC-R link 1595 Note that in the scenario described here, first contact, there is 1596 only one RMB Rkey to exchange on the second SMC-R link and it is 1597 exchanged in the Add Link Continuation request and reply. In 1598 scenarios other than first contact, for example, adding a new SMC-R 1599 link to a longstanding link group with multiple RMBs, additional 1600 flows will be required to exchange additional RMB Rkeys. See 1601 3.5.5.2.3. Adding a new SMC-R link to a link group with multiple RMBs 1602 for more details on these flows 1604 3.5.1.6.4. Aborting SMC-R and falling back to IP 1606 If both partners don't provide the SMC-R TCP option during the 3 way 1607 TCP handshake, the connection falls back to normal TCP/IP. During 1608 the SMC-R negotiation that occurs after the 3 way TCP handshake, 1609 either partner may break off SMC-R by sending an SMC Decline CLC 1610 message. The SMC Decline CLC message may be sent in place of any 1611 expected message, and may also be sent during the Confirm Link LLC 1612 exchange if there is a failure before any application data has flowed 1613 over the RoCE fabric. For more detail on exactly when an SMC Decline 1614 can flow during link group setup, see C.1. SMC Decline during CLC 1615 negotiation and C.2. SMC Decline during LLC negotiation 1617 If this fallback to IP happens while setting up a new SMC-R link 1618 group, the RoCE resources allocated for this SMC-R link group 1619 relationship are torn down and it will be retried as a new SMC-R link 1620 group next time a connection starts between these peers with SMC-R 1621 proposed. Note that if this happens because one side doesn't support 1622 SMC-R, there will be very little to tear down as the TCP option will 1623 have failed to flow either on the initial SYN or the SYN-ACK, before 1624 either side had reserved any local RoCE resources. 1626 3.5.2. Subsequent contact 1628 "Subsequent contact" means setting up a new TCP connection between 1629 two peers that already have an SMC-R link group between them, and 1630 reusing the existing SMC-R link group. In this case it is not 1631 necessary to allocate new QPs. However it is possible that a new RMB 1632 has been allocated for this TCP connection, if the previous TCP 1633 connection used the last element available in the previously used 1634 RMB, or for any other implementation-dependent reason. For this 1635 reason, and for convenience and error checking, the same TCP option 1636 254 followed by inline negotiation method described for initial 1637 contact will be used for subsequent contact, but the processing 1638 differs in some ways. That processing is described below. 1640 3.5.2.1. SMC-R proposal 1642 When the client begins the inline negotiation with the server, it 1643 does not know if this is a first contact or a subsequent contact. 1644 The client cannot know this information until it sees the server's 1645 peer ID to determine whether or not it already has an SMC-R link with 1646 this peer that it can use. There are several reasons why it is not 1647 sufficient to use the partner IP address, subnet, VLAN or other IP 1648 information to make this determination. The most obvious reason is 1649 distributed systems: if the server IP address is actually a virtual 1650 IP address representing a distributed cluster, the actual host 1651 serving this TCP connection may not be the same as the host that 1652 served the last TCP connection to this same IP address. 1654 After the TCP three way handshake, assuming both partners indicate 1655 SMC-R capability, the client builds and sends the SMC Proposal CLC 1656 message to the server in exactly the same manner as it does in the 1657 first contact case, and in fact at this point doesn't know if it's 1658 first contact or subsequent contact. As in the first contact case, 1659 the client sends its Peer ID value, suggested RNIC GID/MAC, and IP 1660 subnet or prefix information. 1662 Upon receiving the client's proposal, the server looks up the peer ID 1663 provided to determine if it already has a usable SMC-R link group 1664 with this peer. If it does already have a usable SMC-R link group, 1665 the server then needs to decide if it will use the existing SMC-R 1666 link group, or create a new link group. For the new link group 1667 case, see 3.5.3. First contact variation: creating a parallel link 1668 group, below. 1670 For this discussion assume the server decides to use the existing 1671 SMC-R link group for the TCP connection, which is expected to be the 1672 most common case. The server is responsible for making this decision. 1673 Then the server needs to communicate that information to the client, 1674 but it is not necessary to allocate, associate, and confirm QPs for 1675 the chosen SMC-R link. All that remains to be done is to set up RMB 1676 space for this TCP connection. 1678 If one of the RMBs already in use for this SMC-R link group has an 1679 available element that uses the appropriate buffer size, the server 1680 merely chooses one for this TCP connection and then sends an SMC 1681 Accept CLC message, providing the full RoCE information for the 1682 chosen SMC-R link to the client, using the same format as the SMC 1683 Accept CLC message described in the initial contact section above. 1685 The server may choose to use the SMC-R link that matches the 1686 suggested MAC/GID provided by the client on the SMC Proposal for its 1687 RDMA writes but is not obligated to. The final decision on which 1688 specific SMC-R link to assign a TCP connection to is an independent 1689 server and client decision. 1691 It may be necessary for the server to allocate a new RMB for this 1692 connection. The reasons for this are implementation dependent and 1693 could include: no available space in existing RMB or RMBs, or desire 1694 to allocate a new RMB that uses a different buffer size from the ones 1695 already created, or any other implementation dependent reason. In 1696 this case the server will allocate the new RMB and then perform the 1697 flows described in 3.5.5.2.1. Adding a new RMB to an SMC-R link 1698 group. Once that processing is complete, the server then provides the 1699 full RoCE information, including the new Rkey, for this connection 1700 on an SMC Confirm CLC message to the client. 1702 3.5.2.2. SMC-R acceptance 1704 Upon receiving the SMC Accept CLC message from the server, the client 1705 examines the RoCE information provided by the server to determine if 1706 this is a first contact for a new SMC link group, or subsequent 1707 contact for an existing SMC-R link group. It is subsequent contact 1708 if the server side peer ID, GID, MAC and QP number provided on the 1709 packet match a known SMC-R link, and the "first contact" flag is not 1710 set. If this is not the case, for example the GID and MAC match but 1711 the QP is new, then the server is creating a new, parallel SMC-R link 1712 group and this is treated as a first contact. 1714 A different RMB RToken does not indicate a first contact as the 1715 server may have allocated a new RMB, or be using several RMBs for 1716 this SMC-R link. The client needs the server's RMB information only 1717 for its RDMA writes to the server, and since there is no requirement 1718 for symmetric RMBs, this information is simply control information 1719 for the RDMA writes on this SMC-R link. 1721 The client must validate that the RMB element being provided by the 1722 server is not in use by another TCP connection on this SMC-R link 1723 group. This validation must validate the new across 1724 all known on this link group. See 4.4.2. RMB element 1725 reuse and conflict resolution for the case in which the server tries 1726 to use an RMB element that is already in use on this link group. 1728 Once the client has determined that this TCP connection is a 1729 subsequent contact over an existing SMC link, it performs a similar 1730 RMB allocation process as the server did: it either allocates an 1731 element from an RMB already associated with this SMC-R link, or it 1732 allocates a new RMB and associates it with this SMC-R link and then 1733 chooses an element out of it. 1735 If the client allocates a new RMB for this TCP connection, it 1736 performs the processing described in 3.5.5.2.1. Adding a new RMB to 1737 an SMC-R link group. Once that processing is complete, the client 1738 provides its full RoCE information for this TCP connection on an SMC 1739 Confirm CLC message. 1741 Because an SMC-R link with a verified connected QP already exists and 1742 is being reused, there is no need for verification or alternate QP 1743 selection flows or timers. 1745 3.5.2.3. SMC-R confirmation 1747 When the server receives the client's SMC Confirm CLC message on a 1748 subsequent contact, it verifies the following: 1750 o the RMB element provided by the client is not already in use by 1751 another TCP connection on this SMC-R link group (see section 1752 4.4.2. RMB element reuse and conflict resolution for the case in 1753 which it is). 1755 o The MAC/GID/QP info provided by the client matches an active link 1756 within the link group. The client is free to select any valid / 1757 active link. The client is not required to select the same link as 1758 the server. 1760 If this validation passes, the server stores the client's RMB 1761 information for this connection and the RoCE setup of the TCP 1762 connection is complete. 1764 3.5.2.4. TCP data flow race with SMC Confirm CLC message 1766 On a subsequent contact TCP/IP connection, a peer may send data as 1767 soon as it has received the peer RMB information for the connection. 1768 There are no additional RoCE confirmation flows, since the QPs on the 1769 SMC link are already reliably connected and verified. 1771 In the majority of cases the first data will flow from the client to 1772 the server. The client must send the SMC Confirm CLC message before 1773 sending any connection data over the chosen SMC-R link, however the 1774 client need not wait for confirmation of this message, and in fact 1775 there will be no such confirmation. Since the server is required to 1776 have the RMB fully set up and ready to receive data from the client 1777 before sending SMC Accept CLC message, the client can begin sending 1778 data over the SMC-R link immediately upon completing the send of the 1779 SMC Confirm CLC message. 1781 It is possible that data from the client will arrive into the server 1782 side RMB before the SMC Confirm CLC message from the client has been 1783 processed. In this case the server must handle this race condition, 1784 and not provide the arrived TCP data to the socket application until 1785 the SMC Confirm CLC message has been received and fully processed, 1786 opening the socket. 1788 If the server has initial data to send to the client which is not a 1789 response to the client (this case should be rare), it can send the 1790 data immediately upon receiving and processing the SMC Confirm CLC 1791 message from the client. The client must have opened the TCP socket 1792 to the client application upon sending of SMC Confirm CLC message so 1793 the client will be ready to process data from the server. 1795 3.5.3. First contact variation: creating a parallel link group 1797 Recall that parallel SMC-R links within an SMC-R link group are not 1798 supported. These are multiple SMC-R links within a link group that 1799 use the same network path. However, multiple SMC-R link groups 1800 between the same peers are supported. This means that if multiple 1801 SMC-R links over the same RoCE path are desired, it is necessary to 1802 use multiple SMC-R link groups. While not a recommended practice, 1803 this could be done for platform specific reasons, like QP separation 1804 of different workloads. Only the server can drive the creation of 1805 multiple SMC-R link groups between peers. 1807 At a high level, when the server decides to create an additional SMC- 1808 R link group with a client it already has an SMC-R link group with, 1809 the flows are basically the same as the normal "first contact" case 1810 described above. The following provides more detail and 1811 clarification of processing in this case. 1813 When the server receives the SMC Proposal CLC message from the client 1814 and using the GID/MAC info determines that it already has an SMC-R 1815 link group with this client, the server can either reuse the existing 1816 SMC-R link group (detailed in 3.5.2. Subsequent contact above) or it 1817 can create a new SMC-R link group in addition to the existing one. 1819 If the server decides to create a new SMC-R link group, it does the 1820 same processing it would have done for first contact: allocate QP and 1821 RMB resources as well as alternate QP resources, and communicate the 1822 QP and RMB information to the client on the SMC Accept CLC message 1823 with the "first contact" flag set. 1825 When the client receives the server's SMC Accept CLC message with the 1826 new QP information and the "first contact" flag, it knows the server 1827 is creating a new SMC-R link group even though it already has an SMC- 1828 R link group with the server. In this case the client will also 1829 allocate a new QP for this new SMC link and allocate an RMB for this 1830 link and generate an Rkey for it. 1832 Note that multiple SMC-R link groups between the same peers must 1833 access different RMB resources, so new RMBs will be required. Using 1834 the same RMBs that are in use in another SMC-R link group is not 1835 permitted. 1837 The client then associates its new QP with the server's new QP and 1838 sends its SMC Confirm CLC message back to the server providing the 1839 new QP/RMB information and sets its confirmation timer for the new 1840 SMC-R link. 1842 When the server receives the client's SMC Confirm CLC message it 1843 associates its QP with the client's QP as learned on the SMC Confirm 1844 CLC message and sends a confirmation LLC message. The rest of the 1845 flow, with the confirmation QP and setup of additional SMC-R links, 1846 unfolds just like the first contact case. 1848 3.5.4. Normal SMC-R link termination 1850 The normal sockets API trigger points are used by the SMC-R layer to 1851 initiate SMC-R connection termination flows. The main design point 1852 for SMC-R normal connection flows is to use the SMC-R protocol to 1853 first shutdown the SMC-R connection and free up any SMC-R RDMA 1854 resources and then allow the normal TCP connection termination 1855 protocol (i.e. FIN processing) to drive cleanup of the TCP connection 1856 that exists on the IP fabric. This design point is very important in 1857 ensuring that RDMA resources such as the RMBEs are only freed and 1858 reused when both SMC-R end points are completely done with their RDMA 1859 Write operations to the partner's RMBE. 1861 When the last TCP connection over an SMC-R link group terminates, the 1862 link group can be terminated. Similar to creation of SMC-R links and 1863 link groups, the primary responsibility for determining that normal 1864 termination is needed and initiating it lies with the server. 1865 Implementations may opt to set timers to keep SMC-R link groups up 1866 for a specified time after the last TCP connection ends, to avoid 1867 churn in cases when TCP connections come and go regularly. 1869 The link or link group may also be terminated as a result of an 1870 operator initiated command. This command can be entered at either 1871 the client or the server. If entered at the client, the client 1872 requests that the server perform link or link group termination, and 1873 the responsibility for doing so ultimately lies with the server. 1875 When the server determines that the SMC-R link group is to be 1876 terminated, it sends a DELETE LINK LLC message to the client, with a 1877 flag set indicating that all links in the link group are to be 1878 terminated. After receiving confirmation from the adapter that the 1879 DELETE LINK LLC message has been sent, the server can clean up its 1880 end of the link group (QPs, RMBs, etc). Upon receipt of the DELETE 1881 LINK message from the server, the client must immediately comply and 1882 clean up its end of the link group. Any TCP connections that the 1883 client believes to be active on the link group must be immediately 1884 terminated. 1886 The client can request that the server delete the link group as well. 1887 The client does this by sending a DELETE LINK message to the server 1888 indicating that cleanup of all links is requested. The server must 1889 comply by sending a DELETE LINK to the client and processing as 1890 described above. If there are TCP connections active on the link 1891 group when the server receives this request, they are immediately 1892 terminated by sending a RST flow over the IP fabric. 1894 3.5.5. Link group management flows 1896 3.5.5.1. Adding and deleting links in an SMC-R link group 1898 The server has the lead role in managing the composition of the link 1899 group. Links are added to link group by the server. The client may 1900 notify the server of new conditions that may result in the server 1901 adding a new link, but the server is ultimately responsible. In 1902 general links are deleted from the link group by the server, however 1903 in certain error cases the client may inform the server that a link 1904 must be deleted and treat it as deleted without waiting for action 1905 from the server. These flows are detailed in the following sections 1907 3.5.5.1.1. Server initiated Add Link processing 1909 As described in previous sections, the server initiates an Add Link 1910 exchange to create redundancy in a newly created link group. Once a 1911 link group is established the server may also initiate Add Link for 1912 other reasons, including: 1914 o Availability of additional resources on the server host to support 1915 an additional SMC-R link. This may include the provisioning of an 1916 additional RNIC, more storage becoming available to support 1917 additional QP resources, operator command, or any other 1918 implementation dependent reason. Note that, to be available for 1919 an existing link group, a new RNIC must be attached to the same 1920 RoCE LAN that the link group is using. 1922 o Receipt of notification from the client that additional resources 1923 on the client are available to support an additional SMC-R link. 1924 See 3.5.5.1.2. Client initiated Add Link processing. 1926 Server initiated Add Link processing in an established SMC-R link 1927 group is the same as the Add Link processing described in 3.5.1.6. 1928 Second SMC-R link setup with the following changes: 1930 o If an asymmetric SMC-R link already exists in the link group a 1931 second asymmetric link will not be created. Only one asymmetric 1932 link is permitted in a link group. 1934 o TCP data flow on already existing link(s) in the link group is not 1935 halted or otherwise affected during the process of setting up the 1936 additional link. 1938 In no case will the server initiate Add Link processing if the link 1939 group already has the maximum number of links negotiated by the 1940 partners. 1942 3.5.5.1.2. Client initiated Add Link processing 1944 If an additional RNIC becomes available for an existing SMC-R link 1945 group on the client's side, the client notifies the server by sending 1946 an Add Link request LLC message to the server. Unlike an Add Link 1947 request sent by the server to the client, this Add Link request 1948 merely informs the server that the client has a new RNIC. If the 1949 link group lacks redundancy, or has redundancy only on an asymmetric 1950 link with a single RNIC on the client side, the server must initiate 1951 an Add Link exchange in response to this message, to create or 1952 improve the link group's redundancy. 1954 If the link group already has symmetric link redundancy but has fewer 1955 than the negotiated maximum number of links, the server may respond 1956 by initiating an Add Link exchange to create a new link using the 1957 client's new resource but is not required to. 1959 If the link group already has the negotiated maximum number of links, 1960 the server must ignore the client's Add Link request LLC message. 1962 Because the server is not required to respond to the client's Add 1963 Link LLC message in all cases, the client must not wait for a 1964 response or throw an error if one does not come. 1966 3.5.5.1.3. Server initiated Delete Link Processing 1968 Reasons that a server may delete a link include: 1970 o The link has not been used for TCP connections for an 1971 implementation defined time interval, and deleting the link will 1972 not cause the link group to lack redundancy 1974 o An error in resources supporting the link. These may include but 1975 are not limited to: RNIC errors, QP errors, software errors 1977 o The RNIC supporting this SMC-R link is being taken down, either 1978 because of an error case or because of an operator or software 1979 command. 1981 If a link being deleted is supporting TCP connections, and there are 1982 one or more surviving links in the link group, the TCP connections 1983 are moved to the surviving links. For more information on this 1984 processing see 2.3. SMC-R resilience and load balancing. 1986 The server deletes a link from the link group by sending a Delete 1987 Link request LLC message to the client over any of the usable links 1988 in the link group. Because the Delete Link LLC message specifies 1989 which link is to be deleted, it may flow over any link in the link 1990 group. The server must not clean up its RoCE resources for the link 1991 until the client responds. 1993 The client responds to the server's Delete Link request LLC message 1994 by sending the server a Delete Link response LLC message. The client 1995 must respond positively; it cannot decline to delete the link. Once 1996 the server has received the client's Delete Link response, both sides 1997 may clean up their resources for the link. 1999 Positive write completion or other indication from the RNIC on the 2000 client's side is sufficient to indicate to the client that the server 2001 has received the Delete Link response. 2003 Host X Host Y 2004 +-------------------+ +-------------------+ 2005 | +------+ +------+ | 2006 | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 9 | 2007 |RToken X| |Failed|<--X----X----X----X-->| | | 2008 | | | | | | | 2009 | \/ +------+ +------+ | 2010 |+--------+ | | | 2011 || deleted| | | | 2012 || RMB | | | | 2013 || | | | | 2014 |+--------+ | | | 2015 | /\ +------+ +------+ | 2016 |RToken Z| | | SMC-R Link 2 | | | 2017 | | |RNIC 3|<-------------------->|RNIC 4| | 2018 | QP 64| | | | QP 65 | 2019 | +------+ +------+ | 2020 +-------------------+ +-------------------+ 2022 DELETE LINK(Request, link number = 1, 2023 ................................................> 2024 reason code = RNIC failure) 2026 DELETE LINK(Response, link number = 1) 2027 <................................................ 2029 (note, architecturally this exchange can flow over either 2030 SMC-R link but most likely flows over link 2 since 2031 the RNIC for link 1 has failed) 2033 Figure 10 Server initiated Delete Link flow 2035 3.5.5.1.4. Client initiated Delete Link request 2037 The client may request that the server delete a link for the same 2038 reasons that the server may delete a link, except for inactivity 2039 timeout. 2041 Because the client depends on the server to delete links, there are 2042 two types of delete requests from client to server: 2044 o Orderly: the client is requesting that the server delete the link 2045 when able. This would result from an operator command to bring 2046 down the RNIC or some other nonfatal reason. In this case the 2047 server is required to delete the link, but may not do it right 2048 away. 2050 o Disorderly: the server must delete the link right away, because 2051 the client has experienced a fatal error with the link. 2053 In either case the server responds by initiating a Delete Link 2054 exchange with the client as described in the previous section. The 2055 difference between the two is whether the server must do so 2056 immediately or can delay for an opportunity to gracefully delete the 2057 link. 2059 Host X Host Y 2060 +-------------------+ +-------------------+ 2061 | +------+ +------+ | 2062 | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 9 | 2063 |RToken X| | |<---X--X--X--X--X--X->|Failed| | 2064 | | | | | | | 2065 | \/ +------+ +------+ | 2066 |+--------+ | | | 2067 || deleted| | | | 2068 || RMB | | | | 2069 || | | | | 2070 |+--------+ | | | 2071 | /\ +------+ +------+ | 2072 |RToken Z| | | SMC-R Link 2 | | | 2073 | | |RNIC 3|<-------------------->|RNIC 4| | 2074 | QP 64| | | | QP 65 | 2075 | +------+ +------+ | 2076 +-------------------+ +-------------------+ 2078 DELETE LINK(Request, link number = 1, disorderly, 2079 <............................................... 2080 reason code = RNIC failure) 2082 DELETE LINK(Request, link number = 1, 2083 ................................................> 2084 reason code = RNIC failure) 2086 DELETE LINK(Response, link number = 1) 2087 <................................................ 2089 (note, architecturally this exchange can flow over either 2090 SMC-R link but most likely flows over link 2 since 2091 the RNIC for link 1 has failed) 2093 Figure 11 Client-initiated Delete Link 2095 3.5.5.2. Managing multiple Rkeys over multiple SMC-R links in a link 2096 group 2098 After the initial contact sequence completes and the number of TCP 2099 connections increases it is possible that the SMC peers could add 2100 additional RMBs to the Link Group. Recall that each peer 2101 independently manages its RMBs. Also recall that an RMB's RToken is 2102 specific to a QP, which means that when there are multiple SMC-R 2103 links in a link group, each RMB accessed with the link group requires 2104 a separate RToken for each SMC-R link in the group. 2106 Each RMB that is added to a link must be added to all links within 2107 the Link Group. The set of RMBs created for the Link is called the 2108 "RToken Set". The RTokens must be exchanged with the peer. As RMBs 2109 are added and deleted, the RToken Set must remain in sync. 2111 3.5.5.2.1. Adding a new RMB to an SMC-R link group 2113 A new RMB can be added to an SMC-R link group on either the client or 2114 the server side. When an additional RMB is added to an existing SMC- 2115 R link group, that RMB must be associated with the QPs for each link 2116 in the link group. Therefore when an RMB is added to an SMC-R link 2117 group, its RMB RToken for each SMC-R link's QP must be communicated 2118 to the peer. 2120 The tokens for a new RMB added to an existing SMC-R link group are 2121 communicated using "Confirm Rkey" LLC messages, as shown in Figure 2122 12. The RToken set is specified as pairs: an SMC link number, paired 2123 with the new RMB's RToken over that SMC Link. To preserve failover 2124 capability, any TCP connection that uses a newly added RMB cannot go 2125 active until all RTokens for the RMB have been communicated for all 2126 the links in the link group. 2128 Host X Host Y 2129 +-------------------+ +-------------------+ 2130 | +------+ +------+ | 2131 | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 9 | 2132 |RToken X| | |<-------------------->| | | 2133 | | | | | | | 2134 | \/ +------+ +------+ | 2135 |+--------+ | | | 2136 || new | | | | 2137 || RMB | | | | 2138 || | | | | 2139 |+--------+ | | | 2140 | /\ +------+ +------+ | 2141 |RToken Z| | | SMC-R Link 2 | | | 2142 | | |RNIC 3|<-------------------->|RNIC 4| | 2143 | QP 64| | | | QP 65 | 2144 | +------+ +------+ | 2145 +-------------------+ +-------------------+ 2147 CONFIRM RKEY(Request, Add, 2148 ................................................> 2149 RToken set((Link 1,RToken X),(Link2,RToken Z))) 2151 CONFIRM RKEY(Response, Add, 2152 <................................................ 2153 RToken set((Link 1,RToken X),(Link2,RToken Z))) 2155 (note, this exchange can flow over either SMC-R link) 2157 Figure 12 Add RMB to existing link group 2159 Implementations may choose to proactively add RMBs to link groups in 2160 anticipation of need. For example, an implementation may add a new 2161 RMB when all of its existing RMBs are over a certain threshold 2162 percentage used. 2164 A new RMB may also be added to an existing link group on an as needed 2165 basis. For example, when a new TCP connection is added to the link 2166 group but there are no available RMB elements. In this case the CLC 2167 exchange is paused while the peer that requires the new RMB adds it. 2168 An example of this is illustrated in figure 13. 2170 Host X -- Server Host Y -- Client 2171 +-------------------+ +-------------------+ 2172 | PeerID = PS1 | | PeerID = PC1 | 2173 | +------+ +------+ | 2174 | QP 8 |RNIC 1| SMC-R link 1 |RNIC 2| QP 64 | 2175 |RToken X| |MAC MA|<-------------------->|MAC MB| | | 2176 | | |GID GA| |GID GB| |RTokenY2| 2177 | \/ +------+ +------+ \/ | 2178 |+--------+ | | +--------+ | 2179 || | | SUBNET S1 | | New | | 2180 || RMB | | | | RMB | | 2181 |+--------+ | | +--------+ | 2182 | /\ +------+ +------+ /\ | 2183 | | |RNIC 3| SMC-R link 2 |RNIC 4| |RTokenW2| 2184 | | |MAC MC|<-------------------->|MAC MD| | | 2185 | QP 9 |GID GC| |GID GD| QP65 | 2186 | +------+ +------+ | 2187 +-------------------+ +-------------------+ 2189 SYN / SYN-ACK / ACT TCP 3-way handshake with TCP option 2190 <---------------------------------------------------------> 2192 SMC Proposal(PC1,MB,GB,S1) 2193 <-------------------------------------------------------- 2195 SMC Accept(PS1,not 1st contact,MA,GA,QP8,RToken=X,RMB elem index) 2196 ---------------------------------------------------------> 2198 Confirm Rkey(Request, Add, 2199 <........................................................ 2200 RToken set((Link1, RToken Y2),{Link2, RToken W2))) 2202 Confirm Rkey(Response, Add, 2203 ........................................................> 2204 RToken set((Link1, RToken Y2),{Link2, RToken W2))) 2206 SMC Confirm(PC1,MB,GB,QP64,RToken=Y2, RMB element index) 2207 <-------------------------------------------------------- 2209 Legend: 2210 ------------ TCP/IP and CLC flows 2211 ............ RoCE (LLC) flows 2213 Figure 13 Client adds RMB during TCP connection setup 2215 3.5.5.2.2. Deleting an RMB from an SMC-R link group 2217 Either peer can delete one or more of its RMBs as long as it is not 2218 being used for any TCP connections. Ideally an SMC-R peer would use 2219 a timer to avoid freeing an RMB immediately after the last TCP 2220 connection stops using it, to keep the RMB available for later TCP 2221 connections and avoid thrashing with addition and deletion of RMBs. 2222 Once an SMC-R peer decides to delete an RMB, it sends a DELETE RKEY 2223 LLC message to its peer. It can then free the RMB once it receives a 2224 response from the peer. Multiple RMBs can be deleted in a DELETE 2225 RKEY exchange. 2227 Note that in a DELETE RKEY message, it is not necessary to specify 2228 the full RToken for a deleted RMB. The RMB's Rkey over one link in 2229 the link group is sufficient to specify which RMB is being deleted. 2231 Host X Host Y 2232 +-------------------+ +-------------------+ 2233 | +------+ +------+ | 2234 | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 9 | 2235 |RToken X| | |<-------------------->| | | 2236 | | | | | | | 2237 | \/ +------+ +------+ | 2238 |+--------+ | | | 2239 || deleted| | | | 2240 || RMB | | | | 2241 || | | | | 2242 |+--------+ | | | 2243 | /\ +------+ +------+ | 2244 |RToken Z| | | SMC-R Link 2 | | | 2245 | | |RNIC 3|<-------------------->|RNIC 4| | 2246 | QP 9 | | | | | 2247 | +------+ +------+ | 2248 +-------------------+ +-------------------+ 2250 DELETE RKEY(Request, Rkey list(Rkey X)) 2251 ................................................> 2253 DELETE RKEY(Response, Rkey list(Rkey X)) 2254 <................................................ 2256 (note, this exchange can flow over either SMC-R link) 2258 Figure 14 Delete RMB from SMC-R link group 2260 3.5.5.2.3. Adding a new SMC-R link to a link group with multiple RMBs 2262 When a new SMC-R link is added to an existing link group, there could 2263 be multiple RMBs on each side already associated with the link group. 2264 There could also be a different number of RMBs on one side as on the 2265 other, because each peer manages its RMBs independently. Each of 2266 these RMBs will require a new RToken to be used on the new SMC-R 2267 link, and then those new RTokens must be communicated to the peer. 2268 This requires two-way communication as the server will have to 2269 communicate its RTokens to the client and vice versa. 2271 RTokens are communicated between peers in pairs. Each RToken pair 2272 consists of: 2274 o The RToken for the RMB, as is already known on an existing SMC-R 2275 link in the link group 2277 o The RToken for the same RMB, to be used on the new SMC-R link. 2279 These pairs are required to ensure that each peer knows which RTokens 2280 across QPs are equivalent. 2282 The "Add Link" request and response LLC messages do not have room to 2283 contain any RToken pairs. "Add Link continuation" LLC messages are 2284 used to communicate these pairs, as shown in Figure 15. The "Add 2285 Link Continuation" LLC messages are sent on the same SMC-R link that 2286 the "Add Link" LLC messages were sent over, and in both the "Add 2287 Link" and the "Add Link Continuation" LLC messages, the first RToken 2288 in each RToken pair will be the RToken for the RMB as known on the 2289 SMC-R link that the LLC message is being sent over. 2291 Host X -- Server Host Y -- Client 2292 +-------------------+ +-------------------+ 2293 | PeerID = PS1 | | PeerID = PC1 | 2294 | +------+ +------+ | 2295 | QP 8 |RNIC 1| |RNIC 2| QP 64 | 2296 |Rkey Set| |MAC MA| |MAC MB| |Rkey set| 2297 |X,Y,Z | |GID GA| |GID GB| |Q,R,S,T | 2298 | \/ +------+ +------+ \/ | 2299 |+--------+ | | +--------+ | 2300 || 3 RMBs | | | | 4 RMBs | | 2301 |+--------+ | | +--------+ | 2302 | /\ +------+ +------+ /\ | 2303 |Rkey set| |RNIC 3| |RNIC 4| | Rkey set| 2304 |U,V,W | |MAC MC| |MAC MD| | L,M,N,P | 2305 | QP 9 |GID GC| |GID GD| QP 65 | 2306 | +------+ +------+ | 2307 +-------------------+ +-------------------+ 2309 ADD link request (QP9,MC,GC, link number=2) 2310 ............................................> 2312 ADD link response (QP65,MD,GD, link number=2) 2313 <............................................ 2315 ADD link continuation req(RToken Pairs=((X,U),(Y,V),(Z,W))) 2316 ............................................> 2318 ADD link continuation rsp(RToken Pairs=((Q,L),(R,M),(S,N),(T,P))) 2319 <............................................. 2321 Confirm Link Req/Rsp exchange on link 2 2322 <.............................................> 2324 Legend: 2325 ------------ TCP/IP and CLC flows 2326 ............ RoCE (LLC) flows 2327 Figure 15 Exchanging Rkeys when a new link is added to a link group 2329 3.5.5.3. Serialization of LLC exchanges, and collisions 2331 LLC flows can be divided into two main groups for serializaion 2332 considerations. 2334 The first group is LLC messages that are independent and can flow at 2335 any time. These are one-time, unsolicited messages that either do 2336 not have a required response, or that have a simple response that 2337 does not interfere with the operations of another group of messages. 2338 These messages are: 2340 o TEST LINK from either the client or the server: This message 2341 requires a TEST LINK response to be returned, but does not affect 2342 the configuration of the link group or the Rkeys. 2344 o ADD LINK from the client to the server: This message is provided 2345 as an "FYI" to the server to let it know that the client has an 2346 additional RNIC available. The server is not required to act upon 2347 or respond to this message. 2349 o DELETE_LINK from the client to the server: This message informs 2350 the server that the client has either experienced an error or 2351 problem that requires a link or link group to be terminated, or 2352 that an operator has commanded that a link or link group be 2353 terminated. The server does not respond directly to the message, 2354 rather it initiates a DELETE LINK exchange as a result of 2355 receiving it. 2357 o DELETE LINK from the server to the client with the "delete entire 2358 link group" flag set: This message informs the client that the 2359 entire link group is being deleted. 2361 The second group is LLC messages that are part of an exchange of LLC 2362 messages that affects link group configuration that must complete 2363 before another exchange of LLC messages that affects link group 2364 configuration can be processed. When a peer knows that one of these 2365 exchanges is in progress, it must not start another exchange. These 2366 exchanges are: 2368 o ADD LINK / ADD LINK response / ADD LINK CONTINUATION / ADD LINK 2369 CONTINUATION response / CONFIRM LINK / CONFIRM LINK RESPONSE: 2370 This exchange, by adding a new link, changes the configuration of 2371 the link group. 2373 o DELETE LINK / DELETE LINK response initiated by the server: This 2374 exchange, by deleting a link, changes the configuration of the 2375 link group. 2377 o CONFIRM RKEY / CONFIRM RKEY response or DELETE RKEY / DELETE RKEY 2378 response: This exchange changes the RMB configuration of the link 2379 group. RKeys can not change while links are being added or 2380 deleted (while ADD or DELETE LINK is in progress). However, 2381 CONFIRM RKEY and DELETE RKEY are unique in that both the client 2382 and server can independently manage (add or remove) their own 2383 RMBs. This allows each peer to concurrently change their RKeys 2384 and therefore concurrently send CONFIRM RKEY or DELETE RKEY 2385 requests. The concurrent CONFIRM RKEY or DELETE RKEY requests can 2386 be independently processed and do not represent a collision 2388 Because the server is in control of the configuration of the link 2389 group, many timing windows and collisions are avoided but there are 2390 still some that must be handled. 2392 3.5.5.3.1. Collisions with ADD LINK / CONFIRM LINK exchange 2394 Colliding LLC message: TEST LINK 2396 Action to resolve: Send immediate TEST LINK reply 2398 Colliding LLC Message: ADD LINK from client to server 2400 Action to resolve: Server ignores the ADD LINK message. When 2401 client receives server's ADD LINK, client will consider that 2402 message to be in response to its ADD LINK message and the flow 2403 works. Since both client and server know not to start this 2404 exchange if an ADD LINK operation is already underway, this can 2405 only occur if the client sends this message before receiving the 2406 server's ADD LINK and this message crosses with the server's ADD 2407 LINK message, therefore the server's ADD LINK arrives at the 2408 client immediately after the client sent this message. 2410 Colliding LLC Message: DELETE LINK from client to server, specific 2411 link specified 2413 Action to resolve: Server queues the DELETE link message and 2414 processes after the ADD LINK exchange completes. If it is an 2415 orderly link termination, it can wait until after this exchange 2416 continues. If it is disorderly and the link affected is the one 2417 that the current exchange is using, the server will discover the 2418 outage when a message in this exchange fails. 2420 Colliding LLC Message: DELETE LINK from client to server, entire link 2421 group to be deleted 2423 Action to resolve: Immediately clean up the link group 2425 Colliding LLC message: CONFIRM RKEY from the client 2427 Action to resolve: Send negative CONFIRM_RKEY response to the 2428 client. Once the current exchange finishes, client will have to 2429 recompute its Rkey set to include the new link, and start a new 2430 CONFIRM RKEY exchange. 2432 3.5.5.3.2. Collisions during DELETE LINK exchange 2434 Colliding LLC Message: TEST LINK from either peer 2436 Action to resolve: Send immediate TEST LINK response 2438 Colliding LLC message: ADD LNK from client to server 2440 Action to resolve: Server queues the ADD LINK and processes it 2441 after the current exchange completes 2443 Colliding LLC message: DELETE LINK from client to server (specific 2444 link) 2446 Action to resolve: Server queues the DELETE link message and 2447 processes after the current exchange completes. If it is an 2448 orderly link termination, it can wait until after this exchange 2449 continues. If it is disorderly and the link affected is the one 2450 that the current exchange is using, the server will discover the 2451 outage when a message in this exchange fails 2453 Colliding LLC message: DELETE LINK from either client or server, 2454 deleting the entire link group 2456 Action to resolve: immediately clean up the link group 2458 Colliding LLC message: CONFIRM_RKEY from client to server 2460 Action to resolve: Send negative CONFIRM_RKEY response to the 2461 client. Once the current exchange finishes, client will have to 2462 recompute its Rkey set to include the new link, and start a new 2463 CONFIRM RKEY exchange 2465 3.5.5.3.3. Collisions during CONFIRM_RKEY exchange 2467 Colliding LLC Message: TEST LINK 2469 Action to resolve: Send immediate TEST LINK reply 2471 Colliding LLC message: ADD LINK from client to server 2473 Action to resolve: Queue the ADD LINK and process it after the 2474 current exchange completes 2476 Colliding LLC message: ADD LINK from server to client (CONFIRM RKEY 2477 exchange was initiated by the client and it crossed with the server 2478 initiating an ADD LINK exchange) 2480 Action to resolve: Process the ADD LINK. Client will receive a 2481 negative CONFIRM RKEY from the server and will have to redo this 2482 CONFIRM RKEY exchange after the ADD LINK exchange completes. 2484 Colliding LLC message: DELETE LINK from client to server, specific 2485 link to be deleted (CONFIRM RKEY exchange was initiated by the server 2486 and it crossed with the client's DELETE LINK request 2488 Action to resolve: Server queues the DELETE link message and 2489 processes after the ADD LINK exchange completes. If it is an 2490 orderly link termination, it can wait until after this exchange 2491 continues. If it is disorderly and the link affected is the one 2492 that the current exchange is using, the server will discover the 2493 outage when a message in this exchange fails. 2495 Colliding LLC message: DELETE LINK from server to client, specific 2496 link deleted (CONFIRM RKEY exchange was initiated by the client and 2497 it crossed with the server's DELETE LINK) 2499 Action to resolve: Process the DELETE LINK. Client will receive a 2500 negative CONFIRM RKEY from the server and will have to redo this 2501 CONFIRM RKEY exchange after the ADD LINK exchange completes. 2503 Colliding LLC message: DELETE LINK from either client or server, 2504 entire link group deleted 2506 Action to resolve: immediately clean up the link group 2508 Colliding LLC message: CONFIRM LINK from the peer that did not start 2509 the current CONFIRM LINK exchange 2511 Action to resolve: Queue the request and process it after the 2512 current exchange completes. 2514 4. SMC-R memory sharing architecture 2516 4.1. RMB element allocation considerations 2518 Each TCP connection using SMC-R must be allocated a RMBE by each SMC- 2519 R peer. This allocation is performed by each end point independently 2520 to allow each end point to select an RMBE that best matches the 2521 characteristics on its TCP socket end point. The RMBE associated with 2522 a TCP socket endpoint must have a Receive buffer that is at least as 2523 large as the TCP receive buffer size in effect for that connection. 2524 The receive buffer size can be determined by what is specified 2525 explicitly by the application using setsockopt() or implicitly via 2526 the system configured default value. This will allow sufficient data 2527 to be RDMA written by the SMC-R peer to fill an entire receive buffer 2528 size worth of data on a given data flow. Given that each RMB must 2529 have fixed length RMBEs this implies that an SMC-R end point may need 2530 to maintain multiple RMBs of various sizes for SMC-R connections on a 2531 given SMC link and can then select an RMBE that most closely fits a 2532 connection. 2534 4.2. RMB and RMBE format 2536 An RMB is a virtual memory buffer whose backing real memory is 2537 pinned, which is divided into a whole number of equal sized RMB 2538 Elements (RMBEs). Each RMBE begins with a four byte eye catcher for 2539 diagnostic and service purposes, followed by the receive data buffer. 2540 The contents of this diagnostic eyecatcher are implementation 2541 dependent and should be used by the local SMC-R peer to check for 2542 overlay errors by verifying an intact eyecatcher with every RMBE 2543 access. 2545 The RMBE is a wrapping receive buffer for receiving RDMA writes from 2546 the peer. Cursors, as described below, are exchanged between peers 2547 to manage and track RDMA writes and local data reads from the RMBE 2548 for a TCP connection. 2550 4.3. RMBE control information 2552 RMBE control information consists of consumer and producer cursors, 2553 wrap counts, CDC message sequence numbers, control flags such as 2554 urgent data and writer blocked indicators, and TCP connection 2555 information such as termination flags. This information is exchanged 2556 between SMC-R peers using CDC messages, which are passed using RDMA 2557 message passing with inline data, with the control information 2558 contained in the inline data. A TCP/IP stack implementing SMC-R must 2559 receive and store this information in its internal data structures as 2560 it is used to manage the RMBE and its data buffer. 2562 The format and contents of the CDC message is described in detail in 2563 4.3. RMBE control information. The following is a high level 2564 description of what this control information contains. 2566 o Connection state flags such as sending done, connection closed, 2567 failover data validation, and abnormal close 2569 o A sequence number that is managed by the sender. This sequence 2570 number starts at 1, is increased each send, and wraps to 0. This 2571 sequence number tracks the CDC message sent and is not related to 2572 the number of bytes sent. It is used for failover data 2573 validation. 2575 o Producer cursor: a wrapping offset into the receiver's RMBE data 2576 area. Set by the peer that is writing into the RMBE, it points to 2577 where the writing peer will write the next byte of data into an 2578 RMBE. This cursor is accompanied by a wrap sequence number to help 2579 the RMBE owner (the receiver) identify full window size wrapping 2580 writes. Note that this cursor must account for (i.e., skip over) 2581 the RMBE eyecatcher that is in the beginning of the data area. 2583 o Consumer cursor: a wrapping offset into the receiver's RMBE data 2584 area. Set by the owner of the RMBE (the peer that is reading from 2585 it), this cursor points to the offset of the next byte of data to 2586 be consumed by the peer in its own RMBE. The sender cannot write 2587 beyond this cursor into the receiver's RMBE without causing data 2588 loss. Like the producer cursor, this is accompanied by a wrap 2589 count to help the writer identify full window size wrapping reads. 2590 Note that this cursor must account for (i.e., skip over) the RMBE 2591 eyecatcher that is in the beginning of the data area. 2593 o Data flags such as urgent data, writer blocked indicator, and 2594 cursor update requests. 2596 4.4. Use of RMBEs 2598 4.4.1. Initializing and accessing RMBEs 2600 The RMBE eyecatcher is initialized by the RMB owner prior to 2601 assigning it to a specific TCP connection and communicating its RMB 2602 index to the SMC-R partner. After an RMBE index is communicated to 2603 the SMC-R partner the RMBE can only be referenced in "read only mode" 2604 by the owner and all updates to it are performed by the remote SMC-R 2605 partner via RDMA write operations. 2607 Initialization of an RMBE must include the following: 2609 o Zeroing out the entire RMBE receive buffer, which helps minimize 2610 data integrity issues (e.g. data from a previous connection 2611 somehow being presented to the current connection). 2613 o Setting the beginning RMBE eye catcher. This eye catcher plays an 2614 important role in helping detect accidental overlays of the RMBE. 2615 The RMB owner must always validate these eye catchers before each 2616 new reference to the RMBE. If the eye catchers are found to be 2617 corrupted the local host must reset the TCP connection associated 2618 with this RMBE and log the appropriate diagnostic information. 2620 4.4.2. RMB element reuse and conflict resolution 2621 RMB elements can be reused once their associated TCP and SMC-R 2622 connections are terminated. Under normal and abnormal SMC-R 2623 connection termination processing both SMC-R peers must explicitly 2624 acknowledge that they are done using an RMBE before that element can 2625 be freed and reassigned to another SMC-R connection instance. For 2626 more details on SMC-R connection termination refer to section 4.8. 2627 However, there are some error scenarios where this 2 way explicit 2628 acknowledgement may not be completed. In these scenarios (mentioned 2629 explicitly elsewhere in this document) an RMBE owner may chose to re- 2630 assign this RMBE to a new SMC-R connection instance on this SMC link 2631 group. When this occurs the partner SMC-R peer must detect this 2632 condition during SMC-R rendezvous processing when presented with an 2633 RMBE that it believes is already in use for a different SMC-R 2634 connection. In this case, the SMC-R peer must abort the existing 2635 SMC-R connection associated with this RMBE. The abort processing 2636 Resets the TCP connection (if it is still active) but it must not 2637 attempt to perform any RDMA writes to this RMBE and must also ignore 2638 any data sitting in the local RMBE associated with the existing 2639 connection. It then proceeds to free up the local RMBE and notify 2640 the local application that the connection is being abnormally reset. 2642 The remote SMC-R peer then proceeds to normal processing for this new 2643 SMC-R connection. 2645 4.5. SMC-R protocol considerations 2647 The following sections describe considerations for the SMC-R protocol 2648 as compared to the TCP protocol. 2650 4.5.1. SMC-R protocol optimized window size updates 2652 An SMC-R receiver host sends its Consumer Cursor information to the 2653 sender to convey the progress that the receiving application has made 2654 in consuming the sent data. The difference between the writer's 2655 Producer Cursor and the associated receiver's Consumer Cursor 2656 indicates the window size available for the sender to write into. 2657 This is somewhat similar to TCP window update processing and 2658 therefore has some similar considerations, such as silly window 2659 syndrome avoidance, whereby the TCP protocol has an optimization that 2660 minimizes the overhead of very small, unproductive window size 2661 updates associated with sub-optimal socket applications consuming 2662 very small amount of data on every receive() invocation. For SMC-R, 2663 the receiver only updates its Consumer Cursor via a unique CDC 2664 message under the following conditions: 2666 o The current window size (from a sender's perspective) is less than 2667 half of the Receive Buffer space and the Consumer Cursor update 2668 will result in a minimum increase in the window size of 10% of the 2669 Receive buffer space. Some examples: 2671 a. Receive Buffer size: 64K, Current window size (from a 2672 sender's perspective): 50K. No need to update the Consumer 2673 Cursor. Plenty of space is available for the sender. 2675 b. Receive Buffer size: 64K, Current window size (from a 2676 sender's perspective): 30K, Current window size from a 2677 receiver's perspective: 31K. No need to update the Consumer 2678 Cursor; even though the sender's window size < 1/2 of the 2679 64K, the window update would only increase that by 1K which 2680 is < 1/10th of the 64K buffer size. 2682 c. Receive Buffer size: 64K, Current window size (from a 2683 sender's perspective): 30K, Current window size from a 2684 receiver's perspective: 64K. The receiver updates the 2685 Consumer Cursor (sender's window size < 1/2 of the 64K, the 2686 window update would increase that by > 6.4K). 2688 o The receiver must always include a Consumer Cursor update whenever 2689 it sends a CDC message to the partner for another flow (i.e. send 2690 flow in the opposite direction). This allows the window size 2691 update to be delivered with no additional overhead. This is 2692 somewhat similar to TCP DelayAck processing and quite effective 2693 for request/response data patterns. 2695 o If a peer has set the B-bit in a CDC message then any consumption 2696 of data by the receiver causes a CDC message to be sent updating 2697 the consumer cursor until that a CDC message with that bit cleared 2698 is received from the peer. 2700 o The optimized window size updates are overridden when the sender 2701 sets the Consumer Cursor Update Requested flag in a CDC message to 2702 the receiver. When this indicator is on the consumer must send a 2703 Consumer Cursor update immediately when data is consumed by the 2704 local application or if the cursor has not been updated for a 2705 while (i.e. local copy consumer cursor does not match the last 2706 consumer cursor value sent to the the partner). This allows the 2707 sender to perform optional diagnostics for detecting a stalled 2708 receiver application (data has been sent but not consumed). It is 2709 recommended that the Consumer Cursor Update Requested flag only be 2710 sent for diagnostic procedures as it may result in non-optimal 2711 data path performance. 2713 4.5.2. Small data sends 2715 The SMC-R protocol makes no special provisions for handling small 2716 data segments sent across a stream socket. Data is always sent if 2717 sufficient window space is available. There are no special provisions 2718 for coalescing small data segments, similar to the TCP Nagle 2719 algorithm. 2721 An implementation of SMC-R may optimize its sending processing by 2722 coalescing outbound data for a given SMC-R connection so that it can 2723 reduce the number of RDMA write operations it performed in a similar 2724 fashion to Nagle's algorithm. However, any such coalescing would 2725 require a timer on the sending host that would ensure that data was 2726 eventually sent. And the sending host would have to opt out of this 2727 processing if Nagle's algorithm had been disabled (programmatically 2728 or via system configuration). 2730 4.5.3. TCP Keepalive processing 2732 TCP keepalive processing allows applications to direct the local 2733 TCP/IP host to periodically "test" the viability of an idle TCP 2734 connection. Since SMC-R connections have both a TCP representation 2735 along with an SMC-R representation there are unique keepalive 2736 processing considerations: 2738 o SMC-R layer keepalive processing: If keepalive is enabled for an 2739 SMC-R connection the local host maintains a keepalive timer that 2740 reflects how long an SMC-R connection has been idle. The local 2741 host also maintains a timestamp of last activity for each SMC link 2742 (for any SMC-R connection on that link). When it is determined 2743 that an SMC-R connection has been idle longer than the keepalive 2744 interval the host checks whether the SMC-R link has been idle for 2745 a duration longer than the keepalive timeout. If both conditions 2746 are met, the local host then performs a Test Link LLC command to 2747 test the viability of the SMC link over the RoCE fabric (RC-QPs). 2748 If a Test Link LLC command response is received within a 2749 reasonable amount of time then the link is considered viable and 2750 all connections using this link are considered viable as well. If 2751 however a response is not received in a reasonable amount of time 2752 or there's a failure in sending the Test Link LLC command then 2753 this is considered a failure in the SMC link and failover 2754 processing to an alternate SMC link must be triggered. If no 2755 alternate SMC link exists in the SMC link group then all the SMC-R 2756 connections on this link are abnormally terminated by resetting 2757 the TCP connections represented by these SMC-R connections. Given 2758 that multiple SMC-R connections can share the same SMC link, 2759 implementing an SMC link level probe using the Test Link LLC 2760 command will help reduce the amount of unproductive keepalive 2761 traffic for SMC-R connections; as long as some SMC-R connections 2762 on a given SMC link are active (i.e. have had I/O activity within 2763 the keepalive interval) then there is no need to perform 2764 additional link viability testing. 2766 o TCP layer keepalives processing: Traditional TCP "keepalive" 2767 packets are not as relevant for SMC-R connections given that the 2768 TCP path is not used for these connections once the SMC-R 2769 rendezvous processing is completed. All SMC-R connections by 2770 default have associated TCP connections that are idle. Are TCP 2771 keepalive probes still needed for these connections? There are 2772 two main scenarios to consider: 2774 1. TCP keepalives that are used determine whether the peer TCP 2775 endpoint is still active. This is not needed for SMC-R 2776 connections as the SMC-R level keepalives mentioned above will 2777 determine whether the remote endpoint connections are still 2778 active. 2780 2. TCP keepalives that are used to ensure that TCP connections 2781 traversing an intermediate proxy maintain an active state. For 2782 example, stateful firewalls typically maintain state 2783 representing every valid TCP connection that traverses the 2784 firewall. These types of firewalls are known to expire idle 2785 connections by removing their state in the firewall to conserve 2786 memory. TCP keepalives are often used in this scenario to 2787 prevent firewalls from timing out otherwise idle connections. 2788 When using SMC-R, both end points must reside in the same layer 2789 2 network (i.e. the same subnet). As a result, firewalls can 2790 not be injected in the path between two SMC-R endpoints. 2791 However, other intermediate proxies, such as TCP/IP layer load 2792 balancers may be injected in the path of two SMC-R endpoints. 2793 These types of load balancers also maintain connection state so 2794 that they can forward TCP connection traffic to the appropriate 2795 cluster end point. When using SMC-R these TCP connections will 2796 appear to be completely idle making them susceptible to 2797 potential timeouts at the LB proxy. As a result, for this 2798 scenario, TCP keepalives may still be relevant. 2800 The following are the TCP level keepalive processing requirements for 2801 SMC-R enabled hosts: 2803 o SMC-R peers should allow TCP keepalives to flow on the TCP path of 2804 SMC-R connections based on existing TCP keepalive configuration 2805 and programming options. However, it is strongly recommended that 2806 platforms provide the ability to specify very granular keepalive 2807 timers (for example, single digit second timers) should consider 2808 providing a configuration option that limits the minimum keepalive 2809 timer that will be used for TCP layer keepalives on SMC-R 2810 connections. This is important to minimize the amount of TCP 2811 keepalive packets transmitted in the network for SMC-R 2812 connections. 2814 o SMC-R peers must always respond to inbound TCP layer keepalives 2815 (by sending ACKs for these packets) even if the connection is 2816 using SMC-R. Typically, once a TCP connection has completed the 2817 SMC-R rendezvous processing and using SMC-R for data flows, no new 2818 inbound TCP segments are expected on that TCP connection other 2819 than TCP termination segments (FIN, RST, etc). TCP keepalives are 2820 the one exception that must be supported. And since TCP keepalive 2821 probes do not carry any application layer data this has no adverse 2822 impact on the application's inbound data stream. 2824 4.6. TCP connection failover between SMC-R links 2826 A peer may change which SMC-R link within a link group it sends its 2827 writes over in the event of a link failure. Since each peer 2828 independently chooses which link to send writes over for a specific 2829 TCP connection, this process is done independently by each peer. 2831 4.6.1. Validating data integrity 2833 Even though RoCE is a reliable transport there is a small subset of 2834 failure modes that could cause unrecoverable loss of data. When an 2835 RNIC acknowledges receipt of an RDMA write to its peer, that creates 2836 a write completion event to the sending peer, which allows the sender 2837 to release any buffers it is holding for that write. In normal 2838 operation and in most failures, this operation is reliable. 2840 However there are failure modes possible in which a receiving RNIC 2841 has acknowledged an RDMA write but then was not able to place the 2842 received data into its host memory, for example a sudden, disorderly 2843 failure of the interface between the RNIC and the host. While rare, 2844 these types of events must be guarded against to ensure data 2845 integrity. The process for switching SMC-R links during failover that 2846 is described in this section guards against this possibility, and is 2847 mandatory. 2849 Each peer must track the current state of the CDC sequence numbers 2850 for a TCP connection. The sender must keep track of SS, which is the 2851 sequence number of the CDC message that described the last write 2852 acknowledged by the peer RNIC. In other words, SS describes the last 2853 write that the sender believes its peer has successfully received. 2854 The receiver must keep track of SR, the sequence number of the CDC 2855 message that described last write that it has successfully received, 2856 i.e., the data has been successfully placed into an RMBE. 2858 When an RNIC fails and the sender changes SMC-R links, the sender 2859 must first send a CDC message with the 'F' flag set over the new SMC- 2860 R link. This is the failover data validation message. The sequence 2861 number in this CDC message is equal to SS. The CDC message key, the 2862 length, and SMC-R alert token are the only other fields in this CDC 2863 message that are significant. No reply is expected from this 2864 validation message, and once the sender has sent it, the sender may 2865 resume sending on the new SMC-R link as described in 4.6.2. below 2867 Upon receipt of the failover validation message, the receiver must 2868 verify that its SR value for the TCP connection is equal to or 2869 greater than the sequence number in the failover validation message. 2870 If so, no further action is required and the TCP connection resumes 2871 on the new SMC-R link. If SR is less than the sequence number value 2872 in the validation message, data has been lost and the receiver must 2873 immediately reset the TCP connection. 2875 4.6.2. Resuming the TCP connection on a new SMCR link 2876 When a connection is moved to a new SMC-R link and the failover 2877 validation message has been sent, the sender can immediately resume 2878 normal transmission. In order to preserve the application message 2879 stream the sender must replay any RDMA writes (and their associated 2880 CDC messages) that were in progress or failed when the previous SMC-R 2881 link failed, before sending new data on the new SMC-R link. The 2882 sender has two options for accomplishing this: 2884 o Preserve the sequence numbers "as is": Retry all failed and 2885 pending operations as they were originally done, including 2886 reposting all associated RDMA write operations and their 2887 associated CDC messages without making any changes. Then resume 2888 sending new data using new sequence numbers. 2890 o Combine pending messages and possibly add new data: Combine failed 2891 and pending messages into a single new write with a new sequence 2892 number. This allows the sender to combine pending messages into 2893 fewer operations. As a further optimization this write can also 2894 include new data, as long as all failed and pending data is also 2895 included. If this approach is taken, the sequence number must be 2896 increased beyond the last failed or pending sequence number. 2898 4.7. RMB data flows 2900 The following sections describe the RDMA wire flows for the SMC-R 2901 protocol after a TCP connection has switched into SMC-R mode (i.e. 2902 SMC-R rendezvous processing is complete and a pair of RMB elements 2903 has been assigned and communicated by the SMC-R peers). The ladder 2904 diagrams below include the following: 2906 o RMBE control information kept by each peer. Only a subset of the 2907 information is depicted, specifically only the fields that reflect 2908 the stream of data written by Host A and read by Host B. 2910 o Time line 0-x that shows the wire flows in a time relative fashion 2912 o Note that RMBE control information is only shown in a time 2913 interval if its value changed (otherwise assume the value is 2914 unchanged from previously depicted value) 2916 o The local copy of the producer and consumer cursors that is 2917 maintained by each host is not depicted in these figures. Note 2918 that the cursor values in the diagram reflect the necessity of 2919 skipping over the eyecatcher in the RMBE data area. They start 2920 and wrap at 4, not 0. 2922 4.7.1. Scenario 1: Send flow, window size unconstrained 2924 SMC Host A SMC HostB 2925 RMBE A Info RMBE B Info 2926 (Consumer Cursors) (Producer Cursors) 2927 Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 2928 4 0 0 0 4 0 0 2929 0 0 1 ---------------> 1 0 0 0 2930 RDMA-WR Data 2931 (4:1003) 2932 4 0 2 ...............> 2 1004 0 0 2933 CDC Message 2935 Figure 16 Scenario 1: Send flow, window size unconstrained 2937 Scenario assumptions: 2939 o Kernel implementation 2941 o New SMC-R connection, no data has been sent on the connection 2943 o Host A: Application issues send for 1,000 bytes to Host B 2945 o Host B: RMBE receive buffer size is 10,000, application has issued 2946 a recv for 10,000 bytes 2948 Flow description: 2950 1. Application issues send() for 1,000 bytes, SMC-R layer copies 2951 data into a kernel send buffer. It then schedules an RDMA write 2952 operation to move the data into the peer's RMBE receive buffer, 2953 at relative position 4-1003 (to skip the four byte eyecatcher in 2954 the RMBE data area). Note that no immediate data or alert (i.e. 2955 interrupt) is provided to host B for this RDMA operation. 2957 2. Host A sends a CDC message to update the Producer Cursor to byte 2958 1004. This CDC message will deliver an interrupt to Host B. At 2959 this point, the SMC-R layer can return control back to the 2960 application. Host B, once notified of the completion of the 2961 previous RDMA operation, locates the RMBE associated with the 2962 RMBE alert token that was included in the message and proceeds 2963 to perform normal receive side processing, waking up the 2964 suspended application read thread, copying the data into the 2965 application's receive buffer, etc. It will use the Producer 2966 Cursor as an indicator of how much data is available to be 2967 delivered to the local application. After this processing is 2968 complete, the SMC-R layer will also update its local Consumer 2969 Cursor to match the Producer Cursor (i.e. indicating that all 2970 data has been consumed). Note that a message to the peer 2971 updating the Consumer Cursor is not needed at this time as the 2972 window size if unconstrained (> 1/2 of the receive buffer size). 2973 The window size is calculated using by taking the difference 2974 between the Producer and the Consumer cursors in the RMBEs 2975 (10,000-1,004=8,996). 2977 4.7.2. Scenario 2: Send/Receive flow, window unconstrained 2979 SMC Host A SMC HostB 2980 RMBE A Info RMBE B Info 2981 (Consumer Cursors) (Producer Cursors) 2982 Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 2983 4 0 0 0 4 0 0 2984 0 0 1 ---------------> 1 0 0 0 2985 RDMA-WR Data 2986 (4:1003) 2987 4 0 2 ...............> 2 1004 0 0 2988 CDC Message 2990 0 0 3 <-------------- 3 1004 0 0 2991 RDMA-WR Data 2992 (4:503) 2993 1004 0 4 <.............. 4 1004 0 0 2994 CDC Message 2996 Figure 17 Scenario 2: Send/Recv flow, window size unconstrained 2998 Scenario assumptions: 3000 o New SMC-R connection, no data has been sent on the connection 3002 o Host A: Application issues send for 1,000 bytes to Host B 3004 o Host B: RMBE receive buffer size is 10,000, application has 3005 already issued a recv for 10,000 bytes. Once the receive is 3006 completed, the application sends a 500 byte response to Host A. 3008 Flow description: 3010 1. Application issues send() for 1,000 bytes, SMC-R layer copies 3011 data into a kernel send buffer. It then schedules an RDMA write 3012 operation to move the data into the peer's RMBE receive buffer, 3013 at relative position 4-1003. Note that no immediate data or 3014 alert (i.e. interrupt) is provided to host B for this RDMA 3015 operation. 3017 2. Host A sends a CDC message to update the Producer Cursor to 3018 byte 1004. This CDC message will deliver an interrupt to Host B. 3019 At this point, the SMC-R layer can return control back to the 3020 application. 3022 3. Host B, once notified of the receipt of the previous CDC 3023 message, locates the RMBE associated with the RMBE alert token 3024 and proceeds to perform normal receive side processing, waking 3025 up the suspended application read thread, copying the data into 3026 the application's receive buffer, etc. After this processing is 3027 complete, the SMC-R layer will also update its local Consumer 3028 Cursor to match the Producer Cursor (i.e. indicating that all 3029 data has been consumed). Note that an update of the Consumer 3030 Cursor to the peer is not needed at this time as the window size 3031 is unconstrained (> 1/2 of the receive buffer size). The 3032 application then performs a send() for 500 bytes to Host A. The 3033 SMC-R layer will copy the data into a kernel buffer and then 3034 schedule an RDMA Write into the partner's RMBE receive buffer. 3035 Note that this RDMA write operation includes no immediate data 3036 or notification to Host A. 3038 4. Host B sends a CDC message to update the partner's RMBE Control 3039 information with the latest Producer Cursor (set to 503 and not 3040 shown in the diagram above) and to also inform the peer that the 3041 Consumer Cursor value is now 1004. It also updates the local 3042 Current Consumer Cursor and Last Sent Consumer Cursor to 1004. 3043 This CDC message includes notification since we are updating 3044 our Producer Cursor which requires attention by the peer host. 3046 4.7.3. Scenario 3: Send Flow, window constrained 3048 SMC Host A SMC HostB 3049 RMBE A Info RMBE B Info 3050 (Consumer Cursors) (Producer Cursors) 3051 Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 3052 4 0 0 0 4 0 0 3053 4 0 1 ---------------> 1 4 0 0 3054 RDMA-WR Data 3055 (4:3003) 3056 4 0 2 ...............> 2 3004 0 0 3057 CDC Message 3058 4 0 3 3 3004 0 0 3059 4 0 4 ---------------> 4 3004 0 0 3060 RDMA-WR Data 3061 (3004:7003) 3062 4 0 5 ................> 5 7004 0 0 3063 CDC Message 3064 7004 0 6 <................ 6 7004 0 0 3065 CDC Message 3067 Figure 18 Scenario 3: Send Flow, window size constrained 3069 Scenario assumptions: 3071 o New SMC-R connection, no data has been sent on this connection 3073 o Host A: Application issues send for 3,000 bytes to Host B and then 3074 another send for 4,000 3076 o Host B: RMBE receive buffer size is 10,000. Application has 3077 already issued a recv for 10,000 bytes 3079 Flow description: 3081 1. Application issues send() for 3,000 bytes, SMC-R layer copies 3082 data into a kernel send buffer. It then schedules an RDMA write 3083 operation to move the data into the peer's RMBE receive buffer, 3084 at relative position 4-3003. Note that no immediate data or 3085 alert (i.e. interrupt) is provided to host B for this RDMA 3086 operation. 3088 2. Host A sends a CDC message to update its Producer Cursor to byte 3089 3003. This CDC message will deliver an interrupt to Host B. At 3090 this point, the SMC-R layer can return control back to the 3091 application. 3093 3. Host B, once notified of the receipt of the previous CDC 3094 message, locates the RMBE associated with the RMBE alert token 3095 and proceeds to perform normal receive side processing, waking 3096 up the suspended application read thread, copying the data into 3097 the application's receive buffer, etc. After this processing is 3098 complete, the SMC-R layer will also update its local Consumer 3099 Cursor to match the Producer Cursor (i.e. indicating that all 3100 data has been consumed). It will not however update the partner 3101 with this information as the window size is not constrained 3102 (10000-3000=7000 of available space). The application on Host B 3103 also issues a new recv() for 10,000. 3105 4. On Host A, application issues a send() for 4,000 bytes. The SMC- 3106 R layer copies the data into a kernel buffer and schedules an 3107 async RDMA write into the peer's RMBE receive buffer at relative 3108 position 3003-7004. Note that no alert is provided to host B for 3109 this flow. 3111 5. Host A sends a CDC message to update the Producer Cursor to 3112 byte 7004. This CDC message will deliver an interrupt to Host B. 3113 At this point, the SMC-R layer can return control back to the 3114 application. 3116 6. Host B, once notified of the receipt of the previous CDC 3117 message, locates the RMBE associated with the RMBE alert token 3118 and proceeds to perform normal receive side processing, waking 3119 up the suspended application read thread, copying the data into 3120 the application's receive buffer, etc. After this processing is 3121 complete, the SMC-R layer will also update its local Consumer 3122 Cursor to match the Producer Cursor (i.e. indicating that all 3123 data has been consumed). It will then determine whether it 3124 needs to update the Consumer Cursor to the peer. The available 3125 window size is now 3,000 (10,000 - (Producer Cursor - Last Sent 3126 Consumer Cursor)) which is < 1/2 receive buffer size 3127 (10,000/2=5,000) and the advance of the window size is > 10% of 3128 the windows size (1,000). Therefore a CDC message is issued to 3129 update the Consumer Cursor to peer A. 3131 4.7.4. Scenario 4: Large send, flow control, full window size writes 3133 SMC Host A SMC HostB 3134 RMBE A Info RMBE B Info 3135 (Consumer Cursors) (Producer Cursors) 3136 Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 3137 1004 1 0 0 1004 1 0 3138 1004 1 1 ---------------> 1 1004 1 0 3139 RDMA-WR Data 3140 (1004:9999) 3141 1004 1 2 ---------------> 2 1004 1 0 3142 RDMA-WR Data 3143 (4:1003) 3144 1004 1 3 ...............> 3 1004 2 Wrt 3145 CDC Message 3146 1004 2 4 <............... 4 1004 2 Wrt 3147 CDC Message 3148 1004 2 5 ---------------> 5 1004 2 Wrt 3149 RDMA-WR Data Blk 3150 (1004:9999) 3151 1004 2 6 ---------------> 6 1004 2 Wrt 3152 RDMA-WR Data Blk 3153 (4:1003) 3154 1004 2 7 ...............> 7 1004 3 Wrt 3155 CDC Message 3156 1004 3 8 <............... 8 1004 3 Wrt 3157 CDC Message 3158 Figure 19 Scenario 4: Large send, flow control, full window size 3159 writes 3161 Scenario assumptions: 3163 o Kernel implementation 3165 o Existing SMC-R connection, Host B's receive window size is fully 3166 open(Peer Consumer Cursor = Peer Producer Cursor). 3168 o Host A: Application issues send for 20,000 bytes to Host B 3170 o Host B: RMB receive buffer size is 10,000, application has issued 3171 a recv for 10,000 bytes 3173 Flow description: 3175 1. Application issues send() for 20,000 bytes, SMC-R layer copies 3176 data into a kernel send buffer (assumes send buffer space of 3177 20,000 is available for this connection). It then schedules an 3178 RDMA write operation to move the data into the peer's RMBE 3179 receive buffer, at relative position 1004-9999. Note that no 3180 immediate data or alert (i.e. interrupt) is provided to host B 3181 for this RDMA operation. 3183 2. Host A then schedules an RDMA write operation to fill the 3184 remaining 1000 bytes of available space in the peer's RMBE 3185 receive buffer, at relative position 4-1003. Note that no 3186 immediate data or alert (i.e. interrupt) is provided to host B 3187 for this RDMA operation. Also note that an implementation of 3188 SMC-R may optimize this processing by combining step 1 and 2 3189 into a single RDMA Write operation (with 2 different data 3190 sources). 3192 3. Host A sends CDC message to update the Producer Cursor to byte 3193 1004. Since the entire receive buffer space is filled, the 3194 Producer Writer Blocked flag (WrtBlk indicator above) is set and 3195 the Producer Window Wrap Sequence Number (Producer WrapSeq# 3196 above) is incremented. This CDC message will deliver an 3197 interrupt to Host B. At this point, the SMC-R layer can return 3198 control back to the application. 3200 4. Host B, once notified of the receipt of the previous CDC 3201 message, locates the RMBE associated with the RMBE alert token 3202 and proceeds to perform normal receive side processing, waking 3203 up the suspended application read thread, copying the data into 3204 the application's receive buffer, etc. In this scenario, Host B 3205 notices that the Producer Cursor has not been advanced (same 3206 value as Consumer Cursor), however, it notices that the Producer 3207 Window Wrap Size Sequence number is different from its local 3208 value (1) indicating that a full window of new data is 3209 available. All the data in the receive buffer can be processed, 3210 the first segment (1004-9999) followed by the second segment (4- 3211 1003). Because the Producer Writer Blocked indicator was set, 3212 Host B schedules a CDC message to update its latest information 3213 to the peer: Consumer Cursor (1004), Consumer Window Wrap Size 3214 Sequence Number (2: the current Producer Window Wrap Sequence 3215 Number is used). 3217 5. Host A, upon receipt of the CDC message locates the TCP 3218 connection associated with the alert token, and upon examining 3219 the control information provided notices that Host B has 3220 consumed all of the data (based on the Consumer Cursor and the 3221 Consumer Window Wrap Size Sequence number) and initiates the 3222 next RDMA write to fill the receive buffer at offset 1003-9999. 3224 6. Host A then moves the next 1000 bytes into the beginning of the 3225 receive buffer (4-1003) by scheduling an RDMA write operation. 3226 Note at this point there are still 8 bytes remaining to be 3227 written. 3229 7. Host A then sends a CDC message to set the Producer Writer 3230 Blocked indicator and to increment the Producer Window Wrap Size 3231 Sequence Number (3). 3233 8. Host B, upon notification completes the same processing as step 3234 4 above, including sending a CDC message to update the peer to 3235 indicate that all data has been consumed. At this point Host A 3236 can write the final 8 utes to host B's RMBE into positions 1004- 3237 1011 (not shown). 3239 4.7.5. Scenario 5: Send flow, urgent data, window size unconstrained 3241 SMC Host A SMC HostB 3242 RMBE A Info RMBE B Info 3243 (Consumer Cursors) (Producer Cursors) 3244 Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flag 3245 1000 1 0 0 1000 1 0 3246 1000 1 1 ---------------> 1 1000 1 0 3247 RDMA-WR Data 3248 (1000:1499) 3249 1000 1 2 ...............> 2 1500 1 UrgP 3250 CDC Message UrgA 3252 1500 1 3 <............... 3 1500 1 UrgP 3253 CDC Message UrgA 3255 1500 1 4 ---------------> 4 1500 1 UrgP 3256 RDMA-WR Data UrgA 3257 (1500:2499) 3258 1500 1 5 ...............> 5 2500 1 0 3259 CDC Message 3261 Figure 20 Scenario 5: send Flow, urgent data, window size open 3263 Scenario assumptions: 3265 o Kernel implementation 3267 o Existing SMC-R connection, window size open, all data has been 3268 consumed by receiver. 3270 o Host A: Application issues send for 500 bytes with urgent data 3271 indicator (OOB) to Host B, then sends 1000 of normal data 3273 o Host B: RMBE Receive buffer size is 10,000, application has issued 3274 a recv for 10,000 bytes and is also monitoring the socket for 3275 urgent data 3277 Flow description: 3279 1. Application issues send() for 500 bytes of urgent data. SMC-R 3280 layer copies data into a kernel send buffer. It then schedules 3281 an RDMA write operation to move the data into the peer's RMBE 3282 receive buffer, at relative position 1000-1499. Note that no 3283 immediate data or alert (i.e. interrupt) is provided to host B 3284 for this RDMA operation. 3286 2. Host A sends a CDC message to update its Producer Cursor to byte 3287 1500 and to turn on the Producer Urgent Data Pending (UrgP) and 3288 Urgent Data Present (UrgA) flags. This CDC message will deliver 3289 an interrupt to Host B. At this point, the SMC-R layer can 3290 return control back to the application. 3292 3. Host B, once notified of the receipt of the previous CDC 3293 message, locates the RMBE associated with the RMBE alert token, 3294 notices that the Urgent Data Pending flag is on and proceeds 3295 with Out of Band socket API notification. For example, 3296 satisfying any outstanding select() or poll() requests on the 3297 socket by indicating that urgent data is pending (i.e. by 3298 setting the exception bit on). The Urgent Data Present indicator 3299 allows Host B to also determine the position of the urgent data 3300 (Producer cursor points one byte beyond the last byte of urgent 3301 data). Host B can then perform normal receive side processing 3302 (including specific urgent data processing), copying the data 3303 into the application's receive buffer, etc. Host B then sends a 3304 CDC message to update the partner's RMBE Control area with its 3305 latest Consumer Cursor (1500). Note this CDC message must occur 3306 regardless of the current local window size that is available. 3307 The partner host (Host A) cannot initiate any additional RDMA 3308 writes until acknowledgement that the urgent data has been 3309 processed (or at least processed/remembered at the SMC-R layer). 3311 4. Upon receipt of the message, Host A wakes up, sees that peer 3312 consumed all data up to and including the last byte of Urgent 3313 data and now resumes sending any pending data. In this case, 3314 the application had previously issued a send for 1000 bytes of 3315 normal data which would have been copied in the send buffer and 3316 control would have been returned to the application. Host A now 3317 initiates a RDMA write to move that data to the Peer's receive 3318 buffer at position 1500-2499. 3320 5. Host A then sends a CDC message with inline data update its 3321 Producer Cursor value (2500) and turn off the Urgent Data 3322 Pending and Urgent Data Present flags. Host B wakes up, 3323 processes the new data (resumes application, copies data into 3324 the application receive buffer) and then proceeds to update the 3325 Local current consumer cursor (2500). Given that the window size 3326 is unconstrained there is no need for Consumer Cursor update in 3327 the peer's RMBE. 3329 4.7.6. Scenario 6: Send flow, urgent data, window size closed 3331 SMC Host A SMC HostB 3332 RMBE A Info RMBE B Info 3333 (Consumer Cursors) (Producer Cursors) 3334 Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flag 3335 1000 1 0 0 1000 2 Wrt 3336 Blk 3338 1000 1 1 ...............> 1 1000 2 Wrt 3339 CDC Message Blk 3340 UrgP 3342 1000 2 2 <............... 2 1000 2 Wrt 3343 CDC Message Blk 3344 UrgP 3346 1000 2 3 ---------------> 3 1000 2 Wrt 3347 RDMA-WR data l Blk 3348 (1000:1499) UrgP 3350 1000 2 4 ...............> 4 1500 2 UrgP 3351 CDC Message UrgA 3353 1500 2 5 <............... 5 1500 2 UrgP 3354 CDC Message UrgA 3356 1500 2 6 ---------------> 6 1500 2 UrgP 3357 RDMA-WR data l UrgA 3358 (1500:2499) 3359 1000 2 7 ...............> 7 2500 2 0 3360 CDC Message 3362 Figure 21 Scenario 6: Send flow, urgent data, window size closed 3364 Scenario assumptions: 3366 o Kernel implementation 3368 o Existing SMC-R connection, window size closed, writer is blocked. 3370 o Host A: Application issues send for 500 bytes with urgent data 3371 indicator (OOB) to Host B, then sends 1000 of normal data. 3373 o Host B: RMBE Receive buffer size is 10,000, application has no 3374 outstanding recv() (for normal data) and is monitoring the socket 3375 for urgent data. 3377 Flow description: 3379 1. Application issues send() for 500 bytes of urgent data. SMC-R 3380 layer copies data into a kernel send buffer (if available). 3381 Since the writer is blocked (window size closed) it cannot send 3382 the data immediately. It then sends a CDC message to notify the 3383 peer of the Urgent Data Pending (UrgP)indicator (the Writer 3384 Blocked indicator remains on as well). This serves as a signal 3385 to Host B that urgent data is pending in the stream. Control is 3386 also returned to the application at this point. 3388 2. Host B, once notified of the receipt of the previous CDC 3389 message, locates the RMBE associated with the RMBE alert token, 3390 notices that the Urgent Data Pending flag is on and proceeds 3391 with Out of Band socket API notification. For example, 3392 satisfying any outstanding select() or poll() requests on the 3393 socket by indicating that urgent data is pending (i.e. by 3394 setting the exception bit on). At this point it is expected that 3395 the application will enter urgent data mode processing, 3396 expeditiously processing all normal data (by issuing recv API 3397 calls) so that it can get to the urgent data byte. Whether the 3398 application has this urgent mode processing or not, at some 3399 point the application will consume some or all of the pending 3400 data in the receive buffer. When this occurs, Host B will also 3401 send a CDC message with inline data to update its Consumer 3402 Cursor and Consumer Window Wrap Sequence Number to the peer. In 3403 the example above, a full window worth of data was consumed. 3405 3. Host A, once awakened by the message will notice that the window 3406 size is now open on this connection (based on the Consumer 3407 Cursor and the Consumer Window Wrap Sequence Number which now 3408 matches the Producer Window Wrap Sequence Number) and resume 3409 sending of the urgent data segment by scheduling an RDMA write 3410 into relative position 1000-1499. 3412 4. Host A the sends a CDC message to advance its Producer Cursor 3413 (1500) and to also notify Host B of the Urgent Data Present 3414 (UrgA) indicator (and turn off the Writer Blocked indicator). 3415 This signals to Host B that the urgent data is now in the local 3416 receive buffer and that the Producer Cursor points to the last 3417 byte of urgent data. 3419 5. Host B wakes up, processes the urgent data and once the urgent 3420 data is consumed sends a CDC message with inline data to update 3421 its Consumer Cursor (1500). 3423 6. Host A wakes up, sees that Host B has consumed the sequence 3424 number associated with the urgent data and then initiates the 3425 next RDMA write operation to move the 1000 bytes associated with 3426 the next send() of normal data into the peer's receive buffer at 3427 position (1500-2499). Note that send() API would have likely 3428 completed earlier in the process by copying the 1000 bytes into 3429 a send buffer and returning back to the application even though 3430 we could not send any new data until the urgent data was 3431 processed and acknowledged by Host B. 3433 7. Host A sends a CDC message to advance its Producer Cursor to 3434 2500 and to reset the Urgent Data Pending and Present flags. 3435 Host B wakes up and processes the inbound data. 3437 4.8. Connection termination 3439 Just as SMC-R connections are established using a combination of TCP 3440 connection establishment flows and SMC-R protocol flows, the 3441 termination of SMC-R connections also uses a similar combination of 3442 SMC-R protocol termination flows and normal TCP protocol connection 3443 termination flows. The following sections describe the SMC-R protocol 3444 normal and abnormal connection termination flows. 3446 4.8.1. Normal SMC-R connection termination flows 3448 Normal SMC-R connection flows are triggered via the normal stream 3449 socket API semantics, namely by the application issuing a close() or 3450 shutdown() API. Most applications, after consuming all incoming data 3451 and after sending any outbound data will then issue a close() API to 3452 indicate that they are done both sending and receiving data. Some 3453 applications, typically a small percentage, make use of the 3454 shutdown() API that allows then to indicate that the application is 3455 done sending data, receiving data or both sending and receiving data. 3456 The main use of this API is scenarios where a TCP application wants 3457 to alert its partner end point that it is done sending data, yet is 3458 still receiving data on its socket (shutdown for Write). Issuing 3459 shutdown for both sending and receiving data is really no different 3460 than issuing a close() and can therefore be treated in a similar 3461 fashion. Shutdown for read is typically not a very useful operation 3462 and in normal circumstances does not trigger any network flows to 3463 notify the partner TCP end point of this operation. 3465 These same trigger points will be used by the SMC-R layer to initiate 3466 SMC-R connections termination flows. The main design point for SMC-R 3467 normal connection flows is to use the SMC-R protocol to first 3468 shutdown the SMC-R connection and free up any SMC-R RDMA resources 3469 and then allow the normal TCP connection termination protocol (i.e. 3471 FIN processing) to drive cleanup of the TCP connection. This design 3472 point is very important in ensuring that RDMA resources such as the 3473 RMBEs are only freed and reused when both SMC-R end points are 3474 completely done with their RDMA Write operations to the partner's 3475 RMBE. 3477 1 3478 +-----------------+ 3479 |-------------->| CLOSED |<-------------| 3480 3D | | | | 4D 3481 | +-----------------+ | 3482 | | | 3483 | 2 | | 3484 | V | 3485 +----------------+ +-----------------+ +----------------+ 3486 |AppFinCloseWait | | ACTIVE | |PeerFinCloseWait| 3487 | | | | | | 3488 +----------------+ +-----------------+ +----------------+ 3489 | | | | 3490 | Active Close | 3A | 4A | Passive Close | 3491 | V | V | 3492 | +--------------+ | +-------------+ | 3493 |--<----|PeerCloseWait1| | |AppCloseWait1|--->----| 3494 3C | | | | | | | 4C 3495 | +--------------+ | +-------------+ | 3496 | | | | | 3497 | | 3B | 4B | | 3498 | V | V | 3499 | +--------------+ | +-------------+ | 3500 |--<----|PeerCloseWait2| | |AppCloseWait2|--->----| 3501 | | | | | 3502 +--------------+ | +-------------+ 3503 | 3504 | 3505 Figure 22 SMC-R connection states 3507 Figure 23 describes the states that an SMC-R connection typically 3508 goes through. Note that there are variations to these states that can 3509 occur when an SMC-R connection is abnormally terminated, similar in a 3510 way to when a TCP connection is reset. The following are the high 3511 level state transitions for an SMC-R connection: 3513 1. An SMC-R connection begins in the Closed state. This state is 3514 meant to reflect an RMBE that is not currently in use (was 3515 previously in use but no longer is or one that was never 3516 allocated) 3518 2. An SMC-R connection progresses to the Active state once the SMC- 3519 R rendezvous processing has successfully completed, RMB element 3520 indices have been exchanged and SMC-R links have been activated. 3521 In this state, TCP connection is fully established, rendezvous 3522 processing has been completed and SMC-R peers can begin exchange 3523 of data via RDMA. 3525 3. Active close processing (on SMC-R peer that is initiating the 3526 connection termination) 3528 A. When an application on one of the SMC-R connection peers issues 3529 a close() or shutdown(write or both) the SMC-R layer on that host 3530 will initiate SMC-R connection termination processing. First if 3531 close() or shutdown(both) is issued it will check to see that 3532 there's no data in the local RMB element that has not been read 3533 by the application. If unread data is detected, the SMC-R 3534 connection must be abnormally reset - for more detail on this 3535 refer to "SMC-R connection reset". If no unread data is pending, 3536 it then checks to see whether any outstanding data is waiting to 3537 be written to the peer or if any outstanding RDMA writes for this 3538 SMC-R connection have not yet completed. If either of these two 3539 scenarios are true, an indicator that this connection is in a 3540 pending close state is saved in internal data structures 3541 representing this SMC-R connection and control is returned to the 3542 application. If all data to be written to the partner has 3543 completed this peer will send a CDC message to notify the peer of 3544 either the PeerConnectionClosed indicator (close or shutdown for 3545 both was issued) or the PeerDoneWriting indicator. This will 3546 provide stimulus to the partner SMC-R peer that the connection is 3547 terminating. At this point the local side of the SMC-R connection 3548 transitions in the PeerCloseWait1 state and control can be 3549 returned to the application. If this process could not be 3550 completed synchronously (close pending condition mentioned above) 3551 it is completed when all RDMA writes for data and control cursors 3552 have been completed. 3554 B. At some point the SMC-R peer application (passive close) will 3555 consume all incoming data, realize that that partner is done 3556 sending data on this connection and proceed to initiate its own 3557 close of the connection once it has completed sending all data 3558 from its end. The partner application can initiate this 3559 connection termination processing via a close() or shutdown() 3560 APIs. If the application does so by issuing a shutdown() for 3561 write, then the partner SMC-R layer will send a CDC message to 3562 notify the peer (active close side) of the PeerDoneWriting 3563 indicator. When the "active close" SMC-R peer wakes up as a 3564 result of the previous CDC message, it will notice that the 3565 PeerDoneWriting indicator is now on and transition to the 3566 PeerCloseWait2 state. This state indicates that the peer is done 3567 sending data and may still be reading data. The "active close" 3568 peer will also at this point need to ensure that any outstanding 3569 recv() calls for this socket are woken up and remember that that 3570 no more data is forthcoming on this connection (in case the local 3571 connection was shutdown() for write only) 3573 C. This flow is a common transition from 3a or 3b above. When the 3574 SMC-R peer (passive close) consumes all data, updates all 3575 necessary cursors to the peer and the application closes its 3576 socket (close or shutdown for both) it will send a CDC message to 3577 the peer (the active close side) with the PeerConnectionClosed 3578 indicator set. At this point the connection can transition back 3579 to Closed state if the local application has already closed (or 3580 issued shutdown for both) the socket. Once in the Closed state, 3581 the RMBE can now be safely be reused for a new SMC-R connection. 3582 When the PeerConnectionClosed indicator is turned on, the SMC-R 3583 peer is indicating that it is done updating the partner's RMBE. 3585 D. Conditional State: If the local application has not yet issued 3586 a close() or shutdown(both) yet, we need to wait until the 3587 application does so (ApplFinWaitState). Once it does, the local 3588 host will send a CDC message to notify the peer of the 3589 PeerConnectionClosed indicator and then transition to the Closed 3590 state. 3592 4. Passive close processing (on SMC-R peer that receives an 3593 indication that the partner is closing the connection) 3595 A. Upon receipt of an inbound RDMA write notice the SMC-R layer 3596 will detect that the PeerConnectionClosed indicator or 3597 PeerDoneWriting indicator is on. If any outstanding recv() calls 3598 are pending they are completed with an indicator that the partner 3599 has closed the connection (zero length data presented to 3600 application). If any pending data to be written and 3601 PeerConnectionClosed is on then an SMC-R connection reset must be 3602 performed. The connection then enters the ApplCloseWait1 state on 3603 the passive close side waiting for the local application to 3604 initiate its own close processing 3605 B. If the local application issues a shutdown() for writing then 3606 the SMC-R layer will send a CDC message to notify the partner of 3607 the PeerDoneWriting indicator transition the local side of the 3608 SMC-R connection to the ApplCloseWait2 state. 3610 C. When the application issues a close() or shutdown() for both, 3611 the local SMC-R peer will send a message informing the peer of 3612 the PeerConnectionClosed indicator and transition to the Closed 3613 state if the remote peer has also sent the local peer the 3614 PeerConnectionClosed indicator. If the peer has not sent the 3615 PeerConnectionClosed indicator, we transition into the 3616 PeerFinalCloseWait state. 3618 D. The local SMC-R connection stays in this state until the peer 3619 sends the PeerConnectionClosed indicator in our RMBE. When the 3620 indicator is sent we transition to the Closed state and are then 3621 free to reuse this RMBE. 3623 Note that each SMC-R peer needs to provide some logic that will 3624 prevent being stranded in termination state indefinitely. For 3625 example, if an Active Close SMC-R peer is in a PeerCloseWait (1 or 2) 3626 state awaiting the remote SMC-R peer to update its connection 3627 termination status it needs to provide a timer that will prevent it 3628 from waiting in that state indefinitely should the remote SMC-R peer 3629 not respond to this termination request. This could occur in error 3630 scenarios; for example, if the remote SMC-R peer suffered a failure 3631 prior to being able to respond to the termination request or the 3632 remote application is not responding to this connection termination 3633 request by closing its own socket. This latter scenario is similar 3634 to the TCP FINWAIT2 state that has been known to sometimes cause 3635 issues when remote TCP/IP hosts lose track of established connections 3636 and neglect to close them. Even though the TCP standards do not 3637 mandate a time out from the TCP FINWAIT2 state, most TCP/IP 3638 implementations implement a timeout for this state. A similar 3639 timeout will be required for SMC-R connections. When this timeout 3640 occurs, the local SMC-R peer performs TCP reset processing for this 3641 connection. However, no additional RDMA writes to the partner RMBE 3642 can occur at this point (we have already indicated that we are done 3643 updating the peer's RMBE). After the TCP connection is Reset the RMBE 3644 can be returned to the free pool for reallocation. See section 3.2.5 3645 for more details. 3647 Also note that it is possible to have two SMC-R end points initiate 3648 an Active close concurrently. In that scenario the flows above still 3649 apply, however, both end points follow the active close path (path 3650 3). 3652 4.8.1.1. Abnormal SMC-R connection termination flows 3654 Abnormal SMC-R connection termination can occur for a variety of 3655 reasons, including: 3657 o The TCP connection associated with an SMC-R connection is reset. 3658 In the TCP protocol either end point can send a RST segment to 3659 abort an existing TCP connection when error conditions are 3660 detected for the connection or the application overtly requests 3661 that the connection be reset. 3663 o Normal SMC-R connection termination processing has unexpectedly 3664 stalled for a given connection. When the stall is detected 3665 (connection termination timeout condition) an abnormal SMC-R 3666 connection termination flow is initiated. 3668 In these scenarios it is very important that resources associated 3669 with the affected SMC-R connections are properly cleaned up to ensure 3670 that there are no orphaned resources and that resources can reliably 3671 be reused for new SMC-R connections. Given that SMC-R relies heavily 3672 on the RDMA Write processing, special care needs to be taken to 3673 ensure that an RMBE is no longer being used by a SMC-R peer before 3674 logically reassigning that RMBE to a new SMC-R connection. 3676 When an SMC-R peer initiates a TCP connection reset it also initiates 3677 an SMC-R abnormal connection flow at the same time. The SMC-R peers 3678 explicitly signal their intent to abnormally terminate an SMC-R 3679 connection and await explicit acknowledgement that the peer has 3680 received this notification and has also completed abnormal connection 3681 termination on its end. Note that TCP connection reset processing can 3682 occur in parallel to these flows. 3684 +-----------------+ 3685 |-------------->| CLOSED |<-------------| 3686 | | | | 3687 | +-----------------+ | 3688 | | 3689 | | 3690 | | 3691 | +-----------------+ | 3692 | | Any State | | 3693 |1B | (before setting | 2B| 3694 | | PeerConnClosed | | 3695 | | Indicator in | | 3696 | | Peer's RMBE) | | 3697 | +-----------------+ | 3698 | 1A | | 2A | 3699 | Active Abort | | Passive Abort | 3700 | V V | 3701 | +--------------+ +--------------+ | 3702 |-------|PeerAbortWait | | Process Abort|------| 3703 | | | | 3704 +--------------+ +--------------+ 3706 Figure 23 SMC-R abnormal connection termination state diagram 3708 Figure 24 above shows the SMC-R abnormal connection termination state 3709 diagram: 3711 1. Active abort designates the SMC-R peer that is initiating the 3712 TCP RST processing. At the time that the TCP RST is sent the 3713 active abort side must also 3715 A. Send the PeerConnAbort indicator to the partner via RDMA 3716 messaging with inline data and then transition to the 3717 PeerAbortWait state. During this state it will monitor this SMC- 3718 R connection waiting for the peer to send its corresponding 3719 PeerConnAbort indicator but will ignore any other activity in 3720 this connection (i.e. new incoming data). It will also surface an 3721 appropriate error to any socket API calls issued against this 3722 socket (e.g. ECONNABORTED, ECONNRESET, etc.) 3724 B. Once the peer sends the PeerConnAbort indicator to the local 3725 host, the local host can transition this SMC-R connection to the 3726 Closed state and reuse this RMBE. Note that the SMC-R peer that 3727 goes into the Active abort state must provide some protection 3728 against staying in that state indefinitely should the remote SMC- 3729 R peer not respond by sending its own PeerConnAbort indicator to 3730 the local host. While this should be a rare scenario it could 3731 occur if the remote SMC-R peer (passive abort) suffered a failure 3732 right after the local SMC-R peer (active abort) sent the 3733 PeerConnAbort indicator. To protect against these types of 3734 failures, a timer can be set after entering the PeerAbortWait 3735 state and when if that timer pops before the peer has sent its 3736 local PeerConnAbort indicator (to the active abort side) then 3737 this RMBE can be returned to the free pool for possible re- 3738 allocation. See section See section 3.2.5 for more details. 3740 2. Passive abort designates the SMC-R peer that is the recipient of 3741 an SMC-R abort from the peer designated by the PeerConnAbort 3742 indicator being sent by the peer in a CDC message. Upon 3743 receiving this request, the local peer must 3745 A. Indicate to the socket application that this connection has 3746 been aborted using the appropriate error codes, purge all in- 3747 flight data for this connection that is waiting to be read or 3748 waiting to be sent. 3750 B. Send a CDC message to notify the peer of the PeerConnAbort 3751 indicator and once that is completed transition this RMBE to the 3752 Closed state. 3754 If an SMC-R peer receives a TCP RST for a given SMC-R connection it 3755 also initiates SMC-R abnormal connection termination processing if it 3756 has not already been notified (via the PeerConnAbort indicator) that 3757 the partner is severing the connection. It is possible to have two 3758 SMC-R endpoints concurrently be in an Active abort role for a given 3759 connection. In that scenario the flows above still apply but both 3760 end points take the active abort path (path 1). 3762 4.8.1.2. Other SMC-R connection termination conditions 3763 The following are additional conditions that have implications of 3764 SMC-R connection termination: 3766 o A SMC-R peer being gracefully shut down. If an SMC-R peer supports 3767 a graceful shutdown operation it should attempt to terminate all 3768 SMC-R connections as part of shutdown processing. This could be 3769 accomplished via LLC Delete Link requests on all active SMC Links. 3771 o Abnormal termination of an SMC-R peer. In this example, there may 3772 be no opportunity for the host to perform any SMC-R cleanup 3773 processing. In this scenario it is up to the remote peer to 3774 detect a RoCE communications failure with the failing host. This 3775 could trigger an SMC link switch but that would also surface RoCE 3776 errors causing the remote host to eventually terminate all 3777 existing SMC-R connections to this peer. 3779 o Loss of RoCE connectivity between two SMC-R peers. If two peers 3780 are no longer reachable across any links in their SMC Link group 3781 then both peers perform a TCP reset for the connections, surface 3782 an error to the local applications and free up all QP resources 3783 associated with the link group. 3785 5. Security considerations 3787 5.1. VLAN considerations 3789 The concepts and access control of virtual LANs (VLANs) must be 3790 extended to also cover the RoCE network traffic flowing across the 3791 ethernet. 3793 The RoCE VLAN configuration and accesses must mirror the IP VLAN 3794 configuration and accesses over the CEE fabric. This means that 3795 hosts, routers and switches that have access to specific VLANs on the 3796 IP fabric must also have the same VLAN access across the RoCE 3797 fabric. In other words, the SMC-R connectivity will follow the same 3798 virtual network access permissions as normal TCP/IP traffic. 3800 5.2. Firewall considerations 3802 As mentioned above, the RoCE fabric inherits the same VLAN 3803 topology/access as the IP fabric. RoCE is a layer 2 protocol that 3804 requires both end points to reside in the same layer 2 network (i.e. 3805 VLAN). RoCE traffic can not traverse multiple VLANs as there is no 3806 support for routing RoCE traffic beyond a single VLAN. As a result, 3807 SMC-R communications will also be confined to peers that are members 3808 of the same VLAN. IP based firewalls are typically inserted between 3809 VLANs (or physical lans) and rely on normal IP routing to insert 3810 themselves in the data path. Since RoCE (and by extension SMC-R) is 3811 not routable beyond the local VLAN, there is no ability to insert a 3812 firewall in the network path of two SMC-R peers. 3814 5.3. Host-based IP Filters 3816 Because SMC-R maintains the TCP three-way handshake for connection 3817 setup before switching to RoCE out of band, existing IP filters that 3818 control connection setup flows remain effective in an SMC-R 3819 environment. IP filters that operate on traffic flowing in an active 3820 TCP connection are not supported, because the connection data does 3821 not flow over IP. 3823 5.4. Intrusion Detection Services 3825 Similar to IP filters, intrusion detection services that operate on 3826 TCP connection setups are compatible with SMC-R with no changes 3827 required. However once the TCP connection has switched to RoCE out 3828 of band, packets are not available for examination. 3830 5.5. IP Security (IPSec) 3832 IP Security is not compatible with SMC-R because there are no IP 3833 packets to operate on. TCP connections that require IP security must 3834 opt out of SMC-R. 3836 5.6. TLS/SSL 3838 TLS/SSL is preserved in an SMC-R environment. The TLS/SSL layer 3839 resides above the SMC-R layer and outgoing connection data is 3840 encrypted before being passed down to the SMC-R layer for RMDA write. 3841 Similarly, incoming connection data goes through the SMC-R layer 3842 encrypted and is decrypted by the TLS/SSL layer as it is today. 3844 The TLS/SSL handshake messages flow over the TCP connection after the 3845 connection has switched to SMC-R, so are exchanged using RDMA writes 3846 by the SMC-R layer, transparently to the TLS/SSL layer. 3848 6. IANA considerations 3850 The scarcity of TCP option codes available for assignment is 3851 understood and this architecture uses experimental TCP options 3852 following the conventions of RFC 6994 "Shared Use of Experimental TCP 3853 Options". 3855 If this protocol achieves wide acceptance a discrete option code may 3856 be requested by subsequent versions of this protocol. 3858 7. References 3860 7.1. Normative References 3862 [ROCE] RDMA over Converged Ethernet specification, URL, 3863 http://members.infinibandta.org/kwspub/spec/Annex_RoCE_fina 3864 l.pdf 3866 [IBTA] Infiniband Architecture specification, URL, 3867 http://www.infinibandta.org/specs 3869 [RFC793] University of Southern California Information Services 3870 Institute, "Transmission Control Protocol", RFC 793, 3871 September 1981. 3873 [RFC4727] Fenner B., "Experimental Values in IPv4, IPv6, ICMPv4, 3874 ICMPv6, UDP, and TCP Headers", RFC 4727, November 2006. 3876 7.2. Informative References 3878 [RFC 6994] Touch, J., "Shared use of Experimental TCP Options", 3879 draft URL, https://tools.ietf.org/html/rfc6994 3881 8. Acknowledgments 3883 This document was prepared using 2-Word-v2.0.template.dot. 3885 9. Conventions used in this document 3887 In the rendezvous flow diagrams, dashed lines (----) are used to 3888 indicate flows over the TCP/IP fabric and dotted lines (....) are 3889 used to indicate flows over the RoCE fabric. 3891 In the data transfer ladder diagrams, dashed lines (----) are used to 3892 indicate RDMA write operations and dotted lines (....) are used to 3893 indicate CDC messages, which are RDMA messages with inline data that 3894 contain control information for the connection. 3896 Appendix A. Formats 3898 A.1. TCP option 3900 The SMC-R TCP option is formatted in accordance with RFC 6994 "Shared 3901 Use of Experimental TCP Options". The ExID value is IBM-1047 3902 (EBCDIC) encoding for 'SMCR' 3904 0 1 2 3 3905 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 3906 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3907 | Kind = 254 | Length = 6 | x'E2' | x'D4' | 3908 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3909 | x'C3' | x'D9' | 3910 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3911 Figure 24 SMC-R TCP option format 3913 A.2. CLC messages 3915 The following rules apply to all CLC messages: 3917 General rules on formats: 3919 o Reserved fields must be set to zero and not validated 3921 o Each message has an eyecatcher at the start and another eyecatcher 3922 at the end. These must both be validated by the receiver. 3924 o SMC version indicator: The only SMC-R version defined in this 3925 architecture is version 1. In the future, if peers have a 3926 mismatch of versions, the lowest common version number is used. 3928 A.2.1. Peer ID format 3930 All CLC messages contain a peer ID that uniquely identifies an 3931 instance of a TCP/IP stack. This peer ID is required to be 3932 universally unique across TCP/IP stacks and instances (including 3933 restarts) of TCP/IP stacks. 3935 0 1 2 3 3936 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 3937 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3938 | Instance ID | RoCE MAC (first two bytes) | 3939 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3940 | RoCE MAC (last four bytes) | 3941 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3942 Figure 25 Peer ID format 3944 Instance ID 3946 A two-byte instance count that ensures that if the same RNIC MAC 3947 is later used in the peer ID for a different TCP/IP stack, for 3948 example if an RNIC is redeployed to another stack, the values are 3949 unique. It also ensures that if a TCP/IP stack is restarted, the 3950 instance ID changes. Value is implementation defined, with one 3951 suggestion being two bytes of the system clock. 3953 RoCE MAC 3955 The RoCE MAC address for one of the peer's RNICs. Note that in a 3956 virtualized environment this will be the virtual MAC of one of 3957 the peer's RNICs. 3959 A.2.2. SMC Proposal CLC message format 3961 0 1 2 3 3962 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 3963 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3964 | x'E2' | x'D4' | x'C3' | x'D9' | 3965 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3966 | Type = 1 | Length |Version| Rsrvd | 3967 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3968 | | 3969 +- Client's Peer ID -+ 3970 | | 3971 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3972 | | 3973 +- -+ 3974 | | 3975 +- Client's preferred GID -+ 3976 | | 3977 +- -+ 3978 | | 3979 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3980 | Client's preferred RoCE | 3981 +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3982 | |Offset to mask/prefix area (0) | 3983 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3984 . . 3985 . Area for future growth . 3986 . . 3987 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3988 | IPv4 Subnet Mask | 3989 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3990 | IPv4 Mask Lgth| Reserved |Num IPv6 prfx | 3991 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3992 : : 3993 : (Variable length) array of IPv6 Prefixes : 3994 : : 3995 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3996 | x'E2' | x'D4' | x'C3' | x'D9' | 3997 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3999 Figure 26 SMC Proposal CLC message format 4001 The fields present in the SMC Proposal CLC message are: 4003 Eyecatchers 4004 Like all CLC messages, the SMC Proposal has beginning and ending 4005 eyecatchers to aid with verification and parsing. The hex digits 4006 spell 'SMCR' in IBM-1047 (EBCDIC) 4008 Type 4010 CLC message type 1 indicates SMC Proposal 4012 Length 4014 The length of this CLC message. If this an IPv4 flow, this 4015 value is 52. Otherwise it is variable depending upon how many 4016 prefixes are listed. 4018 Version 4020 Version of the SMC-R protocol. Version 1 is the only currently 4021 defined value 4023 Client's Peer ID 4025 As described in A.2.1. above 4027 Client's preferred RoCE GID 4029 This is the IPv6 address of the client's preferred RNIC on the 4030 RoCE fabric 4032 Client's preferred RoCE MAC address 4034 The MAC address of the client's preferred RNIC on the RoCE 4035 fabric. It is required as some operating systems do not have 4036 neighbor discovery or ARP support for RoCE RNICs. 4038 Offset to mask/prefix area 4040 Provides the number of bytes that must be skipped after this 4041 field, to access the IPv4 Subnet Mask and the fields that follow 4042 it. Allows for future growth of this signal. In this version of 4043 the architecture, this value is always zero. 4045 Area for future growth 4047 In this version of the architecture, this field does not exist. 4048 This indicates where additional information may be inserted into 4049 the signal in the future. "The Offset to mask/prefix area" field 4050 must be used to skip over this area. 4052 IPv4 Subnet mask 4054 If this message is flowing over an IPv4 TCP connection, the value 4055 of the subnet mask associated with the interface the client sent 4056 this message over. If this an IPv6 flow this field is all 4057 zeroes. 4059 This field, along with all fields that follow it in this signal, 4060 must be accessed by skipping the number of bytes listed in the 4061 "Offset to mask/prefix area" field after the end of that field. 4063 IPv4 Mask Lgth 4065 If this message is flowing over an IPv4 TCP connection, the 4066 number of significant bits in the IPv4 subnet mask. If this an 4067 IPv6 flow, this field is zero. 4069 Num IPv6 prfx 4071 If this message is flowing over an IPv6 TCP connection, the 4072 number of IPv6 prefixes that follow, with a maximum value of 8. 4073 if this is an IPv4 flow this field is zero and is immediately 4074 followed by the ending eyecatcher. 4076 Array of IPv6 Prefixes 4078 For IPv6 TCP connections, a list of the IPv6 prefixes associated 4079 with the network the client sent this message over, up to a 4080 maximum of 8 prefixes. 4082 0 1 2 3 4083 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4084 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4085 | | 4086 + + 4087 | | 4088 + IPv6 Prefix value + 4089 | | 4090 + + 4091 | | 4092 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4093 | Prefix Length | 4094 +-+-+-+-+-+-+-+-+ 4096 Figure 27 Format for IPv6 Prefix array element 4098 A.2.3. SMC Accept CLC message format 4100 0 1 2 3 4101 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4102 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4103 | x'E2' | x'D4' | x'C3' | x'D9' | 4104 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4105 | Type = 2 | Length = 68 |Version|F|Rsvd | 4106 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4107 | | 4108 +- Server's Peer ID -+ 4109 | | 4110 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4111 | | 4112 +- -+ 4113 | | 4114 +- Server's RoCE GID -+ 4115 | | 4116 +- -+ 4117 | | 4118 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4119 | Server's RoCE | 4120 +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4121 | | Server QP (bytes 1-2) | 4122 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---+ 4123 |Srvr QP byte 3 | Server RMB Rkey (bytes 1-3) | 4124 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4125 |Srvr RMB byte 4|Server RMB indx| Srvr RMB alert tkn (bytes 1-2)| 4126 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4127 | Srvr RMB alert tkn (bytes 3-4)|Bsize | MTU | Reserved | 4128 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4129 | | 4130 +- Server's RMB virtual address -+ 4131 | | 4132 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4133 | Reserved | Server's initial packet sequence number | 4134 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4135 | x'E2' | x'D4' | x'C3' | x'D9' | 4136 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4137 Figure 28 SMC Accept CLC message format 4139 The fields present on the SMC Accept CLC message are: 4141 Eyecatchers 4142 Like all CLC messages, the SMC Accept has beginning and ending 4143 eyecatchers to aid with verification and parsing. The hex digits 4144 spell 'SMCR' in IBM-1047 (EBCDIC) 4146 Type 4148 CLC message type 2 indicates SMC Accept 4150 Length 4152 The SMC Accept CLC message is 68 bytes long 4154 Version 4156 Version of the SMC-R protocol. Version 1 is the only currently 4157 defined value. 4159 F-bit 4161 First Contact flag: A 1-bit flag that indicates that the server 4162 believes this TCP connection is the first SMC-R contact for this 4163 link group 4165 Server's Peer ID 4167 As described in A.2.1. above 4169 Server's RoCE GID 4171 This is the IPv6 address of the RNIC that the server chose for 4172 this SMC Link 4174 Server's RoCE MAC address 4176 The MAC address of the server's RNIC for the SMC link. It is 4177 required as some operating systems do not have neighbor discovery 4178 or ARP support for RoCE RNICs. 4180 Server's QP number 4182 The number for the reliably connected queue pair that the server 4183 created for this SMC link 4185 Server's RMB Rkey 4187 The RDMA Rkey for the RMB that the server created or chose for 4188 this TCP connection 4190 Server's RMB element index 4192 This indexes which element within the server's RMB will represent 4193 this TCP connection 4195 Server's RMB element alert token 4197 A platform defined, architecturally opaque token that identifies 4198 this TCP connection. Added by the client as immediate data on 4199 RDMA writes from the client to the server to inform the server 4200 that there is data for this connection to retrieve from the RMB 4201 element 4203 Bsize: 4205 Server's RMB element buffer size in four bits compressed 4206 notation: x=4 bits. Actual buffer size value is (2^(x+4)) * 1K. 4207 Smallest possible value is 16K. Largest size supported by this 4208 architecture is 512K. 4210 MTU 4212 An enumerated value indicating this peer's QP MTU size. The two 4213 peers exchange this value and the minimum of the peer's value 4214 will be used for the QP. This field should only be validated on a 4215 first contact exchange. 4217 The enumerated MTU values are: 4219 0: reserved 4221 1: 256 4223 2: 512 4225 3: 1024 4227 4: 2048 4229 5: 4096 4231 6-15: reserved 4233 Server's RMB virtual address 4235 The virtual address of the server's RMB as assigned by the 4236 server's RNIC. 4238 Server's initial packet sequence number 4240 The starting packet sequence number that this peer will use when 4241 sending to the other peer, so that the other peer can prepare its 4242 QP for the sequence number to expect. 4244 A.2.4. SMC Confirm CLC message format 4246 0 1 2 3 4247 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4248 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4249 | x'E2' | x'D4' | x'C3' | x'D9' | 4250 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4251 | Type = 3 | Length = 68 |Version| Rsrvd | 4252 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4253 | | 4254 +- Client's Peer ID -+ 4255 | | 4256 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4257 | | 4258 +- -+ 4259 | | 4260 +- Client's RoCE GID -+ 4261 | | 4262 +- -+ 4263 | | 4264 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4265 | Client's RoCE | 4266 +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4267 | | Client QP (bytes 1-2) | 4268 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---+ 4269 |Clnt QP byte 3 | Client RMB Rkey (bytes 1-3) | 4270 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4271 |Clnt RMB byte 4|Client RMB indx| Clnt RMB alert tkn (bytes 1-2)| 4272 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4273 | Clnt RMB alert tkn (bytes 3-4)|Bsize | MTU | Reserved | 4274 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4275 | | 4276 +- Client's RMB Virtual Address -+ 4277 | | 4278 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4279 | Reserved | Client's initial packet sequence number | 4280 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4281 | x'E2' | x'D4' | x'C3' | x'D9' | 4282 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4283 Figure 29 SMC Confirm CLC message format 4285 The SMC Confirm CLC message is nearly identical to the SMC Accept 4286 except that it contains client information and lacks a first contact 4287 flag. 4289 The fields present on the SMC Confirm CLC message are: 4291 Eyecatchers 4293 Like all CLC messages, the SMC Confirm has beginning and ending 4294 eyecatchers to aid with verification and parsing. The hex digits 4295 spell 'SMCR' in IBM-1047 (EBCDIC) 4297 Type 4299 CLC message type 3 indicates SMC Confirm 4301 Length 4303 The SMC Confirm CLC message is 68 bytes long 4305 Version 4307 Version of the SMC-R protocol. Version 1 is the only currently 4308 defined value. 4310 Client's Peer ID 4312 As described in A.2.1. above 4314 Clients's RoCE GID 4316 This is the IPv6 address of the RNIC that the client chose for 4317 this SMC Link 4319 Client's RoCE MAC address 4321 The MAC address of the client's RNIC for the SMC link. It is 4322 required as some operating systems do not have neighbor discovery 4323 or ARP support for RoCE RNICs. 4325 Client's QP number 4327 The number for the reliably connected queue pair that the client 4328 created for this SMC link 4330 Client's RMB Rkey 4331 The RDMA Rkey for the RMB that the client created or chose for 4332 this TCP connection 4334 Client's RMB element index 4336 This indexes which element within the client's RMB will represent 4337 this TCP connection 4339 Client's RMB element alert token 4341 A platform defined, architecturally opaque token that identifies 4342 this TCP connection. Added by the server as immediate data on 4343 RDMA writes from the server to the client to inform the client 4344 that there is data for this connection to retrieve from the RMB 4345 element 4347 Bsize: 4349 Client's RMB element buffer size in four bits compressed 4350 notation: x=4 bits. Actual buffer size value is (2^(x+4)) * 1K. 4351 Smallest possible value is 16K. Largest size supported by this 4352 architecture is 512K. 4354 MTU 4356 An enumerated value indicating this peer's QP MTU size. The two 4357 peers exchange this value and the minimum of the peer's value 4358 will be used for the QP. The values are enumerated in A.2.3. This 4359 value should only be validated on the first contact exchange. 4361 Client's RMB virtual address 4363 The virtual address of the server's RMB as assigned by the 4364 server's RNIC. 4366 Client's initial packet sequence number 4368 The starting packet sequence number that this peer will use when 4369 sending to the other peer, so that the other peer can prepare its 4370 QP for the sequence number to expect 4372 . 4374 A.2.5. SMC Decline CLC message format 4376 0 1 2 3 4377 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4378 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4379 | x'E2' | x'D4' | x'C3' | x'D9' | 4380 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4381 | Type = 4 | Length = 28 |Version|S|Rsrvd| 4382 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4383 | | 4384 +- Sender's Peer ID -+ 4385 | | 4386 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4387 | Peer Diagnosis Information | 4388 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4389 | | 4390 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4391 | x'E2' | x'D4' | x'C3' | x'D9' | 4392 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4393 Figure 30 SMC Decline CLC message format 4395 The fields present on the SMC Decline CLC message are: 4397 Eyecatchers 4399 Like all CLC messages, the SMC Decline has beginning and ending 4400 eyecatchers to aid with verification and parsing. The hex digits 4401 spell 'SMCR' in IBM-1047 (EBCDIC) 4403 Type 4405 CLC message type 4 indicates SMC Decline 4407 Length 4409 The SMC Decline CLC message is 28 bytes long 4411 Version 4413 Version of the SMC-R protocol. Version 1 is the only currently 4414 defined value. 4416 S-bit 4418 Synch Bit. Indicates that the link group is out of synch and 4419 receiving peer must clean up its representation of the link group 4421 Sender's Peer ID 4423 As described in A.2.1. above 4425 Peer Diagnosis Information 4427 Four bytes of diagnosis information provided by the peer. These 4428 values are defined by the individual peers and it is necessary to 4429 consult the peer's system documentation to interpret the results. 4431 A.3. LLC messages 4433 LLC messages are sent over an existing SMC-R link using RoCE message 4434 passing and are always 44 bytes long so that they fit into the space 4435 available in a single WQE without requiring the receiver to post 4436 receive buffers. If all 44 bytes are not needed, they are padded out 4437 with zeroes. LLC messages are in a request/response format. The 4438 message type is the same for request and response, and a flag 4439 indicates whether a message is flowing as a request or a response. 4441 The two high order bits of an LLC message opcode indicate how it is 4442 to be handled by a peer that does not support the opcode. 4444 If the high order bits of the opcode are b'00' then the peer must 4445 support the LLC message and indicate a protocol error if it does not. 4447 If the high order bits of the opcode are b'10' then the peer must 4448 silently discard the LLC message if does not support the opcode. This 4449 requirement is inserted to allow for toleration of advanced, but 4450 optional function. 4452 High order bits of b'11' indicate a Connection Data Control (CDC) 4453 message as described in A.4. 4455 A.3.1. CONFIRM LINK LLC message format 4457 0 1 2 3 4458 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4459 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4460 | type = 1 | length = 44 | Reserved |R| Reserved | 4461 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4462 | Sender's RoCE | 4463 +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4464 | | | 4465 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 4466 | | 4467 +- -+ 4468 | Sender's RoCE GID | 4469 +- -+ 4470 | | 4471 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4472 | |Sender's QP number, bytes 1-2 | 4473 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4474 |Sender QP byte3| Link number |Sender's link userid, bytes 1-2| 4475 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4476 |Sender's link userid bytes, 3-4| Max links | Reserved | 4477 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4478 | | 4479 +- Reserved -+ 4480 | | 4481 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4482 Figure 31 CONFIRM LINK LLC message format 4484 The CONFIRM LINK LLC message is required to be exchanged between the 4485 server and client over a newly created SMC-R link to complete the 4486 setup of an SMC link. Its purpose is to confirm that the RoCE path 4487 is actually usable. 4489 On first contact this flows after the server receives the SMC Confirm 4490 CLC message from the client over the IP connection. For additional 4491 links added to an SMC link group, it flows after the ADD LINK and ADD 4492 LINK CONTINUATION exchange. This flow provides confirmation that the 4493 queue pair is in fact usable. Each peer echoes its RoCE information 4494 back to the other. 4496 Type 4498 Type 1 indicates CONFIRM LINK 4500 Length 4501 All LLC messages are 44 bytes long 4503 R 4505 Reply flag. When set indicates this is a CONFIRM LINK REPLY 4507 Sender's RoCE MAC address 4509 The MAC address of the sender's RNIC for the SMC link. It is 4510 required as some operating systems do not have neighbor discovery 4511 or ARP support for RoCE RNICs. 4513 Sender's RoCE GID 4515 This is the IPv6 address of the RNIC that the sender is using for 4516 this SMC-R Link 4518 Sender's QP number 4520 The number for the reliably connected queue pair that the sender 4521 created for this SMC-R link 4523 Link number 4525 An identifier assigned by the server that uniquely identifies the 4526 link within the link group. This identifier is ONLY unique 4527 within a link group. Provided by the server and echoed back by 4528 the client 4530 Link User ID 4532 An opaque, implementation defined identifier assigned by the 4533 sender and provided to the receiver solely for purposes of 4534 display, diagnosis, network management, etc. The link user ID 4535 should be unique across the sender's entire software space, 4536 including all link other link groups. 4538 Max Links 4540 The maximum number of links the sender can support in a link 4541 group. The maximum for this link group is the the smaller of the 4542 values provided by the two peers. 4544 A.3.2. ADD LINK LLC message format 4546 0 1 2 3 4547 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4548 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4549 | type = 2 | length = 44 | Rsrvd |RsnCode|R|Z| Reserved | 4550 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4551 | Sender's RoCE | 4552 +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4553 | | | 4554 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 4555 | | 4556 +- -+ 4557 | Sender's RoCE GID | 4558 +- -+ 4559 | | 4560 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4561 | |Sender's QP number, bytes 1-2 | 4562 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4563 |Sender QP byte3| Link number |Rsrvd | MTU |Initial PSN | 4564 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4565 | Initial PSN, continued | | 4566 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -+ 4567 | Reserved | 4568 +- -+ 4569 | | 4570 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4571 Figure 32 ADD LINK LLC message format 4573 The ADD LINK LLC message is sent over an existing link in the link 4574 group when a peer wishes to add an SMC-R link to an existing SMC-R 4575 link group. It sent by the server to add a new SMC-R link to the 4576 group, or by the client to request that the server add a new link, 4577 for example when a new RNIC becomes active. When sent from the 4578 client to the server, it represents a request that the server 4579 initiate an ADD LINK exchange. 4581 This message is sent immediately after the initial SMC link in the 4582 group completes, as described in 3.5.1. First contact. It can also be 4583 sent over an existing SMC-R link group at any time as new RNICs are 4584 added and become available. Therefore there can be as few as 1 new 4585 RMB RTokens to communicate, or several. Rtokens will be 4586 communicated using ADD LINK CONTINUATION messages. 4588 The contents of the ADD LINK LLC message are: 4590 Type 4592 Type 2 indicates ADD LINK 4594 Length 4596 All LLC messages are 44 bytes long 4598 RsnCode 4600 If the Z (rejection) flag is set, this field provides the reason 4601 code. Values can be: 4603 X'1' - no alternate path available: set when the server provides 4604 the same MAC/GID as an existing SMC-R link in the group, and the 4605 client does not have any additional RNICs available (i.e., server 4606 is attempting to set up an asymmetric link but none is available) 4608 X'2' - Invalid MTU value specified 4610 R 4612 Reply flag. When set indicates this is an ADD LINK REPLY 4614 Z 4616 Rejection flag. When set on reply indicates that the server's 4617 ADD LINK was rejected by the client. When this flag is set, the 4618 reason code will also be set. 4620 Sender's RoCE MAC address 4622 The MAC address of the sender's RNIC for the new SMC-R link. It 4623 is required as some operating systems do not have neighbor 4624 discovery or ARP support for RoCE RNICs. 4626 Sender's RoCE GID 4628 The IPv6 address of the RNIC that the sender is using for the new 4629 SMC-R Link 4631 Sender's QP number 4633 The number for the reliably connected queue pair that the sender 4634 created for the new SMC-R link 4636 Link number 4637 An identifier for the new SMC-R link. This is assigned by the 4638 server and uniquely identifies the link within the link group. 4639 This identifier is ONLY unique within a link group. Provided by 4640 the server and echoed back by the client 4642 MTU 4644 An enumerated value indicating this peer's QP MTU size. The two 4645 peers exchange this value and the minimum of the peer's value 4646 will be used for the QP. The values are enumerated in A.2.3. 4648 Initial PSN 4650 The starting packet sequence number that this peer will use when 4651 sending to the other peer, so that the other peer can prepare its 4652 QP for the sequence number to expect. 4654 A.3.3. ADD LINK CONTINUATION LLC message format 4656 0 1 2 3 4657 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4658 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4659 | type = 3 | length = 44 | Reserved |R| Reserved | 4660 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4661 | Linknum | NumRTokens | Reserved | 4662 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4663 | | 4664 +- -+ 4665 | | 4666 +- Rkey/Rtoken Pair -+ 4667 | | 4668 +- -+ 4669 | | 4670 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4671 | | 4672 +- -+ 4673 | | 4674 +- Rkey/Rtoken Pair or zeroes -+ 4675 | | 4676 +- -+ 4677 | | 4678 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4679 | Reserved | 4680 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4681 Figure 33 ADD LINK CONTINUATION LLC message format 4683 When a new SMC-R link is added to an SMC-R link group, it is 4684 necessary to communicate the new link's RTokens for the RMBs that the 4685 SMC-r link group can access. This message follows the ADD LINK and 4686 provides the RTokens. 4688 The server kicks off this exchange by sending the first ADD LINK 4689 CONTINUATION LLC message, and the server controls the exchange as 4690 described below. 4692 o If the client and the server require the same number of ADD LINK 4693 CONTINUATION messages to communicate their RTokens, the server 4694 starts the exchange by sending the client the first ADD LINK 4695 CONTINUATION request to the client with its RTokens, then the 4696 client responds with an ADD LINK CONTINUATION response with its 4697 RTokens, and so on until the exchange is completed. 4699 o If the server requires more ADD LINK CONTINUATION messages than 4700 the client, then after the client has communicated all its 4701 RTokens, the server continues to send ADD LINK CONTINUATION 4702 request messages to the client. The client continues to respond, 4703 using empty (number of RTokens to be communicated = 0) ADD LINK 4704 CONTINUATION response messages. 4706 o If the client requires more ADD LINK CONTINUATION messages than 4707 the server, then after communicating all its RTokens the server 4708 will continue to send empty ADD LINK CONTINUATION messages to the 4709 client to solicit replies with the client's RTokens, until all 4710 have been communicated. 4712 The contents of this message are: 4714 Type 4716 Type 3 indicates ADD LINK CONTINUATION 4718 Length 4720 All LLC messages are 44 bytes long 4722 R 4724 Reply flag. When set indicates this is an ADD LINK CONTINUATION 4725 REPLY 4727 LinkNum 4729 The link number of the new link within the SMC link group that 4730 Rkeys are being communicated for 4732 NumRTokens 4734 Number of RTokens remaining to be communicated (including the 4735 ones in this message). If the value is less than or equal to 2, 4736 this is the last message. If it is greater than 2, another 4737 continuation message will be required, and its value will be the 4738 value in this message minus 2, and so on until all Rkeys are 4739 communicated. The maximum value for this field is 255. 4741 Up to 2 Rkey/RToken pairs 4743 These consist of an Rkey for an RMB that is known on the SMC-R 4744 link that this message was sent over (the reference Rkey), paired 4745 with the same RMB's RToken over the new SMC link. A full RToken 4746 is not required for the reference because it is only being used 4747 to distinguish which RMB it applies to, not address it. 4749 0 1 2 3 4750 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4751 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4752 | Reference Rkey | 4753 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4754 | New Rkey | 4755 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4756 | | 4757 +- New Virtual Address -+ 4758 | | 4759 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4760 Figure 34 Rkey/Rtoken pair format 4762 The contents of the RKey/RToken pair are: 4764 Reference Rkey 4766 The Rkey of the RMB as it is already known on the SMC-R link over 4767 which this message is being sent. Required so that the peer knows 4768 which RMB to associate the new Rtoken with. 4770 New Rkey 4772 The Rkey of this RMB as it is known over the new SMC-R link 4774 New Virtual Address 4776 The virtual address of this RMB as it is known over the new SMC-R 4777 link. 4779 A.3.4. DELETE LINK LLC message format 4781 0 1 2 3 4782 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4783 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4784 | type = 4 | length = 44 | Reserved |R|A|O| Rsrvd | 4785 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4786 | Linknum | Reason code (bytes 1-3) | 4787 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4788 |RsnCode byte 4 | | 4789 +-+-+-+-+-+-+-+-+ -+ 4790 | | 4791 +- -+ 4792 | | 4793 +- -+ 4794 | | 4795 +- Reserved -+ 4796 | | 4797 +- -+ 4798 | | 4799 +- -+ 4800 | | 4801 +- -+ 4802 | | 4803 +- -+ 4804 | | 4805 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4806 Figure 35 DELETE LINK LLC message format 4808 When the client or server detects that a QP or SMC-R link goes down 4809 or needs to come down, it sends this message over one of the other 4810 links in the link group. 4812 When the DELETE Link is sent from the client it only serves as a 4813 notification, and the client expects the server to send a DELETE LINK 4814 Request in response. To avoid races, only the server will initiate 4815 the actual DELETE LINK Request and Response sequence that results 4816 from notification from the client. 4818 The server can also initiate the DELETE Link without notification 4819 from the client if it detects an error or if orderly link termination 4820 was initiated. 4822 The client may also request termination of the entire link group and 4823 the server may terminate the entire link group using this message. 4825 The contents of this message are: 4827 Type 4829 Type 4 indicates DELETE LINK 4831 Length 4833 All LLC messages are 44 bytes long 4835 R 4837 Reply flag. When set indicates this is an DELETE LINK REPLY 4839 A 4841 All flag. When set indicates that all links in the link group 4842 are to be terminated. This terminates the link group. 4844 O 4846 Orderly flag. Indicates orderly termination. Orderly termination 4847 is generally caused by an operator command rather than an error 4848 on the link. When the client requests orderly termination, the 4849 server may wait to complete other work before terminating. 4851 LinkNum 4853 The link number of the link to be terminated. If the A flag is 4854 set, this field has no meaning and is set to 0. 4856 RsnCode 4858 The termination reason code. Currently defined reason codes are: 4860 Request Reason Codes: 4862 o X'00010000' = lost path 4864 o X'00020000' = operator initiated termination 4865 o X'00030000' = Program initiated termination (link inactivity) 4867 o X'00040000' = LLC protocol violation 4869 o X'00050000' = Asymmetric link no longer needed 4871 Response Reason Codes: 4873 o X'00100000' = Unknown Link ID (no link) 4875 o Others TBD 4877 A.3.5. CONFIRM RKEY LLC message format 4879 0 1 2 3 4880 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4881 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4882 | type = 6 | length = 44 | Reserved |R|0|Z|C|Rsrvd | 4883 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4884 | NumTkns | New RMB Rkey for this link (bytes 1-3) | 4885 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4886 |ThisLink byte 4| | 4887 +-+-+-+-+-+-+-+-+ -+ 4888 | New RMB virtual address for this link | 4889 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4890 | | | 4891 +-+-+-+-+-+-+-+-+ -+ 4892 | | 4893 +- Other link RMB specification or zeros -+ 4894 | | 4895 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4896 | | | 4897 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -+ 4898 | | 4899 +- -+ 4900 | Other link RMB specification or zeroes | 4901 +- +-+-+-+-+-+-+-+-+ 4902 | | Reserved | 4903 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4904 Figure 36 CONFIRM RKEY LLC message format 4906 The CONFIRM_RKEY flow can be sent at any time from either the client 4907 or the server, to inform the peer that an RMB has been created or 4908 deleted. The creator of a new RMB must inform its peer of the new 4909 RMB's RToken for all SMC-R links in the SMC-R link group. The 4910 deleter of an RMB must inform its peer of the deleted RMB's RToken 4911 for all SMC-R links. 4913 For RMB creation, the creator sends this message over the SMC link 4914 that the first TCP connection that uses the new RMB is using. This 4915 message contains the new RMB RToken for the SMC link that the message 4916 is sent over, then it lists the sender's SMC links in the link group 4917 paired with the new RToken for the new RMB for that link. This 4918 message can communicate the new RTokens for 3 QPs: the QP for the 4919 link this message is sent over, and 2 others. If there are more than 4920 3 links in the SMC-R link group, CONFIRM_RKEY_CONTINUATION will be 4921 required. 4923 For RMB deletion, the creator sends the same format of message with a 4924 delete flag set, to inform the peer that the RMB's RTokens on all 4925 links in the group are deleted. 4927 In both cases, the peer responds by simply echoing the message with 4928 the response flag set. If the response is a negative response, the 4929 sender must recalculate the RToken set and start a new CONFIRM_RKEY 4930 exchange from the beginning. The timing of this retry is controlled 4931 by the C flag as described below. 4933 The contents of this message are: 4935 Type 4937 Type 6 indicates CONFIRM RKEY 4939 Length 4941 All LLC messages are 44 bytes long 4943 R 4945 Reply flag. When set indicates this is a CONFIRM RKEY REPLY 4947 0 4949 Reserved bit 4951 Z 4953 Negative response flag 4955 C 4957 Configuration Retry bit. If this is a negative response and this 4958 flag is set, the originator should recalculate the Rkey set and 4959 retry this exchange as soon as the current configuration change 4960 is completed. If this flag is not set on a negative response, the 4961 originator must wait for the next natural stimulus (for example, 4962 a new TCP connection started that requires a new RMB) before 4963 retrying. 4965 NumTkns 4967 The number of other link/RToken pairs, including those provided 4968 in this message, to be communicated. Note that this value does 4969 not include the Rtoken for the link this message was sent on 4970 (i.e., the maximum value is 2). If this value is three or fewer 4971 this is the only message in the exchange. If this value is 4972 greater than three, a CONFIRM RKEY CONTINUATION message will be 4973 required. 4975 Note: in this version of the architecture, 8 is the maximum 4976 number of links supported in a link group. 4978 New RMB Rkey for this link 4980 The new RMB's Rkey as assigned on the link this message is being 4981 sent over. 4983 New RMB virtual address for this link 4985 The new RMB's virtual address as assigned on the link this 4986 messages is being sent over. 4988 Other link RMB specification 4990 The new RMB's specification on the other links in the link group, 4991 as shown in Figure 38. 4993 0 1 2 3 4994 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4995 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4996 | Link number | RMB's Rkey for the specified link (bytes 1-3) | 4997 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4998 |New Rkey byte 4| | 4999 +-+-+-+-+-+-+-+-+ -+ 5000 | RMB's virtual address for the specified link | 5001 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5002 | | 5003 +-+-+-+-+-+-+-+-+ 5004 Figure 37 Format of link number/Rkey pairs 5006 Link number 5008 The link number for a link in the link group 5010 RMB's Rkey for the specified link 5012 The Rkey used to reach the RMB over the link whose number was 5013 specified in the link number field. 5015 RMB's virtual address for the specified link 5017 The virtual address used to reach the RMB over the link whose 5018 number was specified in the link number field. 5020 A.3.6. CONFIRM RKEY CONTINUATION LLC message format 5022 0 1 2 3 5023 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 5024 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5025 | type = 8 | length = 44 | Reserved |R|0|Z| Rsrvd | 5026 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5027 | NumTknsLeft | | 5028 +-+-+-+-+-+-+-+-+ -+ 5029 | | 5030 +- Other link RMB specification -+ 5031 | | 5032 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5033 | | | 5034 +-+-+-+-+-+-+-+-+ -+ 5035 | | 5036 +- Other link RMB specification or zeros -+ 5037 | | 5038 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5039 | | | 5040 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -+ 5041 | | 5042 +- -+ 5043 | Other link RMB specification or zeroes | 5044 +- +-+-+-+-+-+-+-+-+ 5045 | | Reserved | 5046 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5048 The CONFIRM RKEY CONTINUATION LLC message is used to communicate any 5049 additional RMB RTokens that did not fit into the CONFIRM RKEY 5050 message. Each of these messages can hold up to 3 RMB RTokens. The 5051 Numlinks field indicates how many RMB RTokens are to be communicated, 5052 including the ones in this message. If the value is 3 or less, this 5053 is the last message of the group. If the value is 4 or higher, 5054 additional CONFIRM RKEY CONTINUATION messages will follow, and the 5055 Numlinks value will be a countdown until all are communicated. 5057 Like the CONFIRM RKEY message, the peer responds by echoing the 5058 message back with the reply flag set. 5060 The contents of this message are: 5062 Type 5064 Type 8 indicates CONFIRM RKEY CONTINUATION 5066 Length 5068 All LLC messages are 44 bytes long 5070 R 5072 Reply flag. When set indicates this is a CONFIRM RKEY 5073 CONTINUATION REPLY 5075 0 5077 Reserved bit 5079 Z 5081 Negative response flag 5083 NumTknsLeft 5085 The number of link/RToken pairs, including those provided in this 5086 message, that are remaining to be communicated. If this value is 5087 three or fewer this is the last message in the exchange. If this 5088 value is greater than three, another CONFIRM RKEY CONTINUATION 5089 message will be required. Note that in this version of the 5090 architecture, 8 is the maximum number of links supported in a 5091 link group. 5093 Other link RMB specifications 5095 The new RMB's specification on other links in the link group, as 5096 shown in Figure 38. 5098 A.3.7. DELETE RKEY LLC message format 5100 0 1 2 3 5101 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 5102 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5103 | type = 9 | length = 44 | Reserved |R|0|Z| Rsrvd | 5104 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5105 | Count | Error Mask | Reserved | 5106 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5107 | First deleted Rkey | 5108 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5109 | Second deleted Rkey or zeros | 5110 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5111 | Third deleted Rkey or zeros | 5112 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5113 | Fourth deleted Rkey or zeros | 5114 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5115 | Fifth deleted Rkey or zeros | 5116 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5117 | Sixth deleted Rkey or zeros | 5118 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5119 | Seventh deleted Rkey or zeros | 5120 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5121 | Eighth deleted Rkey or zeros | 5122 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5123 | Reserved | 5124 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5126 The DELETE_RKEY flow can be sent at any time from either the client 5127 or the server, to inform the peer that one or more RMBs have been 5128 deleted. Because the peer already knows every RMB's Rkey on each 5129 link in the link group, this message only specifies one Rkey for each 5130 RMB being deleted. The Rkey provided for each deleted RMB will be its 5131 Rkey as known on the SMC-R link that this message is sent over. 5133 It is not necessary to provide the entire RToken. The Rkey alone is 5134 sufficient for identifying an existing RMB. 5136 The peer responds by simply echoing the message with the response 5137 flag set. If the peer did not recognize an Rkey, a negative response 5138 flag will be set, however no aggressive recovery action beyond 5139 logging the error will be taken. 5141 The contents of this message are: 5143 Type 5145 Type 9 indicates DELETE RKEY 5147 Length 5149 All LLC messages are 44 bytes long 5151 R 5153 Reply flag. When set indicates this is a DELETE RKEY REPLY 5155 0 5157 Reserved bit 5159 Z 5161 Negative response flag 5163 Count 5165 Number of RMBs being deleted by this message. Maximum value is 8 5167 Error Mask 5169 If this is a negative response, indicates which RMBs were not 5170 successfully deleted. Each bit corresponds to a listed RMB. So 5171 for example b'01010000' indicates that the second and fourth 5172 Rkeys weren't successfully deleted. 5174 Deleted Rkeys 5176 A list of Count Rkeys. Provided on the request flow and echoed 5177 back on the response flow. Each Rkey is valid on the link this 5178 message is sent over, and represents a deleted RMB. Up to eight 5179 RMBs can be deleted in this message. 5181 A.3.8. TEST LINK LLC message format 5183 0 1 2 3 5184 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 5185 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5186 | type = 7 | length = 44 | Reserved |R| Reserved | 5187 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5188 | | 5189 +- -+ 5190 | | 5191 +- User Data -+ 5192 | | 5193 +- -+ 5194 | | 5195 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5196 | | 5197 +- -+ 5198 | | 5199 +- -+ 5200 | Reserved | 5201 +- -+ 5202 | | 5203 +- -+ 5204 | | 5205 +- -+ 5206 | | 5207 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5208 Figure 38 TEST LINK LLC message format 5210 The TEST_LINK request can be sent from either peer to the other on an 5211 existing SMC-R link at any time to test that the SMC-R link is active 5212 and healthy at the software level. A peer which receives a TEST_LINK 5213 LLC message immediately sends back a TEST_LINK reply, echoing back 5214 the user data. Also refer to 4.5.3. TCP Keepalive processing. 5216 The contents of this message are: 5218 Type 5220 Type 7 indicates TEST LINK 5222 Length 5224 All LLC messages are 44 bytes long 5226 R 5227 Reply flag. When set indicates this is a TEST LINK REPLY 5229 User Data 5231 The receiver of this message echoes the sender's data back in a 5232 TEST_LINK response LLC message 5234 A.4. Connection Data Control (CDC) message format 5236 The RMBE control data is communicated using Connection Data Control 5237 (CDC) messages, which use RDMA message passing using inline data, 5238 similar to LLC messages. Also similar to LLC messages, this data 5239 block is 44 bytes long to ensure that it can it into private data 5240 areas of receive WQEs, without requiring the receiver to post receive 5241 buffers. 5243 Unlike LLC messages, this data is integral to the data path so its 5244 processing must be prioritized and optimized similarly to other data 5245 path processing. While LLC messages may be processed on a slower 5246 path than data, these messages cannot be. 5248 0 1 2 3 5249 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 5250 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5251 | Type = x'FE' | Length = 44 | Sequence number | 5252 4 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5253 | SMC-R alert token | 5254 8 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5255 | Reserved | Producer cursor wrap seqno | 5256 12 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5257 | Producer Cursor | 5258 16 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5259 | Reserved | Consumer cursor wrap seqno | 5260 20 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5261 | Consumer Cursor | 5262 24 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5263 |B|P|U|R|F|Rsrvd|D|C|A| Reserved | 5264 28 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5265 | | 5266 32 +- -+ 5267 | | 5268 36 +- Reserved -+ 5269 | | 5270 40 +- -+ 5271 | | 5272 44 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5274 Figure 39 Connection Data Control (CDC) Message Format 5276 Type = x'FE' 5278 This type number has the two high order bits turned on to enable 5279 processing to quickly distinguish it from an LLC message 5281 Length = 44 5283 The length of inline data that does not require posting of a 5284 receive buffer. 5286 Sequence number 5288 A 2 byte unsigned integer that represents a wrapping sequence 5289 number. The initial value is one and this value can wrap to 0. 5290 Incremented with every control message send, except for the 5291 failover data validation message, and used to guard against 5292 processing an old control message out of sequence, and also used 5293 in failover data validation. In normal usage, if this number is 5294 less than the last received value, discard this message. If 5295 greater, processes this message. Old control messages can be 5296 lost with no ill effect, but cannot be processed after newer 5297 ones. 5299 If this is a failover validation CDC message (F flag set), then 5300 the receiver must verify that it has received and fully processed 5301 the RDMA write that was described by the CDC message with the 5302 sequence number in this message. If not, the TCP connection must 5303 be reset, to guard against data loss. Details of this processing 5304 are in section 4.6.1. 5306 SMC-R alert token 5308 The endpoint-assigned alert token that identifies which TCP 5309 connection on the link group this control message refers to. 5311 Producer cursor wrap seqno 5313 A 2 byte unsigned integer that represents wrapping counter 5314 incremented by the producer whenever the data written into this 5315 RMBE receiver buffer causes a wrap (i.e. the producer cursor 5316 wraps). This is used by the receiver to determine when new data 5317 is available even though the cursors appear unchanged such as 5318 when a full window size write is completed (Producer cursor of 5319 this RMBE sent by peer = Local Consumer Cursor) or in scenarios 5320 where the Producer Cursor sent for this RMBE < Local Consumer 5321 Cursor). 5323 Producer cursor 5325 Unsigned, 4 byte integer that is a wrapping offset into the RMBE 5326 data area. Points to the next byte of data to be written by the 5327 sender. Can advance up to the receiver's Consumer Cursor as known 5328 by the sender. When the urgent data present indicator is on then 5329 points one byte beyond the last byte of urgent data. When 5330 computing this cursor, the presence of the eyecatcher in the RMBE 5331 data area must be accounted for. The first writeable data 5332 location in the RMBE is at offset 4, so this cursor begins at 4 5333 and wraps to 4. 5335 Consumer cursor wrap seqno 5337 2 byte unsigned integer that mirrors the value of the Producer 5338 cursor wrap sequence number when the last read from this RMBE 5339 occurred. Used as an indicator on how far along the consumer is 5340 in reading data (i.e. processed last wrap point or not). The 5341 producer side can use this indicator to detect whether more data 5342 can be written to the partner in full window write scenarios 5343 (where the Producer Cursor = Consumer Cursor as known on the 5344 remote RMBE). In this scenario if the consumer sequence number 5345 equals the local producer sequence number the producer knows that 5346 more data can be written. 5348 Consumer Cursor 5350 Unsigned 4 byte integer that is a wrapping offset into the 5351 sender's RMBE data area. Points to the offset of the next byte 5352 of data to be consumed by the peer in its own RMBE. When 5353 computing this cursor, the presence of the eyecatcher in the RMBE 5354 data area must be accounted for. The first writeable data 5355 location in the RMBE is at offset 4, so this cursor begins at 4 5356 and wraps to 4. The sender cannot write beyond this cursor into 5357 the peer's RMBE without causing data loss. 5359 B-bit 5361 Writer blocked indicator: Sender is blocked for writing, requires 5362 explicit notification when receive buffer space is available. 5364 P-bit 5366 Urgent data pending: Sender has urgent data pending for this 5367 connection 5369 U-bit 5371 Urgent data present: Indicates that urgent is data present in the 5372 RMBE data area, and the producer cursor points to one byte beyond 5373 the last byte of urgent data. 5375 R-bit 5377 Request for consumer cursor update: Indicates that a consumer 5378 cursor update is requested bypassing any window size optimization 5379 algorithms. 5381 F-bit 5383 Failover validation indicator: sent by a peer to guard against 5384 data loss during failover when the TCP connection is being moved 5385 to another SMC-R link in the link group. When this bit is set 5386 the only other fields in the CDC message that are significant are 5387 the type, length, SMC-R alert token and the sequence number. The 5388 receiver must validate that it has fully processed the RDMA write 5389 described by the previous CDC message bearing the same sequence 5390 number as this validation message. If it has, no further action 5391 is required. If it has not, the TCP connection must be reset. 5392 This processing is described in detail in section 4.6.1. 5394 D-bit 5396 Sending done indicator: Sent by a peer when it is done writing 5397 new data into the receiver's RMBE data area. 5399 C-bit 5401 Peer Closed Connection indicator: Sent by a peer when it is 5402 completely done with this connection and will no longer be making 5403 any updates to the receiver's RMBE, and will also not be sending 5404 any more control messages. 5406 A-bit 5408 Abnormal Close indicator: Sent by a peer when the connection is 5409 abnormally terminated (for example, the TCP connection was 5410 Reset). When sent it indicates that the peer is completely done 5411 with this connection and will no longer be making any updates to 5412 this RMBE or sending any more control messages. It also indicates 5413 that the RMBE owner must flush any remaining data on this 5414 connection and surface an error return code to any outstanding 5415 socket APIs on this connection (same processing as receiving an 5416 RST segment on a TCP connection). 5418 Appendix B. Socket API considerations 5420 A key design goal for SMC-R is to require no application changes for 5421 exploitation. It is confined to socket applications using stream 5422 (i.e. TCP protocol) sockets over IPv4 or IPv6. By virtue of the fact 5423 that the switch to the SMC-R protocol occurs after a TCP connection 5424 is established no changes are required in socket address family or in 5425 the IP addresses and ports that the socket application are using. 5426 Existing socket APIs that allow the application to retrieve local and 5427 remote socket address structures for an established TCP connection 5428 (for example, getsockname() and getpeername()) will continue to 5429 function as they have before. Existing DNS setup and APIs for 5430 resolving hostnames to IP addresses and vice versa also continue to 5431 function without any changes. In general all of the usual socket APIs 5432 that are used for TCP communicates (send APIs, recv APIs, etc.) will 5433 continue to function as they do today even if SMC-R is used as the 5434 underlying protocol. 5436 Each SMC-R enabled implementation does however need to pay special 5437 attention to any socket APIs that have a reliance on the underlying 5438 TCP and IP protocols and ensure that their behavior in an SMC-R 5439 environment is reasonable and minimizes impact to the application. 5440 While the basic socket API set is fairly similar across different 5441 Operating Systems, when it comes to advanced socket API options there 5442 is more variability. Each implementation needs to perform a detailed 5443 analysis of its API options and SMC-R impact and implications. As 5444 part of that step a discussion or review with other implementations 5445 supporting SMC-R would be useful to ensure a consistent 5446 implementation. 5448 setsockopt()/ getsockopt() considerations 5450 These APIs allow socket applications to manipulate socket, transport 5451 (TCP/UDP) and IP level options associated with a given socket. 5452 Typically, a platform restricts the number of IP options available to 5453 stream (TCP) socket applications given their connection oriented 5454 nature. The general guideline here is to continue processing these 5455 APIs in a manner that allows for application compatibility. Some 5456 options will be relevant to the SMC-R protocol and will require 5457 special processing under the covers. For example, the ability to 5458 manipulate TCP send and receive buffer sizes is still valid for SMC- 5459 R. However, other options may have no meaning for SMC-R. For 5460 example, if an application enabled the TCP_NODELAY option to disable 5461 Nagle's algorithm it should have no real effect in SMC-R 5462 communications as there is no notion of Nagle's algorithm with this 5463 new protocol. But the implementation must accept the TCP_NODELAY 5464 option as it does today and save it so that it can be later extracted 5465 via getsockopt() processing. Note that any TCP or IP level options 5466 will still have an effect on any TCP/IP packets flowing for an SMC-R 5467 connection (i.e. as part of TCP/IP connection establishment and 5468 TCP/IP connection termination packet flows). 5470 Under the covers manipulation of the TCP options will also include 5471 the SMC layer setting and reading the SMC-R experimental option 5472 before and after completion of the 3 way TCP handshake. 5474 Appendix C. Rendezvous Error scenarios 5476 Error scenarios in setting up and managing SMC-R links are discussed 5477 in this section. 5479 C.1. SMC Decline during CLC negotiation 5481 A peer to the SMC-R CLC negotiation can send SMC Decline in lieu of 5482 any expected CLC message to decline SMC and force the TCP connection 5483 back to IP fabric. There can be several reasons for an SMC Decline 5484 during the CLC negotiation including: RNIC went down, SMC-R forbidden 5485 by local policy, subnet (IPv4) or prefix (IPv6) doesn't match, lack 5486 of resources to perform SMC-R. In all cases when an SMC Decline is 5487 sent in lieu of an expected CLC message, no confirmation is required 5488 and the TCP connection immediately falls back to using the IP fabric. 5490 To prevent ambiguity between CLC messages and application data, an 5491 SMC Decline cannot "chase" another CLC message. SMC Decline can only 5492 be sent in lieu of an expected CLC message. For example, if the 5493 client sends SMC Proposal then its RNIC goes down, it must wait for 5494 the SMC Accept for the server and then it can reply to that with an 5495 SMC Decline. 5497 This "no chase" rule means that if this TCP connection is not a first 5498 contact between RoCE peers, a server cannot send SMC Decline after 5499 sending SMC Accept - it can only either break the TCP connection. 5500 Similarly, once the client sends SMC Confirm on a TCP connection that 5501 isn't first contact, it is committed to SMC-R for this TCP connection 5502 and cannot fall back to IP. 5504 C.2. SMC Decline during LLC negotiation 5506 For a TCP connection that represents first contact between RoCE 5507 pairs, it is possible for SMC to fail back to IP during the LLC 5508 negotiation. This is possible until the first contact SMC link is 5509 confirmed. For example, see Figure 40. After a first contact SMC 5510 link is confirmed, fallback to IP is no longer possible. The rule 5511 that this translates to is: a first contact peer can send SMC Decline 5512 at any time during LLC negotiation until it has successfully sent its 5513 CONFIRM LINK (request or response) flow. After that point, it cannot 5514 fall back to IP. 5516 Host X -- Server Host Y -- Client 5517 +-------------------+ +-------------------+ 5518 | PeerID = PS1 | | PeerID = PC1 | 5519 | +------+ +------+ | 5520 | QP 8 |RNIC 1| SMC-R link 1 |RNIC 2| QP 64 | 5521 | RKey X | |MAC MA|<-------------------->|MAC MB| | | 5522 | | |GID GA| attempted setup |GID GB| | RKey Y2| 5523 | \/ +------+ +------+ \/ | 5524 |+--------+ | | +--------+ | 5525 || RMB | | | | RMB | | 5526 |+--------+ | | +--------+ | 5527 | /\ +------+ +------+ /\ | 5528 | | |RNIC 3| |RNIC 4| | Rkey W2| 5529 | | |MAC MC| |MAC MD| | | 5530 | QP 9 |GID GC| |GID GD| QP65 | 5531 | +------+ +------+ | 5532 +-------------------+ +-------------------+ 5534 SYN / SYN-ACK / ACT TCP 3-way handshake with TCP option 5535 <---------------------------------------------------------> 5537 SMC Proposal / SMC Accept / SMC Confirm exchange 5538 <--------------------------------------------------------> 5540 CONFIRM LINK(request, link 1) 5541 .........................................................> 5543 CONFIRM LINK(response, link 1) 5544 X................................... 5545 : 5546 : ROCE write faliure 5547 :.................................> 5549 SMC Decline(PC1, reason code) 5550 <-------------------------------------------------------- 5552 Connection data flows over IP fabric 5553 <-------------------------------------------------------> 5555 Legend: 5556 ------------ TCP/IP and CLC flows 5557 ............ RoCE (LLC) flows 5559 Figure 40 SMC Decline during LLC negotiation 5561 C.3. The SMC Decline window 5563 Because SMC-R does not support fall-back to IP for a TCP connection 5564 that is already using RDMA, there are specific rules on when SMC 5565 Decline, which signals a fall-back to IP because of an error or 5566 problem with the RoCE fabric, can be sent during TCP connection 5567 setup. There is a point of no return after which a connection cannot 5568 fall back to IP, and RoCE errors that occur after this point require 5569 the connection to be broken with a RST flow in the IP fabric. 5571 For first contact, that point of no return is after the Add Link LLC 5572 message has been successfully sent for the second SMC-R link. 5573 Specifically, the server cannot fall back to IP after receiving 5574 either a positive write completion indication for the Add Link 5575 request, or after receiving the Add Link response from the client, 5576 whichever comes first. The client cannot fall back to IP after 5577 either sending a negative Add Link response, receiving a positive 5578 write complete on a positive Add Link response, or receiving a 5579 Confirm Link for the second SMC-R link from the server, whichever 5580 comes first. 5582 For subsequent contact, that point of no return is after the last 5583 send of the CLC negotiation completes. This, in combination with the 5584 rule that error "chasers" are not allowed during CLC negotiation, 5585 means that the server cannot send SMC Decline after sending an SMC 5586 Accept, and the client cannot send an SMC Decline after sending an 5587 SMC Confirm. 5589 C.4. Out of synch conditions during SMC-R negotiation 5591 The SMC Accept CLC message contains a "first contact" flag that 5592 indicates to the client whether or not the server believes it is 5593 setting up a new link group, or using an existing link group. This 5594 flag is used to detect an out of synch condition between the client 5595 and the server. The scenario detected is as follows: There is a 5596 single existing SMC-R link between the peers. After the client sends 5597 the SMC Proposal CLC message, the existing SMC-R link between the 5598 client and the server fails. The client cannot chase the SMC 5599 Proposal CLC message with an SMC Decline CLC message in this case 5600 because the client does not yet know that the server would have 5601 wanted to choose the SMC-R link that just crashed. The QP that 5602 failed recovers before the server returns its SMC Accept CLC message. 5603 This means that there is a QP but no SMC link. Since the server had 5604 not yet learned of the SMC link failure when it sent the SMC Accept 5605 CLC message, it attempts to re-use the SMC link that just failed. 5606 This means the server would not set the "first contact" flag, 5607 indicating to the client that the server thinks it is reusing an SMC- 5608 R link. However the client does not have an SMC-R link that matches 5609 the server's specification. Because the "first contact" flag is off, 5610 the client realizes it is out of synch with the server and sends SMC 5611 Decline to cause the connection to fall back to IP. 5613 C.5. Timeouts during CLC negotiation 5615 Because the SMC-R negotiation flows as TCP data, there are built-in 5616 timeouts and retransmits at the TCP layer for individual messages. 5617 Implementations also must to protect the overall TCP/CLC handshake 5618 with a timer or timers to prevent connections from hanging 5619 indefinitely due to SMC-R processing. This can be done with 5620 individual timers for individual CLC messages or an overall timer for 5621 the entire exchange, which may include the TCP handshake and the CLC 5622 handshake under one timer or separate timers. This decision is 5623 implementation dependent. 5625 If the TCP and/or CLC handshakes time out, the TCP connection must be 5626 terminated as it would be in a legacy IP environment when connection 5627 setup doesn't complete in a timely manner. Because the CLC flows are 5628 TCP messages, if they cannot be sent and received in a timely 5629 fashion, the TCP connection is not healthy and would not work if 5630 fallback to IP were attempted. 5632 C.6. Protocol errors during CLC negotiation 5634 Protocol errors occur during CLC negotiation when a message is 5635 received that is not expected. For example, a peer that is expecting 5636 a CLC message but instead receives application data has experienced a 5637 protocol error, and also indicates a likely software error as the two 5638 sides are out of synch. When application data is expected, this data 5639 is not parsed to ensure it's not a CLC message. 5641 When a peer is expecting a CLC negotiation message, any parsing error 5642 except a bad enumerated value in that message must be treated as 5643 application data. The CLC negotiation messages are designed with 5644 beginning and ending eyecatchers to help verify that they are 5645 actually the expected message. If other parsing errors in an 5646 expected CLC message occur, such as incorrect length fields or 5647 incorrectly formatted fields, the message must be treated as 5648 application data. 5650 All protocol errors with the exception of bad enumerated values must 5651 result in termination of the TCP connection. No fallback to IP is 5652 allowed in the case of a protocol error because if the protocols are 5653 out of synch, mismatched, or corrupted, then data and security 5654 integrity cannot be ensured. 5656 The exception to this rule is enumerated values, for example the QP 5657 MTU values on SMC Accept and SMC Confirm. If a reserved value is 5658 received, the proper error response is to send SMC Decline and fall 5659 back to IP. The reason for this is that use of a reserved enumerated 5660 value indicates that the other partner likely has additional support 5661 that the receiving partner does not have. This indicated mismatch of 5662 SMC-R capabilities is not an integrity problem, but indicates that 5663 SMC-R cannot be used for this connection 5665 C.7. Timeouts during LLC negotiation 5667 Whenever a peer sends an LLC message to which a reply is expected, it 5668 sets a timer after the send posts to wait for the reply. An expected 5669 response may be a reply flavor of the LLC message (for example 5670 CONFIRM LINK REPLY) or a new LLC message (for example an ADD LINK 5671 CONTINUATION expected from the server by the client if there are more 5672 Rkeys to communicate). 5674 On LLC flows that are part of a first contact setup of a link group, 5675 the value of the timer is implementation dependent but should be long 5676 enough to allow the other peer have a write complete timeout and 2-3 5677 retransmits of an SMC Decline on the TCP fabric. For LLC flows 5678 that are maintaining the link group and not part of first contact 5679 setup of a link group, the timers may be shorter. Upon receipt of an 5680 expected reply the timer is cancelled. If a timer pops without a 5681 reply having been received, the sender must initiate a recovery 5682 action 5684 During first contact processing, failure of an LLC verification timer 5685 is a should-not-occur which indicates a problem with one of the 5686 endpoints. The reason for this is that if there is a "routine" 5687 failure in the RoCE fabric that causes an LLC verification send to 5688 fail, the sender will get a write completion failure and will then 5689 send SMC Decline to the partner. The only time an LLC verification 5690 timer will expire on a first contact is when the sender thinks the 5691 send succeeded but it actually didn't. Because of the reliable 5692 connected nature of QP connections on the RoCE fabric, this is 5693 indicates a problem with one of the peers, not with the RoCE fabric. 5695 After the reliable connected QP for the first SMC-R link in a link 5696 group is set up on initial contact, the client sets a timer to wait 5697 for a RoCE verification message from the server that the QP is 5698 actually connected and usable. If the server experiences a failure 5699 sending its QP confirmation message, it will send SMC Decline, which 5700 should arrive at the client before the client's verification timer 5701 expires. If the client's timer expires without receiving either an 5702 SMC Decline or a RoCE message confirmation from the server, there is 5703 a problem either with the server or with the TCP fabric. In either 5704 case the client must break the TCP connection and clean up the SMC-R 5705 link. 5707 There are two scenarios in which the client's response to the QP 5708 verification message fails to reach the server. The main difference 5709 is whether or not the client has successfully completed the send of 5710 the CONFIRM LINK response. 5712 In the normal case of a problem with the RoCE path, the client will 5713 learn of the failure by getting a write completion failure, before 5714 the server's timer expires. In this case, the client sends an SMC 5715 Decline CLC message to the server and the TCP connection falls back 5716 to IP. 5718 If the client's send of the Confirmation message receives a positive 5719 return code but for some reason still does not reach the server, or 5720 the client's SMC Decline CLC message fails to reach the server after 5721 the client fails to send its RoCE confirmation message, then the 5722 server's timer will time out and the server must break the TCP 5723 connection by sending RST. This is expected to be a very rare case, 5724 because if the client cannot send its CONFIRM LINK RSP LLC message, 5725 the client should get a negative return code and initiate fallback to 5726 IP. A client receiving a positive return code on a send that fails 5727 to reach the server should be extremely rare. 5729 C.7.1. Recovery actions for LLC timeouts and failures 5731 The following table describes recovery actions for LLC timeouts. A 5732 write completion failure or other indication of failure to send on 5733 the send of the LLC command is treated the same as a timeout. 5735 LLC Message: CONFIRM LINK from server (first contact, first link in 5736 the link group) 5738 Timer waits for: CONFIRM LINK reply from client 5740 Recovery action: Break the TCP connection by sending RST and 5741 clean up the link. The server should have received an SMC 5742 Decline from the client by now if the client had an LLC send 5743 failure. 5745 LLC Message: CONFIRM LINK from server (first contact, second link in 5746 the link group) 5748 Timer waits for: CONFIRM LINK reply from client 5749 Recovery action: The second link was not successfully set up. 5750 Send DELETE LINK to the client. Connection data cannot flow in 5751 the first link in the link group, until the reply to this DELETE 5752 LINK is received, to prevent the peers from being out of synch on 5753 the state of the link group. 5755 LLC Message: CONFIRM LINK from server (not first contact) 5757 Timer Waits for: CONFIRM LINK reply from client 5759 Recovery action: Clean up the new link and set a timer to retry. 5760 Send DELETE LINK to the client, in case the client has a longer 5761 timer interval, so the client can stop waiting 5763 LLC Message: CONFIRM LINK REPLY from client (first contact) 5765 Timer waits for: ADD LINK from server 5767 Recovery action: Clean up the SMC-R link and break the TCP 5768 connection by sending RST over the IP fabric. There is a problem 5769 with the server. If the server had a send failure, it should 5770 have have sent SMC Decline by now. 5772 LLC Message: ADD LINK from server (first contact) 5774 Timer waits for: ADD LINK reply from client 5776 Recovery action: Break the TCP connection with RST and clean up 5777 RoCE resources. The connection is past the point where the 5778 server can fall back to IP, and if the client had a send problem 5779 it should have sent SMC Decline by now. 5781 LLC Message: ADD LINK from server (not first contact) 5783 Timer waits for: ADD LINK reply from client 5785 Recovery action: Clean up resources (QP, RMB keys, etc) for the 5786 new link and treat the link that the ADD LINK was sent over as if 5787 it had failed. If there is another link available to resend the 5788 ADD LINK and the link group still needs another link, retry the 5789 ADD LINK over another link in the link group. 5791 LLC Message: ADD LINK REPLY from client (and there are more Rkeys to 5792 be communicated) 5794 Timer waits for: ADD LINK CONTINUATION from server 5795 Recovery action: Treat the same as ADD LINK timer failure 5797 LLC Message: ADD LINK REPLY or ADD LINK CONTINUATION reply from 5798 client (and there are no more Rkeys to be communicated, for the 5799 second link in a first contact scenario) 5801 Timer waits for: CONFIRM LINK from the server on the new link 5803 Recovery action: The new link has failed to set up. Send DELETE 5804 LINK to the server. Do not consider the socket opened to the 5805 client application until receiving confirmation from the server 5806 in the form of a DELETE LINK request for this link and sending 5807 the reply (to prevent the partners from being out of synch on the 5808 state of the link group). 5810 Set a timer to send another ADD LINK to the server if there is 5811 still an unused RNIC on the client side. 5813 LLC Message: ADD LINK REPLY or ADD LINK CONTINUATION reply from the 5814 client (and there are no more Rkeys to be communicated) 5816 Timer waits for: CONFIRM LINK from the server, over the new link 5818 Recovery action: Send a DELETE LINK to the server for the new 5819 link, then clean up any resource allocated for the new link and 5820 set a timer to send ADD LINK to the server if there is still an 5821 unused RNIC on the client side. The new link has failed to set 5822 up, but the link that the ADD LINK exchange occurred over is 5823 unaffected. 5825 LLC Message: ADD LINK CONTINUATION from server 5827 Timer waits for: ADD LINK CONTINUATION REPLY from client 5829 Recovery action: Treat the same as ADD LINK timer failure 5831 LLC Message: ADD LINK CONTINUATION reply from client (first contact, 5832 and RMB count fields indicate that the server owes more ADD LINK 5833 CONTINUATION messages) 5835 Timer waits for: ADD LINK CONTINUATION from the server 5837 Recovery action: Clean up the SMC link and break the TCP 5838 connection by sending RST. There is a problem with the server. 5840 If the server had a send failure, it should have have sent SMC 5841 Decline by now. 5843 LLC Message: ADD LINK CONTINUATION reply from client (not first 5844 contact and RMB count fields indicate that the server owes more ADD 5845 LINK CONTINUATION messages) 5847 Timer waits for: ADD LINK CONTINUATION from server 5849 Recovery action: Treat as is if client detected link failure on 5850 the link the ADD LINK exchange is using. Send DELETE LINK to 5851 the server over another active link if one exists, otherwise 5852 clean up the link group. 5854 LLC Message: DELETE LINK from client 5856 Timer waits for: DELETE LINK request from server 5858 Recovery action: If the scope of the request is to delete a 5859 single link, the surviving link, over which the client sent the 5860 DELETE LINK is no longer usable either. If this is the last link 5861 in the link group, end TCP connections over the link group by 5862 sending RST packets. If there are other surviving links in the 5863 link group, resend over a surviving link. Also send a DELETE 5864 LINK over a surviving link for the link that the client attempted 5865 to send the initial DELETE LINK message over. If the scope of 5866 the request is to delete the entire link group, try resending on 5867 other links in the link group until success is achieved. If all 5868 sends fail, tear down the link group and any TCP connections that 5869 exist on it. 5871 LLC Message: DELETE LINK from server (scope: entire link group) 5873 Timer waits for: Confirmation from the adapter that the message 5874 was delivered. 5876 Recovery action: Tear down the link group and any TCP connections 5877 that exist over it. 5879 LLC Message: DELETE LINK from server (scope: single link) 5881 Timer waits for: DELETE LINK reply from the client 5883 Recovery action: The link over which the client sent the DELETE 5884 LINK is no longer usable either. If this is the last link in the 5885 link group, end TCP connections over the link group by sending 5886 RST packets. If there are other surviving links in the link 5887 group, resend over a surviving link. Also send a DELETE LINK 5888 over a surviving link for the link that the server attempted to 5889 send the initial DELETE LINK message over. If the scope of the 5890 request is to delete the entire link group, try resending on 5891 other links in the link group until success is achieved. If all 5892 sends fail, tear down the link group and any TCP connections that 5893 exist on it. 5895 LLC Message: CONFIRM RKEY from the client 5897 Timer waits for: CONFIRM RKEY REPLY from the server 5899 Recovery action: Perform normal client procedures for detection 5900 of failed link. The link over which the message was sent has 5901 failed. 5903 LLC Message: CONFIRM RKEY from the server 5905 Timer waits for : CONFIRM RKEY REPLY from the client 5907 Recovery action: Perform normal server procedures for detection 5908 of failed link. The link over which the message was sent has 5909 failed. 5911 LLC Message: TEST LINK from the client 5913 Timer waits for: TEST LINK REPLY from the server 5915 Recovery action: Perform normal client procedures for detection 5916 of failed link. The link over which the message was sent has 5917 failed. 5919 LLC Message: TEST LINK from the server 5921 Timer waits for : TEST LINK REPLY from the client 5923 Recovery action: Perform normal server procedures for detection 5924 of failed link. The link over which the message was sent has 5925 failed. 5927 The following table describes recovery actions for invalid LLC 5928 messages. These could be misformatted or contain out of synch data. 5930 LLC Message received: CONFIRM LINK from server 5932 What could be bad: Incorrect link information 5933 Recovery action: Protocol error. The link must be brought down 5934 by sending a DELETE LINK for the link over another link in the 5935 link group if one exists. If this is first contact, fall back to 5936 IP by sending SMC Decline to server. 5938 LLC Message received: ADD LINK 5940 What could be bad: Undefined enumerated MTU value 5942 Recovery action: Send negative ADD LINK reply with reason code 5943 x'2' 5945 LLC Message received: ADD LINK reply from client 5947 What could be bad: Client side link information that would result 5948 in a parallel link being set up 5950 Recovery action: Parallel links are not permitted. Delete the 5951 link by sending DELETE LINK to the client over another link in 5952 the link group. 5954 LLC Message received: Any link group command from the server except 5955 DELETE LINK for the entire link group 5957 What could be bad: Client has sent DELETE LINK for the link that 5958 the message was received on 5960 Recovery action: Ignore the LLC message. Worst case the server 5961 will time out. Best case the DELETE LINK crosses with the 5962 command from the server and the server realizes it failed. 5964 LLC Message received: ADD LINK CONTINUATION from the server or ADD 5965 LINK CONTINUATION REPLY from the client 5967 What could be bad: Number of RMBs provided doesn't match count 5968 given on initial ADD LINK or ADD LINK reply message 5970 Recovery action: Protocol error. Treat as if detected link outage 5972 LLC Message received: DELETE LINK from client 5974 What could be bad: Link indicated doesn't exist 5976 Recovery action: If the link is in the process of being cleaned 5977 up, assume timing window and ignore message. Otherwise, send 5978 DELETE LINK REPLY with reason code 1. 5980 LLC Message received: DELETE LINK from server 5982 What could be bad: Link indicated doesn't exist 5984 Recovery action: Send DELETE LINK REPLY with reason code 1. 5986 LLC Message received: CONFIRM RKEY form either client or server 5988 What could be bad: No Rkey provided for one or more of the links 5989 in the link group 5991 Recovery action: Treat as if detected failure of the link(s) for 5992 which no RKEY was provided 5994 LLC message received: DELETE RKEY 5996 Specified RKey doesn't exist 5998 Send negative DELETE RKEY response. 6000 LLC message received: TEST LINK reply 6002 What could be bad: User data doesn't match what was sent in the 6003 TEST LINK request 6005 Recovery action: Treat as if detected that the link has gone 6006 down. This is a protocol error 6008 LLC message received: Unknown LLC type with high order bits of opcode 6009 equal b'10' 6011 What could be bad: This is an optional LLC message which the 6012 receiver does not support 6014 Recovery action: Ignore (silently discard) the message 6016 LLC message received: any unambiguously incorrect or out of synch LLC 6017 message 6019 What it indicates: Link is out of sync 6021 Recovery action: Treat as if detected that the link has gone 6022 down. Note that an unsupported or unknown LLC opcode whose two 6023 high order bits are b'10' is not an error, and must be silently 6024 discarded. Any other unknown or unsupported LLC opcode is an 6025 error. 6027 C.8. Failure to add second SMC-R link to a link group 6029 When there is any failure in setting up the second SMC-R link in an 6030 SMC-R link group, including confirmation timer expiration, the SMC-R 6031 link group is allowed to continue, without available failover. 6032 However this situation is extremely undesirable and the server must 6033 endeavor to correct it as soon as it can. 6035 The server peer in the SMC-R link group must set a timer to drive it 6036 to retry setup of a failed additional SMC-R link. The server will 6037 immediately retry the SMC-R link setup when the first of the 6038 following events occurs: 6040 o The retry timer expires 6042 o A new RNIC becomes available to the server, on the same LAN as the 6043 SMC-R link group 6045 o An "Add Link" LLC request message is received from the client, 6046 which indicates availability of a new RNIC on the client side. 6048 Authors' Addresses 6050 Mike Fox 6051 IBM 6052 3039 Cornwallis Rd. 6053 Research Triangle Park, NC 27709 6055 Email: mjfox@us.ibm.com 6057 Constantinos (Gus) Kassimis 6058 IBM 6059 3039 Cornwallis Rd. 6060 Research Triangle Park, NC 27709 6062 Email: kassimis@us.ibm.com 6064 Jerry Stevens 6065 IBM 6066 3039 Cornwallis Rd. 6067 Research Triangle Park, NC 27709 6069 Email: sjerry@us.ibm.com