idnits 2.17.1 draft-fox-tcpm-shared-memory-rdma-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 690 has weird spacing: '...essages requi...' == Line 3353 has weird spacing: '...WR data l ...' == Line 3363 has weird spacing: '...WR data l ...' == Line 4708 has weird spacing: '...request messa...' == Line 5795 has weird spacing: '...NK over anoth...' -- The document date (October 1, 2014) is 3493 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'ROCE' is defined on line 3868, but no explicit reference was found in the text == Unused Reference: 'IBTA' is defined on line 3872, but no explicit reference was found in the text == Unused Reference: 'RFC793' is defined on line 3875, but no explicit reference was found in the text == Unused Reference: 'RFC4727' is defined on line 3879, but no explicit reference was found in the text == Unused Reference: 'RFC 6994' is defined on line 3884, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) Summary: 1 error (**), 0 flaws (~~), 12 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 TCPM working group M. Fox 2 Internet Draft C. Kassimis 3 Intended Status: Informational J. Stevens 4 Expires: 4/1/2015 IBM 5 October 1, 2014 7 Shared Memory Communications over RDMA 8 draft-fox-tcpm-shared-memory-rdma-05.txt 10 Status of this Memo 12 This Internet-Draft is submitted in full conformance with the 13 provisions of BCP 78 and BCP 79. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that 17 other groups may also distribute working documents as Internet- 18 Drafts. 20 Internet-Drafts are draft documents valid for a maximum of six months 21 and may be updated, replaced, or obsoleted by other documents at any 22 time. It is inappropriate to use Internet-Drafts as reference 23 material or to cite them other than as "work in progress." 25 The list of current Internet-Drafts can be accessed at 26 http://www.ietf.org/ietf/1id-abstracts.txt 28 The list of Internet-Draft Shadow Directories can be accessed at 29 http://www.ietf.org/shadow.html 31 This Internet-Draft will expire on April 1, 2015. 33 Copyright Notice 35 Copyright (c) 2014 IETF Trust and the persons identified as the 36 document authors. All rights reserved. 38 This document is subject to BCP 78 and the IETF Trust's Legal 39 Provisions Relating to IETF Documents 40 (http://trustee.ietf.org/license-info) in effect on the date of 41 publication of this document. Please review these documents 42 carefully, as they describe your rights and restrictions with respect 43 to this document. Code Components extracted from this document must 44 include Simplified BSD License text as described in Section 4.e of 45 the Trust Legal Provisions and are provided without warranty as 46 described in the Simplified BSD License. 48 Abstract 50 This document describes the Shared Memory Communications over RDMA 51 (SMC-R) protocol. This protocol provides RDMA communications to TCP 52 endpoints in a manner that is transparent to socket applications. It 53 further provides for dynamic discovery of partner RDMA capabilities 54 and dynamic setup of RDMA connections, transparent high availability 55 and load balancing when redundant RDMA network paths are available, 56 and it maintains many of the traditional TCP/IP qualities of service 57 such as filtering that enterprise users demand, as well as TCP socket 58 semantics such as urgent data. 60 Table of Contents 62 1. Introduction...................................................5 63 1.1. Summary of changes in this draft..........................6 64 1.2. Protocol overview.........................................6 65 1.2.1. Hardware requirements................................8 66 1.3. Definition of common terms................................8 67 2. Link Architecture.............................................10 68 2.1. Remote Memory Buffers (RMBs).............................12 69 2.2. SMC-R Link groups........................................16 70 2.2.1. Link group types....................................17 71 2.2.2. Maximum number of links in link group...............20 72 2.2.3. Forming and managing link groups....................21 73 2.2.4. SMC-R link identifiers..............................22 74 2.3. SMC-R resilience and load balancing......................23 75 3. SMC-R Rendezvous architecture.................................25 76 3.1. TCP options..............................................25 77 3.2. Connection Layer Control (CLC) messages..................26 78 3.3. LLC messages.............................................26 79 3.4. CDC Messages.............................................28 80 3.5. Rendezvous flows.........................................28 81 3.5.1. First contact.......................................28 82 3.5.1.1. TCP Options pre-negotiation....................28 83 3.5.1.2. Client Proposal................................29 84 3.5.1.3. Server acceptance..............................30 85 3.5.1.4. Client confirmation............................31 86 3.5.1.5. Link (QP) confirmation.........................31 87 3.5.1.6. Second SMC-R link setup........................34 88 3.5.1.6.1. Client processing of "Add Link" LLC message 89 from server..........................................34 90 3.5.1.6.2. Server processing of "Add Link" reply LLC 91 message from the client..............................35 92 3.5.1.6.3. Exchange of Rkeys on second SMC-R link....37 93 3.5.1.6.4. Aborting SMC-R and falling back to IP.....37 94 3.5.2. Subsequent contact..................................37 95 3.5.2.1. SMC-R proposal.................................38 96 3.5.2.2. SMC-R acceptance...............................39 97 3.5.2.3. SMC-R confirmation.............................40 98 3.5.2.4. TCP data flow race with SMC Confirm CLC message40 99 3.5.3. First contact variation: creating a parallel link group 100 ...........................................................41 101 3.5.4. Normal SMC-R link termination.......................42 102 3.5.5. Link group management flows.........................43 103 3.5.5.1. Adding and deleting links in an SMC-R link group43 104 3.5.5.1.1. Server initiated Add Link processing......43 105 3.5.5.1.2. Client initiated Add Link processing......44 106 3.5.5.1.3. Server initiated Delete Link Processing...44 107 3.5.5.1.4. Client initiated Delete Link request......46 108 3.5.5.2. Managing multiple Rkeys over multiple SMC-R links 109 in a link group.........................................48 110 3.5.5.2.1. Adding a new RMB to an SMC-R link group...49 111 3.5.5.2.2. Deleting an RMB from an SMC-R link group..52 112 3.5.5.2.3. Adding a new SMC-R link to a link group with 113 multiple RMBs........................................53 114 3.5.5.3. Serialization of LLC exchanges, and collisions.54 115 3.5.5.3.1. Collisions with ADD LINK / CONFIRM LINK 116 exchange.............................................56 117 3.5.5.3.2. Collisions during DELETE LINK exchange....57 118 3.5.5.3.3. Collisions during CONFIRM_RKEY exchange...57 119 4. SMC-R memory sharing architecture.............................59 120 4.1. RMB element allocation considerations....................59 121 4.2. RMB and RMBE format......................................59 122 4.3. RMBE control information.................................59 123 4.4. Use of RMBEs.............................................60 124 4.4.1. Initializing and accessing RMBEs....................60 125 4.4.2. RMB element reuse and conflict resolution...........61 126 4.5. SMC-R protocol considerations............................62 127 4.5.1. SMC-R protocol optimized window size updates........62 128 4.5.2. Small data sends....................................63 129 4.5.3. TCP Keepalive processing............................63 130 4.6. TCP connection failover between SMC-R links..............66 131 4.6.1. Validating data integrity...........................66 132 4.6.2. Resuming the TCP connection on a new SMCR link......67 133 4.7. RMB data flows...........................................67 134 4.7.1. Scenario 1: Send flow, window size unconstrained....68 135 4.7.2. Scenario 2: Send/Receive flow, window unconstrained.70 136 4.7.3. Scenario 3: Send Flow, window constrained...........71 137 4.7.4. Scenario 4: Large send, flow control, full window size 138 writes.....................................................73 139 4.7.5. Scenario 5: Send flow, urgent data, window size 140 unconstrained..............................................76 141 4.7.6. Scenario 6: Send flow, urgent data, window size closed78 142 4.8. Connection termination...................................80 143 4.8.1. Normal SMC-R connection termination flows...........80 144 4.8.1.1. Abnormal SMC-R connection termination flows....85 145 4.8.1.2. Other SMC-R connection termination conditions..87 146 5. Security considerations.......................................88 147 5.1. VLAN considerations......................................88 148 5.2. Firewall considerations..................................88 149 5.3. Host-based IP Filters....................................89 150 5.4. Intrusion Detection Services.............................89 151 5.5. IP Security (IPSec)......................................89 152 5.6. TLS/SSL..................................................89 153 6. IANA considerations...........................................89 154 7. References....................................................90 155 7.1. Normative References.....................................90 156 7.2. Informative References...................................90 157 8. Acknowledgments...............................................90 158 9. Conventions used in this document.............................90 159 Appendix A. Formats..............................................91 160 A.1. TCP option...............................................91 161 A.2. CLC messages.............................................91 162 A.2.1. Peer ID format......................................91 163 A.2.2. SMC Proposal CLC message format.....................93 164 A.2.3. SMC Accept CLC message format.......................96 165 A.2.4. SMC Confirm CLC message format......................99 166 A.2.5. SMC Decline CLC message format.....................102 167 A.3. LLC messages............................................103 168 A.3.1. CONFIRM LINK LLC message format....................104 169 A.3.2. ADD LINK LLC message format........................106 170 A.3.3. ADD LINK CONTINUATION LLC message format...........108 171 A.3.4. DELETE LINK LLC message format.....................111 172 A.3.5. CONFIRM RKEY LLC message format....................113 173 A.3.6. CONFIRM RKEY CONTINUATION LLC message format.......116 174 A.3.7. DELETE RKEY LLC message format.....................118 175 A.3.8. TEST LINK LLC message format.......................120 176 Appendix B. Socket API considerations...........................126 177 Appendix C. Rendezvous Error scenarios..........................128 178 C.1. SMC Decline during CLC negotiation......................128 179 C.2. SMC Decline during LLC negotiation......................128 180 C.3. The SMC Decline window..................................130 181 C.4. Out of synch conditions during SMC-R negotiation........130 182 C.5. Timeouts during CLC negotiation.........................131 183 C.6. Protocol errors during CLC negotiation..................131 184 C.7. Timeouts during LLC negotiation.........................132 185 C.7.1. Recovery actions for LLC timeouts and failures.....133 187 C.8. Failure to add second SMC-R link to a link group........140 189 1. Introduction 191 This document is a specification of the Shared Memory Communications 192 over RDMA (SMC-R) protocol. SMC-R is a protocol for Remote Direct 193 Memory Access (RDMA) communication between TCP socket endpoints. SMC- 194 R runs over networks that support RDMA over Converged Ethernet 195 (RoCE). It is designed to permit existing TCP applications to 196 benefit from RDMA without requiring modifications to the applications 197 or predefinition of RDMA partners. 199 SMC-R provides dynamic discovery of the RDMA capabilities of TCP 200 peers and automatic setup of RDMA connections that those peers can 201 use. SMC-R also provides transparent high availability and load 202 balancing capabilities that are demanded by enterprise installations 203 but are missing from current RDMA protocols. If redundant RoCE 204 capable hardware such as RDMA NICs (RNICs)and RoCE capable switches 205 is present, SMC-R can load balance over that redundant hardware and 206 can also non-disruptively move TCP traffic from failed paths to 207 surviving paths, all seamlessly to the application and the sockets 208 layer. Because SMC-R preserves socket semantics and the TCP three-way 209 handshake, many TCP qualities of service such as filtering, load 210 balancing, and SSL encryption are preserved, as are TCP features such 211 as urgent data. 213 Because of the dynamic discovery and setup of SMC-R connectivity 214 between peers, no RDMA connection manager (RDMA-CM) is required. This 215 also means that support for UD queue pairs is also not required. 217 It is recommended that the SMC-R services be implemented in kernel 218 space, which enables optimizations such as resource sharing between 219 connections across multiple processes and also permits applications 220 using SMC-R to spawn multiple processes (e.g. fork) without losing 221 SMC-R functionality. A user space implementation is compatible with 222 this architecture, but it may not support spawned processes (i.e. 223 fork) which limits sharing and resource optimization to TCP 224 connections that originate from the same process. This might be an 225 appropriate design choice if the use case is a system that hosts a 226 large single process application that creates many TCP connections to 227 a peer host, or in implementations where a kernel space 228 implementation is not possible or introduces excessive overhead for 229 kernel space to user space context switches. 231 1.1. Summary of changes in this draft 233 Significant changes in this architecture since the previous draft: 235 o Updated SMC Proposal format to all for future growth in the fixed 236 part of the message 238 o Clarified the relationship between consumer and producer cursors 239 and the RMBE eyecatcher 241 1.2. Protocol overview 243 SMC-R defines the concept of the SMC-R Link, which is a logical 244 point-to-point link using reliably connected queue pairs between 245 TCP/IP stack peers over a RoCE fabric. An SMC-R link is bound to a 246 specific hardware path, meaning a specific RNIC on each peer. SMC-R 247 links are created and maintained by an SMC-R layer, which may reside 248 in kernel or user space depending upon operating system and 249 implementation requirements. The SMC-R layer resides below the 250 sockets layer and directs data traffic for TCP connections between 251 connected peers over the RoCE fabric using RDMA rather than over a 252 TCP connection. The TCP/IP stack with its fragmentation, 253 packetization, etc. requirements is bypassed and the application data 254 is moved between peers using RDMA. 256 Multiple SMC-R links between the same two TCP/IP stack peers are also 257 supported. A set of SMC-R links called a link group can be logically 258 bonded together to provide redundant connectivity. If there is 259 redundant hardware, for example two RNICs on each peer, separate SMC- 260 R links are created between the peers to exploit that redundant 261 hardware. The link group architecture with redundant links provide 262 load balancing, increased bandwidth as well as seamless failover. 264 Each SMC-R link group is associated with an area of memory called 265 Remote Memory Buffers (RMBs), which are areas of memory that are 266 available for SMC-R peers to write into using RDMA writes. Multiple 267 TCP connections between peers may be multiplexed over a single SMC-R 268 link, in which case the SMC-R layer manages the partitioning of the 269 RMBs between the TCP connections. This multiplexing reduces the RDMA 270 resources such as queue pairs and RMBs that are required to support 271 multiple connections between peers, and also reduces the processing 272 and delays related to setting up queue pairs, pinning memory, and 273 other RDMA setup tasks when new TCP connections are created. In a 274 kernel space SMC-R implementation in which the RMBs reside in kernel 275 storage, this sharing and optimization works across multiple 276 processes executing on the same host. In a user space SMC-R 277 implementation in which the RMBs reside in user space, this sharing 278 and optimization is limited to multiple TCP connections created by a 279 single process, as separate RMBs and QPs will be required for each 280 process. 282 SMC-R also introduces a rendezvous protocol that is used to 283 dynamically discover the RDMA capabilities of TCP connection partners 284 and exchange credentials necessary to exploit that capability if 285 present. TCP connections are set up using the normal TCP 3-way 286 handshake, with the addition of a new TCP option that indicates SMC-R 287 capability. If both partners indicate SMC-R capability then at the 288 completion of the 3-way TCP handshake the SMC-R layers in each peer 289 take control of the TCP connection and use it to exchange additional 290 connection level control (CLC) messages to negotiate SMC-R 291 credentials such as queue pair (QP) information, addressability over 292 the RoCE fabric, RMB buffer sizes, keys and addresses for accessing 293 RMBs over RDMA, etc. If at any time during this negotiation a 294 failure or decline occurs, the TCP connection falls back to using the 295 IP fabric. 297 If the SMC-R negotiation succeeds and either a new SMC-R link is set 298 up or an existing SMC-R link is chosen for the TCP connection, then 299 the SMC-R layers open the sockets to the applications and the 300 applications use the sockets as normal. The SMC-R layer intercepts 301 the socket reads and writes and moves the TCP connection data over 302 the SMC-R link, "out of band" to the TCP connection which remains 303 open and idle over the IP fabric, except for termination flows and 304 possible keepalive flows. Regular TCP sequence numbering methods are 305 used for the TCP flows that do occur; data flowing over RDMA does not 306 use or affect TCP sequence numbers. 308 This architecture does not support fallback of active SMC-R 309 connections to IP. Once connection data has completed the switch to 310 RDMA, a TCP connection cannot be switched back to IP and will reset 311 if RDMA becomes unusable. 313 The SMC-R protocol defines the format of the Remote Memory Buffers 314 that are used to receive TCP connection data written over RDMA, as 315 well as the semantics for managing and writing to these buffers using 316 Connection Data Control (CDC) messages. 318 Finally, SMC-R defines link level control (LLC) messages that are 319 exchanged over the RoCE fabric between peer SMC-R layers to manage 320 the SMC-R links and link groups. These include messages to test and 321 confirm connectivity over an SMC-R link, add and delete SMC-R links 322 to or from the link group, and exchange RMB addressability 323 information. 325 1.2.1. Hardware requirements 327 SMC-R does not require full Converged Enhanced Ethernet switch 328 functionality. SMC-R functions over standard Ethernet fabrics 329 provided endpoint RNICs are provided and IEEE 802.3x Global Pause 330 Frame is supported and enabled in the switch fabric. 332 While SMC-R as specified in this document is designed to operate over 333 RoCE fabrics, adjustments to the rendezvous methods could enable it 334 to run over other RDMA fabrics such as Infiniband and iWARP. 336 1.3. Definition of common terms 338 This section provides definitions of terms that have a specific 339 meaning to the SMC-R protocol and are used throughout this document. 341 SMC-R link 343 An SMC-R Link is a logical point to point connection over the 344 RoCE fabric via specific physical adapters (MAC/GID). The Link 345 is formed during the first contact sequence of the TCP/IP 3 way 346 handshake sequence that occurs over the IP fabric. During this 347 handshake an RDMA RC-QP connection is formed between the two peer 348 SMC hosts and is defined as the SMC Link. The SMC Link can then 349 support multiple TCP connections between the two peers. An SMC 350 link is associated with a single LAN (or VLAN) segment and is not 351 routable. 353 SMC-R link group 355 An SMC-R Link Group is a group of SMC-R Links typically each over 356 unique RoCE adapters between the same two SMC-R peers. Each link 357 in the link group has equal characteristics such as the same VLAN 358 ID (if VLANs are in use), access to the same RMB(s) and the same 359 TCP server / client 361 SMC-R peer 363 The SMC-R Peer is the peer software stack within the peer 364 Operating System with respect the Shared Memory Communications 365 (messaging) protocol. 367 SMC-R Rendezvous 369 The SMC-R Rendezvous is the SMC-R peer discovery and handshake 370 sequence that occurs transparently over the IP (Ethernet) fabric 371 during and immediately after the TCP connection 3 way handshake 372 by exchanging the SMC capabilities and credentials using 373 experimental TCP option and CLC messages. 375 TCP Client 377 The TCP socket-based peer that initiates a TCP connection 379 TCP Server 381 The TCP socket-based peer that accepts a TCP connection 383 CLC messages 385 The SMC-R protocol defines a set of Connection Layer Control 386 Messages that flow over the TCP connection that are used to 387 manage SMC link rendezvous at TCP connection setup time. This 388 mechanism is analogous to SSL setup messages 390 LLC Commands 392 The SMC-R protocol defines a set of RoCE Link Layer Control 393 Commands that flow over the RoCE fabric using RDMA sendmsg, that 394 are used to manage SMC Links, SMC Link Groups and SMC Link Group 395 RMB expansion and contraction. 397 CDC message 399 The SMC-R protocol defines a Connection Data Control message that 400 flows over the RoCE fabric using RDMA sendmsg that is used to 401 manage the SMC-R connection data. This message provides 402 information about data being transferred over the out of band 403 RDMA connection, such as data cursors, sequence numbers, and data 404 flags (for example urgent data). The receipt of this message 405 also provides an interrupt to inform the receiver that it has 406 received RDMA data. 408 RMB 410 A Remote (RDMA) Memory Buffer is a fixed or pinned buffer 411 allocated in each of the peer hosts for a TCP (via SMC-R) 412 connection. The RMB is registered to the RNIC and allows remote 413 access by the remote peer using RDMA semantics. Each host is 414 passed the peer's RMB specific access information (RKey and RMB 415 Element offset) during the SMC-R rendezvous process. The host 416 stores socket application user data directly into the peer's RMB 417 using RDMA over RoCE. 419 Rtoken 421 The combination of an RMB's Rkey and RDMA virtual addressing, an 422 Rtoken provides addressability to an RMB to an RDMA peer 424 RMBE 426 The Remote Memory Buffer Element is an area of an RMB that is 427 allocated to a specific TCP connection. The RMBE contains data 428 for the TCP connection. The RMBE represents the TCP receive 429 buffer whereby the remote peer writes into the RMBE and the local 430 peer reads from the local RMBE. The alert token resolves to a 431 specific RMBE. 433 Alert Token 435 The SMC-R alert token is a four byte value that uniquely 436 identifies the TCP connection over an SMC-R connection. The 437 alert token allows the SMC peer to quickly identify the target 438 TCP connection that now has new work. The format of the token is 439 defined by the owning SMC-R end point and is considered opaque to 440 the remote peer. However the token should not simply be an index 441 to an RMBE element; it should reference a TCP connection and be 442 able to be validated to avoid reading data from stale 443 connections. 445 RNIC 447 The RDMA capable Network Interface Card (RNIC) is an Ethernet NIC 448 that supports RDMA semantics and verbs using RoCE. 450 First Contact 452 Describes an SMC-R negotiation to set up the first link in a link 453 group 455 Subsequent Contact 457 Describes an SMC-R negotiation between peers who are using an 458 already existing SMC-R link group 460 2. Link Architecture 462 An SMC-R link is based on reliably connected queue pairs (QPs) that 463 form a "logical point to point link" between the two SMC-R peers over 464 a RoCE fabric. An SMC-R link extends from SMC-R peer to SMC-R peer, 465 where typically each peer would be a TCP/IP stack would reside on 466 separate hosts. 468 ,,.--..,_ 469 +----+ _-`` `-, +-----+ 470 |QP 8| - RoCE ', |QP 64| 471 | | / VLAN M . | | 472 +----+--------+/ \+-------+-----+ 473 | RNIC 1 | SMC-R Link | RNIC 2 | 474 | |<--------------------->| | 475 +------------+ , /+------------+ 476 MAC A (GID A) MAC B (GID B) 477 . .` 478 `', ,-` 479 ``''--''`` 481 Figure 1 SMC-R Link Overview 483 Figure 1 illustrates an overview of the basic concepts of SMC-R peer 484 to peer connectivity which is called the SMC-R Link. The SMC-R Link 485 forms a logical point to point connection between two SMC-R peers via 486 RoCE. The SMC Link is defined and identified by the following 487 attributes: 489 SMC-R Link = RC QPs (source VMAC GID QP + target VMAC GID QP + VLAN 490 ID) 492 The SMC-R Link can optionally be associated with a VLAN ID. If VLANs 493 are in use for the associated IP (LAN) connection then the VLAN 494 attribute is carried over on the SMC-R link. When VLANs are in use 495 each SMC-R link group is associated with a single and specific VLAN. 496 The RoCE fabric is the same physical Ethernet LAN used for standard 497 TCP/IP over Ethernet communications, with switches as described in 498 1.2.1. 500 An SMC-R Link is designed to support multiple TCP connections between 501 the same two peers. An SMC Link is intended to be long lived while 502 the underlying TCP connections can dynamically come and go. The 503 associated RMBs can also be dynamically added and removed from the 504 link as needed. The first TCP connection between the peers 505 establishes the SMC-R link. Subsequent TCP connections then use the 506 previously established link. When the last TCP connection terminates 507 the link can then be terminated, typically after an implementation 508 defined idle time-out period has elapsed. The TCP server is 509 responsible for initiating and terminating the SMC Link. 511 2.1. Remote Memory Buffers (RMBs) 513 Figure 2 shows the hosts X and Y and their associated RMBs within 514 each host. With the SMC-R link and the associated RMB keys (Rkeys)and 515 RDMA virtual addresses each SMC-R enabled TCP/IP stack can remotely 516 access its peer's RMBs using RDMA. The RKeys and virtual addresses 517 are exchanged during the rendezvous processing when the link is 518 established. The combination of the Rkey and the virtual address is 519 the Rtoken. Note that the SMC-R Link ends at the QP providing access 520 to the RMB (via the Link + RToken). 522 Host X Host Y 523 +-------------------+ ,.--.,_ +-------------------+ 524 | | .'` '. | | 525 | Protection | ,' `, | Protection | 526 | Domain X | / \ | Domain Y | 527 | +------+ / \ +------+ | 528 | QP 8 |RNIC 1| | SMC-R Link | |RNIC 2| QP 64 | 529 | | | |<-------------------->| | | | 530 | | | || || | | | 531 | | +------+| VLAN A |+------+ | | 532 | | || || | | 533 | | | | RoCE | | | | 534 | |RTokenX) | \ / |RToken (Y)| | 535 | | | \ / | | | 536 | V | `. ,' | V | 537 | +--------+ | '._ ,' | +--------+ | 538 | | | | `''-'`` | | | | 539 | | RMB | | | | RMB | | 540 | | | | | | | | 541 | +--------+ | | +--------+ | 542 +-------------------+ +-------------------+ 543 Figure 2 SMC link and RMBs 545 An SMC-R link can support multiple RMBs which are independently 546 managed by each peer. The number of and the size of RMBs are managed 547 by the peers based on host unique memory management requirements; 548 however the maximum number of RMBs that can be associated to a link 549 group on one peer is 255. The QP has a single protection domain, but 550 each RMB has a unique RToken. All RTokens must be exchanged with the 551 peer. 553 Each peer manages the RMBs in its local memory for its remote SMC-R 554 peer by sharing access to the RMBs via Rtokens with its peers. The 555 remote peer writes into the RMBs via RDMA and the local peer (RMB 556 owner) then reads from the RMBs. 558 When two peers decide to use SMC-R for a given TCP connection, they 559 each allocate a local RMB Element for the TCP connection and 560 communicate the location of this local RMB Element during rendezvous 561 processing. To that end, RMB elements are created in pairs, with one 562 RMB element allocated locally on each peer of the SMC-R link. 564 --- +-----------+----------------+ 565 /\ |Eyecatcher | | 566 | +-----------+ | 567 | | | 568 RMB Element 1 | | 569 | | Receive Buffer | 570 | | | 571 | | | 572 \/ | | 573 --- +-----------+----------------+ 574 /\ |Eyecatcher | | 575 | +-----------+ | 576 | | | 577 RMB Element 2 | | 578 | | Receive Buffer | 579 | | | 580 | | | 581 \/ | | 582 --- +----------------------------+ 583 | . | 584 | . | 585 | . | 586 | . | 587 | (up to 255 elements) | 588 +----------------------------+ 589 Figure 3 RMB Format 591 Figure 3 illustrates the basic format of an RMB. The RMB is a virtual 592 memory buffer whose backing real memory is pinned, which can support 593 up to 255 TCP connections to exactly one remote SMC-R peer. Each RMB 594 is therefore associated with the SMC-R links within a link group for 595 the two peers and a specific RoCE Protection Domain. Other than the 2 596 peers identified by the SMC-R link no other SMC-R peers can have RDMA 597 access to an RMB; this requires a unique Protection Domain for every 598 SMC-R Link. This is critical to ensure integrity of SMC-R 599 communications. 601 RMBs are subdivided into multiple elements for efficiency, with each 602 RMBE element (RMBE) is associated with a single TCP connection. 603 Therefore multiple TCP connections across an SMC link group can share 604 the same memory for RDMA purposes, reducing the overhead of having to 605 register additional memory with the RNIC for every new TCP 606 connection. The number of elements in an RMB and the size of each RMB 607 Element is entirely governed by the owning peer subject to the SMC-R 608 architecture rules, however, all RMB elements within a given RMB must 609 be the same size. Each peer can decide the level of resource sharing 610 that is desirable across TCP connections based on local constraints 611 such as available system memory, etc. An RMB Element is identified to 612 the remote SMC-R peer via an RMB Element Token which consists of the 613 following: 615 o RMB RToken: The combination of the Rkey and virtual address 616 provided by the RNIC that identifies the start of the RMB for RDMA 617 operations. 619 o RMB Index: Identifies the RMB element index in the RMB. Used to 620 locate a specific RMB element within an RMB. Valid value range is 621 1-255. 623 o RMB element length: The length of the RMB element's eyecatcher 624 plus the length of receive buffer. This length is equal for all 625 RMB elements in a given RMB. This length can be variable across 626 different RMBs. 628 Multiple RMBs can be associated to an SMC-R link group and each peer 629 in an SMC-R link group manages allocation of its RMBs. RMB allocation 630 can be asymmetric. For example, server X can allocate 2 RMBs to an 631 SMC-R link group while server Y allocates 5. This provides maximum 632 implementation flexibility to allow hosts optimize RMB management for 633 their own local requirements. The maximum number of RMBs that can be 634 allocated on one peer to a link group is 255. If more RMBs are 635 required, the peer may fall back to IP for subsequent connections or, 636 if the peer is the server, create a parallel link group. 638 One use case for multiple RMBs is multiple receive buffer sizes. 639 Since every element in an RMB must be the same size, multiple RMBs 640 with different element sizes can be allocated if varying receive 641 buffer sizes are required. 643 Also since the maximum number of TCP connections whose receive 644 buffers can be allocated to an RMB is 255, multiple RMBs may be 645 required to provide capacity for large numbers of TCP connections 646 between two peers. 648 Separately from the RMB, the TCP/IP stack that owns each RMB 649 maintains control data for each RMB element within its local control 650 structures. The control data contains flags for maintaining the 651 state of the TCP data (for example, urgent indicator) and most 652 importantly, two cursors which are illustrated in Figure 4: 654 o The peer producer cursor: This is a wrapping offset into the RMB 655 element's receive buffer that points to the next byte of data to 656 be written by the remote peer. This cursor is provided by the 657 remote peer in a Connection Data Control (CDC message), which is 658 sent using RDMA sendmsg processing, and tells the local peer how 659 far it can consume data in the RMBE buffer. 661 o The peer consumer cursor: This is a wrapping offset into the 662 remote peer's RMB element's receive buffer that points to the next 663 byte of data to be consumed by the remote peer in its own RMBE. 664 The local cannot write into the remote peer's RMBE beyond this 665 point without causing data loss. This cursor is also provided by 666 the peer using a Connection Data Control message. 668 Each TCP connection peer maintains its cursors for a TCP connection's 669 RMBE in its local control structures. In other words, the peer who 670 writes into a remote peer's RMBE provides its producer cursor to the 671 peer whose RMBE it has written into. The peer who reads from its 672 RMBE provides its consumer cursor to the writing peer. In this 673 manner the reads and writes between peers are kept coordinated. 675 For example, referring to Figure 4, peer B writes the hashed data 676 into the receive buffer of peer A's RMBE. After that write 677 completes, peer B uses a CDC message to update its producer cursor to 678 peer A, to indicate to peer A how much data is available for peer A 679 to consume. The CDC message that peer B sends to peer A wakes up 680 peer A and notifies it that there is data to be consumed. 682 Similarly, when peer A consumes data written by peer B, it uses a CDC 683 message to update its consumer cursor to peer B to let peer B know 684 how much data it has consumed, so peer B knows how much space is 685 available for further writes. If peer B were to write enough data to 686 peer A that it would wrap the RMBE receive buffer and exceed the 687 consumer cursor, data loss would result. 689 Note that this is a simplistic description of the control flows and 690 they are optimized to minimize the number of CDC messages required, 691 as described in 4.7. RMB data flows. 693 Peer A's RMBE Control Info Peer B's RMBE Control Info 694 +--------------------------+ +--------------------------+ 695 | | | | 696 /----Peer producer cursor | +-----+-Peer consumer cursor | 697 /| | | | | 698 | +--------------------------+ | +--------------------------+ 699 | Peer A's RMBE | 700 | +--------------------------+ | 701 | | +------------------+ 702 | | | | 703 | | \/ | 704 | | +------------| 705 | |-------------+/////////// | 706 | |//RMA data written by /// | 707 | |/// peer B that is ////// | 708 | |/available to be consumed/| 709 | |///////////////////////// | 710 | |///////// +---------------| 711 | |----------+/\ | 712 | | | | 713 \| | | 714 \ / | 715 |\---------/ | 716 | | 717 | | 718 Figure 4 RMBE cursors 720 Additional flags and indicators are communicated between peers. In 721 all cases, these flags and indicators are updated by the peer using 722 CDC messages with the control information contained in inline data. 723 More details on these additional flags and indicators are described 724 in . 4.3. RMBE control information. 726 2.2. SMC-R Link groups 728 SMC-R links are logically grouped together to form an SMC-R Link 729 Group. The purpose of the Link Group is for supporting multiple links 730 between the same two peers to provide for: 732 o Resilience: Provides transparent and dynamic switching of the link 733 used by existing TCP connections during link failures, typically 734 hardware related. TCP traffic using the failing link can be 735 switched to an active link within the link group avoiding 736 disruptions to application workloads. 738 o Link utilization: Provides an active/active link usage model 739 allowing TCP traffic to be balanced across the links, which 740 increases bandwidth and avoids hardware imbalances and 741 bottlenecks. Note that both adapter and switch utilization can 742 become potential resource constraint issues 744 SMC-R Link Group support is required. Resilience is not optional. 745 However, the user can elect to provision a single RNIC (on one or 746 both hosts). 748 Multiple links that are formed between the same two peers fall into 749 two distinct categories: 751 1. Equal Links: Links providing equal access to the same RMB(s) at 752 both endpoints whereby all TCP connections associated with the 753 links must have the same VLAN ID and have the same TCP server 754 and TCP client roles or relationship. 756 2. Unequal Links: Links providing access to unique, unrelated and 757 isolated RMB(s) (i.e. for unique VLANs or unique and isolated 758 application workloads, etc.) or have unique TCP server or client 759 roles. 761 Links that are logically grouped together forming an SMC Link Group 762 must be equal links. 764 2.2.1. Link group types 766 Equal links within a link group also have another "Link Group Type" 767 attribute based on the link's associated underlying physical path. 768 The following SMC-R link types are defined: 770 1. Single Link: the only active link within a link group 772 2. Parallel Link: not allowed - SMC Links having the same physical 773 RNIC at both hosts 775 3. Asymmetric Link: links that have unique RNIC adapters at one 776 host but share a single adapter at the peer host 778 4. Symmetric Link: links that have unique RNIC adapters at both 779 hosts 781 These link group types are further explained in the following figures 782 and descriptions. 784 Figure 2 above shows the single link case. The single link 785 illustrated in Figure 2 also establishes the SMC-R Link Group. Link 786 groups are supposed to have multiple links, but when only one RNIC is 787 available at both hosts then only a single link can be created. This 788 is expected to be a transient case. 790 Figure 5 shows the symmetric link case. Both hosts have unique and 791 redundant RNIC adapters. This configuration meets the objectives for 792 providing full RoCE redundancy required to provide the level of 793 resilience required for high availability for SMC-R. While this 794 configuration is not required, it is a strongly recommended "best 795 practice" for the exploitation of SMC-R. Single and asymmetric links 796 must be supported but are intended to provide for short term 797 transient conditions, for example during a temporary outage or 798 recycle of a RNIC. 800 Host X Host Y 801 +-------------------+ +-------------------+ 802 | | | | 803 | Protection | | Protection | 804 | Domain X | | Domain Y | 805 | +------+ +------+ | 806 | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 64 | 807 |RToken X| | |<-------------------->| | | | 808 | | | | | | |RToken Y| 809 | \/ +------+ +------+ \/ | 810 |+--------+ | | +--------+ | 811 || | | | | | | 812 || RMB | | | | RMB | | 813 || | | | | | | 814 |+--------+ | | +--------+ | 815 | /\ +------+ +------+ /\ | 816 |RToken Z| | | SMC-R Link 2 | | |RToken W| 817 | | |RNIC 3|<-------------------->|RNIC 4| | | 818 | QP 9 | | | | QP 65 | 819 | +------+ +------+ | 820 +-------------------+ +-------------------+ 821 Figure 5 Symmetric SMC-R links 823 Host X Host Y 824 +-------------------+ +-------------------+ 825 | | | | 826 | Protection | | Protection | 827 | Domain X | | Domain Y | 828 | +------+ +------+ | 829 | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 64 | 830 |RToken X| | |<-------------------->| | | | 831 | | | | .->| | |RToken Y| 832 | \/ +------+ .` +------+ \/ | 833 |+--------+ | .` | +--------+ | 834 || | | .` | | | | 835 || RMB | | .` | | RMB | | 836 || | | .`SMC-R | | | | 837 |+--------+ | .` Link 2 | +--------+ | 838 | /\ +------+ .` +------+ | 839 |Rtoken Z| | | .` | |down or | 840 | | |RNIC 3|<-` |RNIC 4|unavailable | 841 | QP 9 | | | | | 842 | +------+ +------+ | 843 +-------------------+ +-------------------+ 844 Figure 6 Asymmetric SMC-R links 846 In the example provided by Figure 6, host X has two RNICs but Host Y 847 only has one RNIC. This configuration allows for the creation of an 848 asymmetric link. While an asymmetric link will provide some 849 resilience (i.e. when RNIC 1 fails) ideally each host should provide 850 two redundant RNICs. This should be a transient case, and when RNIC 851 4 becomes available, this configuration must transition to a 852 symmetric link configuration. This transition is accomplished by 853 first creating the new symmetric link, then deleting the asymmetric 854 link with reason code "Asymmetric link no longer needed" specified in 855 the DELETE LINK LLC message. 857 Host X Host Y 858 +-------------------+ +-------------------+ 859 | | | | 860 | Protection | | Protection | 861 | Domain X | | Domain Y | 862 | +------+ SMC-R link 1 +------+ | 863 | QP 8 |RNIC 1|<-------------------->|RNIC 2| QP 64 | 864 |RToken X| | | | | | | 865 | | | |<-------------------->| | |Rtoken Y| 866 | \/ +------+ SMC-R link 2 +------+ \/ | 867 |+--------+ QP 9 | | QP 65 +--------+ | 868 || | | | | | | | | 869 || RMB |<-- + | | +---->| RMB | | 870 || | | | | | | 871 |+--------+ | | +--------+ | 872 | +------+ +------+ | 873 | down or| | | |down or | 874 | unavailale|RNIC 3| |RNIC 4|unavailable | 875 | | | | | | 876 | +------+ +------+ | 877 +-------------------+ +-------------------+ 878 Figure 7 SMC-R parallel links (not supported) 880 Figure 7 shows parallel links, which are two links in the link group 881 that use the same hardware. This configuration is not permitted. 882 Because SMC-R multiplexes multiple TCP connections over an SMC-R link 883 and both links are using the exact same hardware, there is no 884 additional redundancy or capacity benefit obtained from this 885 configuration. However this configuration does add unnecessary 886 overhead of additional queue pairs, generation of additional Rkeys, 887 etc. 889 2.2.2. Maximum number of links in link group 891 The SMC-R protocol defines a maximum of 8 symmetric SMC-R links 892 within a single SMC-R link group. This allows for support for up to 893 8 unique physical paths between peer hosts. However, in terms of 894 meeting the basic requirements for redundancy support for at least 2 895 symmetric links must be implemented. Supporting greater than 2 896 links also simplifies implementation for practical matters relating 897 to dynamically adding and removing links, for example starting a 898 third SMC-R link prior to taking down one of the two existing links. 899 Recall that all links within a link group must have equal access to 900 all associated RMBs. 902 The SMC-R protocol allows an implementation to implement an 903 implementation specific and appropriate value for maximum symmetric 904 links. The implementation value must not exceed the architecture 905 limit of 8 and the implementation must not be lower than 2, because 906 the SMC-R protocol requires redundancy. This does not mean that two 907 RNICs are physically required to enable SMC-R connectivity, but at 908 least two RNICs for redundancy are strongly recommended. 910 The SMC-R peers exchange their implementation maximum link values 911 during the link group establishment using the defined maximum link 912 value in the CONFIRM LINK LLC command. Once the initial exchange 913 completes the value is set for the life of the link group. The 914 maximum link value can be provided by both the server and client. The 915 server must supply a value, whereas the client maximum link value is 916 optional. When the client does not supply a value, it indicates that 917 the client accepts the server supplied maximum value. If the client 918 provides a value it can not exceed the server maximum value. If the 919 client passes a lower value then this lower value then becomes the 920 final negotiated maximum number of symmetric links for this link 921 group. Again, the minimum value is 2. 923 During run time the client must never request that the server add a 924 symmetric link to a link group that would exceed the negotiated 925 maximum link value. Likewise the server must never attempt to add a 926 symmetric link to a link group that would exceed the negotiated 927 maximum value. 929 In terms of counting the active link count within a link group, the 930 initial link (or the only / last) link is always counted as 1. Then 931 as additional links are added they are either symmetric or asymmetric 932 links. 934 With regards to enforcing the maximum link rules, asymmetric links 935 are an exception having a unique set of rules: 937 o Asymmetric links are always limited to one asymmetric link allowed 938 per link group 940 o Asymmetric links must not be counted in the maximum symmetric link 941 count calculation. When tracking the current count or enforcing 942 the negotiated maximum number of links, an asymmetric link is not 943 to be counted 945 2.2.3. Forming and managing link groups 947 SMC-R link groups are self-defining. The first SMC-R link in a link 948 group is created using TCP option flows on the TCP three-way 949 handshake followed by CLC message flows over the TCP connection. 950 Subsequent SMC-R links in the link group are created by sending LLC 951 messages over an SMC-R link that already exists in the link group. 952 Once an SMC-R link group is created, no additional SMC-R links in 953 that group are created using TCP and CLC negotiation. Because 954 subsequent SMC-R links are created exclusively by sending LLC 955 messages over an existing SMC-R link in a link group, the membership 956 of SMC-R links to a link group is self-defining. 958 This architecture does not define a specific identifier for an SMC-R 959 link group. This identification may be useful for network management 960 and may be assigned in a platform specific manner, or in an extension 961 to this architecture. 963 In each SMC-R link group, one peer is the server for all TCP 964 connections and the other peer is the client. If there are 965 additional TCP connections between the peers that use SMC-R and have 966 the client and server roles reversed, another SMC-R link group is set 967 up between them with the opposite client-server relationship. 969 This is required because there are specific responsibilities divided 970 between the client and server in the management of an SMC-R link 971 group. 973 In this architecture, the decision of whether or not to use an 974 existing SMC-R link group or create a new SMC-R link group for a TCP 975 connection is made exclusively by the server. 977 Management of the links in an SMC-R link group is also a server 978 responsibility. The server is responsible for adding and deleting 979 links in a link group. The client may request that the server take 980 certain actions but the final responsibility is the server's. 982 2.2.4. SMC-R link identifiers 984 This architecture defines multiple identifiers to identify SMC-R 985 links and peers. 987 o Link number: This is a one-byte value that identifies an SMC-R 988 link within a link group. Both the server and the client use this 989 number to distinguish an SMC-R link from other links within the 990 same link group. It is only unique within a link group. In order 991 to prevent timing windows that may occur when a server creates a 992 new link while the client is still cleaning up a previously 993 existing link, link numbers cannot be reused until the entire link 994 numbering space has been exhausted. 996 o Link User ID: This is an architecturally opaque four byte value 997 that a peer uses to uniquely define an SMC-R link within its own 998 space. This means that a link user ID is unique within one peer 999 only. Each peer defines its own link user ID for a link. The 1000 peers exchange this information once during link setup and it is 1001 never used architecturally again. The purpose of this identifier 1002 is for network management, display, and debugging purposes. For 1003 example an operator on a client could provide the operator on the 1004 server with the server's link user ID if he requires the server's 1005 operator to check on the operation of a link that the client is 1006 having trouble with. 1008 o Peer ID: The SMC-R peer ID uniquely identifies a specific instance 1009 of a specific TCP/IP stack. It is required because in clustered 1010 and load balancing environments, an IP address does not uniquely 1011 identify a TCP/IP stack. An RNIC's MAC/GID also doesn't uniquely 1012 or reliably identify a TCP/IP stack because RNICs can go up and 1013 down and even be redeployed to other TCP/IP stacks in a multiple 1014 partitioned or virtualized environment. The peer ID is not only 1015 unique per TCP/IP stack but is also unique per instance of a 1016 TCP/IP stack, meaning that if a TCP/IP stack is restarted, its 1017 peer ID changes. 1019 2.3. SMC-R resilience and load balancing 1021 The SMC-R multi-link architecture provides resilience for network 1022 high availability via failover capability to an alternate RoCE 1023 adapter. 1025 The SMC-R multilink architecture does not define primary, secondary 1026 or alternate roles to the links. Instead there are multiple active 1027 links representing multiple redundant RoCE paths over the same LAN. 1029 Assignment of TCP connections to links is unidirectional and 1030 asymmetric. This means that the client and server may each choose a 1031 separate link for their RDMA writes associated with a specific TCP 1032 connection. 1034 If a hardware failure occurs or a QP failure associated with an 1035 individual link, then the TCP connections that were associated with 1036 the failing link are dynamically and transparently switched to use 1037 another available link. The server or the client can detect a 1038 failure and immediately move their TCP connections and then notify 1039 their peer via the DELETE LINK LLC command. While the client can 1040 notify the server of an apparent link failure with the DELETE LINK 1041 LLC command, the server performs the actual link deletion. 1043 The movement of TCP connections to another link can be accomplished 1044 with minimal coordination between the peers. The TCP connection 1045 movement is also transparent to and non disruptive to the TCP socket 1046 application workloads for most failure scenarios. After a failure, 1047 the surviving links and all associated hardware must handle the link 1048 group's workload. 1050 As each SMC-R peer begins to move active TCP connections to another 1051 link all current RDMA write operations must be allowed to complete. 1052 Then the moving peer sends a signal to verify receipt of the last 1053 successful write by its peer. If this verification fails, the TCP 1054 connection must be reset. Once this verification is complete, all 1055 writes that failed may then be retried, in order, over the new link. 1056 Any data writes or CDC messages for which the sender did not receive 1057 write completion must be replayed before any subsequent data or CDC 1058 write operations are sent. LLC messages are not retried over the new 1059 link because they are dependent on a known link configuration, which 1060 has just changed because of the failure. The initiator of an LLC 1061 message exchange that fails will be responsible for retrying once the 1062 link group configuration stabilizes. 1064 When a new link becomes available and is re-added to the link group 1065 then each peer is free to rebalance its current TCP connections as 1066 needed or only assign new TCP connections to the newly added link. 1067 Both the server and client are free to manage TCP connections across 1068 the link group as needed. TCP connection movement does not have to 1069 stimulated by a link failure. 1071 The SMC-R architecture also defines orderly vs. disorderly failover. 1072 The type is communicated in the LLC Delete Link command and is simply 1073 a means to indicate that the link has terminated (disorderly) or link 1074 termination is imminent (orderly). The orderly link deletion could 1075 be initiated via operator command or programmatically to bring down 1076 an idle link. For example an operator command could initiate orderly 1077 shut down of an adapter for service. Implementation of the two types 1078 is based on implementation requirements and is beyond the scope of 1079 the SMC-R architecture. 1081 3. SMC-R Rendezvous architecture 1083 Rendezvous is the process that SMC-R capable peers use to dynamically 1084 discover each others' capabilities, negotiate SMC-R connections, set 1085 up SMC-R links and link groups, and manage those link groups. A key 1086 aspect of SMC-R rendezvous is that it occurs dynamically and 1087 automatically, without requiring SMC link configuration to be defined 1088 by an administrator. 1090 SMC-R Rendezvous starts with the TCP/IP three-way handshake during 1091 which connection peers use TCP options to announce their SMC-R 1092 capabilities. If both endpoints are SMC-R capable, then Connection 1093 Layer Control (CLC) messages are exchanged between the peers' SMC-R 1094 layers over the newly established TCP connection to negotiate SMC-R 1095 credentials. The CLC message mechanism is analogous to the messages 1096 exchanged by SSL for its handshake processing. 1098 If a new SMC-R link is being set up, Link Layer Control (LLC) 1099 messages are used to confirm RDMA connectivity. LLC messages are 1100 also used by the SMC-R layers at each peer to manage the links and 1101 link groups. 1103 Once an SMC-R link is set up or agreed to by the peers, the TCP 1104 sockets are passed to the peer applications which use them as normal. 1105 The SMC-R layer, which resides under the sockets layer, transmits the 1106 socket data between peers over RDMA using the SMC-R protocol, 1107 bypassing the TCP/IP stack. 1109 3.1. TCP options 1111 During the TCP/IP three-way handshake, the client and server indicate 1112 their support for SMC-R by including experimental TCP option 254 on 1113 the three-way handshake flows, in accordance with RFC 6994 "Shared 1114 Use of Experimental TCP Options". The ExID value used is the string 1115 'SMCR' in EBCDIC (IBM-1047) encoding (0xE2D4C3D9). This ExID has 1116 been registered in the TCP ExIDs registry maintained by IANA. 1118 After completion of the 3-way TCP handshake each peer queries its 1119 peer's options. If both peers set the TCP option on the three-way 1120 handshake, inline SMC-R negotiation occurs using CLC messages. If 1121 neither peer or only one peer set the TCP option, SMC-R cannot be 1122 used for the TCP connection, and the TCP connection completes setup 1123 using the IP fabric. 1125 3.2. Connection Layer Control (CLC) messages 1127 CLC messages are sent as data payload over the IP network using the 1128 TCP connection between SMC-R layers at the peers. They are analogous 1129 to the messages used to exchange parameters for SSL. 1131 Use of CLC messages is detailed in the following sections. The 1132 following list provides a summary of the defined CLC messages and 1133 their purposes: 1135 o SMC PROPOSAL: Sent from the client to propose that this TCP 1136 connection is eligible to be moved to SMC-R. The client identifies 1137 itself and its subnet to the server and passes the SMC-R elements 1138 for a suggested RoCE path via the MAC and GID. 1140 o SMC ACCEPT: Sent from the server to accept the client's TCP 1141 connection SMC proposal. The server responds to the client's 1142 proposal by identifying itself to the client and passing the 1143 elements of a RoCE path that the client can use to to perform RDMA 1144 writes to the server. This consists of SMC-R ink elements such as 1145 RoCE MAC, GID, RMB information etc. 1147 o SMC CONFIRM: Sent from the client to confirm the server's 1148 acceptance of SMC connection. The client responds to the server's 1149 acceptance by passing the elements of a RoCE path that the server 1150 can use to to perform RDMA writes to the client. This consists of 1151 SMC-R ink elements such as RoCE MAC, GID, RMB information etc. 1153 o SMC DECLINE: Sent from either the server or the client to reject 1154 the SMC connection, indicating the reason the peer must decline 1155 the SMC proposal and allowing the TCP connection to revert back to 1156 IP connectivity. 1158 3.3. LLC messages 1160 Link Layer Control (LLC) messages are sent between peer SMC-R layers 1161 over an SMC-R link to manage the link or the link group. LLC 1162 messages are sent using RoCE sendmsg with inline data and are 44 1163 bytes long. The 44 bytes size is based on what can fit into a RoCE 1164 Work Queue Element (WQE) without requiring the posting of receive 1165 buffers. 1167 LLC messages generally follow a request-reply semantic. Each message 1168 has a request flavor and a reply flavor, and each request must be 1169 confirmed with a reply, except where otherwise noted. Use of LLC 1170 messages is detailed in the following sections. The following list 1171 provides a summary of the defined LLC messages and their purposes: 1173 o ADD LINK: Add a new link to a link group. Sent from the server to 1174 the client to initiate addition of a new link to the link group, 1175 or from the client to the server to request that the server 1176 initiate addition of a new link. 1178 o ADD LINK CONTINUATION: This is a continuation of ADD link that 1179 allows the ADD link to span multiple commands, because all the 1180 link information cannot be contained in a single ADD LINK message 1182 o CONFIRM LINK: Used to confirm that RoCE connectivity over a newly 1183 created SMC-R link is working correctly. Initiated by the server, 1184 and both this message and its reply must flow over the SMC-R link 1185 being confirmed. 1187 o DELETE LINK: When initiated by the server, deletes a specific link 1188 from the link group or deletes the entire link group. When 1189 initiated by the client, requests that the server delete a 1190 specific link or the entire link group. 1192 o CONFIRM RKEY: Informs the peer on the SMC-R link of the addition 1193 of an RMB to the link group. 1195 o CONFIRM RKEY CONTINUATION: This is a continuation of CONFIRM RKEY 1196 that allows the ADD link to span multiple commands, in the event 1197 that all of the information cannot be contained in a single 1198 CONFIRM RKEY message. 1200 o DELETE RKEY: Informs the peer on the SMC-R link of the deletion of 1201 one or more RMBs from the link group 1203 o TEST LINK: Verifies that an already-active SMC-R link is active 1204 and healthy 1206 o Optional LLC message: Any LLC message in which the two high order 1207 bits of the opcode are b'10' is an optional message and must be 1208 silently discarded by a receiving peer that does not support the 1209 opcode. No such messages are defined in this version of the 1210 architecture, however the concept is defined to allow for 1211 toleration of possible advanced, optional functions. 1213 CONFIRM LINK and TEST LINK are sensitive to which link they flow on 1214 and must flow on the link being confirmed or tested. The other flows 1215 may flow over any active link in the link group. When there are 1216 multiple links in a link group, a response to an LLC message must 1217 flow over the same link that the original message flowed over, with 1218 the following exceptions: 1220 o ADD LINK request from a server in response to an ADD LINK from a 1221 client 1223 o DELETE LINK request from a server in response to a DELETE LINK 1224 from a client 1226 3.4. CDC Messages 1228 Connection Data Control (CDC) messages are sent over the RoCE fabric 1229 between peers using RoCE sendmsg with inline data, and are 44 bytes 1230 long which is based on the size that can fit into a RoCE Work Queue 1231 Element (WQE) without requiring the posting of receive buffers. CDC 1232 messages are used to describe the socket application data passed via 1233 RDMA write operations, and TCP connection state information including 1234 producer and consumer cursors, RMBE state information, and failover 1235 data validation. 1237 3.5. Rendezvous flows 1239 Rendezvous information for SMC-R is be exchanged as TCP options on 1240 the TCP 3-way handshake flows to indicate capability, followed by in- 1241 line TCP negotiation messages to actually do the SMC-R setup. Formats 1242 of all rendezvous options and messages discussed in this section are 1243 detailed in Appendix A. 1245 3.5.1. First contact 1247 First contact between RoCE peers occurs when a new SMC-R link group 1248 is being set up. This could be because no SMC-R links already exist 1249 between the peers, or the server decides to create a new SMC-R link 1250 group in parallel with an existing one. 1252 3.5.1.1. TCP Options pre-negotiation 1254 The client and server indicate their SMC-R capability to each other 1255 using TCP option 254 on the TCP 3-way handshake flows. 1257 A client who wishes to do SMC-R will include TCP option 254 using an 1258 ExID equal to the EBCDIC (codepage IBM-1047) encoding of "SMCR" on 1259 its SYN flow. 1261 A server that supports SMC-R will include TCP option 254 with the 1262 ExID value of EBCDIC "SMCR" on its SYN-ACK flow. Because the server 1263 is listening for connections and does not know where client 1264 connections will come from, the server implementation may choose to 1265 unconditionally include this TCP option if it supports SMC-R. This 1266 may be required for server implementations where extensions to the 1267 TCP stack are not practical. For server implementations which can 1268 add code to examine and react to packets during the three-way 1269 handshake, the server should only include the SMC-R TCP option on 1270 SYN-ACK if the client included it on its SYN packet. 1272 A client who supports SMC-R and meets the three conditions outlined 1273 above may optionally include the TCP option for SMC-R on its ACK 1274 flow, regardless of whether or not the server included it on its SYN- 1275 ACK flow. Some TCP/IP stacks may have to include it if the SMC-R 1276 layer cannot modify the options on the socket until the 3-way 1277 handshake completes. Proprietary servers should not include this 1278 option on the ACK flow, since including it on the SYN flow was 1279 sufficient to indicate the client's capabilities. 1281 Once the initial three-way TCP handshake is completed, each peer 1282 examines the socket options. SMC-R implementations may do this by 1283 examining what was actually provided on the SYN and SYN-ACK packets 1284 or by performing a getsockopt() operation to determine the options 1285 set by the peer. If neither peer, or only one peer, specified the TCP 1286 option for SMC-R, then SMC-R cannot be used on this connection and it 1287 proceeds using normal IP flows and processing. 1289 If both peers specified the TCP option for SMC-R, then the TCP 1290 connection is not started yet and the peers proceed to SMC-R 1291 negotiation using inline data flows. The socket is not yet turned 1292 over to the applications; instead the respective SMC layers exchange 1293 CLC messages over the newly formed TCP connection. 1295 3.5.1.2. Client Proposal 1297 If SMC-R is supported by both peers, the client sends an SMC Proposal 1298 CLC message to the server. On this flow from client to server it is 1299 not immediately apparent if this is a new or existing SMC-R link 1300 because in clustered environments a single IP address may represent 1301 multiple hosts. This type of cluster virtual IP address can be owned 1302 by a network based or host based layer 4 load balancer that 1303 distributes incoming TCP connections across a cluster of 1304 servers/hosts. Other clustered environments may also support the 1305 movement of a virtual IP address dynamically from one host in the 1306 cluster to another for high availability purposes. In summary, the 1307 client can not pre-determine that a connection is targeting the same 1308 host simply by matching the destination IP address for outgoing TCP 1309 connections. Therefore it cannot pre-determine the SMC-R link that 1310 will be used for a new TCP connection. This information will be 1311 dynamically learned and the appropriate actions will be taken as the 1312 SMC-R negotiation handshake unfolds. 1314 On the SMC-R proposal message, the initiator (client) proposes use of 1315 SMC-R by including its peer ID and GID and MAC addresses, as well as 1316 the IP subnet number of the outgoing interface (if IPv4) or the IP 1317 prefix list for the network that the proposal is sent over (if IPv6). 1318 At this point in the flow, the client makes no local commitments of 1319 resources for SMC-R. 1321 When the server receives the SMC Proposal CLC message, it uses the 1322 peer ID provided by the client plus subnet or prefix information 1323 provided by the client, to determine if it already has a usable SMC-R 1324 link with this SMC-R peer. If there is one or more existing SMC-R 1325 links with this SMC-R peer, the server then decides which SMC link it 1326 will use for this TCP connection. See subsequent sections for the 1327 cases of reusing an existing SMC-R link or creating a parallel SMC 1328 link group between SMC-R peers. 1330 If this is a first contact between SMC-R peers the server must 1331 validate that it is on the same LAN as the client before continuing. 1332 For IPv4, the server does this by verifying that it has an interface 1333 with an IP subnet number that matches the subnet number set by the 1334 client on the SMC Proposal. For IPv6 it does this by verifying that 1335 it is directly attached to at least one IP prefix that was listed by 1336 the client in its SMC Proposal message. 1338 If server agrees to use SMC-R, the server begins setup of a new SMC-R 1339 link by allocating local QP and RMB resources (setting its QP state 1340 to INIT) and providing its full SMC-R information in an SMC Accept 1341 CLC message to the client over the TCP connection, along with a flag 1342 set indicating that this is a first contact flow. While the SMC 1343 Accept message could flow over any route back to the client depending 1344 upon IP routing, the SMC-R credentials provided must be for the 1345 common subnet or prefix between the server and client, as determined 1346 above. If the server cannot or does not want to do SMC-R with the 1347 client it sends an SMC Decline CLC message to the client and the 1348 connection data may begin flowing using normal TCP/IP flows. 1350 3.5.1.3. Server acceptance 1352 When the client receives the SMC Accept from the server, it uses the 1353 combination of the first contact flag, its GID/MAC and the GID/MAC 1354 returned by the server plus the LAN that the connection is setting up 1355 over and the QP number provided by the server to determine if this is 1356 a new or existing SMC-R link. 1358 If it is an existing SMC-R link, and the client agrees to use that 1359 link for the TCP connection, see 3.5.2. Subsequent contact below. If 1360 it is a new SMC-R link between peers that already have an SMC link, 1361 then the server is starting a new SMC link group. 1363 Assuming this is either a first contact between peers or the server 1364 is starting a new SMC link group, the client now allocates local QP 1365 and RMB resources for the SMC-R link (setting the QP state to RTR or 1366 "ready to receive"), associates them with the server QP as learned on 1367 the SMC Accept CLC message, and sends an SMC Confirm CLC message to 1368 the server over the TCP connection with its SMC-R link information 1369 included. The client also starts a timer to wait for the server to 1370 confirm the reliable connected QP as described below. 1372 3.5.1.4. Client confirmation 1374 Upon receipt of the client's SMC Confirm CLC message, the server 1375 associates its QP for this SMC-R link with the client's QP as learned 1376 on the SMC Confirm CLC message and sets its QP state to RTS (ready to 1377 send). Now the client and the server have reliable connected QPs. 1379 3.5.1.5. Link (QP) confirmation 1381 Since setting up the SMC-R link and its QPs did not require any 1382 network flows on the RoCE fabric, the client and server must now 1383 confirm connectivity over the RoCE fabric. To accomplish this, the 1384 server will send a "Confirm Link" Link Layer Control (LLC) message to 1385 the client over the RoCE fabric. The "Confirm Link" LLC message will 1386 provide the server's MAC, GID, and QP information for the connection, 1387 allow each partner to communicate the maximum number of links it can 1388 tolerate in this link group (the "link limit"), and will additionally 1389 provide two link IDs: 1391 o a one-byte server-assigned Link number that is used by both peers 1392 to identify the link within the link group and is only unique 1393 within a link group. 1395 o a four byte link user id. This opaque value is assigned by the 1396 server for the server's local use and is provided to the client 1397 for management purposes, for example to use in network management 1398 displays and products. 1400 When the server sends this message, it will set a timer for receiving 1401 confirmation from the client. 1403 When the client receives the server's confirmation "Confirm Link" LLC 1404 message it will cancel the confirmation timer it set when it sent the 1405 SMC Confirm message. It will also advance its QP state to RTS and 1406 respond over the RoCE fabric with a "Confirm Link" response LLC 1407 message, providing its MAC, GID, QP number, link limit, confirming 1408 the one byte link number sent by the server, and providing its own 1409 four byte link user id to the server. 1411 Host X -- Server Host Y -- Client 1412 +-------------------+ +-------------------+ 1413 | PeerID = PS1 | | PeerID = PC1 | 1414 | +------+ +------+ | 1415 | QP 8 |RNIC 1| |RNIC 2| QP 64 | 1416 |RToken X| |MAC MA| |MAC MB| | | 1417 | | |GID GA| |GID GB| |Rtoken Y| 1418 | \/ +------+ (Subnet S1) +------+ \/ | 1419 |+--------+ | | +--------+ | 1420 || RMB | | | | RMB | | 1421 |+--------+ | | +--------+ | 1422 | +------+ +------+ | 1423 | |RNIC 3| |RNIC 4| | 1424 | |MAC MC| |MAC MD| | 1425 | |GID GC| |GID GD| | 1426 | +------+ +------+ | 1427 +-------------------+ +-------------------+ 1429 SYN TCP options(254,"SMCR") 1430 <--------------------------------------------------------- 1432 SYN-ACK TCP options(254, "SMCR") 1433 ---------------------------------------------------------> 1435 ACK [TCP options(254, "SMCR")] 1436 <-------------------------------------------------------- 1438 SMC Proposal(PC1,MB,GB,S1) 1439 <-------------------------------------------------------- 1441 SMC Accept(PS1,first contact,MA,GA,MTU,QP8,RToken=X,RMB elem ndx) 1442 ---------------------------------------------------------> 1444 SMC Confirm(PC1,MB,GB,MTU,QP64,RToken=Y, RMB element index) 1445 <-------------------------------------------------------- 1447 Confirm Link (MA,GA,QP8, link lim, server's link userid, linknum) 1448 .........................................................> 1450 Confirm Link Rsp(MB,GB,QP64, link lim, client link userid, linknum) 1451 <........................................................ 1453 Legend: 1454 ------------ TCP/IP and CLC flows 1455 ............ RoCE (LLC) flows 1457 Figure 8 First contact rendezvous flows 1459 Technically, the data for the TCP connection could now flow over the 1460 RoCE path. However if this is first contact, there is no alternate 1461 for this recently established RoCE path. Since in the current 1462 architecture there is no failover from RoCE to IP once connection 1463 data starts flowing, this means that a failure of this path would 1464 disrupt the TCP connection, meaning that the level of redundancy and 1465 failover is less than that provided by IP. If the network has 1466 alternate RoCE paths available, they would not be usable at this 1467 point, which is an unacceptable condition 1469 3.5.1.6. Second SMC-R link setup 1471 Because of the unacceptable situation described above, TCP data will 1472 not be allowed to flow on the newly established SMC-R link until a 1473 second path has been set up, or at least attempted. 1475 If the server has a second RNIC available on the same LAN, it 1476 attempts to set up the second SMC-R link over that second RNIC. If 1477 it only has one RNIC available on the LAN, it will attempt to set up 1478 the second SMC-R link over that one RNIC. In the latter case, the 1479 server is attempting to set up an asymmetric link, in case the client 1480 does have a second RNIC on the LAN. 1482 In either case the server allocates a new QP over the RNIC it is 1483 attempting to use for the second link, assigns a link number to the 1484 new link and also creates an RToken for the RMB over this second QP 1485 (note that this means that the first and second QP each has its own 1486 RToken to represent the same RMB). The server provides this 1487 information, as well as the MAC and GID of the RNIC it is attempting 1488 set up the second link over in an "Add Link" LLC message which it 1489 sends to the client over the SMC-R link that is already set up. 1491 3.5.1.6.1. Client processing of "Add Link" LLC message from server 1493 When the client receives the server's "Add Link" LLC message, it 1494 examines the GID and MAC provided by the server to determine if the 1495 server is attempting to use the same server-side RNIC as the existing 1496 SMC-R link, or a different one. 1498 If the server is attempting to use the same server-side RNIC as the 1499 existing SMC-R link, then the client verifies that it has a second 1500 RNIC on the same LAN. If it does not, the client rejects the "Add 1501 Link" request from the server, because the resulting link would be a 1502 parallel link which is not supported within a link group. If the 1503 client does have a second RNIC on the same LAN, it accepts the 1504 request and an asymmetric link will be set up. 1506 If the server is using a different server-side RNIC from the existing 1507 SMC-R link then the client will accept the request and a second SMC-R 1508 link will set up in this SMC-R link group. If the client has a 1509 second RNIC on the same LAN, that second RNIC will be used for the 1510 second SMC-R link, creating symmetric links. If the client does not 1511 have a second RNIC on the same LAN, it will use the same RNIC as was 1512 used for the initial SMC-R link, resulting in the setup of an 1513 asymmetric link in the SMC-R link group. 1515 In either case, when the client accepts the server's "Add Link" 1516 request, it allocates a new QP on the chosen RNIC and creates an Rkey 1517 over that new QP for the client-side RMB for the SMC link group, then 1518 sends an "Add Link" reply LLC message to the server providing that 1519 information as well as echoing the Link number that was set by the 1520 server. 1522 If the client rejects the server's "Add Link" request, it sends an 1523 "Add Link" reply LLC message to the server with the reason code for 1524 the rejection. 1526 3.5.1.6.2. Server processing of "Add Link" reply LLC message from the 1527 client 1529 If the client sends a negative response to the server or no reply is 1530 received, the server frees the RoCE resources it had allocated for 1531 the new link. Having a single link in an SMC-R link group is 1532 undesirable and the server's recovery is detailed in C.8. Failure to 1533 add second SMC-R link to a link group. 1535 If the client sends a positive reply to the server with 1536 MAC/GID/QP/Rkey information, the server associates its QP for the new 1537 SMC-R link to the QP that the client provided. Now the new SMC-R 1538 link is in the same situation that the first was in after the client 1539 sent its ACK packet - there is a reliable connected QP over the new 1540 RoCE path, but there have been no RoCE flows to confirm that it's 1541 actually usable. So at this point the client and server will 1542 exchange "Confirm Link" LLC messages just like they did on the first 1543 SMC-R link. 1545 If either peer receives a failure during this second "Confirm Link" 1546 LLC exchange (either an immediate failure which implies that the 1547 message did not reach the partner, or a timeout), it sends a "Delete 1548 Link" LLC message to the partner over the first (and now only) link 1549 in the link group which must be acknowledged before data can flow on 1550 the single link in the link group. 1552 Host X -- Server Host Y -- Client 1553 +-------------------+ +-------------------+ 1554 | PeerID = PS1 | | PeerID = PC1 | 1555 | +------+ +------+ | 1556 | QP 8 |RNIC 1| |RNIC 2| QP 64 | 1557 |RToken X| |MAC MA| |MAC MB| | | 1558 | | |GID GA| |GID GB| |RToken Y| 1559 | \/ +------+ +------+ \/ | 1560 |+--------+ | | +--------+ | 1561 || | | | | | | 1562 || RMB | | | | RMB | | 1563 || | | | | | | 1564 |+--------+ | | +--------+ | 1565 | /\ +------+ +------+ /\ | 1566 | | |RNIC 3| |RNIC 4| | | 1567 |RToken Z| |MAC MC| |MAC MD| |RToken W| 1568 | QP 9 |GID GC| |GID GD| QP 65 | 1569 | +------+ +------+ | 1570 +-------------------+ +-------------------+ 1572 First SMC-R link setup as shown in Figure 8 1573 <-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-> 1575 ADD link request (QP9,MC,GC, link number=2) 1576 ............................................> 1578 ADD link response (QP65,MD,GD, link number=2) 1579 <............................................ 1581 ADD link continuation request (RToken=Z) 1582 ............................................> 1584 ADD link continuation response(RToken=W) 1585 <............................................ 1587 Confirm Link(MC,GC,QP9,link number=2, link userid) 1588 .............................................> 1590 Confirm Link response(MD,GD,QP65,link number=2, link userid) 1591 <............................................. 1593 Legend: 1594 ------------ TCP/IP and CLC flows 1595 ............ RoCE (LLC) flows 1597 Figure 9 First contact, second link setup 1599 3.5.1.6.3. Exchange of Rkeys on second SMC-R link 1601 Note that in the scenario described here, first contact, there is 1602 only one RMB Rkey to exchange on the second SMC-R link and it is 1603 exchanged in the Add Link Continuation request and reply. In 1604 scenarios other than first contact, for example, adding a new SMC-R 1605 link to a longstanding link group with multiple RMBs, additional 1606 flows will be required to exchange additional RMB Rkeys. See 1607 3.5.5.2.3. Adding a new SMC-R link to a link group with multiple RMBs 1608 for more details on these flows 1610 3.5.1.6.4. Aborting SMC-R and falling back to IP 1612 If both partners don't provide the SMC-R TCP option during the 3 way 1613 TCP handshake, the connection falls back to normal TCP/IP. During 1614 the SMC-R negotiation that occurs after the 3 way TCP handshake, 1615 either partner may break off SMC-R by sending an SMC Decline CLC 1616 message. The SMC Decline CLC message may be sent in place of any 1617 expected message, and may also be sent during the Confirm Link LLC 1618 exchange if there is a failure before any application data has flowed 1619 over the RoCE fabric. For more detail on exactly when an SMC Decline 1620 can flow during link group setup, see C.1. SMC Decline during CLC 1621 negotiation and C.2. SMC Decline during LLC negotiation 1623 If this fallback to IP happens while setting up a new SMC-R link 1624 group, the RoCE resources allocated for this SMC-R link group 1625 relationship are torn down and it will be retried as a new SMC-R link 1626 group next time a connection starts between these peers with SMC-R 1627 proposed. Note that if this happens because one side doesn't support 1628 SMC-R, there will be very little to tear down as the TCP option will 1629 have failed to flow either on the initial SYN or the SYN-ACK, before 1630 either side had reserved any local RoCE resources. 1632 3.5.2. Subsequent contact 1634 "Subsequent contact" means setting up a new TCP connection between 1635 two peers that already have an SMC-R link group between them, and 1636 reusing the existing SMC-R link group. In this case it is not 1637 necessary to allocate new QPs. However it is possible that a new RMB 1638 has been allocated for this TCP connection, if the previous TCP 1639 connection used the last element available in the previously used 1640 RMB, or for any other implementation-dependent reason. For this 1641 reason, and for convenience and error checking, the same TCP option 1642 254 followed by inline negotiation method described for initial 1643 contact will be used for subsequent contact, but the processing 1644 differs in some ways. That processing is described below. 1646 3.5.2.1. SMC-R proposal 1648 When the client begins the inline negotiation with the server, it 1649 does not know if this is a first contact or a subsequent contact. 1650 The client cannot know this information until it sees the server's 1651 peer ID to determine whether or not it already has an SMC-R link with 1652 this peer that it can use. There are several reasons why it is not 1653 sufficient to use the partner IP address, subnet, VLAN or other IP 1654 information to make this determination. The most obvious reason is 1655 distributed systems: if the server IP address is actually a virtual 1656 IP address representing a distributed cluster, the actual host 1657 serving this TCP connection may not be the same as the host that 1658 served the last TCP connection to this same IP address. 1660 After the TCP three way handshake, assuming both partners indicate 1661 SMC-R capability, the client builds and sends the SMC Proposal CLC 1662 message to the server in exactly the same manner as it does in the 1663 first contact case, and in fact at this point doesn't know if it's 1664 first contact or subsequent contact. As in the first contact case, 1665 the client sends its Peer ID value, suggested RNIC GID/MAC, and IP 1666 subnet or prefix information. 1668 Upon receiving the client's proposal, the server looks up the peer ID 1669 provided to determine if it already has a usable SMC-R link group 1670 with this peer. If it does already have a usable SMC-R link group, 1671 the server then needs to decide if it will use the existing SMC-R 1672 link group, or create a new link group. For the new link group 1673 case, see 3.5.3. First contact variation: creating a parallel link 1674 group, below. 1676 For this discussion assume the server decides to use the existing 1677 SMC-R link group for the TCP connection, which is expected to be the 1678 most common case. The server is responsible for making this decision. 1679 Then the server needs to communicate that information to the client, 1680 but it is not necessary to allocate, associate, and confirm QPs for 1681 the chosen SMC-R link. All that remains to be done is to set up RMB 1682 space for this TCP connection. 1684 If one of the RMBs already in use for this SMC-R link group has an 1685 available element that uses the appropriate buffer size, the server 1686 merely chooses one for this TCP connection and then sends an SMC 1687 Accept CLC message, providing the full RoCE information for the 1688 chosen SMC-R link to the client, using the same format as the SMC 1689 Accept CLC message described in the initial contact section above. 1691 The server may choose to use the SMC-R link that matches the 1692 suggested MAC/GID provided by the client on the SMC Proposal for its 1693 RDMA writes but is not obligated to. The final decision on which 1694 specific SMC-R link to assign a TCP connection to is an independent 1695 server and client decision. 1697 It may be necessary for the server to allocate a new RMB for this 1698 connection. The reasons for this are implementation dependent and 1699 could include: no available space in existing RMB or RMBs, or desire 1700 to allocate a new RMB that uses a different buffer size from the ones 1701 already created, or any other implementation dependent reason. In 1702 this case the server will allocate the new RMB and then perform the 1703 flows described in 3.5.5.2.1. Adding a new RMB to an SMC-R link 1704 group. Once that processing is complete, the server then provides the 1705 full RoCE information, including the new Rkey, for this connection 1706 on an SMC Confirm CLC message to the client. 1708 3.5.2.2. SMC-R acceptance 1710 Upon receiving the SMC Accept CLC message from the server, the client 1711 examines the RoCE information provided by the server to determine if 1712 this is a first contact for a new SMC link group, or subsequent 1713 contact for an existing SMC-R link group. It is subsequent contact 1714 if the server side peer ID, GID, MAC and QP number provided on the 1715 packet match a known SMC-R link, and the "first contact" flag is not 1716 set. If this is not the case, for example the GID and MAC match but 1717 the QP is new, then the server is creating a new, parallel SMC-R link 1718 group and this is treated as a first contact. 1720 A different RMB RToken does not indicate a first contact as the 1721 server may have allocated a new RMB, or be using several RMBs for 1722 this SMC-R link. The client needs the server's RMB information only 1723 for its RDMA writes to the server, and since there is no requirement 1724 for symmetric RMBs, this information is simply control information 1725 for the RDMA writes on this SMC-R link. 1727 The client must validate that the RMB element being provided by the 1728 server is not in use by another TCP connection on this SMC-R link 1729 group. This validation must validate the new across 1730 all known on this link group. See 4.4.2. RMB element 1731 reuse and conflict resolution for the case in which the server tries 1732 to use an RMB element that is already in use on this link group. 1734 Once the client has determined that this TCP connection is a 1735 subsequent contact over an existing SMC link, it performs a similar 1736 RMB allocation process as the server did: it either allocates an 1737 element from an RMB already associated with this SMC-R link, or it 1738 allocates a new RMB and associates it with this SMC-R link and then 1739 chooses an element out of it. 1741 If the client allocates a new RMB for this TCP connection, it 1742 performs the processing described in 3.5.5.2.1. Adding a new RMB to 1743 an SMC-R link group. Once that processing is complete, the client 1744 provides its full RoCE information for this TCP connection on an SMC 1745 Confirm CLC message. 1747 Because an SMC-R link with a verified connected QP already exists and 1748 is being reused, there is no need for verification or alternate QP 1749 selection flows or timers. 1751 3.5.2.3. SMC-R confirmation 1753 When the server receives the client's SMC Confirm CLC message on a 1754 subsequent contact, it verifies the following: 1756 o the RMB element provided by the client is not already in use by 1757 another TCP connection on this SMC-R link group (see section 1758 4.4.2. RMB element reuse and conflict resolution for the case in 1759 which it is). 1761 o The MAC/GID/QP info provided by the client matches an active link 1762 within the link group. The client is free to select any valid / 1763 active link. The client is not required to select the same link as 1764 the server. 1766 If this validation passes, the server stores the client's RMB 1767 information for this connection and the RoCE setup of the TCP 1768 connection is complete. 1770 3.5.2.4. TCP data flow race with SMC Confirm CLC message 1772 On a subsequent contact TCP/IP connection, a peer may send data as 1773 soon as it has received the peer RMB information for the connection. 1774 There are no additional RoCE confirmation flows, since the QPs on the 1775 SMC link are already reliably connected and verified. 1777 In the majority of cases the first data will flow from the client to 1778 the server. The client must send the SMC Confirm CLC message before 1779 sending any connection data over the chosen SMC-R link, however the 1780 client need not wait for confirmation of this message, and in fact 1781 there will be no such confirmation. Since the server is required to 1782 have the RMB fully set up and ready to receive data from the client 1783 before sending SMC Accept CLC message, the client can begin sending 1784 data over the SMC-R link immediately upon completing the send of the 1785 SMC Confirm CLC message. 1787 It is possible that data from the client will arrive into the server 1788 side RMB before the SMC Confirm CLC message from the client has been 1789 processed. In this case the server must handle this race condition, 1790 and not provide the arrived TCP data to the socket application until 1791 the SMC Confirm CLC message has been received and fully processed, 1792 opening the socket. 1794 If the server has initial data to send to the client which is not a 1795 response to the client (this case should be rare), it can send the 1796 data immediately upon receiving and processing the SMC Confirm CLC 1797 message from the client. The client must have opened the TCP socket 1798 to the client application upon sending of SMC Confirm CLC message so 1799 the client will be ready to process data from the server. 1801 3.5.3. First contact variation: creating a parallel link group 1803 Recall that parallel SMC-R links within an SMC-R link group are not 1804 supported. These are multiple SMC-R links within a link group that 1805 use the same network path. However, multiple SMC-R link groups 1806 between the same peers are supported. This means that if multiple 1807 SMC-R links over the same RoCE path are desired, it is necessary to 1808 use multiple SMC-R link groups. While not a recommended practice, 1809 this could be done for platform specific reasons, like QP separation 1810 of different workloads. Only the server can drive the creation of 1811 multiple SMC-R link groups between peers. 1813 At a high level, when the server decides to create an additional SMC- 1814 R link group with a client it already has an SMC-R link group with, 1815 the flows are basically the same as the normal "first contact" case 1816 described above. The following provides more detail and 1817 clarification of processing in this case. 1819 When the server receives the SMC Proposal CLC message from the client 1820 and using the GID/MAC info determines that it already has an SMC-R 1821 link group with this client, the server can either reuse the existing 1822 SMC-R link group (detailed in 3.5.2. Subsequent contact above) or it 1823 can create a new SMC-R link group in addition to the existing one. 1825 If the server decides to create a new SMC-R link group, it does the 1826 same processing it would have done for first contact: allocate QP and 1827 RMB resources as well as alternate QP resources, and communicate the 1828 QP and RMB information to the client on the SMC Accept CLC message 1829 with the "first contact" flag set. 1831 When the client receives the server's SMC Accept CLC message with the 1832 new QP information and the "first contact" flag, it knows the server 1833 is creating a new SMC-R link group even though it already has an SMC- 1834 R link group with the server. In this case the client will also 1835 allocate a new QP for this new SMC link and allocate an RMB for this 1836 link and generate an Rkey for it. 1838 Note that multiple SMC-R link groups between the same peers must 1839 access different RMB resources, so new RMBs will be required. Using 1840 the same RMBs that are in use in another SMC-R link group is not 1841 permitted. 1843 The client then associates its new QP with the server's new QP and 1844 sends its SMC Confirm CLC message back to the server providing the 1845 new QP/RMB information and sets its confirmation timer for the new 1846 SMC-R link. 1848 When the server receives the client's SMC Confirm CLC message it 1849 associates its QP with the client's QP as learned on the SMC Confirm 1850 CLC message and sends a confirmation LLC message. The rest of the 1851 flow, with the confirmation QP and setup of additional SMC-R links, 1852 unfolds just like the first contact case. 1854 3.5.4. Normal SMC-R link termination 1856 The normal sockets API trigger points are used by the SMC-R layer to 1857 initiate SMC-R connection termination flows. The main design point 1858 for SMC-R normal connection flows is to use the SMC-R protocol to 1859 first shutdown the SMC-R connection and free up any SMC-R RDMA 1860 resources and then allow the normal TCP connection termination 1861 protocol (i.e. FIN processing) to drive cleanup of the TCP connection 1862 that exists on the IP fabric. This design point is very important in 1863 ensuring that RDMA resources such as the RMBEs are only freed and 1864 reused when both SMC-R end points are completely done with their RDMA 1865 Write operations to the partner's RMBE. 1867 When the last TCP connection over an SMC-R link group terminates, the 1868 link group can be terminated. Similar to creation of SMC-R links and 1869 link groups, the primary responsibility for determining that normal 1870 termination is needed and initiating it lies with the server. 1871 Implementations may opt to set timers to keep SMC-R link groups up 1872 for a specified time after the last TCP connection ends, to avoid 1873 churn in cases when TCP connections come and go regularly. 1875 The link or link group may also be terminated as a result of an 1876 operator initiated command. This command can be entered at either 1877 the client or the server. If entered at the client, the client 1878 requests that the server perform link or link group termination, and 1879 the responsibility for doing so ultimately lies with the server. 1881 When the server determines that the SMC-R link group is to be 1882 terminated, it sends a DELETE LINK LLC message to the client, with a 1883 flag set indicating that all links in the link group are to be 1884 terminated. After receiving confirmation from the adapter that the 1885 DELETE LINK LLC message has been sent, the server can clean up its 1886 end of the link group (QPs, RMBs, etc). Upon receipt of the DELETE 1887 LINK message from the server, the client must immediately comply and 1888 clean up its end of the link group. Any TCP connections that the 1889 client believes to be active on the link group must be immediately 1890 terminated. 1892 The client can request that the server delete the link group as well. 1893 The client does this by sending a DELETE LINK message to the server 1894 indicating that cleanup of all links is requested. The server must 1895 comply by sending a DELETE LINK to the client and processing as 1896 described above. If there are TCP connections active on the link 1897 group when the server receives this request, they are immediately 1898 terminated by sending a RST flow over the IP fabric. 1900 3.5.5. Link group management flows 1902 3.5.5.1. Adding and deleting links in an SMC-R link group 1904 The server has the lead role in managing the composition of the link 1905 group. Links are added to link group by the server. The client may 1906 notify the server of new conditions that may result in the server 1907 adding a new link, but the server is ultimately responsible. In 1908 general links are deleted from the link group by the server, however 1909 in certain error cases the client may inform the server that a link 1910 must be deleted and treat it as deleted without waiting for action 1911 from the server. These flows are detailed in the following sections 1913 3.5.5.1.1. Server initiated Add Link processing 1915 As described in previous sections, the server initiates an Add Link 1916 exchange to create redundancy in a newly created link group. Once a 1917 link group is established the server may also initiate Add Link for 1918 other reasons, including: 1920 o Availability of additional resources on the server host to support 1921 an additional SMC-R link. This may include the provisioning of an 1922 additional RNIC, more storage becoming available to support 1923 additional QP resources, operator command, or any other 1924 implementation dependent reason. Note that, to be available for 1925 an existing link group, a new RNIC must be attached to the same 1926 RoCE LAN that the link group is using. 1928 o Receipt of notification from the client that additional resources 1929 on the client are available to support an additional SMC-R link. 1930 See 3.5.5.1.2. Client initiated Add Link processing. 1932 Server initiated Add Link processing in an established SMC-R link 1933 group is the same as the Add Link processing described in 3.5.1.6. 1934 Second SMC-R link setup with the following changes: 1936 o If an asymmetric SMC-R link already exists in the link group a 1937 second asymmetric link will not be created. Only one asymmetric 1938 link is permitted in a link group. 1940 o TCP data flow on already existing link(s) in the link group is not 1941 halted or otherwise affected during the process of setting up the 1942 additional link. 1944 In no case will the server initiate Add Link processing if the link 1945 group already has the maximum number of links negotiated by the 1946 partners. 1948 3.5.5.1.2. Client initiated Add Link processing 1950 If an additional RNIC becomes available for an existing SMC-R link 1951 group on the client's side, the client notifies the server by sending 1952 an Add Link request LLC message to the server. Unlike an Add Link 1953 request sent by the server to the client, this Add Link request 1954 merely informs the server that the client has a new RNIC. If the 1955 link group lacks redundancy, or has redundancy only on an asymmetric 1956 link with a single RNIC on the client side, the server must initiate 1957 an Add Link exchange in response to this message, to create or 1958 improve the link group's redundancy. 1960 If the link group already has symmetric link redundancy but has fewer 1961 than the negotiated maximum number of links, the server may respond 1962 by initiating an Add Link exchange to create a new link using the 1963 client's new resource but is not required to. 1965 If the link group already has the negotiated maximum number of links, 1966 the server must ignore the client's Add Link request LLC message. 1968 Because the server is not required to respond to the client's Add 1969 Link LLC message in all cases, the client must not wait for a 1970 response or throw an error if one does not come. 1972 3.5.5.1.3. Server initiated Delete Link Processing 1974 Reasons that a server may delete a link include: 1976 o The link has not been used for TCP connections for an 1977 implementation defined time interval, and deleting the link will 1978 not cause the link group to lack redundancy 1980 o An error in resources supporting the link. These may include but 1981 are not limited to: RNIC errors, QP errors, software errors 1983 o The RNIC supporting this SMC-R link is being taken down, either 1984 because of an error case or because of an operator or software 1985 command. 1987 If a link being deleted is supporting TCP connections, and there are 1988 one or more surviving links in the link group, the TCP connections 1989 are moved to the surviving links. For more information on this 1990 processing see 2.3. SMC-R resilience and load balancing. 1992 The server deletes a link from the link group by sending a Delete 1993 Link request LLC message to the client over any of the usable links 1994 in the link group. Because the Delete Link LLC message specifies 1995 which link is to be deleted, it may flow over any link in the link 1996 group. The server must not clean up its RoCE resources for the link 1997 until the client responds. 1999 The client responds to the server's Delete Link request LLC message 2000 by sending the server a Delete Link response LLC message. The client 2001 must respond positively; it cannot decline to delete the link. Once 2002 the server has received the client's Delete Link response, both sides 2003 may clean up their resources for the link. 2005 Positive write completion or other indication from the RNIC on the 2006 client's side is sufficient to indicate to the client that the server 2007 has received the Delete Link response. 2009 Host X Host Y 2010 +-------------------+ +-------------------+ 2011 | +------+ +------+ | 2012 | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 9 | 2013 |RToken X| |Failed|<--X----X----X----X-->| | | 2014 | | | | | | | 2015 | \/ +------+ +------+ | 2016 |+--------+ | | | 2017 || deleted| | | | 2018 || RMB | | | | 2019 || | | | | 2020 |+--------+ | | | 2021 | /\ +------+ +------+ | 2022 |RToken Z| | | SMC-R Link 2 | | | 2023 | | |RNIC 3|<-------------------->|RNIC 4| | 2024 | QP 64| | | | QP 65 | 2025 | +------+ +------+ | 2026 +-------------------+ +-------------------+ 2028 DELETE LINK(Request, link number = 1, 2029 ................................................> 2030 reason code = RNIC failure) 2032 DELETE LINK(Response, link number = 1) 2033 <................................................ 2035 (note, architecturally this exchange can flow over either 2036 SMC-R link but most likely flows over link 2 since 2037 the RNIC for link 1 has failed) 2039 Figure 10 Server initiated Delete Link flow 2041 3.5.5.1.4. Client initiated Delete Link request 2043 The client may request that the server delete a link for the same 2044 reasons that the server may delete a link, except for inactivity 2045 timeout. 2047 Because the client depends on the server to delete links, there are 2048 two types of delete requests from client to server: 2050 o Orderly: the client is requesting that the server delete the link 2051 when able. This would result from an operator command to bring 2052 down the RNIC or some other nonfatal reason. In this case the 2053 server is required to delete the link, but may not do it right 2054 away. 2056 o Disorderly: the server must delete the link right away, because 2057 the client has experienced a fatal error with the link. 2059 In either case the server responds by initiating a Delete Link 2060 exchange with the client as described in the previous section. The 2061 difference between the two is whether the server must do so 2062 immediately or can delay for an opportunity to gracefully delete the 2063 link. 2065 Host X Host Y 2066 +-------------------+ +-------------------+ 2067 | +------+ +------+ | 2068 | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 9 | 2069 |RToken X| | |<---X--X--X--X--X--X->|Failed| | 2070 | | | | | | | 2071 | \/ +------+ +------+ | 2072 |+--------+ | | | 2073 || deleted| | | | 2074 || RMB | | | | 2075 || | | | | 2076 |+--------+ | | | 2077 | /\ +------+ +------+ | 2078 |RToken Z| | | SMC-R Link 2 | | | 2079 | | |RNIC 3|<-------------------->|RNIC 4| | 2080 | QP 64| | | | QP 65 | 2081 | +------+ +------+ | 2082 +-------------------+ +-------------------+ 2084 DELETE LINK(Request, link number = 1, disorderly, 2085 <............................................... 2086 reason code = RNIC failure) 2088 DELETE LINK(Request, link number = 1, 2089 ................................................> 2090 reason code = RNIC failure) 2092 DELETE LINK(Response, link number = 1) 2093 <................................................ 2095 (note, architecturally this exchange can flow over either 2096 SMC-R link but most likely flows over link 2 since 2097 the RNIC for link 1 has failed) 2099 Figure 11 Client-initiated Delete Link 2101 3.5.5.2. Managing multiple Rkeys over multiple SMC-R links in a link 2102 group 2104 After the initial contact sequence completes and the number of TCP 2105 connections increases it is possible that the SMC peers could add 2106 additional RMBs to the Link Group. Recall that each peer 2107 independently manages its RMBs. Also recall that an RMB's RToken is 2108 specific to a QP, which means that when there are multiple SMC-R 2109 links in a link group, each RMB accessed with the link group requires 2110 a separate RToken for each SMC-R link in the group. 2112 Each RMB that is added to a link must be added to all links within 2113 the Link Group. The set of RMBs created for the Link is called the 2114 "RToken Set". The RTokens must be exchanged with the peer. As RMBs 2115 are added and deleted, the RToken Set must remain in sync. 2117 3.5.5.2.1. Adding a new RMB to an SMC-R link group 2119 A new RMB can be added to an SMC-R link group on either the client or 2120 the server side. When an additional RMB is added to an existing SMC- 2121 R link group, that RMB must be associated with the QPs for each link 2122 in the link group. Therefore when an RMB is added to an SMC-R link 2123 group, its RMB RToken for each SMC-R link's QP must be communicated 2124 to the peer. 2126 The tokens for a new RMB added to an existing SMC-R link group are 2127 communicated using "Confirm Rkey" LLC messages, as shown in Figure 2128 12. The RToken set is specified as pairs: an SMC link number, paired 2129 with the new RMB's RToken over that SMC Link. To preserve failover 2130 capability, any TCP connection that uses a newly added RMB cannot go 2131 active until all RTokens for the RMB have been communicated for all 2132 the links in the link group. 2134 Host X Host Y 2135 +-------------------+ +-------------------+ 2136 | +------+ +------+ | 2137 | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 9 | 2138 |RToken X| | |<-------------------->| | | 2139 | | | | | | | 2140 | \/ +------+ +------+ | 2141 |+--------+ | | | 2142 || new | | | | 2143 || RMB | | | | 2144 || | | | | 2145 |+--------+ | | | 2146 | /\ +------+ +------+ | 2147 |RToken Z| | | SMC-R Link 2 | | | 2148 | | |RNIC 3|<-------------------->|RNIC 4| | 2149 | QP 64| | | | QP 65 | 2150 | +------+ +------+ | 2151 +-------------------+ +-------------------+ 2153 CONFIRM RKEY(Request, Add, 2154 ................................................> 2155 RToken set((Link 1,RToken X),(Link2,RToken Z))) 2157 CONFIRM RKEY(Response, Add, 2158 <................................................ 2159 RToken set((Link 1,RToken X),(Link2,RToken Z))) 2161 (note, this exchange can flow over either SMC-R link) 2163 Figure 12 Add RMB to existing link group 2165 Implementations may choose to proactively add RMBs to link groups in 2166 anticipation of need. For example, an implementation may add a new 2167 RMB when all of its existing RMBs are over a certain threshold 2168 percentage used. 2170 A new RMB may also be added to an existing link group on an as needed 2171 basis. For example, when a new TCP connection is added to the link 2172 group but there are no available RMB elements. In this case the CLC 2173 exchange is paused while the peer that requires the new RMB adds it. 2174 An example of this is illustrated in figure 13. 2176 Host X -- Server Host Y -- Client 2177 +-------------------+ +-------------------+ 2178 | PeerID = PS1 | | PeerID = PC1 | 2179 | +------+ +------+ | 2180 | QP 8 |RNIC 1| SMC-R link 1 |RNIC 2| QP 64 | 2181 |RToken X| |MAC MA|<-------------------->|MAC MB| | | 2182 | | |GID GA| |GID GB| |RTokenY2| 2183 | \/ +------+ +------+ \/ | 2184 |+--------+ | | +--------+ | 2185 || | | SUBNET S1 | | New | | 2186 || RMB | | | | RMB | | 2187 |+--------+ | | +--------+ | 2188 | /\ +------+ +------+ /\ | 2189 | | |RNIC 3| SMC-R link 2 |RNIC 4| |RTokenW2| 2190 | | |MAC MC|<-------------------->|MAC MD| | | 2191 | QP 9 |GID GC| |GID GD| QP65 | 2192 | +------+ +------+ | 2193 +-------------------+ +-------------------+ 2195 SYN / SYN-ACK / ACT TCP 3-way handshake with TCP option 2196 <---------------------------------------------------------> 2198 SMC Proposal(PC1,MB,GB,S1) 2199 <-------------------------------------------------------- 2201 SMC Accept(PS1,not 1st contact,MA,GA,QP8,RToken=X,RMB elem index) 2202 ---------------------------------------------------------> 2204 Confirm Rkey(Request, Add, 2205 <........................................................ 2206 RToken set((Link1, RToken Y2),{Link2, RToken W2))) 2208 Confirm Rkey(Response, Add, 2209 ........................................................> 2210 RToken set((Link1, RToken Y2),{Link2, RToken W2))) 2212 SMC Confirm(PC1,MB,GB,QP64,RToken=Y2, RMB element index) 2213 <-------------------------------------------------------- 2215 Legend: 2216 ------------ TCP/IP and CLC flows 2217 ............ RoCE (LLC) flows 2219 Figure 13 Client adds RMB during TCP connection setup 2221 3.5.5.2.2. Deleting an RMB from an SMC-R link group 2223 Either peer can delete one or more of its RMBs as long as it is not 2224 being used for any TCP connections. Ideally an SMC-R peer would use 2225 a timer to avoid freeing an RMB immediately after the last TCP 2226 connection stops using it, to keep the RMB available for later TCP 2227 connections and avoid thrashing with addition and deletion of RMBs. 2228 Once an SMC-R peer decides to delete an RMB, it sends a DELETE RKEY 2229 LLC message to its peer. It can then free the RMB once it receives a 2230 response from the peer. Multiple RMBs can be deleted in a DELETE 2231 RKEY exchange. 2233 Note that in a DELETE RKEY message, it is not necessary to specify 2234 the full RToken for a deleted RMB. The RMB's Rkey over one link in 2235 the link group is sufficient to specify which RMB is being deleted. 2237 Host X Host Y 2238 +-------------------+ +-------------------+ 2239 | +------+ +------+ | 2240 | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 9 | 2241 |RToken X| | |<-------------------->| | | 2242 | | | | | | | 2243 | \/ +------+ +------+ | 2244 |+--------+ | | | 2245 || deleted| | | | 2246 || RMB | | | | 2247 || | | | | 2248 |+--------+ | | | 2249 | /\ +------+ +------+ | 2250 |RToken Z| | | SMC-R Link 2 | | | 2251 | | |RNIC 3|<-------------------->|RNIC 4| | 2252 | QP 9 | | | | | 2253 | +------+ +------+ | 2254 +-------------------+ +-------------------+ 2256 DELETE RKEY(Request, Rkey list(Rkey X)) 2257 ................................................> 2259 DELETE RKEY(Response, Rkey list(Rkey X)) 2260 <................................................ 2262 (note, this exchange can flow over either SMC-R link) 2264 Figure 14 Delete RMB from SMC-R link group 2266 3.5.5.2.3. Adding a new SMC-R link to a link group with multiple RMBs 2268 When a new SMC-R link is added to an existing link group, there could 2269 be multiple RMBs on each side already associated with the link group. 2270 There could also be a different number of RMBs on one side as on the 2271 other, because each peer manages its RMBs independently. Each of 2272 these RMBs will require a new RToken to be used on the new SMC-R 2273 link, and then those new RTokens must be communicated to the peer. 2274 This requires two-way communication as the server will have to 2275 communicate its RTokens to the client and vice versa. 2277 RTokens are communicated between peers in pairs. Each RToken pair 2278 consists of: 2280 o The RToken for the RMB, as is already known on an existing SMC-R 2281 link in the link group 2283 o The RToken for the same RMB, to be used on the new SMC-R link. 2285 These pairs are required to ensure that each peer knows which RTokens 2286 across QPs are equivalent. 2288 The "Add Link" request and response LLC messages do not have room to 2289 contain any RToken pairs. "Add Link continuation" LLC messages are 2290 used to communicate these pairs, as shown in Figure 15. The "Add 2291 Link Continuation" LLC messages are sent on the same SMC-R link that 2292 the "Add Link" LLC messages were sent over, and in both the "Add 2293 Link" and the "Add Link Continuation" LLC messages, the first RToken 2294 in each RToken pair will be the RToken for the RMB as known on the 2295 SMC-R link that the LLC message is being sent over. 2297 Host X -- Server Host Y -- Client 2298 +-------------------+ +-------------------+ 2299 | PeerID = PS1 | | PeerID = PC1 | 2300 | +------+ +------+ | 2301 | QP 8 |RNIC 1| |RNIC 2| QP 64 | 2302 |Rkey Set| |MAC MA| |MAC MB| |Rkey set| 2303 |X,Y,Z | |GID GA| |GID GB| |Q,R,S,T | 2304 | \/ +------+ +------+ \/ | 2305 |+--------+ | | +--------+ | 2306 || 3 RMBs | | | | 4 RMBs | | 2307 |+--------+ | | +--------+ | 2308 | /\ +------+ +------+ /\ | 2309 |Rkey set| |RNIC 3| |RNIC 4| | Rkey set| 2310 |U,V,W | |MAC MC| |MAC MD| | L,M,N,P | 2311 | QP 9 |GID GC| |GID GD| QP 65 | 2312 | +------+ +------+ | 2313 +-------------------+ +-------------------+ 2315 ADD link request (QP9,MC,GC, link number=2) 2316 ............................................> 2318 ADD link response (QP65,MD,GD, link number=2) 2319 <............................................ 2321 ADD link continuation req(RToken Pairs=((X,U),(Y,V),(Z,W))) 2322 ............................................> 2324 ADD link continuation rsp(RToken Pairs=((Q,L),(R,M),(S,N),(T,P))) 2325 <............................................. 2327 Confirm Link Req/Rsp exchange on link 2 2328 <.............................................> 2330 Legend: 2331 ------------ TCP/IP and CLC flows 2332 ............ RoCE (LLC) flows 2333 Figure 15 Exchanging Rkeys when a new link is added to a link group 2335 3.5.5.3. Serialization of LLC exchanges, and collisions 2337 LLC flows can be divided into two main groups for serializaion 2338 considerations. 2340 The first group is LLC messages that are independent and can flow at 2341 any time. These are one-time, unsolicited messages that either do 2342 not have a required response, or that have a simple response that 2343 does not interfere with the operations of another group of messages. 2344 These messages are: 2346 o TEST LINK from either the client or the server: This message 2347 requires a TEST LINK response to be returned, but does not affect 2348 the configuration of the link group or the Rkeys. 2350 o ADD LINK from the client to the server: This message is provided 2351 as an "FYI" to the server to let it know that the client has an 2352 additional RNIC available. The server is not required to act upon 2353 or respond to this message. 2355 o DELETE_LINK from the client to the server: This message informs 2356 the server that the client has either experienced an error or 2357 problem that requires a link or link group to be terminated, or 2358 that an operator has commanded that a link or link group be 2359 terminated. The server does not respond directly to the message, 2360 rather it initiates a DELETE LINK exchange as a result of 2361 receiving it. 2363 o DELETE LINK from the server to the client with the "delete entire 2364 link group" flag set: This message informs the client that the 2365 entire link group is being deleted. 2367 The second group is LLC messages that are part of an exchange of LLC 2368 messages that affects link group configuration that must complete 2369 before another exchange of LLC messages that affects link group 2370 configuration can be processed. When a peer knows that one of these 2371 exchanges is in progress, it must not start another exchange. These 2372 exchanges are: 2374 o ADD LINK / ADD LINK response / ADD LINK CONTINUATION / ADD LINK 2375 CONTINUATION response / CONFIRM LINK / CONFIRM LINK RESPONSE: 2376 This exchange, by adding a new link, changes the configuration of 2377 the link group. 2379 o DELETE LINK / DELETE LINK response initiated by the server: This 2380 exchange, by deleting a link, changes the configuration of the 2381 link group. 2383 o CONFIRM RKEY / CONFIRM RKEY response or DELETE RKEY / DELETE RKEY 2384 response: This exchange changes the RMB configuration of the link 2385 group. RKeys can not change while links are being added or 2386 deleted (while ADD or DELETE LINK is in progress). However, 2387 CONFIRM RKEY and DELETE RKEY are unique in that both the client 2388 and server can independently manage (add or remove) their own 2389 RMBs. This allows each peer to concurrently change their RKeys 2390 and therefore concurrently send CONFIRM RKEY or DELETE RKEY 2391 requests. The concurrent CONFIRM RKEY or DELETE RKEY requests can 2392 be independently processed and do not represent a collision 2394 Because the server is in control of the configuration of the link 2395 group, many timing windows and collisions are avoided but there are 2396 still some that must be handled. 2398 3.5.5.3.1. Collisions with ADD LINK / CONFIRM LINK exchange 2400 Colliding LLC message: TEST LINK 2402 Action to resolve: Send immediate TEST LINK reply 2404 Colliding LLC Message: ADD LINK from client to server 2406 Action to resolve: Server ignores the ADD LINK message. When 2407 client receives server's ADD LINK, client will consider that 2408 message to be in response to its ADD LINK message and the flow 2409 works. Since both client and server know not to start this 2410 exchange if an ADD LINK operation is already underway, this can 2411 only occur if the client sends this message before receiving the 2412 server's ADD LINK and this message crosses with the server's ADD 2413 LINK message, therefore the server's ADD LINK arrives at the 2414 client immediately after the client sent this message. 2416 Colliding LLC Message: DELETE LINK from client to server, specific 2417 link specified 2419 Action to resolve: Server queues the DELETE link message and 2420 processes after the ADD LINK exchange completes. If it is an 2421 orderly link termination, it can wait until after this exchange 2422 continues. If it is disorderly and the link affected is the one 2423 that the current exchange is using, the server will discover the 2424 outage when a message in this exchange fails. 2426 Colliding LLC Message: DELETE LINK from client to server, entire link 2427 group to be deleted 2429 Action to resolve: Immediately clean up the link group 2431 Colliding LLC message: CONFIRM RKEY from the client 2433 Action to resolve: Send negative CONFIRM_RKEY response to the 2434 client. Once the current exchange finishes, client will have to 2435 recompute its Rkey set to include the new link, and start a new 2436 CONFIRM RKEY exchange. 2438 3.5.5.3.2. Collisions during DELETE LINK exchange 2440 Colliding LLC Message: TEST LINK from either peer 2442 Action to resolve: Send immediate TEST LINK response 2444 Colliding LLC message: ADD LNK from client to server 2446 Action to resolve: Server queues the ADD LINK and processes it 2447 after the current exchange completes 2449 Colliding LLC message: DELETE LINK from client to server (specific 2450 link) 2452 Action to resolve: Server queues the DELETE link message and 2453 processes after the current exchange completes. If it is an 2454 orderly link termination, it can wait until after this exchange 2455 continues. If it is disorderly and the link affected is the one 2456 that the current exchange is using, the server will discover the 2457 outage when a message in this exchange fails 2459 Colliding LLC message: DELETE LINK from either client or server, 2460 deleting the entire link group 2462 Action to resolve: immediately clean up the link group 2464 Colliding LLC message: CONFIRM_RKEY from client to server 2466 Action to resolve: Send negative CONFIRM_RKEY response to the 2467 client. Once the current exchange finishes, client will have to 2468 recompute its Rkey set to include the new link, and start a new 2469 CONFIRM RKEY exchange 2471 3.5.5.3.3. Collisions during CONFIRM_RKEY exchange 2473 Colliding LLC Message: TEST LINK 2475 Action to resolve: Send immediate TEST LINK reply 2477 Colliding LLC message: ADD LINK from client to server 2479 Action to resolve: Queue the ADD LINK and process it after the 2480 current exchange completes 2482 Colliding LLC message: ADD LINK from server to client (CONFIRM RKEY 2483 exchange was initiated by the client and it crossed with the server 2484 initiating an ADD LINK exchange) 2486 Action to resolve: Process the ADD LINK. Client will receive a 2487 negative CONFIRM RKEY from the server and will have to redo this 2488 CONFIRM RKEY exchange after the ADD LINK exchange completes. 2490 Colliding LLC message: DELETE LINK from client to server, specific 2491 link to be deleted (CONFIRM RKEY exchange was initiated by the server 2492 and it crossed with the client's DELETE LINK request 2494 Action to resolve: Server queues the DELETE link message and 2495 processes after the ADD LINK exchange completes. If it is an 2496 orderly link termination, it can wait until after this exchange 2497 continues. If it is disorderly and the link affected is the one 2498 that the current exchange is using, the server will discover the 2499 outage when a message in this exchange fails. 2501 Colliding LLC message: DELETE LINK from server to client, specific 2502 link deleted (CONFIRM RKEY exchange was initiated by the client and 2503 it crossed with the server's DELETE LINK) 2505 Action to resolve: Process the DELETE LINK. Client will receive a 2506 negative CONFIRM RKEY from the server and will have to redo this 2507 CONFIRM RKEY exchange after the ADD LINK exchange completes. 2509 Colliding LLC message: DELETE LINK from either client or server, 2510 entire link group deleted 2512 Action to resolve: immediately clean up the link group 2514 Colliding LLC message: CONFIRM LINK from the peer that did not start 2515 the current CONFIRM LINK exchange 2517 Action to resolve: Queue the request and process it after the 2518 current exchange completes. 2520 4. SMC-R memory sharing architecture 2522 4.1. RMB element allocation considerations 2524 Each TCP connection using SMC-R must be allocated a RMBE by each SMC- 2525 R peer. This allocation is performed by each end point independently 2526 to allow each end point to select an RMBE that best matches the 2527 characteristics on its TCP socket end point. The RMBE associated with 2528 a TCP socket endpoint must have a Receive buffer that is at least as 2529 large as the TCP receive buffer size in effect for that connection. 2530 The receive buffer size can be determined by what is specified 2531 explicitly by the application using setsockopt() or implicitly via 2532 the system configured default value. This will allow sufficient data 2533 to be RDMA written by the SMC-R peer to fill an entire receive buffer 2534 size worth of data on a given data flow. Given that each RMB must 2535 have fixed length RMBEs this implies that an SMC-R end point may need 2536 to maintain multiple RMBs of various sizes for SMC-R connections on a 2537 given SMC link and can then select an RMBE that most closely fits a 2538 connection. 2540 4.2. RMB and RMBE format 2542 An RMB is a virtual memory buffer whose backing real memory is 2543 pinned, which is divided into a whole number of equal sized RMB 2544 Elements (RMBEs). Each RMBE begins with a four byte eye catcher for 2545 diagnostic and service purposes, followed by the receive data buffer. 2546 The contents of this diagnostic eyecatcher are implementation 2547 dependent and should be used by the local SMC-R peer to check for 2548 overlay errors by verifying an intact eyecatcher with every RMBE 2549 access. 2551 The RMBE is a wrapping receive buffer for receiving RDMA writes from 2552 the peer. Cursors, as described below, are exchanged between peers 2553 to manage and track RDMA writes and local data reads from the RMBE 2554 for a TCP connection. 2556 4.3. RMBE control information 2558 RMBE control information consists of consumer and producer cursors, 2559 wrap counts, CDC message sequence numbers, control flags such as 2560 urgent data and writer blocked indicators, and TCP connection 2561 information such as termination flags. This information is exchanged 2562 between SMC-R peers using CDC messages, which are passed using RDMA 2563 message passing with inline data, with the control information 2564 contained in the inline data. A TCP/IP stack implementing SMC-R must 2565 receive and store this information in its internal data structures as 2566 it is used to manage the RMBE and its data buffer. 2568 The format and contents of the CDC message is described in detail in 2569 4.3. RMBE control information. The following is a high level 2570 description of what this control information contains. 2572 o Connection state flags such as sending done, connection closed, 2573 failover data validation, and abnormal close 2575 o A sequence number that is managed by the sender. This sequence 2576 number starts at 1, is increased each send, and wraps to 0. This 2577 sequence number tracks the CDC message sent and is not related to 2578 the number of bytes sent. It is used for failover data 2579 validation. 2581 o Producer cursor: a wrapping offset into the receiver's RMBE data 2582 area. Set by the peer that is writing into the RMBE, it points to 2583 where the writing peer will write the next byte of data into an 2584 RMBE. This cursor is accompanied by a wrap sequence number to help 2585 the RMBE owner (the receiver) identify full window size wrapping 2586 writes. Note that this cursor must account for (i.e., skip over) 2587 the RMBE eyecatcher that is in the beginning of the data area. 2589 o Consumer cursor: a wrapping offset into the receiver's RMBE data 2590 area. Set by the owner of the RMBE (the peer that is reading from 2591 it), this cursor points to the offset of the next byte of data to 2592 be consumed by the peer in its own RMBE. The sender cannot write 2593 beyond this cursor into the receiver's RMBE without causing data 2594 loss. Like the producer cursor, this is accompanied by a wrap 2595 count to help the writer identify full window size wrapping reads. 2596 Note that this cursor must account for (i.e., skip over) the RMBE 2597 eyecatcher that is in the beginning of the data area. 2599 o Data flags such as urgent data, writer blocked indicator, and 2600 cursor update requests. 2602 4.4. Use of RMBEs 2604 4.4.1. Initializing and accessing RMBEs 2606 The RMBE eyecatcher is initialized by the RMB owner prior to 2607 assigning it to a specific TCP connection and communicating its RMB 2608 index to the SMC-R partner. After an RMBE index is communicated to 2609 the SMC-R partner the RMBE can only be referenced in "read only mode" 2610 by the owner and all updates to it are performed by the remote SMC-R 2611 partner via RDMA write operations. 2613 Initialization of an RMBE must include the following: 2615 o Zeroing out the entire RMBE receive buffer, which helps minimize 2616 data integrity issues (e.g. data from a previous connection 2617 somehow being presented to the current connection). 2619 o Setting the beginning RMBE eye catcher. This eye catcher plays an 2620 important role in helping detect accidental overlays of the RMBE. 2621 The RMB owner must always validate these eye catchers before each 2622 new reference to the RMBE. If the eye catchers are found to be 2623 corrupted the local host must reset the TCP connection associated 2624 with this RMBE and log the appropriate diagnostic information. 2626 4.4.2. RMB element reuse and conflict resolution 2627 RMB elements can be reused once their associated TCP and SMC-R 2628 connections are terminated. Under normal and abnormal SMC-R 2629 connection termination processing both SMC-R peers must explicitly 2630 acknowledge that they are done using an RMBE before that element can 2631 be freed and reassigned to another SMC-R connection instance. For 2632 more details on SMC-R connection termination refer to section 4.8. 2633 However, there are some error scenarios where this 2 way explicit 2634 acknowledgement may not be completed. In these scenarios (mentioned 2635 explicitly elsewhere in this document) an RMBE owner may chose to re- 2636 assign this RMBE to a new SMC-R connection instance on this SMC link 2637 group. When this occurs the partner SMC-R peer must detect this 2638 condition during SMC-R rendezvous processing when presented with an 2639 RMBE that it believes is already in use for a different SMC-R 2640 connection. In this case, the SMC-R peer must abort the existing 2641 SMC-R connection associated with this RMBE. The abort processing 2642 Resets the TCP connection (if it is still active) but it must not 2643 attempt to perform any RDMA writes to this RMBE and must also ignore 2644 any data sitting in the local RMBE associated with the existing 2645 connection. It then proceeds to free up the local RMBE and notify 2646 the local application that the connection is being abnormally reset. 2648 The remote SMC-R peer then proceeds to normal processing for this new 2649 SMC-R connection. 2651 4.5. SMC-R protocol considerations 2653 The following sections describe considerations for the SMC-R protocol 2654 as compared to the TCP protocol. 2656 4.5.1. SMC-R protocol optimized window size updates 2658 An SMC-R receiver host sends its Consumer Cursor information to the 2659 sender to convey the progress that the receiving application has made 2660 in consuming the sent data. The difference between the writer's 2661 Producer Cursor and the associated receiver's Consumer Cursor 2662 indicates the window size available for the sender to write into. 2663 This is somewhat similar to TCP window update processing and 2664 therefore has some similar considerations, such as silly window 2665 syndrome avoidance, whereby the TCP protocol has an optimization that 2666 minimizes the overhead of very small, unproductive window size 2667 updates associated with sub-optimal socket applications consuming 2668 very small amount of data on every receive() invocation. For SMC-R, 2669 the receiver only updates its Consumer Cursor via a unique CDC 2670 message under the following conditions: 2672 o The current window size (from a sender's perspective) is less than 2673 half of the Receive Buffer space and the Consumer Cursor update 2674 will result in a minimum increase in the window size of 10% of the 2675 Receive buffer space. Some examples: 2677 a. Receive Buffer size: 64K, Current window size (from a 2678 sender's perspective): 50K. No need to update the Consumer 2679 Cursor. Plenty of space is available for the sender. 2681 b. Receive Buffer size: 64K, Current window size (from a 2682 sender's perspective): 30K, Current window size from a 2683 receiver's perspective: 31K. No need to update the Consumer 2684 Cursor; even though the sender's window size < 1/2 of the 2685 64K, the window update would only increase that by 1K which 2686 is < 1/10th of the 64K buffer size. 2688 c. Receive Buffer size: 64K, Current window size (from a 2689 sender's perspective): 30K, Current window size from a 2690 receiver's perspective: 64K. The receiver updates the 2691 Consumer Cursor (sender's window size < 1/2 of the 64K, the 2692 window update would increase that by > 6.4K). 2694 o The receiver must always include a Consumer Cursor update whenever 2695 it sends a CDC message to the partner for another flow (i.e. send 2696 flow in the opposite direction). This allows the window size 2697 update to be delivered with no additional overhead. This is 2698 somewhat similar to TCP DelayAck processing and quite effective 2699 for request/response data patterns. 2701 o If a peer has set the B-bit in a CDC message then any consumption 2702 of data by the receiver causes a CDC message to be sent updating 2703 the consumer cursor until that a CDC message with that bit cleared 2704 is received from the peer. 2706 o The optimized window size updates are overridden when the sender 2707 sets the Consumer Cursor Update Requested flag in a CDC message to 2708 the receiver. When this indicator is on the consumer must send a 2709 Consumer Cursor update immediately when data is consumed by the 2710 local application or if the cursor has not been updated for a 2711 while (i.e. local copy consumer cursor does not match the last 2712 consumer cursor value sent to the the partner). This allows the 2713 sender to perform optional diagnostics for detecting a stalled 2714 receiver application (data has been sent but not consumed). It is 2715 recommended that the Consumer Cursor Update Requested flag only be 2716 sent for diagnostic procedures as it may result in non-optimal 2717 data path performance. 2719 4.5.2. Small data sends 2721 The SMC-R protocol makes no special provisions for handling small 2722 data segments sent across a stream socket. Data is always sent if 2723 sufficient window space is available. There are no special provisions 2724 for coalescing small data segments, similar to the TCP Nagle 2725 algorithm. 2727 An implementation of SMC-R may optimize its sending processing by 2728 coalescing outbound data for a given SMC-R connection so that it can 2729 reduce the number of RDMA write operations it performed in a similar 2730 fashion to Nagle's algorithm. However, any such coalescing would 2731 require a timer on the sending host that would ensure that data was 2732 eventually sent. And the sending host would have to opt out of this 2733 processing if Nagle's algorithm had been disabled (programmatically 2734 or via system configuration). 2736 4.5.3. TCP Keepalive processing 2738 TCP keepalive processing allows applications to direct the local 2739 TCP/IP host to periodically "test" the viability of an idle TCP 2740 connection. Since SMC-R connections have both a TCP representation 2741 along with an SMC-R representation there are unique keepalive 2742 processing considerations: 2744 o SMC-R layer keepalive processing: If keepalive is enabled for an 2745 SMC-R connection the local host maintains a keepalive timer that 2746 reflects how long an SMC-R connection has been idle. The local 2747 host also maintains a timestamp of last activity for each SMC link 2748 (for any SMC-R connection on that link). When it is determined 2749 that an SMC-R connection has been idle longer than the keepalive 2750 interval the host checks whether the SMC-R link has been idle for 2751 a duration longer than the keepalive timeout. If both conditions 2752 are met, the local host then performs a Test Link LLC command to 2753 test the viability of the SMC link over the RoCE fabric (RC-QPs). 2754 If a Test Link LLC command response is received within a 2755 reasonable amount of time then the link is considered viable and 2756 all connections using this link are considered viable as well. If 2757 however a response is not received in a reasonable amount of time 2758 or there's a failure in sending the Test Link LLC command then 2759 this is considered a failure in the SMC link and failover 2760 processing to an alternate SMC link must be triggered. If no 2761 alternate SMC link exists in the SMC link group then all the SMC-R 2762 connections on this link are abnormally terminated by resetting 2763 the TCP connections represented by these SMC-R connections. Given 2764 that multiple SMC-R connections can share the same SMC link, 2765 implementing an SMC link level probe using the Test Link LLC 2766 command will help reduce the amount of unproductive keepalive 2767 traffic for SMC-R connections; as long as some SMC-R connections 2768 on a given SMC link are active (i.e. have had I/O activity within 2769 the keepalive interval) then there is no need to perform 2770 additional link viability testing. 2772 o TCP layer keepalives processing: Traditional TCP "keepalive" 2773 packets are not as relevant for SMC-R connections given that the 2774 TCP path is not used for these connections once the SMC-R 2775 rendezvous processing is completed. All SMC-R connections by 2776 default have associated TCP connections that are idle. Are TCP 2777 keepalive probes still needed for these connections? There are 2778 two main scenarios to consider: 2780 1. TCP keepalives that are used determine whether the peer TCP 2781 endpoint is still active. This is not needed for SMC-R 2782 connections as the SMC-R level keepalives mentioned above will 2783 determine whether the remote endpoint connections are still 2784 active. 2786 2. TCP keepalives that are used to ensure that TCP connections 2787 traversing an intermediate proxy maintain an active state. For 2788 example, stateful firewalls typically maintain state 2789 representing every valid TCP connection that traverses the 2790 firewall. These types of firewalls are known to expire idle 2791 connections by removing their state in the firewall to conserve 2792 memory. TCP keepalives are often used in this scenario to 2793 prevent firewalls from timing out otherwise idle connections. 2794 When using SMC-R, both end points must reside in the same layer 2795 2 network (i.e. the same subnet). As a result, firewalls can 2796 not be injected in the path between two SMC-R endpoints. 2797 However, other intermediate proxies, such as TCP/IP layer load 2798 balancers may be injected in the path of two SMC-R endpoints. 2799 These types of load balancers also maintain connection state so 2800 that they can forward TCP connection traffic to the appropriate 2801 cluster end point. When using SMC-R these TCP connections will 2802 appear to be completely idle making them susceptible to 2803 potential timeouts at the LB proxy. As a result, for this 2804 scenario, TCP keepalives may still be relevant. 2806 The following are the TCP level keepalive processing requirements for 2807 SMC-R enabled hosts: 2809 o SMC-R peers should allow TCP keepalives to flow on the TCP path of 2810 SMC-R connections based on existing TCP keepalive configuration 2811 and programming options. However, it is strongly recommended that 2812 platforms provide the ability to specify very granular keepalive 2813 timers (for example, single digit second timers) should consider 2814 providing a configuration option that limits the minimum keepalive 2815 timer that will be used for TCP layer keepalives on SMC-R 2816 connections. This is important to minimize the amount of TCP 2817 keepalive packets transmitted in the network for SMC-R 2818 connections. 2820 o SMC-R peers must always respond to inbound TCP layer keepalives 2821 (by sending ACKs for these packets) even if the connection is 2822 using SMC-R. Typically, once a TCP connection has completed the 2823 SMC-R rendezvous processing and using SMC-R for data flows, no new 2824 inbound TCP segments are expected on that TCP connection other 2825 than TCP termination segments (FIN, RST, etc). TCP keepalives are 2826 the one exception that must be supported. And since TCP keepalive 2827 probes do not carry any application layer data this has no adverse 2828 impact on the application's inbound data stream. 2830 4.6. TCP connection failover between SMC-R links 2832 A peer may change which SMC-R link within a link group it sends its 2833 writes over in the event of a link failure. Since each peer 2834 independently chooses which link to send writes over for a specific 2835 TCP connection, this process is done independently by each peer. 2837 4.6.1. Validating data integrity 2839 Even though RoCE is a reliable transport there is a small subset of 2840 failure modes that could cause unrecoverable loss of data. When an 2841 RNIC acknowledges receipt of an RDMA write to its peer, that creates 2842 a write completion event to the sending peer, which allows the sender 2843 to release any buffers it is holding for that write. In normal 2844 operation and in most failures, this operation is reliable. 2846 However there are failure modes possible in which a receiving RNIC 2847 has acknowledged an RDMA write but then was not able to place the 2848 received data into its host memory, for example a sudden, disorderly 2849 failure of the interface between the RNIC and the host. While rare, 2850 these types of events must be guarded against to ensure data 2851 integrity. The process for switching SMC-R links during failover that 2852 is described in this section guards against this possibility, and is 2853 mandatory. 2855 Each peer must track the current state of the CDC sequence numbers 2856 for a TCP connection. The sender must keep track of SS, which is the 2857 sequence number of the CDC message that described the last write 2858 acknowledged by the peer RNIC. In other words, SS describes the last 2859 write that the sender believes its peer has successfully received. 2860 The receiver must keep track of SR, the sequence number of the CDC 2861 message that described last write that it has successfully received, 2862 i.e., the data has been successfully placed into an RMBE. 2864 When an RNIC fails and the sender changes SMC-R links, the sender 2865 must first send a CDC message with the 'F' flag set over the new SMC- 2866 R link. This is the failover data validation message. The sequence 2867 number in this CDC message is equal to SS. The CDC message key, the 2868 length, and SMC-R alert token are the only other fields in this CDC 2869 message that are significant. No reply is expected from this 2870 validation message, and once the sender has sent it, the sender may 2871 resume sending on the new SMC-R link as described in 4.6.2. below 2873 Upon receipt of the failover validation message, the receiver must 2874 verify that its SR value for the TCP connection is equal to or 2875 greater than the sequence number in the failover validation message. 2876 If so, no further action is required and the TCP connection resumes 2877 on the new SMC-R link. If SR is less than the sequence number value 2878 in the validation message, data has been lost and the receiver must 2879 immediately reset the TCP connection. 2881 4.6.2. Resuming the TCP connection on a new SMCR link 2882 When a connection is moved to a new SMC-R link and the failover 2883 validation message has been sent, the sender can immediately resume 2884 normal transmission. In order to preserve the application message 2885 stream the sender must replay any RDMA writes (and their associated 2886 CDC messages) that were in progress or failed when the previous SMC-R 2887 link failed, before sending new data on the new SMC-R link. The 2888 sender has two options for accomplishing this: 2890 o Preserve the sequence numbers "as is": Retry all failed and 2891 pending operations as they were originally done, including 2892 reposting all associated RDMA write operations and their 2893 associated CDC messages without making any changes. Then resume 2894 sending new data using new sequence numbers. 2896 o Combine pending messages and possibly add new data: Combine failed 2897 and pending messages into a single new write with a new sequence 2898 number. This allows the sender to combine pending messages into 2899 fewer operations. As a further optimization this write can also 2900 include new data, as long as all failed and pending data is also 2901 included. If this approach is taken, the sequence number must be 2902 increased beyond the last failed or pending sequence number. 2904 4.7. RMB data flows 2906 The following sections describe the RDMA wire flows for the SMC-R 2907 protocol after a TCP connection has switched into SMC-R mode (i.e. 2908 SMC-R rendezvous processing is complete and a pair of RMB elements 2909 has been assigned and communicated by the SMC-R peers). The ladder 2910 diagrams below include the following: 2912 o RMBE control information kept by each peer. Only a subset of the 2913 information is depicted, specifically only the fields that reflect 2914 the stream of data written by Host A and read by Host B. 2916 o Time line 0-x that shows the wire flows in a time relative fashion 2918 o Note that RMBE control information is only shown in a time 2919 interval if its value changed (otherwise assume the value is 2920 unchanged from previously depicted value) 2922 o The local copy of the producer and consumer cursors that is 2923 maintained by each host is not depicted in these figures. Note 2924 that the cursor values in the diagram reflect the necessity of 2925 skipping over the eyecatcher in the RMBE data area. They start 2926 and wrap at 4, not 0. 2928 4.7.1. Scenario 1: Send flow, window size unconstrained 2930 SMC Host A SMC HostB 2931 RMBE A Info RMBE B Info 2932 (Consumer Cursors) (Producer Cursors) 2933 Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 2934 4 0 0 0 4 0 0 2935 0 0 1 ---------------> 1 0 0 0 2936 RDMA-WR Data 2937 (4:1003) 2938 4 0 2 ...............> 2 1004 0 0 2939 CDC Message 2941 Figure 16 Scenario 1: Send flow, window size unconstrained 2943 Scenario assumptions: 2945 o Kernel implementation 2947 o New SMC-R connection, no data has been sent on the connection 2949 o Host A: Application issues send for 1,000 bytes to Host B 2951 o Host B: RMBE receive buffer size is 10,000, application has issued 2952 a recv for 10,000 bytes 2954 Flow description: 2956 1. Application issues send() for 1,000 bytes, SMC-R layer copies 2957 data into a kernel send buffer. It then schedules an RDMA write 2958 operation to move the data into the peer's RMBE receive buffer, 2959 at relative position 4-1003 (to skip the four byte eyecatcher in 2960 the RMBE data area). Note that no immediate data or alert (i.e. 2961 interrupt) is provided to host B for this RDMA operation. 2963 2. Host A sends a CDC message to update the Producer Cursor to byte 2964 1004. This CDC message will deliver an interrupt to Host B. At 2965 this point, the SMC-R layer can return control back to the 2966 application. Host B, once notified of the completion of the 2967 previous RDMA operation, locates the RMBE associated with the 2968 RMBE alert token that was included in the message and proceeds 2969 to perform normal receive side processing, waking up the 2970 suspended application read thread, copying the data into the 2971 application's receive buffer, etc. It will use the Producer 2972 Cursor as an indicator of how much data is available to be 2973 delivered to the local application. After this processing is 2974 complete, the SMC-R layer will also update its local Consumer 2975 Cursor to match the Producer Cursor (i.e. indicating that all 2976 data has been consumed). Note that a message to the peer 2977 updating the Consumer Cursor is not needed at this time as the 2978 window size if unconstrained (> 1/2 of the receive buffer size). 2979 The window size is calculated using by taking the difference 2980 between the Producer and the Consumer cursors in the RMBEs 2981 (10,000-1,004=8,996). 2983 4.7.2. Scenario 2: Send/Receive flow, window unconstrained 2985 SMC Host A SMC HostB 2986 RMBE A Info RMBE B Info 2987 (Consumer Cursors) (Producer Cursors) 2988 Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 2989 4 0 0 0 4 0 0 2990 0 0 1 ---------------> 1 0 0 0 2991 RDMA-WR Data 2992 (4:1003) 2993 4 0 2 ...............> 2 1004 0 0 2994 CDC Message 2996 0 0 3 <-------------- 3 1004 0 0 2997 RDMA-WR Data 2998 (4:503) 2999 1004 0 4 <.............. 4 1004 0 0 3000 CDC Message 3002 Figure 17 Scenario 2: Send/Recv flow, window size unconstrained 3004 Scenario assumptions: 3006 o New SMC-R connection, no data has been sent on the connection 3008 o Host A: Application issues send for 1,000 bytes to Host B 3010 o Host B: RMBE receive buffer size is 10,000, application has 3011 already issued a recv for 10,000 bytes. Once the receive is 3012 completed, the application sends a 500 byte response to Host A. 3014 Flow description: 3016 1. Application issues send() for 1,000 bytes, SMC-R layer copies 3017 data into a kernel send buffer. It then schedules an RDMA write 3018 operation to move the data into the peer's RMBE receive buffer, 3019 at relative position 4-1003. Note that no immediate data or 3020 alert (i.e. interrupt) is provided to host B for this RDMA 3021 operation. 3023 2. Host A sends a CDC message to update the Producer Cursor to 3024 byte 1004. This CDC message will deliver an interrupt to Host B. 3025 At this point, the SMC-R layer can return control back to the 3026 application. 3028 3. Host B, once notified of the receipt of the previous CDC 3029 message, locates the RMBE associated with the RMBE alert token 3030 and proceeds to perform normal receive side processing, waking 3031 up the suspended application read thread, copying the data into 3032 the application's receive buffer, etc. After this processing is 3033 complete, the SMC-R layer will also update its local Consumer 3034 Cursor to match the Producer Cursor (i.e. indicating that all 3035 data has been consumed). Note that an update of the Consumer 3036 Cursor to the peer is not needed at this time as the window size 3037 is unconstrained (> 1/2 of the receive buffer size). The 3038 application then performs a send() for 500 bytes to Host A. The 3039 SMC-R layer will copy the data into a kernel buffer and then 3040 schedule an RDMA Write into the partner's RMBE receive buffer. 3041 Note that this RDMA write operation includes no immediate data 3042 or notification to Host A. 3044 4. Host B sends a CDC message to update the partner's RMBE Control 3045 information with the latest Producer Cursor (set to 503 and not 3046 shown in the diagram above) and to also inform the peer that the 3047 Consumer Cursor value is now 1004. It also updates the local 3048 Current Consumer Cursor and Last Sent Consumer Cursor to 1004. 3049 This CDC message includes notification since we are updating 3050 our Producer Cursor which requires attention by the peer host. 3052 4.7.3. Scenario 3: Send Flow, window constrained 3054 SMC Host A SMC HostB 3055 RMBE A Info RMBE B Info 3056 (Consumer Cursors) (Producer Cursors) 3057 Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 3058 4 0 0 0 4 0 0 3059 4 0 1 ---------------> 1 4 0 0 3060 RDMA-WR Data 3061 (4:3003) 3062 4 0 2 ...............> 2 3004 0 0 3063 CDC Message 3064 4 0 3 3 3004 0 0 3065 4 0 4 ---------------> 4 3004 0 0 3066 RDMA-WR Data 3067 (3004:7003) 3068 4 0 5 ................> 5 7004 0 0 3069 CDC Message 3070 7004 0 6 <................ 6 7004 0 0 3071 CDC Message 3073 Figure 18 Scenario 3: Send Flow, window size constrained 3075 Scenario assumptions: 3077 o New SMC-R connection, no data has been sent on this connection 3079 o Host A: Application issues send for 3,000 bytes to Host B and then 3080 another send for 4,000 3082 o Host B: RMBE receive buffer size is 10,000. Application has 3083 already issued a recv for 10,000 bytes 3085 Flow description: 3087 1. Application issues send() for 3,000 bytes, SMC-R layer copies 3088 data into a kernel send buffer. It then schedules an RDMA write 3089 operation to move the data into the peer's RMBE receive buffer, 3090 at relative position 4-3003. Note that no immediate data or 3091 alert (i.e. interrupt) is provided to host B for this RDMA 3092 operation. 3094 2. Host A sends a CDC message to update its Producer Cursor to byte 3095 3003. This CDC message will deliver an interrupt to Host B. At 3096 this point, the SMC-R layer can return control back to the 3097 application. 3099 3. Host B, once notified of the receipt of the previous CDC 3100 message, locates the RMBE associated with the RMBE alert token 3101 and proceeds to perform normal receive side processing, waking 3102 up the suspended application read thread, copying the data into 3103 the application's receive buffer, etc. After this processing is 3104 complete, the SMC-R layer will also update its local Consumer 3105 Cursor to match the Producer Cursor (i.e. indicating that all 3106 data has been consumed). It will not however update the partner 3107 with this information as the window size is not constrained 3108 (10000-3000=7000 of available space). The application on Host B 3109 also issues a new recv() for 10,000. 3111 4. On Host A, application issues a send() for 4,000 bytes. The SMC- 3112 R layer copies the data into a kernel buffer and schedules an 3113 async RDMA write into the peer's RMBE receive buffer at relative 3114 position 3003-7004. Note that no alert is provided to host B for 3115 this flow. 3117 5. Host A sends a CDC message to update the Producer Cursor to 3118 byte 7004. This CDC message will deliver an interrupt to Host B. 3119 At this point, the SMC-R layer can return control back to the 3120 application. 3122 6. Host B, once notified of the receipt of the previous CDC 3123 message, locates the RMBE associated with the RMBE alert token 3124 and proceeds to perform normal receive side processing, waking 3125 up the suspended application read thread, copying the data into 3126 the application's receive buffer, etc. After this processing is 3127 complete, the SMC-R layer will also update its local Consumer 3128 Cursor to match the Producer Cursor (i.e. indicating that all 3129 data has been consumed). It will then determine whether it 3130 needs to update the Consumer Cursor to the peer. The available 3131 window size is now 3,000 (10,000 - (Producer Cursor - Last Sent 3132 Consumer Cursor)) which is < 1/2 receive buffer size 3133 (10,000/2=5,000) and the advance of the window size is > 10% of 3134 the windows size (1,000). Therefore a CDC message is issued to 3135 update the Consumer Cursor to peer A. 3137 4.7.4. Scenario 4: Large send, flow control, full window size writes 3139 SMC Host A SMC HostB 3140 RMBE A Info RMBE B Info 3141 (Consumer Cursors) (Producer Cursors) 3142 Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 3143 1004 1 0 0 1004 1 0 3144 1004 1 1 ---------------> 1 1004 1 0 3145 RDMA-WR Data 3146 (1004:9999) 3147 1004 1 2 ---------------> 2 1004 1 0 3148 RDMA-WR Data 3149 (4:1003) 3150 1004 1 3 ...............> 3 1004 2 Wrt 3151 CDC Message 3152 1004 2 4 <............... 4 1004 2 Wrt 3153 CDC Message 3154 1004 2 5 ---------------> 5 1004 2 Wrt 3155 RDMA-WR Data Blk 3156 (1004:9999) 3157 1004 2 6 ---------------> 6 1004 2 Wrt 3158 RDMA-WR Data Blk 3159 (4:1003) 3160 1004 2 7 ...............> 7 1004 3 Wrt 3161 CDC Message 3162 1004 3 8 <............... 8 1004 3 Wrt 3163 CDC Message 3164 Figure 19 Scenario 4: Large send, flow control, full window size 3165 writes 3167 Scenario assumptions: 3169 o Kernel implementation 3171 o Existing SMC-R connection, Host B's receive window size is fully 3172 open(Peer Consumer Cursor = Peer Producer Cursor). 3174 o Host A: Application issues send for 20,000 bytes to Host B 3176 o Host B: RMB receive buffer size is 10,000, application has issued 3177 a recv for 10,000 bytes 3179 Flow description: 3181 1. Application issues send() for 20,000 bytes, SMC-R layer copies 3182 data into a kernel send buffer (assumes send buffer space of 3183 20,000 is available for this connection). It then schedules an 3184 RDMA write operation to move the data into the peer's RMBE 3185 receive buffer, at relative position 1004-9999. Note that no 3186 immediate data or alert (i.e. interrupt) is provided to host B 3187 for this RDMA operation. 3189 2. Host A then schedules an RDMA write operation to fill the 3190 remaining 1000 bytes of available space in the peer's RMBE 3191 receive buffer, at relative position 4-1003. Note that no 3192 immediate data or alert (i.e. interrupt) is provided to host B 3193 for this RDMA operation. Also note that an implementation of 3194 SMC-R may optimize this processing by combining step 1 and 2 3195 into a single RDMA Write operation (with 2 different data 3196 sources). 3198 3. Host A sends CDC message to update the Producer Cursor to byte 3199 1004. Since the entire receive buffer space is filled, the 3200 Producer Writer Blocked flag (WrtBlk indicator above) is set and 3201 the Producer Window Wrap Sequence Number (Producer WrapSeq# 3202 above) is incremented. This CDC message will deliver an 3203 interrupt to Host B. At this point, the SMC-R layer can return 3204 control back to the application. 3206 4. Host B, once notified of the receipt of the previous CDC 3207 message, locates the RMBE associated with the RMBE alert token 3208 and proceeds to perform normal receive side processing, waking 3209 up the suspended application read thread, copying the data into 3210 the application's receive buffer, etc. In this scenario, Host B 3211 notices that the Producer Cursor has not been advanced (same 3212 value as Consumer Cursor), however, it notices that the Producer 3213 Window Wrap Size Sequence number is different from its local 3214 value (1) indicating that a full window of new data is 3215 available. All the data in the receive buffer can be processed, 3216 the first segment (1004-9999) followed by the second segment (4- 3217 1003). Because the Producer Writer Blocked indicator was set, 3218 Host B schedules a CDC message to update its latest information 3219 to the peer: Consumer Cursor (1004), Consumer Window Wrap Size 3220 Sequence Number (2: the current Producer Window Wrap Sequence 3221 Number is used). 3223 5. Host A, upon receipt of the CDC message locates the TCP 3224 connection associated with the alert token, and upon examining 3225 the control information provided notices that Host B has 3226 consumed all of the data (based on the Consumer Cursor and the 3227 Consumer Window Wrap Size Sequence number) and initiates the 3228 next RDMA write to fill the receive buffer at offset 1003-9999. 3230 6. Host A then moves the next 1000 bytes into the beginning of the 3231 receive buffer (4-1003) by scheduling an RDMA write operation. 3232 Note at this point there are still 8 bytes remaining to be 3233 written. 3235 7. Host A then sends a CDC message to set the Producer Writer 3236 Blocked indicator and to increment the Producer Window Wrap Size 3237 Sequence Number (3). 3239 8. Host B, upon notification completes the same processing as step 3240 4 above, including sending a CDC message to update the peer to 3241 indicate that all data has been consumed. At this point Host A 3242 can write the final 8 utes to host B's RMBE into positions 1004- 3243 1011 (not shown). 3245 4.7.5. Scenario 5: Send flow, urgent data, window size unconstrained 3247 SMC Host A SMC HostB 3248 RMBE A Info RMBE B Info 3249 (Consumer Cursors) (Producer Cursors) 3250 Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flag 3251 1000 1 0 0 1000 1 0 3252 1000 1 1 ---------------> 1 1000 1 0 3253 RDMA-WR Data 3254 (1000:1499) 3255 1000 1 2 ...............> 2 1500 1 UrgP 3256 CDC Message UrgA 3258 1500 1 3 <............... 3 1500 1 UrgP 3259 CDC Message UrgA 3261 1500 1 4 ---------------> 4 1500 1 UrgP 3262 RDMA-WR Data UrgA 3263 (1500:2499) 3264 1500 1 5 ...............> 5 2500 1 0 3265 CDC Message 3267 Figure 20 Scenario 5: send Flow, urgent data, window size open 3269 Scenario assumptions: 3271 o Kernel implementation 3273 o Existing SMC-R connection, window size open, all data has been 3274 consumed by receiver. 3276 o Host A: Application issues send for 500 bytes with urgent data 3277 indicator (OOB) to Host B, then sends 1000 of normal data 3279 o Host B: RMBE Receive buffer size is 10,000, application has issued 3280 a recv for 10,000 bytes and is also monitoring the socket for 3281 urgent data 3283 Flow description: 3285 1. Application issues send() for 500 bytes of urgent data. SMC-R 3286 layer copies data into a kernel send buffer. It then schedules 3287 an RDMA write operation to move the data into the peer's RMBE 3288 receive buffer, at relative position 1000-1499. Note that no 3289 immediate data or alert (i.e. interrupt) is provided to host B 3290 for this RDMA operation. 3292 2. Host A sends a CDC message to update its Producer Cursor to byte 3293 1500 and to turn on the Producer Urgent Data Pending (UrgP) and 3294 Urgent Data Present (UrgA) flags. This CDC message will deliver 3295 an interrupt to Host B. At this point, the SMC-R layer can 3296 return control back to the application. 3298 3. Host B, once notified of the receipt of the previous CDC 3299 message, locates the RMBE associated with the RMBE alert token, 3300 notices that the Urgent Data Pending flag is on and proceeds 3301 with Out of Band socket API notification. For example, 3302 satisfying any outstanding select() or poll() requests on the 3303 socket by indicating that urgent data is pending (i.e. by 3304 setting the exception bit on). The Urgent Data Present indicator 3305 allows Host B to also determine the position of the urgent data 3306 (Producer cursor points one byte beyond the last byte of urgent 3307 data). Host B can then perform normal receive side processing 3308 (including specific urgent data processing), copying the data 3309 into the application's receive buffer, etc. Host B then sends a 3310 CDC message to update the partner's RMBE Control area with its 3311 latest Consumer Cursor (1500). Note this CDC message must occur 3312 regardless of the current local window size that is available. 3313 The partner host (Host A) cannot initiate any additional RDMA 3314 writes until acknowledgement that the urgent data has been 3315 processed (or at least processed/remembered at the SMC-R layer). 3317 4. Upon receipt of the message, Host A wakes up, sees that peer 3318 consumed all data up to and including the last byte of Urgent 3319 data and now resumes sending any pending data. In this case, 3320 the application had previously issued a send for 1000 bytes of 3321 normal data which would have been copied in the send buffer and 3322 control would have been returned to the application. Host A now 3323 initiates a RDMA write to move that data to the Peer's receive 3324 buffer at position 1500-2499. 3326 5. Host A then sends a CDC message with inline data update its 3327 Producer Cursor value (2500) and turn off the Urgent Data 3328 Pending and Urgent Data Present flags. Host B wakes up, 3329 processes the new data (resumes application, copies data into 3330 the application receive buffer) and then proceeds to update the 3331 Local current consumer cursor (2500). Given that the window size 3332 is unconstrained there is no need for Consumer Cursor update in 3333 the peer's RMBE. 3335 4.7.6. Scenario 6: Send flow, urgent data, window size closed 3337 SMC Host A SMC HostB 3338 RMBE A Info RMBE B Info 3339 (Consumer Cursors) (Producer Cursors) 3340 Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flag 3341 1000 1 0 0 1000 2 Wrt 3342 Blk 3344 1000 1 1 ...............> 1 1000 2 Wrt 3345 CDC Message Blk 3346 UrgP 3348 1000 2 2 <............... 2 1000 2 Wrt 3349 CDC Message Blk 3350 UrgP 3352 1000 2 3 ---------------> 3 1000 2 Wrt 3353 RDMA-WR data l Blk 3354 (1000:1499) UrgP 3356 1000 2 4 ...............> 4 1500 2 UrgP 3357 CDC Message UrgA 3359 1500 2 5 <............... 5 1500 2 UrgP 3360 CDC Message UrgA 3362 1500 2 6 ---------------> 6 1500 2 UrgP 3363 RDMA-WR data l UrgA 3364 (1500:2499) 3365 1000 2 7 ...............> 7 2500 2 0 3366 CDC Message 3368 Figure 21 Scenario 6: Send flow, urgent data, window size closed 3370 Scenario assumptions: 3372 o Kernel implementation 3374 o Existing SMC-R connection, window size closed, writer is blocked. 3376 o Host A: Application issues send for 500 bytes with urgent data 3377 indicator (OOB) to Host B, then sends 1000 of normal data. 3379 o Host B: RMBE Receive buffer size is 10,000, application has no 3380 outstanding recv() (for normal data) and is monitoring the socket 3381 for urgent data. 3383 Flow description: 3385 1. Application issues send() for 500 bytes of urgent data. SMC-R 3386 layer copies data into a kernel send buffer (if available). 3387 Since the writer is blocked (window size closed) it cannot send 3388 the data immediately. It then sends a CDC message to notify the 3389 peer of the Urgent Data Pending (UrgP)indicator (the Writer 3390 Blocked indicator remains on as well). This serves as a signal 3391 to Host B that urgent data is pending in the stream. Control is 3392 also returned to the application at this point. 3394 2. Host B, once notified of the receipt of the previous CDC 3395 message, locates the RMBE associated with the RMBE alert token, 3396 notices that the Urgent Data Pending flag is on and proceeds 3397 with Out of Band socket API notification. For example, 3398 satisfying any outstanding select() or poll() requests on the 3399 socket by indicating that urgent data is pending (i.e. by 3400 setting the exception bit on). At this point it is expected that 3401 the application will enter urgent data mode processing, 3402 expeditiously processing all normal data (by issuing recv API 3403 calls) so that it can get to the urgent data byte. Whether the 3404 application has this urgent mode processing or not, at some 3405 point the application will consume some or all of the pending 3406 data in the receive buffer. When this occurs, Host B will also 3407 send a CDC message with inline data to update its Consumer 3408 Cursor and Consumer Window Wrap Sequence Number to the peer. In 3409 the example above, a full window worth of data was consumed. 3411 3. Host A, once awakened by the message will notice that the window 3412 size is now open on this connection (based on the Consumer 3413 Cursor and the Consumer Window Wrap Sequence Number which now 3414 matches the Producer Window Wrap Sequence Number) and resume 3415 sending of the urgent data segment by scheduling an RDMA write 3416 into relative position 1000-1499. 3418 4. Host A the sends a CDC message to advance its Producer Cursor 3419 (1500) and to also notify Host B of the Urgent Data Present 3420 (UrgA) indicator (and turn off the Writer Blocked indicator). 3421 This signals to Host B that the urgent data is now in the local 3422 receive buffer and that the Producer Cursor points to the last 3423 byte of urgent data. 3425 5. Host B wakes up, processes the urgent data and once the urgent 3426 data is consumed sends a CDC message with inline data to update 3427 its Consumer Cursor (1500). 3429 6. Host A wakes up, sees that Host B has consumed the sequence 3430 number associated with the urgent data and then initiates the 3431 next RDMA write operation to move the 1000 bytes associated with 3432 the next send() of normal data into the peer's receive buffer at 3433 position (1500-2499). Note that send() API would have likely 3434 completed earlier in the process by copying the 1000 bytes into 3435 a send buffer and returning back to the application even though 3436 we could not send any new data until the urgent data was 3437 processed and acknowledged by Host B. 3439 7. Host A sends a CDC message to advance its Producer Cursor to 3440 2500 and to reset the Urgent Data Pending and Present flags. 3441 Host B wakes up and processes the inbound data. 3443 4.8. Connection termination 3445 Just as SMC-R connections are established using a combination of TCP 3446 connection establishment flows and SMC-R protocol flows, the 3447 termination of SMC-R connections also uses a similar combination of 3448 SMC-R protocol termination flows and normal TCP protocol connection 3449 termination flows. The following sections describe the SMC-R protocol 3450 normal and abnormal connection termination flows. 3452 4.8.1. Normal SMC-R connection termination flows 3454 Normal SMC-R connection flows are triggered via the normal stream 3455 socket API semantics, namely by the application issuing a close() or 3456 shutdown() API. Most applications, after consuming all incoming data 3457 and after sending any outbound data will then issue a close() API to 3458 indicate that they are done both sending and receiving data. Some 3459 applications, typically a small percentage, make use of the 3460 shutdown() API that allows then to indicate that the application is 3461 done sending data, receiving data or both sending and receiving data. 3462 The main use of this API is scenarios where a TCP application wants 3463 to alert its partner end point that it is done sending data, yet is 3464 still receiving data on its socket (shutdown for Write). Issuing 3465 shutdown for both sending and receiving data is really no different 3466 than issuing a close() and can therefore be treated in a similar 3467 fashion. Shutdown for read is typically not a very useful operation 3468 and in normal circumstances does not trigger any network flows to 3469 notify the partner TCP end point of this operation. 3471 These same trigger points will be used by the SMC-R layer to initiate 3472 SMC-R connections termination flows. The main design point for SMC-R 3473 normal connection flows is to use the SMC-R protocol to first 3474 shutdown the SMC-R connection and free up any SMC-R RDMA resources 3475 and then allow the normal TCP connection termination protocol (i.e. 3477 FIN processing) to drive cleanup of the TCP connection. This design 3478 point is very important in ensuring that RDMA resources such as the 3479 RMBEs are only freed and reused when both SMC-R end points are 3480 completely done with their RDMA Write operations to the partner's 3481 RMBE. 3483 1 3484 +-----------------+ 3485 |-------------->| CLOSED |<-------------| 3486 3D | | | | 4D 3487 | +-----------------+ | 3488 | | | 3489 | 2 | | 3490 | V | 3491 +----------------+ +-----------------+ +----------------+ 3492 |AppFinCloseWait | | ACTIVE | |PeerFinCloseWait| 3493 | | | | | | 3494 +----------------+ +-----------------+ +----------------+ 3495 | | | | 3496 | Active Close | 3A | 4A | Passive Close | 3497 | V | V | 3498 | +--------------+ | +-------------+ | 3499 |--<----|PeerCloseWait1| | |AppCloseWait1|--->----| 3500 3C | | | | | | | 4C 3501 | +--------------+ | +-------------+ | 3502 | | | | | 3503 | | 3B | 4B | | 3504 | V | V | 3505 | +--------------+ | +-------------+ | 3506 |--<----|PeerCloseWait2| | |AppCloseWait2|--->----| 3507 | | | | | 3508 +--------------+ | +-------------+ 3509 | 3510 | 3511 Figure 22 SMC-R connection states 3513 Figure 23 describes the states that an SMC-R connection typically 3514 goes through. Note that there are variations to these states that can 3515 occur when an SMC-R connection is abnormally terminated, similar in a 3516 way to when a TCP connection is reset. The following are the high 3517 level state transitions for an SMC-R connection: 3519 1. An SMC-R connection begins in the Closed state. This state is 3520 meant to reflect an RMBE that is not currently in use (was 3521 previously in use but no longer is or one that was never 3522 allocated) 3524 2. An SMC-R connection progresses to the Active state once the SMC- 3525 R rendezvous processing has successfully completed, RMB element 3526 indices have been exchanged and SMC-R links have been activated. 3527 In this state, TCP connection is fully established, rendezvous 3528 processing has been completed and SMC-R peers can begin exchange 3529 of data via RDMA. 3531 3. Active close processing (on SMC-R peer that is initiating the 3532 connection termination) 3534 A. When an application on one of the SMC-R connection peers issues 3535 a close() or shutdown(write or both) the SMC-R layer on that host 3536 will initiate SMC-R connection termination processing. First if 3537 close() or shutdown(both) is issued it will check to see that 3538 there's no data in the local RMB element that has not been read 3539 by the application. If unread data is detected, the SMC-R 3540 connection must be abnormally reset - for more detail on this 3541 refer to "SMC-R connection reset". If no unread data is pending, 3542 it then checks to see whether any outstanding data is waiting to 3543 be written to the peer or if any outstanding RDMA writes for this 3544 SMC-R connection have not yet completed. If either of these two 3545 scenarios are true, an indicator that this connection is in a 3546 pending close state is saved in internal data structures 3547 representing this SMC-R connection and control is returned to the 3548 application. If all data to be written to the partner has 3549 completed this peer will send a CDC message to notify the peer of 3550 either the PeerConnectionClosed indicator (close or shutdown for 3551 both was issued) or the PeerDoneWriting indicator. This will 3552 provide stimulus to the partner SMC-R peer that the connection is 3553 terminating. At this point the local side of the SMC-R connection 3554 transitions in the PeerCloseWait1 state and control can be 3555 returned to the application. If this process could not be 3556 completed synchronously (close pending condition mentioned above) 3557 it is completed when all RDMA writes for data and control cursors 3558 have been completed. 3560 B. At some point the SMC-R peer application (passive close) will 3561 consume all incoming data, realize that that partner is done 3562 sending data on this connection and proceed to initiate its own 3563 close of the connection once it has completed sending all data 3564 from its end. The partner application can initiate this 3565 connection termination processing via a close() or shutdown() 3566 APIs. If the application does so by issuing a shutdown() for 3567 write, then the partner SMC-R layer will send a CDC message to 3568 notify the peer (active close side) of the PeerDoneWriting 3569 indicator. When the "active close" SMC-R peer wakes up as a 3570 result of the previous CDC message, it will notice that the 3571 PeerDoneWriting indicator is now on and transition to the 3572 PeerCloseWait2 state. This state indicates that the peer is done 3573 sending data and may still be reading data. The "active close" 3574 peer will also at this point need to ensure that any outstanding 3575 recv() calls for this socket are woken up and remember that that 3576 no more data is forthcoming on this connection (in case the local 3577 connection was shutdown() for write only) 3579 C. This flow is a common transition from 3a or 3b above. When the 3580 SMC-R peer (passive close) consumes all data, updates all 3581 necessary cursors to the peer and the application closes its 3582 socket (close or shutdown for both) it will send a CDC message to 3583 the peer (the active close side) with the PeerConnectionClosed 3584 indicator set. At this point the connection can transition back 3585 to Closed state if the local application has already closed (or 3586 issued shutdown for both) the socket. Once in the Closed state, 3587 the RMBE can now be safely be reused for a new SMC-R connection. 3588 When the PeerConnectionClosed indicator is turned on, the SMC-R 3589 peer is indicating that it is done updating the partner's RMBE. 3591 D. Conditional State: If the local application has not yet issued 3592 a close() or shutdown(both) yet, we need to wait until the 3593 application does so (ApplFinWaitState). Once it does, the local 3594 host will send a CDC message to notify the peer of the 3595 PeerConnectionClosed indicator and then transition to the Closed 3596 state. 3598 4. Passive close processing (on SMC-R peer that receives an 3599 indication that the partner is closing the connection) 3601 A. Upon receipt of an inbound RDMA write notice the SMC-R layer 3602 will detect that the PeerConnectionClosed indicator or 3603 PeerDoneWriting indicator is on. If any outstanding recv() calls 3604 are pending they are completed with an indicator that the partner 3605 has closed the connection (zero length data presented to 3606 application). If any pending data to be written and 3607 PeerConnectionClosed is on then an SMC-R connection reset must be 3608 performed. The connection then enters the ApplCloseWait1 state on 3609 the passive close side waiting for the local application to 3610 initiate its own close processing 3611 B. If the local application issues a shutdown() for writing then 3612 the SMC-R layer will send a CDC message to notify the partner of 3613 the PeerDoneWriting indicator transition the local side of the 3614 SMC-R connection to the ApplCloseWait2 state. 3616 C. When the application issues a close() or shutdown() for both, 3617 the local SMC-R peer will send a message informing the peer of 3618 the PeerConnectionClosed indicator and transition to the Closed 3619 state if the remote peer has also sent the local peer the 3620 PeerConnectionClosed indicator. If the peer has not sent the 3621 PeerConnectionClosed indicator, we transition into the 3622 PeerFinalCloseWait state. 3624 D. The local SMC-R connection stays in this state until the peer 3625 sends the PeerConnectionClosed indicator in our RMBE. When the 3626 indicator is sent we transition to the Closed state and are then 3627 free to reuse this RMBE. 3629 Note that each SMC-R peer needs to provide some logic that will 3630 prevent being stranded in termination state indefinitely. For 3631 example, if an Active Close SMC-R peer is in a PeerCloseWait (1 or 2) 3632 state awaiting the remote SMC-R peer to update its connection 3633 termination status it needs to provide a timer that will prevent it 3634 from waiting in that state indefinitely should the remote SMC-R peer 3635 not respond to this termination request. This could occur in error 3636 scenarios; for example, if the remote SMC-R peer suffered a failure 3637 prior to being able to respond to the termination request or the 3638 remote application is not responding to this connection termination 3639 request by closing its own socket. This latter scenario is similar 3640 to the TCP FINWAIT2 state that has been known to sometimes cause 3641 issues when remote TCP/IP hosts lose track of established connections 3642 and neglect to close them. Even though the TCP standards do not 3643 mandate a time out from the TCP FINWAIT2 state, most TCP/IP 3644 implementations implement a timeout for this state. A similar 3645 timeout will be required for SMC-R connections. When this timeout 3646 occurs, the local SMC-R peer performs TCP reset processing for this 3647 connection. However, no additional RDMA writes to the partner RMBE 3648 can occur at this point (we have already indicated that we are done 3649 updating the peer's RMBE). After the TCP connection is Reset the RMBE 3650 can be returned to the free pool for reallocation. See section 3.2.5 3651 for more details. 3653 Also note that it is possible to have two SMC-R end points initiate 3654 an Active close concurrently. In that scenario the flows above still 3655 apply, however, both end points follow the active close path (path 3656 3). 3658 4.8.1.1. Abnormal SMC-R connection termination flows 3660 Abnormal SMC-R connection termination can occur for a variety of 3661 reasons, including: 3663 o The TCP connection associated with an SMC-R connection is reset. 3664 In the TCP protocol either end point can send a RST segment to 3665 abort an existing TCP connection when error conditions are 3666 detected for the connection or the application overtly requests 3667 that the connection be reset. 3669 o Normal SMC-R connection termination processing has unexpectedly 3670 stalled for a given connection. When the stall is detected 3671 (connection termination timeout condition) an abnormal SMC-R 3672 connection termination flow is initiated. 3674 In these scenarios it is very important that resources associated 3675 with the affected SMC-R connections are properly cleaned up to ensure 3676 that there are no orphaned resources and that resources can reliably 3677 be reused for new SMC-R connections. Given that SMC-R relies heavily 3678 on the RDMA Write processing, special care needs to be taken to 3679 ensure that an RMBE is no longer being used by a SMC-R peer before 3680 logically reassigning that RMBE to a new SMC-R connection. 3682 When an SMC-R peer initiates a TCP connection reset it also initiates 3683 an SMC-R abnormal connection flow at the same time. The SMC-R peers 3684 explicitly signal their intent to abnormally terminate an SMC-R 3685 connection and await explicit acknowledgement that the peer has 3686 received this notification and has also completed abnormal connection 3687 termination on its end. Note that TCP connection reset processing can 3688 occur in parallel to these flows. 3690 +-----------------+ 3691 |-------------->| CLOSED |<-------------| 3692 | | | | 3693 | +-----------------+ | 3694 | | 3695 | | 3696 | | 3697 | +-----------------+ | 3698 | | Any State | | 3699 |1B | (before setting | 2B| 3700 | | PeerConnClosed | | 3701 | | Indicator in | | 3702 | | Peer's RMBE) | | 3703 | +-----------------+ | 3704 | 1A | | 2A | 3705 | Active Abort | | Passive Abort | 3706 | V V | 3707 | +--------------+ +--------------+ | 3708 |-------|PeerAbortWait | | Process Abort|------| 3709 | | | | 3710 +--------------+ +--------------+ 3712 Figure 23 SMC-R abnormal connection termination state diagram 3714 Figure 24 above shows the SMC-R abnormal connection termination state 3715 diagram: 3717 1. Active abort designates the SMC-R peer that is initiating the 3718 TCP RST processing. At the time that the TCP RST is sent the 3719 active abort side must also 3721 A. Send the PeerConnAbort indicator to the partner via RDMA 3722 messaging with inline data and then transition to the 3723 PeerAbortWait state. During this state it will monitor this SMC- 3724 R connection waiting for the peer to send its corresponding 3725 PeerConnAbort indicator but will ignore any other activity in 3726 this connection (i.e. new incoming data). It will also surface an 3727 appropriate error to any socket API calls issued against this 3728 socket (e.g. ECONNABORTED, ECONNRESET, etc.) 3730 B. Once the peer sends the PeerConnAbort indicator to the local 3731 host, the local host can transition this SMC-R connection to the 3732 Closed state and reuse this RMBE. Note that the SMC-R peer that 3733 goes into the Active abort state must provide some protection 3734 against staying in that state indefinitely should the remote SMC- 3735 R peer not respond by sending its own PeerConnAbort indicator to 3736 the local host. While this should be a rare scenario it could 3737 occur if the remote SMC-R peer (passive abort) suffered a failure 3738 right after the local SMC-R peer (active abort) sent the 3739 PeerConnAbort indicator. To protect against these types of 3740 failures, a timer can be set after entering the PeerAbortWait 3741 state and when if that timer pops before the peer has sent its 3742 local PeerConnAbort indicator (to the active abort side) then 3743 this RMBE can be returned to the free pool for possible re- 3744 allocation. See section See section 3.2.5 for more details. 3746 2. Passive abort designates the SMC-R peer that is the recipient of 3747 an SMC-R abort from the peer designated by the PeerConnAbort 3748 indicator being sent by the peer in a CDC message. Upon 3749 receiving this request, the local peer must 3751 A. Indicate to the socket application that this connection has 3752 been aborted using the appropriate error codes, purge all in- 3753 flight data for this connection that is waiting to be read or 3754 waiting to be sent. 3756 B. Send a CDC message to notify the peer of the PeerConnAbort 3757 indicator and once that is completed transition this RMBE to the 3758 Closed state. 3760 If an SMC-R peer receives a TCP RST for a given SMC-R connection it 3761 also initiates SMC-R abnormal connection termination processing if it 3762 has not already been notified (via the PeerConnAbort indicator) that 3763 the partner is severing the connection. It is possible to have two 3764 SMC-R endpoints concurrently be in an Active abort role for a given 3765 connection. In that scenario the flows above still apply but both 3766 end points take the active abort path (path 1). 3768 4.8.1.2. Other SMC-R connection termination conditions 3769 The following are additional conditions that have implications of 3770 SMC-R connection termination: 3772 o A SMC-R peer being gracefully shut down. If an SMC-R peer supports 3773 a graceful shutdown operation it should attempt to terminate all 3774 SMC-R connections as part of shutdown processing. This could be 3775 accomplished via LLC Delete Link requests on all active SMC Links. 3777 o Abnormal termination of an SMC-R peer. In this example, there may 3778 be no opportunity for the host to perform any SMC-R cleanup 3779 processing. In this scenario it is up to the remote peer to 3780 detect a RoCE communications failure with the failing host. This 3781 could trigger an SMC link switch but that would also surface RoCE 3782 errors causing the remote host to eventually terminate all 3783 existing SMC-R connections to this peer. 3785 o Loss of RoCE connectivity between two SMC-R peers. If two peers 3786 are no longer reachable across any links in their SMC Link group 3787 then both peers perform a TCP reset for the connections, surface 3788 an error to the local applications and free up all QP resources 3789 associated with the link group. 3791 5. Security considerations 3793 5.1. VLAN considerations 3795 The concepts and access control of virtual LANs (VLANs) must be 3796 extended to also cover the RoCE network traffic flowing across the 3797 ethernet. 3799 The RoCE VLAN configuration and accesses must mirror the IP VLAN 3800 configuration and accesses over the CEE fabric. This means that 3801 hosts, routers and switches that have access to specific VLANs on the 3802 IP fabric must also have the same VLAN access across the RoCE 3803 fabric. In other words, the SMC-R connectivity will follow the same 3804 virtual network access permissions as normal TCP/IP traffic. 3806 5.2. Firewall considerations 3808 As mentioned above, the RoCE fabric inherits the same VLAN 3809 topology/access as the IP fabric. RoCE is a layer 2 protocol that 3810 requires both end points to reside in the same layer 2 network (i.e. 3811 VLAN). RoCE traffic can not traverse multiple VLANs as there is no 3812 support for routing RoCE traffic beyond a single VLAN. As a result, 3813 SMC-R communications will also be confined to peers that are members 3814 of the same VLAN. IP based firewalls are typically inserted between 3815 VLANs (or physical lans) and rely on normal IP routing to insert 3816 themselves in the data path. Since RoCE (and by extension SMC-R) is 3817 not routable beyond the local VLAN, there is no ability to insert a 3818 firewall in the network path of two SMC-R peers. 3820 5.3. Host-based IP Filters 3822 Because SMC-R maintains the TCP three-way handshake for connection 3823 setup before switching to RoCE out of band, existing IP filters that 3824 control connection setup flows remain effective in an SMC-R 3825 environment. IP filters that operate on traffic flowing in an active 3826 TCP connection are not supported, because the connection data does 3827 not flow over IP. 3829 5.4. Intrusion Detection Services 3831 Similar to IP filters, intrusion detection services that operate on 3832 TCP connection setups are compatible with SMC-R with no changes 3833 required. However once the TCP connection has switched to RoCE out 3834 of band, packets are not available for examination. 3836 5.5. IP Security (IPSec) 3838 IP Security is not compatible with SMC-R because there are no IP 3839 packets to operate on. TCP connections that require IP security must 3840 opt out of SMC-R. 3842 5.6. TLS/SSL 3844 TLS/SSL is preserved in an SMC-R environment. The TLS/SSL layer 3845 resides above the SMC-R layer and outgoing connection data is 3846 encrypted before being passed down to the SMC-R layer for RMDA write. 3847 Similarly, incoming connection data goes through the SMC-R layer 3848 encrypted and is decrypted by the TLS/SSL layer as it is today. 3850 The TLS/SSL handshake messages flow over the TCP connection after the 3851 connection has switched to SMC-R, so are exchanged using RDMA writes 3852 by the SMC-R layer, transparently to the TLS/SSL layer. 3854 6. IANA considerations 3856 The scarcity of TCP option codes available for assignment is 3857 understood and this architecture uses experimental TCP options 3858 following the conventions of RFC 6994 "Shared Use of Experimental TCP 3859 Options". 3861 If this protocol achieves wide acceptance a discrete option code may 3862 be requested by subsequent versions of this protocol. 3864 7. References 3866 7.1. Normative References 3868 [ROCE] RDMA over Converged Ethernet specification, URL, 3869 http://members.infinibandta.org/kwspub/spec/Annex_RoCE_fina 3870 l.pdf 3872 [IBTA] Infiniband Architecture specification, URL, 3873 http://www.infinibandta.org/specs 3875 [RFC793] University of Southern California Information Services 3876 Institute, "Transmission Control Protocol", RFC 793, 3877 September 1981. 3879 [RFC4727] Fenner B., "Experimental Values in IPv4, IPv6, ICMPv4, 3880 ICMPv6, UDP, and TCP Headers", RFC 4727, November 2006. 3882 7.2. Informative References 3884 [RFC 6994] Touch, J., "Shared use of Experimental TCP Options", 3885 draft URL, https://tools.ietf.org/html/rfc6994 3887 8. Acknowledgments 3889 This document was prepared using 2-Word-v2.0.template.dot. 3891 9. Conventions used in this document 3893 In the rendezvous flow diagrams, dashed lines (----) are used to 3894 indicate flows over the TCP/IP fabric and dotted lines (....) are 3895 used to indicate flows over the RoCE fabric. 3897 In the data transfer ladder diagrams, dashed lines (----) are used to 3898 indicate RDMA write operations and dotted lines (....) are used to 3899 indicate CDC messages, which are RDMA messages with inline data that 3900 contain control information for the connection. 3902 Appendix A. Formats 3904 A.1. TCP option 3906 The SMC-R TCP option is formatted in accordance with RFC 6994 "Shared 3907 Use of Experimental TCP Options". The ExID value is IBM-1047 3908 (EBCDIC) encoding for 'SMCR' 3910 0 1 2 3 3911 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 3912 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3913 | Kind = 254 | Length = 6 | x'E2' | x'D4' | 3914 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3915 | x'C3' | x'D9' | 3916 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3917 Figure 24 SMC-R TCP option format 3919 A.2. CLC messages 3921 The following rules apply to all CLC messages: 3923 General rules on formats: 3925 o Reserved fields must be set to zero and not validated 3927 o Each message has an eyecatcher at the start and another eyecatcher 3928 at the end. These must both be validated by the receiver. 3930 o SMC version indicator: The only SMC-R version defined in this 3931 architecture is version 1. In the future, if peers have a 3932 mismatch of versions, the lowest common version number is used. 3934 A.2.1. Peer ID format 3936 All CLC messages contain a peer ID that uniquely identifies an 3937 instance of a TCP/IP stack. This peer ID is required to be 3938 universally unique across TCP/IP stacks and instances (including 3939 restarts) of TCP/IP stacks. 3941 0 1 2 3 3942 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 3943 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3944 | Instance ID | RoCE MAC (first two bytes) | 3945 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3946 | RoCE MAC (last four bytes) | 3947 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3948 Figure 25 Peer ID format 3950 Instance ID 3952 A two-byte instance count that ensures that if the same RNIC MAC 3953 is later used in the peer ID for a different TCP/IP stack, for 3954 example if an RNIC is redeployed to another stack, the values are 3955 unique. It also ensures that if a TCP/IP stack is restarted, the 3956 instance ID changes. Value is implementation defined, with one 3957 suggestion being two bytes of the system clock. 3959 RoCE MAC 3961 The RoCE MAC address for one of the peer's RNICs. Note that in a 3962 virtualized environment this will be the virtual MAC of one of 3963 the peer's RNICs. 3965 A.2.2. SMC Proposal CLC message format 3967 0 1 2 3 3968 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 3969 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3970 | x'E2' | x'D4' | x'C3' | x'D9' | 3971 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3972 | Type = 1 | Length |Version| Rsrvd | 3973 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3974 | | 3975 +- Client's Peer ID -+ 3976 | | 3977 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3978 | | 3979 +- -+ 3980 | | 3981 +- Client's preferred GID -+ 3982 | | 3983 +- -+ 3984 | | 3985 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3986 | Client's preferred RoCE | 3987 +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3988 | |Offset to mask/prefix area (0) | 3989 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3990 . . 3991 . Area for future growth . 3992 . . 3993 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3994 | IPv4 Subnet Mask | 3995 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3996 | IPv4 Mask Lgth| Reserved |Num IPv6 prfx | 3997 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3998 : : 3999 : (Variable length) array of IPv6 Prefixes : 4000 : : 4001 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4002 | x'E2' | x'D4' | x'C3' | x'D9' | 4003 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4005 Figure 26 SMC Proposal CLC message format 4007 The fields present in the SMC Proposal CLC message are: 4009 Eyecatchers 4010 Like all CLC messages, the SMC Proposal has beginning and ending 4011 eyecatchers to aid with verification and parsing. The hex digits 4012 spell 'SMCR' in IBM-1047 (EBCDIC) 4014 Type 4016 CLC message type 1 indicates SMC Proposal 4018 Length 4020 The length of this CLC message. If this an IPv4 flow, this 4021 value is 52. Otherwise it is variable depending upon how many 4022 prefixes are listed. 4024 Version 4026 Version of the SMC-R protocol. Version 1 is the only currently 4027 defined value 4029 Client's Peer ID 4031 As described in A.2.1. above 4033 Client's preferred RoCE GID 4035 This is the IPv6 address of the client's preferred RNIC on the 4036 RoCE fabric 4038 Client's preferred RoCE MAC address 4040 The MAC address of the client's preferred RNIC on the RoCE 4041 fabric. It is required as some operating systems do not have 4042 neighbor discovery or ARP support for RoCE RNICs. 4044 Offset to mask/prefix area 4046 Provides the number of bytes that must be skipped after this 4047 field, to access the IPv4 Subnet Mask and the fields that follow 4048 it. Allows for future growth of this signal. In this version of 4049 the architecture, this value is always zero. 4051 Area for future growth 4053 In this version of the architecture, this field does not exist. 4054 This indicates where additional information may be inserted into 4055 the signal in the future. "The Offset to mask/prefix area" field 4056 must be used to skip over this area. 4058 IPv4 Subnet mask 4060 If this message is flowing over an IPv4 TCP connection, the value 4061 of the subnet mask associated with the interface the client sent 4062 this message over. If this an IPv6 flow this field is all 4063 zeroes. 4065 This field, along with all fields that follow it in this signal, 4066 must be accessed by skipping the number of bytes listed in the 4067 "Offset to mask/prefix area" field after the end of that field. 4069 IPv4 Mask Lgth 4071 If this message is flowing over an IPv4 TCP connection, the 4072 number of significant bits in the IPv4 subnet mask. If this an 4073 IPv6 flow, this field is zero. 4075 Num IPv6 prfx 4077 If this message is flowing over an IPv6 TCP connection, the 4078 number of IPv6 prefixes that follow, with a maximum value of 8. 4079 if this is an IPv4 flow this field is zero and is immediately 4080 followed by the ending eyecatcher. 4082 Array of IPv6 Prefixes 4084 For IPv6 TCP connections, a list of the IPv6 prefixes associated 4085 with the network the client sent this message over, up to a 4086 maximum of 8 prefixes. 4088 0 1 2 3 4089 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4090 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4091 | | 4092 + + 4093 | | 4094 + IPv6 Prefix value + 4095 | | 4096 + + 4097 | | 4098 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4099 | Prefix Length | 4100 +-+-+-+-+-+-+-+-+ 4102 Figure 27 Format for IPv6 Prefix array element 4104 A.2.3. SMC Accept CLC message format 4106 0 1 2 3 4107 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4108 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4109 | x'E2' | x'D4' | x'C3' | x'D9' | 4110 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4111 | Type = 2 | Length = 68 |Version|F|Rsvd | 4112 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4113 | | 4114 +- Server's Peer ID -+ 4115 | | 4116 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4117 | | 4118 +- -+ 4119 | | 4120 +- Server's RoCE GID -+ 4121 | | 4122 +- -+ 4123 | | 4124 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4125 | Server's RoCE | 4126 +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4127 | | Server QP (bytes 1-2) | 4128 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---+ 4129 |Srvr QP byte 3 | Server RMB Rkey (bytes 1-3) | 4130 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4131 |Srvr RMB byte 4|Server RMB indx| Srvr RMB alert tkn (bytes 1-2)| 4132 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4133 | Srvr RMB alert tkn (bytes 3-4)|Bsize | MTU | Reserved | 4134 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4135 | | 4136 +- Server's RMB virtual address -+ 4137 | | 4138 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4139 | Reserved | Server's initial packet sequence number | 4140 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4141 | x'E2' | x'D4' | x'C3' | x'D9' | 4142 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4143 Figure 28 SMC Accept CLC message format 4145 The fields present on the SMC Accept CLC message are: 4147 Eyecatchers 4148 Like all CLC messages, the SMC Accept has beginning and ending 4149 eyecatchers to aid with verification and parsing. The hex digits 4150 spell 'SMCR' in IBM-1047 (EBCDIC) 4152 Type 4154 CLC message type 2 indicates SMC Accept 4156 Length 4158 The SMC Accept CLC message is 68 bytes long 4160 Version 4162 Version of the SMC-R protocol. Version 1 is the only currently 4163 defined value. 4165 F-bit 4167 First Contact flag: A 1-bit flag that indicates that the server 4168 believes this TCP connection is the first SMC-R contact for this 4169 link group 4171 Server's Peer ID 4173 As described in A.2.1. above 4175 Server's RoCE GID 4177 This is the IPv6 address of the RNIC that the server chose for 4178 this SMC Link 4180 Server's RoCE MAC address 4182 The MAC address of the server's RNIC for the SMC link. It is 4183 required as some operating systems do not have neighbor discovery 4184 or ARP support for RoCE RNICs. 4186 Server's QP number 4188 The number for the reliably connected queue pair that the server 4189 created for this SMC link 4191 Server's RMB Rkey 4193 The RDMA Rkey for the RMB that the server created or chose for 4194 this TCP connection 4196 Server's RMB element index 4198 This indexes which element within the server's RMB will represent 4199 this TCP connection 4201 Server's RMB element alert token 4203 A platform defined, architecturally opaque token that identifies 4204 this TCP connection. Added by the client as immediate data on 4205 RDMA writes from the client to the server to inform the server 4206 that there is data for this connection to retrieve from the RMB 4207 element 4209 Bsize: 4211 Server's RMB element buffer size in four bits compressed 4212 notation: x=4 bits. Actual buffer size value is (2^(x+4)) * 1K. 4213 Smallest possible value is 16K. Largest size supported by this 4214 architecture is 512K. 4216 MTU 4218 An enumerated value indicating this peer's QP MTU size. The two 4219 peers exchange this value and the minimum of the peer's value 4220 will be used for the QP. This field should only be validated on a 4221 first contact exchange. 4223 The enumerated MTU values are: 4225 0: reserved 4227 1: 256 4229 2: 512 4231 3: 1024 4233 4: 2048 4235 5: 4096 4237 6-15: reserved 4239 Server's RMB virtual address 4241 The virtual address of the server's RMB as assigned by the 4242 server's RNIC. 4244 Server's initial packet sequence number 4246 The starting packet sequence number that this peer will use when 4247 sending to the other peer, so that the other peer can prepare its 4248 QP for the sequence number to expect. 4250 A.2.4. SMC Confirm CLC message format 4252 0 1 2 3 4253 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4254 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4255 | x'E2' | x'D4' | x'C3' | x'D9' | 4256 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4257 | Type = 3 | Length = 68 |Version| Rsrvd | 4258 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4259 | | 4260 +- Client's Peer ID -+ 4261 | | 4262 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4263 | | 4264 +- -+ 4265 | | 4266 +- Client's RoCE GID -+ 4267 | | 4268 +- -+ 4269 | | 4270 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4271 | Client's RoCE | 4272 +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4273 | | Client QP (bytes 1-2) | 4274 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---+ 4275 |Clnt QP byte 3 | Client RMB Rkey (bytes 1-3) | 4276 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4277 |Clnt RMB byte 4|Client RMB indx| Clnt RMB alert tkn (bytes 1-2)| 4278 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4279 | Clnt RMB alert tkn (bytes 3-4)|Bsize | MTU | Reserved | 4280 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4281 | | 4282 +- Client's RMB Virtual Address -+ 4283 | | 4284 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4285 | Reserved | Client's initial packet sequence number | 4286 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4287 | x'E2' | x'D4' | x'C3' | x'D9' | 4288 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4289 Figure 29 SMC Confirm CLC message format 4291 The SMC Confirm CLC message is nearly identical to the SMC Accept 4292 except that it contains client information and lacks a first contact 4293 flag. 4295 The fields present on the SMC Confirm CLC message are: 4297 Eyecatchers 4299 Like all CLC messages, the SMC Confirm has beginning and ending 4300 eyecatchers to aid with verification and parsing. The hex digits 4301 spell 'SMCR' in IBM-1047 (EBCDIC) 4303 Type 4305 CLC message type 3 indicates SMC Confirm 4307 Length 4309 The SMC Confirm CLC message is 68 bytes long 4311 Version 4313 Version of the SMC-R protocol. Version 1 is the only currently 4314 defined value. 4316 Client's Peer ID 4318 As described in A.2.1. above 4320 Clients's RoCE GID 4322 This is the IPv6 address of the RNIC that the client chose for 4323 this SMC Link 4325 Client's RoCE MAC address 4327 The MAC address of the client's RNIC for the SMC link. It is 4328 required as some operating systems do not have neighbor discovery 4329 or ARP support for RoCE RNICs. 4331 Client's QP number 4333 The number for the reliably connected queue pair that the client 4334 created for this SMC link 4336 Client's RMB Rkey 4337 The RDMA Rkey for the RMB that the client created or chose for 4338 this TCP connection 4340 Client's RMB element index 4342 This indexes which element within the client's RMB will represent 4343 this TCP connection 4345 Client's RMB element alert token 4347 A platform defined, architecturally opaque token that identifies 4348 this TCP connection. Added by the server as immediate data on 4349 RDMA writes from the server to the client to inform the client 4350 that there is data for this connection to retrieve from the RMB 4351 element 4353 Bsize: 4355 Client's RMB element buffer size in four bits compressed 4356 notation: x=4 bits. Actual buffer size value is (2^(x+4)) * 1K. 4357 Smallest possible value is 16K. Largest size supported by this 4358 architecture is 512K. 4360 MTU 4362 An enumerated value indicating this peer's QP MTU size. The two 4363 peers exchange this value and the minimum of the peer's value 4364 will be used for the QP. The values are enumerated in A.2.3. This 4365 value should only be validated on the first contact exchange. 4367 Client's RMB virtual address 4369 The virtual address of the server's RMB as assigned by the 4370 server's RNIC. 4372 Client's initial packet sequence number 4374 The starting packet sequence number that this peer will use when 4375 sending to the other peer, so that the other peer can prepare its 4376 QP for the sequence number to expect 4378 . 4380 A.2.5. SMC Decline CLC message format 4382 0 1 2 3 4383 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4384 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4385 | x'E2' | x'D4' | x'C3' | x'D9' | 4386 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4387 | Type = 4 | Length = 28 |Version|S|Rsrvd| 4388 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4389 | | 4390 +- Sender's Peer ID -+ 4391 | | 4392 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4393 | Peer Diagnosis Information | 4394 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4395 | | 4396 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4397 | x'E2' | x'D4' | x'C3' | x'D9' | 4398 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4399 Figure 30 SMC Decline CLC message format 4401 The fields present on the SMC Decline CLC message are: 4403 Eyecatchers 4405 Like all CLC messages, the SMC Decline has beginning and ending 4406 eyecatchers to aid with verification and parsing. The hex digits 4407 spell 'SMCR' in IBM-1047 (EBCDIC) 4409 Type 4411 CLC message type 4 indicates SMC Decline 4413 Length 4415 The SMC Decline CLC message is 28 bytes long 4417 Version 4419 Version of the SMC-R protocol. Version 1 is the only currently 4420 defined value. 4422 S-bit 4424 Synch Bit. Indicates that the link group is out of synch and 4425 receiving peer must clean up its representation of the link group 4427 Sender's Peer ID 4429 As described in A.2.1. above 4431 Peer Diagnosis Information 4433 Four bytes of diagnosis information provided by the peer. These 4434 values are defined by the individual peers and it is necessary to 4435 consult the peer's system documentation to interpret the results. 4437 A.3. LLC messages 4439 LLC messages are sent over an existing SMC-R link using RoCE message 4440 passing and are always 44 bytes long so that they fit into the space 4441 available in a single WQE without requiring the receiver to post 4442 receive buffers. If all 44 bytes are not needed, they are padded out 4443 with zeroes. LLC messages are in a request/response format. The 4444 message type is the same for request and response, and a flag 4445 indicates whether a message is flowing as a request or a response. 4447 The two high order bits of an LLC message opcode indicate how it is 4448 to be handled by a peer that does not support the opcode. 4450 If the high order bits of the opcode are b'00' then the peer must 4451 support the LLC message and indicate a protocol error if it does not. 4453 If the high order bits of the opcode are b'10' then the peer must 4454 silently discard the LLC message if does not support the opcode. This 4455 requirement is inserted to allow for toleration of advanced, but 4456 optional function. 4458 High order bits of b'11' indicate a Connection Data Control (CDC) 4459 message as described in A.4. 4461 A.3.1. CONFIRM LINK LLC message format 4463 0 1 2 3 4464 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4465 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4466 | type = 1 | length = 44 | Reserved |R| Reserved | 4467 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4468 | Sender's RoCE | 4469 +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4470 | | | 4471 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 4472 | | 4473 +- -+ 4474 | Sender's RoCE GID | 4475 +- -+ 4476 | | 4477 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4478 | |Sender's QP number, bytes 1-2 | 4479 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4480 |Sender QP byte3| Link number |Sender's link userid, bytes 1-2| 4481 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4482 |Sender's link userid bytes, 3-4| Max links | Reserved | 4483 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4484 | | 4485 +- Reserved -+ 4486 | | 4487 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4488 Figure 31 CONFIRM LINK LLC message format 4490 The CONFIRM LINK LLC message is required to be exchanged between the 4491 server and client over a newly created SMC-R link to complete the 4492 setup of an SMC link. Its purpose is to confirm that the RoCE path 4493 is actually usable. 4495 On first contact this flows after the server receives the SMC Confirm 4496 CLC message from the client over the IP connection. For additional 4497 links added to an SMC link group, it flows after the ADD LINK and ADD 4498 LINK CONTINUATION exchange. This flow provides confirmation that the 4499 queue pair is in fact usable. Each peer echoes its RoCE information 4500 back to the other. 4502 Type 4504 Type 1 indicates CONFIRM LINK 4506 Length 4507 All LLC messages are 44 bytes long 4509 R 4511 Reply flag. When set indicates this is a CONFIRM LINK REPLY 4513 Sender's RoCE MAC address 4515 The MAC address of the sender's RNIC for the SMC link. It is 4516 required as some operating systems do not have neighbor discovery 4517 or ARP support for RoCE RNICs. 4519 Sender's RoCE GID 4521 This is the IPv6 address of the RNIC that the sender is using for 4522 this SMC-R Link 4524 Sender's QP number 4526 The number for the reliably connected queue pair that the sender 4527 created for this SMC-R link 4529 Link number 4531 An identifier assigned by the server that uniquely identifies the 4532 link within the link group. This identifier is ONLY unique 4533 within a link group. Provided by the server and echoed back by 4534 the client 4536 Link User ID 4538 An opaque, implementation defined identifier assigned by the 4539 sender and provided to the receiver solely for purposes of 4540 display, diagnosis, network management, etc. The link user ID 4541 should be unique across the sender's entire software space, 4542 including all link other link groups. 4544 Max Links 4546 The maximum number of links the sender can support in a link 4547 group. The maximum for this link group is the the smaller of the 4548 values provided by the two peers. 4550 A.3.2. ADD LINK LLC message format 4552 0 1 2 3 4553 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4554 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4555 | type = 2 | length = 44 | Rsrvd |RsnCode|R|Z| Reserved | 4556 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4557 | Sender's RoCE | 4558 +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4559 | | | 4560 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 4561 | | 4562 +- -+ 4563 | Sender's RoCE GID | 4564 +- -+ 4565 | | 4566 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4567 | |Sender's QP number, bytes 1-2 | 4568 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4569 |Sender QP byte3| Link number |Rsrvd | MTU |Initial PSN | 4570 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4571 | Initial PSN, continued | | 4572 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -+ 4573 | Reserved | 4574 +- -+ 4575 | | 4576 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4577 Figure 32 ADD LINK LLC message format 4579 The ADD LINK LLC message is sent over an existing link in the link 4580 group when a peer wishes to add an SMC-R link to an existing SMC-R 4581 link group. It sent by the server to add a new SMC-R link to the 4582 group, or by the client to request that the server add a new link, 4583 for example when a new RNIC becomes active. When sent from the 4584 client to the server, it represents a request that the server 4585 initiate an ADD LINK exchange. 4587 This message is sent immediately after the initial SMC link in the 4588 group completes, as described in 3.5.1. First contact. It can also be 4589 sent over an existing SMC-R link group at any time as new RNICs are 4590 added and become available. Therefore there can be as few as 1 new 4591 RMB RTokens to communicate, or several. Rtokens will be 4592 communicated using ADD LINK CONTINUATION messages. 4594 The contents of the ADD LINK LLC message are: 4596 Type 4598 Type 2 indicates ADD LINK 4600 Length 4602 All LLC messages are 44 bytes long 4604 RsnCode 4606 If the Z (rejection) flag is set, this field provides the reason 4607 code. Values can be: 4609 X'1' - no alternate path available: set when the server provides 4610 the same MAC/GID as an existing SMC-R link in the group, and the 4611 client does not have any additional RNICs available (i.e., server 4612 is attempting to set up an asymmetric link but none is available) 4614 X'2' - Invalid MTU value specified 4616 R 4618 Reply flag. When set indicates this is an ADD LINK REPLY 4620 Z 4622 Rejection flag. When set on reply indicates that the server's 4623 ADD LINK was rejected by the client. When this flag is set, the 4624 reason code will also be set. 4626 Sender's RoCE MAC address 4628 The MAC address of the sender's RNIC for the new SMC-R link. It 4629 is required as some operating systems do not have neighbor 4630 discovery or ARP support for RoCE RNICs. 4632 Sender's RoCE GID 4634 The IPv6 address of the RNIC that the sender is using for the new 4635 SMC-R Link 4637 Sender's QP number 4639 The number for the reliably connected queue pair that the sender 4640 created for the new SMC-R link 4642 Link number 4643 An identifier for the new SMC-R link. This is assigned by the 4644 server and uniquely identifies the link within the link group. 4645 This identifier is ONLY unique within a link group. Provided by 4646 the server and echoed back by the client 4648 MTU 4650 An enumerated value indicating this peer's QP MTU size. The two 4651 peers exchange this value and the minimum of the peer's value 4652 will be used for the QP. The values are enumerated in A.2.3. 4654 Initial PSN 4656 The starting packet sequence number that this peer will use when 4657 sending to the other peer, so that the other peer can prepare its 4658 QP for the sequence number to expect. 4660 A.3.3. ADD LINK CONTINUATION LLC message format 4662 0 1 2 3 4663 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4664 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4665 | type = 3 | length = 44 | Reserved |R| Reserved | 4666 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4667 | Linknum | NumRTokens | Reserved | 4668 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4669 | | 4670 +- -+ 4671 | | 4672 +- Rkey/Rtoken Pair -+ 4673 | | 4674 +- -+ 4675 | | 4676 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4677 | | 4678 +- -+ 4679 | | 4680 +- Rkey/Rtoken Pair or zeroes -+ 4681 | | 4682 +- -+ 4683 | | 4684 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4685 | Reserved | 4686 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4687 Figure 33 ADD LINK CONTINUATION LLC message format 4689 When a new SMC-R link is added to an SMC-R link group, it is 4690 necessary to communicate the new link's RTokens for the RMBs that the 4691 SMC-r link group can access. This message follows the ADD LINK and 4692 provides the RTokens. 4694 The server kicks off this exchange by sending the first ADD LINK 4695 CONTINUATION LLC message, and the server controls the exchange as 4696 described below. 4698 o If the client and the server require the same number of ADD LINK 4699 CONTINUATION messages to communicate their RTokens, the server 4700 starts the exchange by sending the client the first ADD LINK 4701 CONTINUATION request to the client with its RTokens, then the 4702 client responds with an ADD LINK CONTINUATION response with its 4703 RTokens, and so on until the exchange is completed. 4705 o If the server requires more ADD LINK CONTINUATION messages than 4706 the client, then after the client has communicated all its 4707 RTokens, the server continues to send ADD LINK CONTINUATION 4708 request messages to the client. The client continues to respond, 4709 using empty (number of RTokens to be communicated = 0) ADD LINK 4710 CONTINUATION response messages. 4712 o If the client requires more ADD LINK CONTINUATION messages than 4713 the server, then after communicating all its RTokens the server 4714 will continue to send empty ADD LINK CONTINUATION messages to the 4715 client to solicit replies with the client's RTokens, until all 4716 have been communicated. 4718 The contents of this message are: 4720 Type 4722 Type 3 indicates ADD LINK CONTINUATION 4724 Length 4726 All LLC messages are 44 bytes long 4728 R 4730 Reply flag. When set indicates this is an ADD LINK CONTINUATION 4731 REPLY 4733 LinkNum 4735 The link number of the new link within the SMC link group that 4736 Rkeys are being communicated for 4738 NumRTokens 4740 Number of RTokens remaining to be communicated (including the 4741 ones in this message). If the value is less than or equal to 2, 4742 this is the last message. If it is greater than 2, another 4743 continuation message will be required, and its value will be the 4744 value in this message minus 2, and so on until all Rkeys are 4745 communicated. The maximum value for this field is 255. 4747 Up to 2 Rkey/RToken pairs 4749 These consist of an Rkey for an RMB that is known on the SMC-R 4750 link that this message was sent over (the reference Rkey), paired 4751 with the same RMB's RToken over the new SMC link. A full RToken 4752 is not required for the reference because it is only being used 4753 to distinguish which RMB it applies to, not address it. 4755 0 1 2 3 4756 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4757 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4758 | Reference Rkey | 4759 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4760 | New Rkey | 4761 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4762 | | 4763 +- New Virtual Address -+ 4764 | | 4765 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4766 Figure 34 Rkey/Rtoken pair format 4768 The contents of the RKey/RToken pair are: 4770 Reference Rkey 4772 The Rkey of the RMB as it is already known on the SMC-R link over 4773 which this message is being sent. Required so that the peer knows 4774 which RMB to associate the new Rtoken with. 4776 New Rkey 4778 The Rkey of this RMB as it is known over the new SMC-R link 4780 New Virtual Address 4782 The virtual address of this RMB as it is known over the new SMC-R 4783 link. 4785 A.3.4. DELETE LINK LLC message format 4787 0 1 2 3 4788 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4789 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4790 | type = 4 | length = 44 | Reserved |R|A|O| Rsrvd | 4791 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4792 | Linknum | Reason code (bytes 1-3) | 4793 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4794 |RsnCode byte 4 | | 4795 +-+-+-+-+-+-+-+-+ -+ 4796 | | 4797 +- -+ 4798 | | 4799 +- -+ 4800 | | 4801 +- Reserved -+ 4802 | | 4803 +- -+ 4804 | | 4805 +- -+ 4806 | | 4807 +- -+ 4808 | | 4809 +- -+ 4810 | | 4811 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4812 Figure 35 DELETE LINK LLC message format 4814 When the client or server detects that a QP or SMC-R link goes down 4815 or needs to come down, it sends this message over one of the other 4816 links in the link group. 4818 When the DELETE Link is sent from the client it only serves as a 4819 notification, and the client expects the server to send a DELETE LINK 4820 Request in response. To avoid races, only the server will initiate 4821 the actual DELETE LINK Request and Response sequence that results 4822 from notification from the client. 4824 The server can also initiate the DELETE Link without notification 4825 from the client if it detects an error or if orderly link termination 4826 was initiated. 4828 The client may also request termination of the entire link group and 4829 the server may terminate the entire link group using this message. 4831 The contents of this message are: 4833 Type 4835 Type 4 indicates DELETE LINK 4837 Length 4839 All LLC messages are 44 bytes long 4841 R 4843 Reply flag. When set indicates this is an DELETE LINK REPLY 4845 A 4847 All flag. When set indicates that all links in the link group 4848 are to be terminated. This terminates the link group. 4850 O 4852 Orderly flag. Indicates orderly termination. Orderly termination 4853 is generally caused by an operator command rather than an error 4854 on the link. When the client requests orderly termination, the 4855 server may wait to complete other work before terminating. 4857 LinkNum 4859 The link number of the link to be terminated. If the A flag is 4860 set, this field has no meaning and is set to 0. 4862 RsnCode 4864 The termination reason code. Currently defined reason codes are: 4866 Request Reason Codes: 4868 o X'00010000' = lost path 4870 o X'00020000' = operator initiated termination 4871 o X'00030000' = Program initiated termination (link inactivity) 4873 o X'00040000' = LLC protocol violation 4875 o X'00050000' = Asymmetric link no longer needed 4877 Response Reason Codes: 4879 o X'00100000' = Unknown Link ID (no link) 4881 o Others TBD 4883 A.3.5. CONFIRM RKEY LLC message format 4885 0 1 2 3 4886 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4887 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4888 | type = 6 | length = 44 | Reserved |R|0|Z|C|Rsrvd | 4889 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4890 | NumTkns | New RMB Rkey for this link (bytes 1-3) | 4891 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4892 |ThisLink byte 4| | 4893 +-+-+-+-+-+-+-+-+ -+ 4894 | New RMB virtual address for this link | 4895 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4896 | | | 4897 +-+-+-+-+-+-+-+-+ -+ 4898 | | 4899 +- Other link RMB specification or zeros -+ 4900 | | 4901 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4902 | | | 4903 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -+ 4904 | | 4905 +- -+ 4906 | Other link RMB specification or zeroes | 4907 +- +-+-+-+-+-+-+-+-+ 4908 | | Reserved | 4909 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4910 Figure 36 CONFIRM RKEY LLC message format 4912 The CONFIRM_RKEY flow can be sent at any time from either the client 4913 or the server, to inform the peer that an RMB has been created or 4914 deleted. The creator of a new RMB must inform its peer of the new 4915 RMB's RToken for all SMC-R links in the SMC-R link group. The 4916 deleter of an RMB must inform its peer of the deleted RMB's RToken 4917 for all SMC-R links. 4919 For RMB creation, the creator sends this message over the SMC link 4920 that the first TCP connection that uses the new RMB is using. This 4921 message contains the new RMB RToken for the SMC link that the message 4922 is sent over, then it lists the sender's SMC links in the link group 4923 paired with the new RToken for the new RMB for that link. This 4924 message can communicate the new RTokens for 3 QPs: the QP for the 4925 link this message is sent over, and 2 others. If there are more than 4926 3 links in the SMC-R link group, CONFIRM_RKEY_CONTINUATION will be 4927 required. 4929 For RMB deletion, the creator sends the same format of message with a 4930 delete flag set, to inform the peer that the RMB's RTokens on all 4931 links in the group are deleted. 4933 In both cases, the peer responds by simply echoing the message with 4934 the response flag set. If the response is a negative response, the 4935 sender must recalculate the RToken set and start a new CONFIRM_RKEY 4936 exchange from the beginning. The timing of this retry is controlled 4937 by the C flag as described below. 4939 The contents of this message are: 4941 Type 4943 Type 6 indicates CONFIRM RKEY 4945 Length 4947 All LLC messages are 44 bytes long 4949 R 4951 Reply flag. When set indicates this is a CONFIRM RKEY REPLY 4953 0 4955 Reserved bit 4957 Z 4959 Negative response flag 4961 C 4963 Configuration Retry bit. If this is a negative response and this 4964 flag is set, the originator should recalculate the Rkey set and 4965 retry this exchange as soon as the current configuration change 4966 is completed. If this flag is not set on a negative response, the 4967 originator must wait for the next natural stimulus (for example, 4968 a new TCP connection started that requires a new RMB) before 4969 retrying. 4971 NumTkns 4973 The number of other link/RToken pairs, including those provided 4974 in this message, to be communicated. Note that this value does 4975 not include the Rtoken for the link this message was sent on 4976 (i.e., the maximum value is 2). If this value is three or fewer 4977 this is the only message in the exchange. If this value is 4978 greater than three, a CONFIRM RKEY CONTINUATION message will be 4979 required. 4981 Note: in this version of the architecture, 8 is the maximum 4982 number of links supported in a link group. 4984 New RMB Rkey for this link 4986 The new RMB's Rkey as assigned on the link this message is being 4987 sent over. 4989 New RMB virtual address for this link 4991 The new RMB's virtual address as assigned on the link this 4992 messages is being sent over. 4994 Other link RMB specification 4996 The new RMB's specification on the other links in the link group, 4997 as shown in Figure 38. 4999 0 1 2 3 5000 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 5001 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5002 | Link number | RMB's Rkey for the specified link (bytes 1-3) | 5003 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5004 |New Rkey byte 4| | 5005 +-+-+-+-+-+-+-+-+ -+ 5006 | RMB's virtual address for the specified link | 5007 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5008 | | 5009 +-+-+-+-+-+-+-+-+ 5010 Figure 37 Format of link number/Rkey pairs 5012 Link number 5014 The link number for a link in the link group 5016 RMB's Rkey for the specified link 5018 The Rkey used to reach the RMB over the link whose number was 5019 specified in the link number field. 5021 RMB's virtual address for the specified link 5023 The virtual address used to reach the RMB over the link whose 5024 number was specified in the link number field. 5026 A.3.6. CONFIRM RKEY CONTINUATION LLC message format 5028 0 1 2 3 5029 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 5030 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5031 | type = 8 | length = 44 | Reserved |R|0|Z| Rsrvd | 5032 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5033 | NumTknsLeft | | 5034 +-+-+-+-+-+-+-+-+ -+ 5035 | | 5036 +- Other link RMB specification -+ 5037 | | 5038 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5039 | | | 5040 +-+-+-+-+-+-+-+-+ -+ 5041 | | 5042 +- Other link RMB specification or zeros -+ 5043 | | 5044 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5045 | | | 5046 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -+ 5047 | | 5048 +- -+ 5049 | Other link RMB specification or zeroes | 5050 +- +-+-+-+-+-+-+-+-+ 5051 | | Reserved | 5052 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5054 The CONFIRM RKEY CONTINUATION LLC message is used to communicate any 5055 additional RMB RTokens that did not fit into the CONFIRM RKEY 5056 message. Each of these messages can hold up to 3 RMB RTokens. The 5057 Numlinks field indicates how many RMB RTokens are to be communicated, 5058 including the ones in this message. If the value is 3 or less, this 5059 is the last message of the group. If the value is 4 or higher, 5060 additional CONFIRM RKEY CONTINUATION messages will follow, and the 5061 Numlinks value will be a countdown until all are communicated. 5063 Like the CONFIRM RKEY message, the peer responds by echoing the 5064 message back with the reply flag set. 5066 The contents of this message are: 5068 Type 5070 Type 8 indicates CONFIRM RKEY CONTINUATION 5072 Length 5074 All LLC messages are 44 bytes long 5076 R 5078 Reply flag. When set indicates this is a CONFIRM RKEY 5079 CONTINUATION REPLY 5081 0 5083 Reserved bit 5085 Z 5087 Negative response flag 5089 NumTknsLeft 5091 The number of link/RToken pairs, including those provided in this 5092 message, that are remaining to be communicated. If this value is 5093 three or fewer this is the last message in the exchange. If this 5094 value is greater than three, another CONFIRM RKEY CONTINUATION 5095 message will be required. Note that in this version of the 5096 architecture, 8 is the maximum number of links supported in a 5097 link group. 5099 Other link RMB specifications 5101 The new RMB's specification on other links in the link group, as 5102 shown in Figure 38. 5104 A.3.7. DELETE RKEY LLC message format 5106 0 1 2 3 5107 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 5108 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5109 | type = 9 | length = 44 | Reserved |R|0|Z| Rsrvd | 5110 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5111 | Count | Error Mask | Reserved | 5112 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5113 | First deleted Rkey | 5114 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5115 | Second deleted Rkey or zeros | 5116 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5117 | Third deleted Rkey or zeros | 5118 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5119 | Fourth deleted Rkey or zeros | 5120 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5121 | Fifth deleted Rkey or zeros | 5122 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5123 | Sixth deleted Rkey or zeros | 5124 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5125 | Seventh deleted Rkey or zeros | 5126 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5127 | Eighth deleted Rkey or zeros | 5128 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5129 | Reserved | 5130 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5132 The DELETE_RKEY flow can be sent at any time from either the client 5133 or the server, to inform the peer that one or more RMBs have been 5134 deleted. Because the peer already knows every RMB's Rkey on each 5135 link in the link group, this message only specifies one Rkey for each 5136 RMB being deleted. The Rkey provided for each deleted RMB will be its 5137 Rkey as known on the SMC-R link that this message is sent over. 5139 It is not necessary to provide the entire RToken. The Rkey alone is 5140 sufficient for identifying an existing RMB. 5142 The peer responds by simply echoing the message with the response 5143 flag set. If the peer did not recognize an Rkey, a negative response 5144 flag will be set, however no aggressive recovery action beyond 5145 logging the error will be taken. 5147 The contents of this message are: 5149 Type 5151 Type 9 indicates DELETE RKEY 5153 Length 5155 All LLC messages are 44 bytes long 5157 R 5159 Reply flag. When set indicates this is a DELETE RKEY REPLY 5161 0 5163 Reserved bit 5165 Z 5167 Negative response flag 5169 Count 5171 Number of RMBs being deleted by this message. Maximum value is 8 5173 Error Mask 5175 If this is a negative response, indicates which RMBs were not 5176 successfully deleted. Each bit corresponds to a listed RMB. So 5177 for example b'01010000' indicates that the second and fourth 5178 Rkeys weren't successfully deleted. 5180 Deleted Rkeys 5182 A list of Count Rkeys. Provided on the request flow and echoed 5183 back on the response flow. Each Rkey is valid on the link this 5184 message is sent over, and represents a deleted RMB. Up to eight 5185 RMBs can be deleted in this message. 5187 A.3.8. TEST LINK LLC message format 5189 0 1 2 3 5190 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 5191 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5192 | type = 7 | length = 44 | Reserved |R| Reserved | 5193 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5194 | | 5195 +- -+ 5196 | | 5197 +- User Data -+ 5198 | | 5199 +- -+ 5200 | | 5201 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5202 | | 5203 +- -+ 5204 | | 5205 +- -+ 5206 | Reserved | 5207 +- -+ 5208 | | 5209 +- -+ 5210 | | 5211 +- -+ 5212 | | 5213 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5214 Figure 38 TEST LINK LLC message format 5216 The TEST_LINK request can be sent from either peer to the other on an 5217 existing SMC-R link at any time to test that the SMC-R link is active 5218 and healthy at the software level. A peer which receives a TEST_LINK 5219 LLC message immediately sends back a TEST_LINK reply, echoing back 5220 the user data. Also refer to 4.5.3. TCP Keepalive processing. 5222 The contents of this message are: 5224 Type 5226 Type 7 indicates TEST LINK 5228 Length 5230 All LLC messages are 44 bytes long 5232 R 5233 Reply flag. When set indicates this is a TEST LINK REPLY 5235 User Data 5237 The receiver of this message echoes the sender's data back in a 5238 TEST_LINK response LLC message 5240 A.4. Connection Data Control (CDC) message format 5242 The RMBE control data is communicated using Connection Data Control 5243 (CDC) messages, which use RDMA message passing using inline data, 5244 similar to LLC messages. Also similar to LLC messages, this data 5245 block is 44 bytes long to ensure that it can it into private data 5246 areas of receive WQEs, without requiring the receiver to post receive 5247 buffers. 5249 Unlike LLC messages, this data is integral to the data path so its 5250 processing must be prioritized and optimized similarly to other data 5251 path processing. While LLC messages may be processed on a slower 5252 path than data, these messages cannot be. 5254 0 1 2 3 5255 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 5256 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5257 | Type = x'FE' | Length = 44 | Sequence number | 5258 4 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5259 | SMC-R alert token | 5260 8 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5261 | Reserved | Producer cursor wrap seqno | 5262 12 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5263 | Producer Cursor | 5264 16 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5265 | Reserved | Consumer cursor wrap seqno | 5266 20 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5267 | Consumer Cursor | 5268 24 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5269 |B|P|U|R|F|Rsrvd|D|C|A| Reserved | 5270 28 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5271 | | 5272 32 +- -+ 5273 | | 5274 36 +- Reserved -+ 5275 | | 5276 40 +- -+ 5277 | | 5278 44 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5280 Figure 39 Connection Data Control (CDC) Message Format 5282 Type = x'FE' 5284 This type number has the two high order bits turned on to enable 5285 processing to quickly distinguish it from an LLC message 5287 Length = 44 5289 The length of inline data that does not require posting of a 5290 receive buffer. 5292 Sequence number 5294 A 2 byte unsigned integer that represents a wrapping sequence 5295 number. The initial value is one and this value can wrap to 0. 5296 Incremented with every control message send, except for the 5297 failover data validation message, and used to guard against 5298 processing an old control message out of sequence, and also used 5299 in failover data validation. In normal usage, if this number is 5300 less than the last received value, discard this message. If 5301 greater, processes this message. Old control messages can be 5302 lost with no ill effect, but cannot be processed after newer 5303 ones. 5305 If this is a failover validation CDC message (F flag set), then 5306 the receiver must verify that it has received and fully processed 5307 the RDMA write that was described by the CDC message with the 5308 sequence number in this message. If not, the TCP connection must 5309 be reset, to guard against data loss. Details of this processing 5310 are in section 4.6.1. 5312 SMC-R alert token 5314 The endpoint-assigned alert token that identifies which TCP 5315 connection on the link group this control message refers to. 5317 Producer cursor wrap seqno 5319 A 2 byte unsigned integer that represents wrapping counter 5320 incremented by the producer whenever the data written into this 5321 RMBE receiver buffer causes a wrap (i.e. the producer cursor 5322 wraps). This is used by the receiver to determine when new data 5323 is available even though the cursors appear unchanged such as 5324 when a full window size write is completed (Producer cursor of 5325 this RMBE sent by peer = Local Consumer Cursor) or in scenarios 5326 where the Producer Cursor sent for this RMBE < Local Consumer 5327 Cursor). 5329 Producer cursor 5331 Unsigned, 4 byte integer that is a wrapping offset into the RMBE 5332 data area. Points to the next byte of data to be written by the 5333 sender. Can advance up to the receiver's Consumer Cursor as known 5334 by the sender. When the urgent data present indicator is on then 5335 points one byte beyond the last byte of urgent data. When 5336 computing this cursor, the presence of the eyecatcher in the RMBE 5337 data area must be accounted for. The first writeable data 5338 location in the RMBE is at offset 4, so this cursor begins at 4 5339 and wraps to 4. 5341 Consumer cursor wrap seqno 5343 2 byte unsigned integer that mirrors the value of the Producer 5344 cursor wrap sequence number when the last read from this RMBE 5345 occurred. Used as an indicator on how far along the consumer is 5346 in reading data (i.e. processed last wrap point or not). The 5347 producer side can use this indicator to detect whether more data 5348 can be written to the partner in full window write scenarios 5349 (where the Producer Cursor = Consumer Cursor as known on the 5350 remote RMBE). In this scenario if the consumer sequence number 5351 equals the local producer sequence number the producer knows that 5352 more data can be written. 5354 Consumer Cursor 5356 Unsigned 4 byte integer that is a wrapping offset into the 5357 sender's RMBE data area. Points to the offset of the next byte 5358 of data to be consumed by the peer in its own RMBE. When 5359 computing this cursor, the presence of the eyecatcher in the RMBE 5360 data area must be accounted for. The first writeable data 5361 location in the RMBE is at offset 4, so this cursor begins at 4 5362 and wraps to 4. The sender cannot write beyond this cursor into 5363 the peer's RMBE without causing data loss. 5365 B-bit 5367 Writer blocked indicator: Sender is blocked for writing, requires 5368 explicit notification when receive buffer space is available. 5370 P-bit 5372 Urgent data pending: Sender has urgent data pending for this 5373 connection 5375 U-bit 5377 Urgent data present: Indicates that urgent is data present in the 5378 RMBE data area, and the producer cursor points to one byte beyond 5379 the last byte of urgent data. 5381 R-bit 5383 Request for consumer cursor update: Indicates that a consumer 5384 cursor update is requested bypassing any window size optimization 5385 algorithms. 5387 F-bit 5389 Failover validation indicator: sent by a peer to guard against 5390 data loss during failover when the TCP connection is being moved 5391 to another SMC-R link in the link group. When this bit is set 5392 the only other fields in the CDC message that are significant are 5393 the type, length, SMC-R alert token and the sequence number. The 5394 receiver must validate that it has fully processed the RDMA write 5395 described by the previous CDC message bearing the same sequence 5396 number as this validation message. If it has, no further action 5397 is required. If it has not, the TCP connection must be reset. 5398 This processing is described in detail in section 4.6.1. 5400 D-bit 5402 Sending done indicator: Sent by a peer when it is done writing 5403 new data into the receiver's RMBE data area. 5405 C-bit 5407 Peer Closed Connection indicator: Sent by a peer when it is 5408 completely done with this connection and will no longer be making 5409 any updates to the receiver's RMBE, and will also not be sending 5410 any more control messages. 5412 A-bit 5414 Abnormal Close indicator: Sent by a peer when the connection is 5415 abnormally terminated (for example, the TCP connection was 5416 Reset). When sent it indicates that the peer is completely done 5417 with this connection and will no longer be making any updates to 5418 this RMBE or sending any more control messages. It also indicates 5419 that the RMBE owner must flush any remaining data on this 5420 connection and surface an error return code to any outstanding 5421 socket APIs on this connection (same processing as receiving an 5422 RST segment on a TCP connection). 5424 Appendix B. Socket API considerations 5426 A key design goal for SMC-R is to require no application changes for 5427 exploitation. It is confined to socket applications using stream 5428 (i.e. TCP protocol) sockets over IPv4 or IPv6. By virtue of the fact 5429 that the switch to the SMC-R protocol occurs after a TCP connection 5430 is established no changes are required in socket address family or in 5431 the IP addresses and ports that the socket application are using. 5432 Existing socket APIs that allow the application to retrieve local and 5433 remote socket address structures for an established TCP connection 5434 (for example, getsockname() and getpeername()) will continue to 5435 function as they have before. Existing DNS setup and APIs for 5436 resolving hostnames to IP addresses and vice versa also continue to 5437 function without any changes. In general all of the usual socket APIs 5438 that are used for TCP communicates (send APIs, recv APIs, etc.) will 5439 continue to function as they do today even if SMC-R is used as the 5440 underlying protocol. 5442 Each SMC-R enabled implementation does however need to pay special 5443 attention to any socket APIs that have a reliance on the underlying 5444 TCP and IP protocols and ensure that their behavior in an SMC-R 5445 environment is reasonable and minimizes impact to the application. 5446 While the basic socket API set is fairly similar across different 5447 Operating Systems, when it comes to advanced socket API options there 5448 is more variability. Each implementation needs to perform a detailed 5449 analysis of its API options and SMC-R impact and implications. As 5450 part of that step a discussion or review with other implementations 5451 supporting SMC-R would be useful to ensure a consistent 5452 implementation. 5454 setsockopt()/ getsockopt() considerations 5456 These APIs allow socket applications to manipulate socket, transport 5457 (TCP/UDP) and IP level options associated with a given socket. 5458 Typically, a platform restricts the number of IP options available to 5459 stream (TCP) socket applications given their connection oriented 5460 nature. The general guideline here is to continue processing these 5461 APIs in a manner that allows for application compatibility. Some 5462 options will be relevant to the SMC-R protocol and will require 5463 special processing under the covers. For example, the ability to 5464 manipulate TCP send and receive buffer sizes is still valid for SMC- 5465 R. However, other options may have no meaning for SMC-R. For 5466 example, if an application enabled the TCP_NODELAY option to disable 5467 Nagle's algorithm it should have no real effect in SMC-R 5468 communications as there is no notion of Nagle's algorithm with this 5469 new protocol. But the implementation must accept the TCP_NODELAY 5470 option as it does today and save it so that it can be later extracted 5471 via getsockopt() processing. Note that any TCP or IP level options 5472 will still have an effect on any TCP/IP packets flowing for an SMC-R 5473 connection (i.e. as part of TCP/IP connection establishment and 5474 TCP/IP connection termination packet flows). 5476 Under the covers manipulation of the TCP options will also include 5477 the SMC layer setting and reading the SMC-R experimental option 5478 before and after completion of the 3 way TCP handshake. 5480 Appendix C. Rendezvous Error scenarios 5482 Error scenarios in setting up and managing SMC-R links are discussed 5483 in this section. 5485 C.1. SMC Decline during CLC negotiation 5487 A peer to the SMC-R CLC negotiation can send SMC Decline in lieu of 5488 any expected CLC message to decline SMC and force the TCP connection 5489 back to IP fabric. There can be several reasons for an SMC Decline 5490 during the CLC negotiation including: RNIC went down, SMC-R forbidden 5491 by local policy, subnet (IPv4) or prefix (IPv6) doesn't match, lack 5492 of resources to perform SMC-R. In all cases when an SMC Decline is 5493 sent in lieu of an expected CLC message, no confirmation is required 5494 and the TCP connection immediately falls back to using the IP fabric. 5496 To prevent ambiguity between CLC messages and application data, an 5497 SMC Decline cannot "chase" another CLC message. SMC Decline can only 5498 be sent in lieu of an expected CLC message. For example, if the 5499 client sends SMC Proposal then its RNIC goes down, it must wait for 5500 the SMC Accept for the server and then it can reply to that with an 5501 SMC Decline. 5503 This "no chase" rule means that if this TCP connection is not a first 5504 contact between RoCE peers, a server cannot send SMC Decline after 5505 sending SMC Accept - it can only either break the TCP connection. 5506 Similarly, once the client sends SMC Confirm on a TCP connection that 5507 isn't first contact, it is committed to SMC-R for this TCP connection 5508 and cannot fall back to IP. 5510 C.2. SMC Decline during LLC negotiation 5512 For a TCP connection that represents first contact between RoCE 5513 pairs, it is possible for SMC to fail back to IP during the LLC 5514 negotiation. This is possible until the first contact SMC link is 5515 confirmed. For example, see Figure 40. After a first contact SMC 5516 link is confirmed, fallback to IP is no longer possible. The rule 5517 that this translates to is: a first contact peer can send SMC Decline 5518 at any time during LLC negotiation until it has successfully sent its 5519 CONFIRM LINK (request or response) flow. After that point, it cannot 5520 fall back to IP. 5522 Host X -- Server Host Y -- Client 5523 +-------------------+ +-------------------+ 5524 | PeerID = PS1 | | PeerID = PC1 | 5525 | +------+ +------+ | 5526 | QP 8 |RNIC 1| SMC-R link 1 |RNIC 2| QP 64 | 5527 | RKey X | |MAC MA|<-------------------->|MAC MB| | | 5528 | | |GID GA| attempted setup |GID GB| | RKey Y2| 5529 | \/ +------+ +------+ \/ | 5530 |+--------+ | | +--------+ | 5531 || RMB | | | | RMB | | 5532 |+--------+ | | +--------+ | 5533 | /\ +------+ +------+ /\ | 5534 | | |RNIC 3| |RNIC 4| | Rkey W2| 5535 | | |MAC MC| |MAC MD| | | 5536 | QP 9 |GID GC| |GID GD| QP65 | 5537 | +------+ +------+ | 5538 +-------------------+ +-------------------+ 5540 SYN / SYN-ACK / ACT TCP 3-way handshake with TCP option 5541 <---------------------------------------------------------> 5543 SMC Proposal / SMC Accept / SMC Confirm exchange 5544 <--------------------------------------------------------> 5546 CONFIRM LINK(request, link 1) 5547 .........................................................> 5549 CONFIRM LINK(response, link 1) 5550 X................................... 5551 : 5552 : ROCE write faliure 5553 :.................................> 5555 SMC Decline(PC1, reason code) 5556 <-------------------------------------------------------- 5558 Connection data flows over IP fabric 5559 <-------------------------------------------------------> 5561 Legend: 5562 ------------ TCP/IP and CLC flows 5563 ............ RoCE (LLC) flows 5565 Figure 40 SMC Decline during LLC negotiation 5567 C.3. The SMC Decline window 5569 Because SMC-R does not support fall-back to IP for a TCP connection 5570 that is already using RDMA, there are specific rules on when SMC 5571 Decline, which signals a fall-back to IP because of an error or 5572 problem with the RoCE fabric, can be sent during TCP connection 5573 setup. There is a point of no return after which a connection cannot 5574 fall back to IP, and RoCE errors that occur after this point require 5575 the connection to be broken with a RST flow in the IP fabric. 5577 For first contact, that point of no return is after the Add Link LLC 5578 message has been successfully sent for the second SMC-R link. 5579 Specifically, the server cannot fall back to IP after receiving 5580 either a positive write completion indication for the Add Link 5581 request, or after receiving the Add Link response from the client, 5582 whichever comes first. The client cannot fall back to IP after 5583 either sending a negative Add Link response, receiving a positive 5584 write complete on a positive Add Link response, or receiving a 5585 Confirm Link for the second SMC-R link from the server, whichever 5586 comes first. 5588 For subsequent contact, that point of no return is after the last 5589 send of the CLC negotiation completes. This, in combination with the 5590 rule that error "chasers" are not allowed during CLC negotiation, 5591 means that the server cannot send SMC Decline after sending an SMC 5592 Accept, and the client cannot send an SMC Decline after sending an 5593 SMC Confirm. 5595 C.4. Out of synch conditions during SMC-R negotiation 5597 The SMC Accept CLC message contains a "first contact" flag that 5598 indicates to the client whether or not the server believes it is 5599 setting up a new link group, or using an existing link group. This 5600 flag is used to detect an out of synch condition between the client 5601 and the server. The scenario detected is as follows: There is a 5602 single existing SMC-R link between the peers. After the client sends 5603 the SMC Proposal CLC message, the existing SMC-R link between the 5604 client and the server fails. The client cannot chase the SMC 5605 Proposal CLC message with an SMC Decline CLC message in this case 5606 because the client does not yet know that the server would have 5607 wanted to choose the SMC-R link that just crashed. The QP that 5608 failed recovers before the server returns its SMC Accept CLC message. 5609 This means that there is a QP but no SMC link. Since the server had 5610 not yet learned of the SMC link failure when it sent the SMC Accept 5611 CLC message, it attempts to re-use the SMC link that just failed. 5612 This means the server would not set the "first contact" flag, 5613 indicating to the client that the server thinks it is reusing an SMC- 5614 R link. However the client does not have an SMC-R link that matches 5615 the server's specification. Because the "first contact" flag is off, 5616 the client realizes it is out of synch with the server and sends SMC 5617 Decline to cause the connection to fall back to IP. 5619 C.5. Timeouts during CLC negotiation 5621 Because the SMC-R negotiation flows as TCP data, there are built-in 5622 timeouts and retransmits at the TCP layer for individual messages. 5623 Implementations also must to protect the overall TCP/CLC handshake 5624 with a timer or timers to prevent connections from hanging 5625 indefinitely due to SMC-R processing. This can be done with 5626 individual timers for individual CLC messages or an overall timer for 5627 the entire exchange, which may include the TCP handshake and the CLC 5628 handshake under one timer or separate timers. This decision is 5629 implementation dependent. 5631 If the TCP and/or CLC handshakes time out, the TCP connection must be 5632 terminated as it would be in a legacy IP environment when connection 5633 setup doesn't complete in a timely manner. Because the CLC flows are 5634 TCP messages, if they cannot be sent and received in a timely 5635 fashion, the TCP connection is not healthy and would not work if 5636 fallback to IP were attempted. 5638 C.6. Protocol errors during CLC negotiation 5640 Protocol errors occur during CLC negotiation when a message is 5641 received that is not expected. For example, a peer that is expecting 5642 a CLC message but instead receives application data has experienced a 5643 protocol error, and also indicates a likely software error as the two 5644 sides are out of synch. When application data is expected, this data 5645 is not parsed to ensure it's not a CLC message. 5647 When a peer is expecting a CLC negotiation message, any parsing error 5648 except a bad enumerated value in that message must be treated as 5649 application data. The CLC negotiation messages are designed with 5650 beginning and ending eyecatchers to help verify that they are 5651 actually the expected message. If other parsing errors in an 5652 expected CLC message occur, such as incorrect length fields or 5653 incorrectly formatted fields, the message must be treated as 5654 application data. 5656 All protocol errors with the exception of bad enumerated values must 5657 result in termination of the TCP connection. No fallback to IP is 5658 allowed in the case of a protocol error because if the protocols are 5659 out of synch, mismatched, or corrupted, then data and security 5660 integrity cannot be ensured. 5662 The exception to this rule is enumerated values, for example the QP 5663 MTU values on SMC Accept and SMC Confirm. If a reserved value is 5664 received, the proper error response is to send SMC Decline and fall 5665 back to IP. The reason for this is that use of a reserved enumerated 5666 value indicates that the other partner likely has additional support 5667 that the receiving partner does not have. This indicated mismatch of 5668 SMC-R capabilities is not an integrity problem, but indicates that 5669 SMC-R cannot be used for this connection 5671 C.7. Timeouts during LLC negotiation 5673 Whenever a peer sends an LLC message to which a reply is expected, it 5674 sets a timer after the send posts to wait for the reply. An expected 5675 response may be a reply flavor of the LLC message (for example 5676 CONFIRM LINK REPLY) or a new LLC message (for example an ADD LINK 5677 CONTINUATION expected from the server by the client if there are more 5678 Rkeys to communicate). 5680 On LLC flows that are part of a first contact setup of a link group, 5681 the value of the timer is implementation dependent but should be long 5682 enough to allow the other peer have a write complete timeout and 2-3 5683 retransmits of an SMC Decline on the TCP fabric. For LLC flows 5684 that are maintaining the link group and not part of first contact 5685 setup of a link group, the timers may be shorter. Upon receipt of an 5686 expected reply the timer is cancelled. If a timer pops without a 5687 reply having been received, the sender must initiate a recovery 5688 action 5690 During first contact processing, failure of an LLC verification timer 5691 is a should-not-occur which indicates a problem with one of the 5692 endpoints. The reason for this is that if there is a "routine" 5693 failure in the RoCE fabric that causes an LLC verification send to 5694 fail, the sender will get a write completion failure and will then 5695 send SMC Decline to the partner. The only time an LLC verification 5696 timer will expire on a first contact is when the sender thinks the 5697 send succeeded but it actually didn't. Because of the reliable 5698 connected nature of QP connections on the RoCE fabric, this is 5699 indicates a problem with one of the peers, not with the RoCE fabric. 5701 After the reliable connected QP for the first SMC-R link in a link 5702 group is set up on initial contact, the client sets a timer to wait 5703 for a RoCE verification message from the server that the QP is 5704 actually connected and usable. If the server experiences a failure 5705 sending its QP confirmation message, it will send SMC Decline, which 5706 should arrive at the client before the client's verification timer 5707 expires. If the client's timer expires without receiving either an 5708 SMC Decline or a RoCE message confirmation from the server, there is 5709 a problem either with the server or with the TCP fabric. In either 5710 case the client must break the TCP connection and clean up the SMC-R 5711 link. 5713 There are two scenarios in which the client's response to the QP 5714 verification message fails to reach the server. The main difference 5715 is whether or not the client has successfully completed the send of 5716 the CONFIRM LINK response. 5718 In the normal case of a problem with the RoCE path, the client will 5719 learn of the failure by getting a write completion failure, before 5720 the server's timer expires. In this case, the client sends an SMC 5721 Decline CLC message to the server and the TCP connection falls back 5722 to IP. 5724 If the client's send of the Confirmation message receives a positive 5725 return code but for some reason still does not reach the server, or 5726 the client's SMC Decline CLC message fails to reach the server after 5727 the client fails to send its RoCE confirmation message, then the 5728 server's timer will time out and the server must break the TCP 5729 connection by sending RST. This is expected to be a very rare case, 5730 because if the client cannot send its CONFIRM LINK RSP LLC message, 5731 the client should get a negative return code and initiate fallback to 5732 IP. A client receiving a positive return code on a send that fails 5733 to reach the server should be extremely rare. 5735 C.7.1. Recovery actions for LLC timeouts and failures 5737 The following table describes recovery actions for LLC timeouts. A 5738 write completion failure or other indication of failure to send on 5739 the send of the LLC command is treated the same as a timeout. 5741 LLC Message: CONFIRM LINK from server (first contact, first link in 5742 the link group) 5744 Timer waits for: CONFIRM LINK reply from client 5746 Recovery action: Break the TCP connection by sending RST and 5747 clean up the link. The server should have received an SMC 5748 Decline from the client by now if the client had an LLC send 5749 failure. 5751 LLC Message: CONFIRM LINK from server (first contact, second link in 5752 the link group) 5754 Timer waits for: CONFIRM LINK reply from client 5755 Recovery action: The second link was not successfully set up. 5756 Send DELETE LINK to the client. Connection data cannot flow in 5757 the first link in the link group, until the reply to this DELETE 5758 LINK is received, to prevent the peers from being out of synch on 5759 the state of the link group. 5761 LLC Message: CONFIRM LINK from server (not first contact) 5763 Timer Waits for: CONFIRM LINK reply from client 5765 Recovery action: Clean up the new link and set a timer to retry. 5766 Send DELETE LINK to the client, in case the client has a longer 5767 timer interval, so the client can stop waiting 5769 LLC Message: CONFIRM LINK REPLY from client (first contact) 5771 Timer waits for: ADD LINK from server 5773 Recovery action: Clean up the SMC-R link and break the TCP 5774 connection by sending RST over the IP fabric. There is a problem 5775 with the server. If the server had a send failure, it should 5776 have have sent SMC Decline by now. 5778 LLC Message: ADD LINK from server (first contact) 5780 Timer waits for: ADD LINK reply from client 5782 Recovery action: Break the TCP connection with RST and clean up 5783 RoCE resources. The connection is past the point where the 5784 server can fall back to IP, and if the client had a send problem 5785 it should have sent SMC Decline by now. 5787 LLC Message: ADD LINK from server (not first contact) 5789 Timer waits for: ADD LINK reply from client 5791 Recovery action: Clean up resources (QP, RMB keys, etc) for the 5792 new link and treat the link that the ADD LINK was sent over as if 5793 it had failed. If there is another link available to resend the 5794 ADD LINK and the link group still needs another link, retry the 5795 ADD LINK over another link in the link group. 5797 LLC Message: ADD LINK REPLY from client (and there are more Rkeys to 5798 be communicated) 5800 Timer waits for: ADD LINK CONTINUATION from server 5801 Recovery action: Treat the same as ADD LINK timer failure 5803 LLC Message: ADD LINK REPLY or ADD LINK CONTINUATION reply from 5804 client (and there are no more Rkeys to be communicated, for the 5805 second link in a first contact scenario) 5807 Timer waits for: CONFIRM LINK from the server on the new link 5809 Recovery action: The new link has failed to set up. Send DELETE 5810 LINK to the server. Do not consider the socket opened to the 5811 client application until receiving confirmation from the server 5812 in the form of a DELETE LINK request for this link and sending 5813 the reply (to prevent the partners from being out of synch on the 5814 state of the link group). 5816 Set a timer to send another ADD LINK to the server if there is 5817 still an unused RNIC on the client side. 5819 LLC Message: ADD LINK REPLY or ADD LINK CONTINUATION reply from the 5820 client (and there are no more Rkeys to be communicated) 5822 Timer waits for: CONFIRM LINK from the server, over the new link 5824 Recovery action: Send a DELETE LINK to the server for the new 5825 link, then clean up any resource allocated for the new link and 5826 set a timer to send ADD LINK to the server if there is still an 5827 unused RNIC on the client side. The new link has failed to set 5828 up, but the link that the ADD LINK exchange occurred over is 5829 unaffected. 5831 LLC Message: ADD LINK CONTINUATION from server 5833 Timer waits for: ADD LINK CONTINUATION REPLY from client 5835 Recovery action: Treat the same as ADD LINK timer failure 5837 LLC Message: ADD LINK CONTINUATION reply from client (first contact, 5838 and RMB count fields indicate that the server owes more ADD LINK 5839 CONTINUATION messages) 5841 Timer waits for: ADD LINK CONTINUATION from the server 5843 Recovery action: Clean up the SMC link and break the TCP 5844 connection by sending RST. There is a problem with the server. 5846 If the server had a send failure, it should have have sent SMC 5847 Decline by now. 5849 LLC Message: ADD LINK CONTINUATION reply from client (not first 5850 contact and RMB count fields indicate that the server owes more ADD 5851 LINK CONTINUATION messages) 5853 Timer waits for: ADD LINK CONTINUATION from server 5855 Recovery action: Treat as is if client detected link failure on 5856 the link the ADD LINK exchange is using. Send DELETE LINK to 5857 the server over another active link if one exists, otherwise 5858 clean up the link group. 5860 LLC Message: DELETE LINK from client 5862 Timer waits for: DELETE LINK request from server 5864 Recovery action: If the scope of the request is to delete a 5865 single link, the surviving link, over which the client sent the 5866 DELETE LINK is no longer usable either. If this is the last link 5867 in the link group, end TCP connections over the link group by 5868 sending RST packets. If there are other surviving links in the 5869 link group, resend over a surviving link. Also send a DELETE 5870 LINK over a surviving link for the link that the client attempted 5871 to send the initial DELETE LINK message over. If the scope of 5872 the request is to delete the entire link group, try resending on 5873 other links in the link group until success is achieved. If all 5874 sends fail, tear down the link group and any TCP connections that 5875 exist on it. 5877 LLC Message: DELETE LINK from server (scope: entire link group) 5879 Timer waits for: Confirmation from the adapter that the message 5880 was delivered. 5882 Recovery action: Tear down the link group and any TCP connections 5883 that exist over it. 5885 LLC Message: DELETE LINK from server (scope: single link) 5887 Timer waits for: DELETE LINK reply from the client 5889 Recovery action: The link over which the client sent the DELETE 5890 LINK is no longer usable either. If this is the last link in the 5891 link group, end TCP connections over the link group by sending 5892 RST packets. If there are other surviving links in the link 5893 group, resend over a surviving link. Also send a DELETE LINK 5894 over a surviving link for the link that the server attempted to 5895 send the initial DELETE LINK message over. If the scope of the 5896 request is to delete the entire link group, try resending on 5897 other links in the link group until success is achieved. If all 5898 sends fail, tear down the link group and any TCP connections that 5899 exist on it. 5901 LLC Message: CONFIRM RKEY from the client 5903 Timer waits for: CONFIRM RKEY REPLY from the server 5905 Recovery action: Perform normal client procedures for detection 5906 of failed link. The link over which the message was sent has 5907 failed. 5909 LLC Message: CONFIRM RKEY from the server 5911 Timer waits for : CONFIRM RKEY REPLY from the client 5913 Recovery action: Perform normal server procedures for detection 5914 of failed link. The link over which the message was sent has 5915 failed. 5917 LLC Message: TEST LINK from the client 5919 Timer waits for: TEST LINK REPLY from the server 5921 Recovery action: Perform normal client procedures for detection 5922 of failed link. The link over which the message was sent has 5923 failed. 5925 LLC Message: TEST LINK from the server 5927 Timer waits for : TEST LINK REPLY from the client 5929 Recovery action: Perform normal server procedures for detection 5930 of failed link. The link over which the message was sent has 5931 failed. 5933 The following table describes recovery actions for invalid LLC 5934 messages. These could be misformatted or contain out of synch data. 5936 LLC Message received: CONFIRM LINK from server 5938 What could be bad: Incorrect link information 5939 Recovery action: Protocol error. The link must be brought down 5940 by sending a DELETE LINK for the link over another link in the 5941 link group if one exists. If this is first contact, fall back to 5942 IP by sending SMC Decline to server. 5944 LLC Message received: ADD LINK 5946 What could be bad: Undefined enumerated MTU value 5948 Recovery action: Send negative ADD LINK reply with reason code 5949 x'2' 5951 LLC Message received: ADD LINK reply from client 5953 What could be bad: Client side link information that would result 5954 in a parallel link being set up 5956 Recovery action: Parallel links are not permitted. Delete the 5957 link by sending DELETE LINK to the client over another link in 5958 the link group. 5960 LLC Message received: Any link group command from the server except 5961 DELETE LINK for the entire link group 5963 What could be bad: Client has sent DELETE LINK for the link that 5964 the message was received on 5966 Recovery action: Ignore the LLC message. Worst case the server 5967 will time out. Best case the DELETE LINK crosses with the 5968 command from the server and the server realizes it failed. 5970 LLC Message received: ADD LINK CONTINUATION from the server or ADD 5971 LINK CONTINUATION REPLY from the client 5973 What could be bad: Number of RMBs provided doesn't match count 5974 given on initial ADD LINK or ADD LINK reply message 5976 Recovery action: Protocol error. Treat as if detected link outage 5978 LLC Message received: DELETE LINK from client 5980 What could be bad: Link indicated doesn't exist 5982 Recovery action: If the link is in the process of being cleaned 5983 up, assume timing window and ignore message. Otherwise, send 5984 DELETE LINK REPLY with reason code 1. 5986 LLC Message received: DELETE LINK from server 5988 What could be bad: Link indicated doesn't exist 5990 Recovery action: Send DELETE LINK REPLY with reason code 1. 5992 LLC Message received: CONFIRM RKEY form either client or server 5994 What could be bad: No Rkey provided for one or more of the links 5995 in the link group 5997 Recovery action: Treat as if detected failure of the link(s) for 5998 which no RKEY was provided 6000 LLC message received: DELETE RKEY 6002 Specified RKey doesn't exist 6004 Send negative DELETE RKEY response. 6006 LLC message received: TEST LINK reply 6008 What could be bad: User data doesn't match what was sent in the 6009 TEST LINK request 6011 Recovery action: Treat as if detected that the link has gone 6012 down. This is a protocol error 6014 LLC message received: Unknown LLC type with high order bits of opcode 6015 equal b'10' 6017 What could be bad: This is an optional LLC message which the 6018 receiver does not support 6020 Recovery action: Ignore (silently discard) the message 6022 LLC message received: any unambiguously incorrect or out of synch LLC 6023 message 6025 What it indicates: Link is out of sync 6027 Recovery action: Treat as if detected that the link has gone 6028 down. Note that an unsupported or unknown LLC opcode whose two 6029 high order bits are b'10' is not an error, and must be silently 6030 discarded. Any other unknown or unsupported LLC opcode is an 6031 error. 6033 C.8. Failure to add second SMC-R link to a link group 6035 When there is any failure in setting up the second SMC-R link in an 6036 SMC-R link group, including confirmation timer expiration, the SMC-R 6037 link group is allowed to continue, without available failover. 6038 However this situation is extremely undesirable and the server must 6039 endeavor to correct it as soon as it can. 6041 The server peer in the SMC-R link group must set a timer to drive it 6042 to retry setup of a failed additional SMC-R link. The server will 6043 immediately retry the SMC-R link setup when the first of the 6044 following events occurs: 6046 o The retry timer expires 6048 o A new RNIC becomes available to the server, on the same LAN as the 6049 SMC-R link group 6051 o An "Add Link" LLC request message is received from the client, 6052 which indicates availability of a new RNIC on the client side. 6054 Authors' Addresses 6056 Mike Fox 6057 IBM 6058 3039 Cornwallis Rd. 6059 Research Triangle Park, NC 27709 6061 Email: mjfox@us.ibm.com 6063 Constantinos (Gus) Kassimis 6064 IBM 6065 3039 Cornwallis Rd. 6066 Research Triangle Park, NC 27709 6068 Email: kassimis@us.ibm.com 6070 Jerry Stevens 6071 IBM 6072 3039 Cornwallis Rd. 6073 Research Triangle Park, NC 27709 6075 Email: sjerry@us.ibm.com