idnits 2.17.1 draft-fox-tcpm-shared-memory-rdma-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 3 instances of too long lines in the document, the longest one being 9 characters in excess of 72. == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 669 has weird spacing: '...essages requi...' == Line 3205 has weird spacing: '...WR data l ...' == Line 3215 has weird spacing: '...WR data l ...' == Line 4538 has weird spacing: '...request messa...' == Line 5635 has weird spacing: '...NK over anoth...' -- The document date (November 13, 2013) is 3810 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'ROCE' is defined on line 3719, but no explicit reference was found in the text == Unused Reference: 'IBTA' is defined on line 3723, but no explicit reference was found in the text == Unused Reference: 'RFC793' is defined on line 3726, but no explicit reference was found in the text == Unused Reference: 'RFC4727' is defined on line 3730, but no explicit reference was found in the text == Unused Reference: 'Tou2012' is defined on line 3735, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) Summary: 2 errors (**), 0 flaws (~~), 12 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 TCPM working group M. Fox 2 Internet Draft C. Kassimis 3 Intended Status: Informational J. Stevens 4 Expires: 5/31/2014 IBM 5 November 13, 2013 7 Shared Memory Communications over RDMA 8 draft-fox-tcpm-shared-memory-rdma-03.txt 10 Status of this Memo 12 This Internet-Draft is submitted in full conformance with the 13 provisions of BCP 78 and BCP 79. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that 17 other groups may also distribute working documents as Internet- 18 Drafts. 20 Internet-Drafts are draft documents valid for a maximum of six months 21 and may be updated, replaced, or obsoleted by other documents at any 22 time. It is inappropriate to use Internet-Drafts as reference 23 material or to cite them other than as "work in progress." 25 The list of current Internet-Drafts can be accessed at 26 http://www.ietf.org/ietf/1id-abstracts.txt 28 The list of Internet-Draft Shadow Directories can be accessed at 29 http://www.ietf.org/shadow.html 31 This Internet-Draft will expire on December 1, 2013. 33 Copyright Notice 35 Copyright (c) 2013 IETF Trust and the persons identified as the 36 document authors. All rights reserved. 38 This document is subject to BCP 78 and the IETF Trust's Legal 39 Provisions Relating to IETF Documents 40 (http://trustee.ietf.org/license-info) in effect on the date of 41 publication of this document. Please review these documents 42 carefully, as they describe your rights and restrictions with respect 43 to this document. Code Components extracted from this document must 44 include Simplified BSD License text as described in Section 4.e of 45 the Trust Legal Provisions and are provided without warranty as 46 described in the Simplified BSD License. 48 Abstract 50 This document describes the Shared Memory Communications over RDMA 51 (SMC-R) protocol. This protocol provides RDMA communications to TCP 52 endpoints in a manner that is transparent to socket applications. It 53 further provides for dynamic discovery of partner RDMA capabilities 54 and dynamic setup of RDMA connections, transparent high availability 55 and load balancing when redundant RDMA network paths are available, 56 and it maintains many of the traditional TCP/IP qualities of service 57 such as filtering that enterprise users demand, as well as TCP socket 58 semantics such as urgent data. 60 Table of Contents 62 1. Introduction...................................................5 63 1.1. Summary of changes in this draft..........................6 64 1.2. Protocol overview.........................................6 65 1.3. Definition of common terms................................8 66 2. Link Architecture.............................................10 67 2.1. Remote Memory Buffers (RMBs).............................11 68 2.2. SMC-R Link groups........................................16 69 2.2.1. Link types..........................................17 70 2.2.2. Maximum number of links in link group...............20 71 2.2.3. Forming and managing link groups....................21 72 2.2.4. SMC-R link identifiers..............................22 73 2.3. SMC-R resilience and load balancing......................23 74 3. SMC-R Rendezvous architecture.................................24 75 3.1. TCP options..............................................25 76 3.2. Connection Layer Control (CLC) messages..................25 77 3.3. LLC messages.............................................26 78 3.4. Rendezvous flows.........................................27 79 3.4.1. First contact.......................................28 80 3.4.1.1. TCP Options pre-negotiation....................28 81 3.4.1.2. Client Proposal................................29 82 3.4.1.3. Server acceptance..............................30 83 3.4.1.4. Client confirmation............................30 84 3.4.1.5. Link (QP) confirmation.........................31 85 3.4.1.6. Second SMC-R link setup........................33 86 3.4.1.6.1. Client processing of "Add Link" LLC message 87 from server..........................................33 88 3.4.1.6.2. Server processing of "Add Link" reply LLC 89 message from the client..............................34 90 3.4.1.6.3. Exchange of Rkeys on second SMC-R link....36 91 3.4.1.6.4. Aborting SMC-R and falling back to IP.....36 93 3.4.2. Subsequent contact..................................36 94 3.4.2.1. SMC-R proposal.................................37 95 3.4.2.2. SMC-R acceptance...............................38 96 3.4.2.3. SMC-R confirmation.............................39 97 3.4.2.4. TCP data flow race with SMC Confirm CLC message39 98 3.4.3. First contact variation: creating a parallel link group 99 ...........................................................40 100 3.4.4. Normal SMC-R link termination.......................41 101 3.4.5. Link group management flows.........................42 102 3.4.5.1. Adding and deleting links in an SMC-R link group42 103 3.4.5.1.1. Server initiated Add Link processing......42 104 3.4.5.1.2. Client initiated Add Link processing......43 105 3.4.5.1.3. Server initiated Delete Link Processing...43 106 3.4.5.1.4. Client initiated Delete Link request......45 107 3.4.5.2. Managing multiple Rkeys over multiple SMC-R links 108 in a link group.........................................47 109 3.4.5.2.1. Adding a new RMB to an SMC-R link group...48 110 3.4.5.2.2. Deleting an RMB from an SMC-R link group..51 111 3.4.5.2.3. Adding a new SMC-R link to a link group with 112 multiple RMBs........................................52 113 3.4.5.3. Serialization of LLC exchanges, and collisions.53 114 3.4.5.3.1. Collisions with ADD LINK / CONFIRM LINK 115 exchange.............................................55 116 3.4.5.3.2. Collisions during DELETE LINK exchange....56 117 3.4.5.3.3. Collisions during CONFIRM_RKEY exchange...56 118 4. SMC-R memory sharing architecture.............................58 119 4.1. RMB element allocation considerations....................58 120 4.2. RMB and RMBE format......................................58 121 4.3. RMBE control information.................................58 122 4.4. Use of RMBEs.............................................59 123 4.4.1. Initializing and accessing RMBEs....................59 124 4.4.2. RMB element reuse and conflict resolution...........60 125 4.5. SMC-R protocol considerations............................60 126 4.5.1. SMC-R protocol optimized window size updates........60 127 4.5.2. Small data sends....................................62 128 4.5.3. TCP Keepalive processing............................62 129 4.6. RMB data flows...........................................65 130 4.6.1. Scenario 1: Send flow, window size unconstrained....65 131 4.6.2. Scenario 2: Send/Receive flow, window unconstrained.67 132 4.6.3. Scenario 3: Send Flow, window constrained...........68 133 4.6.4. Scenario 4: Large send, flow control, full window size 134 writes.....................................................70 135 4.6.5. Scenario 5: Send flow, urgent data, window size 136 unconstrained..............................................73 137 4.6.6. Scenario 6: Send flow, urgent data, window size closed75 138 4.7. Connection termination...................................77 139 4.7.1. Normal SMC-R connection termination flows...........77 140 4.7.1.1. Abnormal SMC-R connection termination flows....82 141 4.7.1.2. Other SMC-R connection termination conditions..84 142 5. Security considerations.......................................85 143 5.1. VLAN considerations......................................85 144 5.2. Firewall considerations..................................85 145 5.3. IP Filters...............................................86 146 5.4. Intrusion Detection Services.............................86 147 5.5. IP Security (IPSec)......................................86 148 5.6. TLS/SSL..................................................86 149 6. IANA considerations...........................................86 150 7. References....................................................87 151 7.1. Normative References.....................................87 152 7.2. Informative References...................................87 153 8. Acknowledgments...............................................87 154 9. Conventions used in this document.............................87 155 Appendix A. Formats..............................................88 156 A.1. TCP option...............................................88 157 A.2. CLC messages.............................................88 158 A.2.1. Peer ID format......................................88 159 A.2.2. SMC Proposal CLC message format.....................90 160 A.2.3. SMC Accept CLC message format.......................93 161 A.2.4. SMC Confirm CLC message format......................96 162 A.2.5. SMC Decline CLC message format......................99 163 A.3. LLC messages............................................100 164 A.3.1. CONFIRM LINK LLC message format....................101 165 A.3.2. ADD LINK LLC message format........................103 166 A.3.3. ADD LINK CONTINUATION LLC message format...........106 167 A.3.4. DELETE LINK LLC message format.....................109 168 A.3.5. CONFIRM RKEY LLC message format....................111 169 A.3.6. CONFIRM RKEY CONTINUATION LLC message format.......115 170 A.3.7. DELETE RKEY LLC message format.....................117 171 A.3.8. TEST LINK LLC message format.......................119 172 Appendix B. Socket API considerations...........................125 173 Appendix C. Rendezvous Error scenarios..........................127 174 C.1. SMC Decline during CLC negotiation......................127 175 C.2. SMC Decline during LLC negotiation......................127 176 C.3. The SMC Decline window..................................129 177 C.4. Out of synch conditions during SMC-R negotiation........129 178 C.5. Timeouts during CLC negotiation.........................130 179 C.6. Protocol errors during CLC negotiation..................130 180 C.7. Timeouts during LLC negotiation.........................131 181 C.7.1. Recovery actions for LLC timeouts and failures.....132 182 C.8. Failure to add second SMC-R link to a link group........138 184 1. Introduction 186 This document is a specification of the Shared Memory Communications 187 over RDMA (SMC-R) protocol. SMC-R is a protocol for Remote Direct 188 Memory Access (RDMA) communication between TCP socket endpoints. SMC- 189 R runs over networks that support RDMA over Converged Ethernet 190 (RoCE). It is designed to permit existing TCP applications to 191 benefit from RDMA without requiring modifications to the applications 192 or predefinition of RDMA partners. 194 SMC-R provides dynamic discovery of the RDMA capabilities of TCP 195 peers and automatic setup of RDMA connections that those peers can 196 use. SMC-R also provides transparent high availability and load 197 balancing capabilities that are demanded by enterprise installations 198 but are missing from current RDMA protocols. If redundant RoCE 199 capable hardware such as RDMA NICs (RNICs)and RoCE capable switches 200 is present, SMC-R can load balance over that redundant hardware and 201 can also non-disruptively move TCP traffic from failed paths to 202 surviving paths, all seamlessly to the application and the sockets 203 layer. Because SMC-R preserves socket semantics and the TCP three-way 204 handshake, many TCP qualities of service such as filtering, load 205 balancing, and SSL encryption are preserved, as are TCP features such 206 as urgent data. 208 Because of the dynamic discovery and setup of SMC-R connectivity 209 between peers, no RDMA connection manager (RDMA-CM) is required. This 210 also means that support for UD queue pairs is also not required. 212 It is recommended that the SMC-R services be implemented in kernel 213 space, which enables optimizations such as resource sharing between 214 connections across multiple processes and also permits applications 215 using SMC-R to spawn multiple processes (e.g. fork) without losing 216 SMC-R functionality. A user space implementation is compatible with 217 this architecture, but it may not support spawned processes (i.e. 218 fork) which limits sharing and resource optimization to TCP 219 connections that originate from the same process. This might be an 220 appropriate design choice if the use case is a system that hosts a 221 large single process application that creates many TCP connections to 222 a peer host, or in implementations where a kernel space 223 implementation is not possible or introduces excessive overhead for 224 kernel space to user space context switches. 226 While SMC-R as specified in this document is designed to operate over 227 RoCE fabrics, adjustments to the rendezvous methods could enable it 228 to run over other RDMA fabrics such as Infiniband and iWarp. 230 1.1. Summary of changes in this draft 232 Significant changes in this architecture since the previous draft: 234 o Clarified the TCP option code for SMCR is 254. 236 o Added limit of 255 RMBs per peer per link group 238 o Clarified retry requirements in failover scenarios 240 o Clarified that real memory backing RMBs need not be contiguous 242 o Clarified recovery from various parsing and timing errors 244 o Changed the format of SMC Decline to replace the two byte reason 245 code with four bytes of peer diagnosis information 247 o Modified format of RMB management flows: modified the format of 248 the CONFIRM RKEY message and added the CONFIRM RKEY CONTINUATION 249 LLC and DELETE RKEY messages. 251 o Added the term Connection Data Control (CDC) to describe RDMA 252 messages sent to update cursors and flags on a connection. 254 1.2. Protocol overview 256 SMC-R defines the concept of the SMC-R Link, which is a logical 257 point-to-point link between TCP/IP stack peers over a RoCE fabric. 258 An SMC-R link is bound to a specific hardware path, meaning a 259 specific RNIC on each peer. SMC-R links are created and maintained by 260 an SMC-R layer, which may reside in kernel or user space depending 261 upon operating system and implementation requirements. The SMC-R 262 layer resides below the sockets layer and directs data traffic for 263 TCP connections between connected peers over the RoCE fabric using 264 RDMA rather than over a TCP connection. The TCP/IP stack with its 265 fragmentation, packetization, etc. requirements is bypassed and the 266 application data is moved between peers using RDMA. 268 An SMC-R link manages Remote Memory Buffers (RMBs), which are areas 269 of memory that are available for SMC-R peers to write into using RDMA 270 writes. Multiple TCP connections between peers may be multiplexed 271 over a single SMC-R link, in which case the SMC-R layer manages the 272 partitioning of the RMBs between the TCP connections. This 273 multiplexing reduces the RDMA resources such as queue pairs and RMBs 274 that are required to support multiple connections between stack 275 peers, and also reduces the processing and delays related to setting 276 up queue pairs, pinning memory, and other RDMA setup tasks when new 277 TCP connections are created. In a kernel space SMC-R implementation 278 in which the RMBs reside in kernel storage, this sharing and 279 optimization works across multiple processes executing on the same 280 host. In a user space SMC-R implementation in which the RMBs reside 281 in user space, this sharing and optimization is limited to multiple 282 TCP connections created by a single process, as separate RMBs and QPs 283 will be required for each process. 285 Multiple SMC-R links between the same two TCP/IP stack peers are also 286 supported. If there is redundant hardware, for example two RNICs on 287 each peer, separate SMC-R links are created between the peers to 288 exploit that redundant hardware. The redundant links are available 289 for load balancing as well as seamless failover. A set of SMC-R links 290 that provides redundant connectivity is called a link group. 292 SMC-R also introduces a rendezvous protocol that is used to 293 dynamically discover the RDMA capabilities of TCP connection partners 294 and exchange credentials necessary to exploit that capability if 295 present. TCP connections are set up using the normal TCP 3-way 296 handshake, with the addition of a new TCP option that indicates SMC-R 297 capability. If both partners indicate SMC-R capability then at the 298 completion of the 3-way TCP handshake the SMC-R layers in each peer 299 take control of the TCP connection and use it to exchange additional 300 connection level control (CLC) messages to negotiate SMC-R 301 credentials such as queue pair (QP) information, addressability over 302 the RoCE fabric, RMB buffer sizes, keys and addresses for accessing 303 RMBs over RDMA, etc. If at any time during this negotiation a 304 failure or decline occurs, the TCP connection falls back to using the 305 IP fabric. 307 If the SMC-R negotiation succeeds and either a new SMC-R link is set 308 up or an existing SMC-R link is chosen for the TCP connection, then 309 the SMC-R layers open the sockets to the applications and the 310 applications use the sockets as normal. The SMC-R layer intercepts 311 the socket reads and writes and moves the TCP connection data over 312 the SMC-R link, "out of band" to the TCP connection which remains 313 open and idle, except for termination flows and possible keepalive 314 flows. Regular TCP sequence numbering methods are used for the TCP 315 flows that do occur; data flowing over RDMA does not use or affect 316 TCP sequence numbers. 318 This architecture does not support fallback of active SMC-R 319 connections to IP. Once connection data has completed the switch to 320 RDMA, a TCP connection cannot be switched back to IP and will reset 321 if RDMA becomes unusable. 323 The SMC-R protocol defines the format of the Remote Memory Buffers 324 that are used to receive TCP connection data written over RDMA, as 325 well as the semantics for managing and writing to these buffers. 327 Finally, SMC-R defines link level control (LLC) messages that are 328 exchanged over the RoCE fabric between peer SMC-R layers to manage 329 the SMC-R links and link groups. These include messages to test and 330 confirm connectivity over an SMC-R link, add and delete SMC-R links 331 to or from the link group, and exchange RMB addressability 332 information. 334 1.3. Definition of common terms 336 This section provides definitions of terms that have a specific 337 meaning to the SMC-R protocol and are used throughout this document. 339 SMC-R link 341 An SMC-R Link is a logical point to point connection over the 342 RoCE fabric via specific physical adapters (MAC/GID). The Link 343 is formed during the first contact sequence of the TCP/IP 3 way 344 handshake sequence that occurs over the IP fabric. During this 345 handshake an RDMA RC-QP connection is formed between the two peer 346 SMC hosts and is defined as the SMC Link. The SMC Link can then 347 support multiple TCP connections between the two peers. An SMC 348 link is associated with a single VLAN and is not routable. 350 SMC-R link group 352 An SMC-R Link Group is a group of SMC-R Links typically each over 353 unique RoCE adapters between the same two SMC-R peers. Each link 354 in the link group has equal characteristics such as the same VLAN 355 ID, access to the same RMB(s) and the same TCP server / client 357 SMC-R peer 359 The SMC-R Peer stack is the peer software stack within the peer 360 Operating System with respect the Shared Memory Communications 361 (messaging) protocol. 363 SMC-R Rendezvous 365 The SMC-R Rendezvous is the SMC-R peer discovery and handshake 366 sequence that occurs transparently over the IP (Ethernet) fabric 367 during and immediately after the TCP connection 3 way handshake 368 by exchanging the SMC capabilities and credentials using 369 experimental TCP option and CLC messages. 371 TCP Client 373 The TCP socket-based peer that initiates a TCP connection 375 TCP Server 377 The TCP socket-based peer that accepts a TCP connection 379 CLC messages 381 The SMC-R protocol defines a set of Connection Layer Control 382 Messages that flow over the TCP connection that are used to 383 manage SMC link rendezvous at TCP connection setup time. This 384 mechanism is analogous to SSL setup messages 386 LLC Commands 388 The SMC-R protocol defines a set of RoCE Link Layer Control 389 Commands that flow over the RoCE fabric using RDMA sendmsg, that 390 are used to manage SMC Links, SMC Link Groups and SMC Link Group 391 RMB expansion and contraction. 393 RMB 395 A Remote (RDMA) Memory Buffer is a fixed or pinned buffer 396 allocated in each of the peer hosts for a TCP (via SMC-R) 397 connection. The RMB is registered to the RNIC and allows remote 398 access by the remote stack using RDMA semantics. Each host is 399 passed the peer's RMB specific access information (RKey and RMB 400 Element offset) during the SMC-R rendezvous process. The host 401 stores socket application user data directly into the peer's RMB 402 using RDMA over RoCE. 404 Rtoken 406 The combination of an RMB's Rkey and RDMA virtual addressing, an 407 Rtoken provides addressability to an RMB to an RDMA peer 409 RMBE 411 The Remote Memory Buffer Element is an area of an RMB that is 412 allocated to a specific TCP connection. The RMBE contains data 413 for the TCP connection. The RMBE represents the TCP receive 414 buffer whereby the remote peer writes into the RMBE and the local 415 peer reads from the local RMBE. The alert token resolves to a 416 specific RMBE. 418 Alert Token 420 The SMC-R alert token is a four byte value that uniquely 421 identifies the TCP connection over an SMC-R connection. The 422 alert token allows the SMC peer to quickly identify the target 423 TCP connection that now has new work. The format of the token is 424 defined by the owning SMC-R end point and is considered opaque to 425 the remote peer. However the token should not simply an index to 426 an RMBE element; it should reference a TCP connection and be able 427 to be validated to avoid reading data from stale connections. 429 RNIC 431 The RDMA capable Network Interface Card (RNIC) is an Ethernet NIC 432 that supports RDMA semantics and verbs using RoCE. 434 First Contact 436 Describes an SMC-R negotiation to set up the first link in a link 437 group 439 Subsequent Contact 441 Describes an SMC-R negotiation between peers who are using an 442 already existing SMC-R link group 444 2. Link Architecture 446 An SMC-R link is based on reliably connected queue pairs (QPs) that 447 form a "logical point to point link" between the two SMC-R peers over 448 a RoCE fabric. An SMC-R link extends from SMC-R to SMC-R stack, where 449 typically each peer stack would reside on separate hosts. 451 ,,.--..,_ 452 +----+ _-`` `-, +-----+ 453 |QP 8| - RoCE ', |QP 64| 454 | | / VLAN M . | | 455 +----+--------+/ \+-------+-----+ 456 | RNIC 1 | SMC-R Link | RNIC 2 | 457 | |<--------------------->| | 458 +------------+ , /+------------+ 459 MAC A (GID A) MAC B (GID B) 460 . .` 461 `', ,-` 462 ``''--''`` 464 Figure 1 SMC-R Link Overview 466 Figure 1 illustrates an overview of the basic concepts of SMC-R peer 467 to peer connectivity which is called the SMC-R Link. The SMC-R Link 468 forms a logical point to point connection between two SMC-R peers via 469 RoCE. The SMC Link is defined and identified by the following 470 attributes: 472 SMC-R Link = RC QPs (source VMAC GID QP + target VMAC GID QP + VLAN 473 ID) 475 The SMC-R Link is associated with a single and specific VLAN. VLAN 476 exploitation is required for SMC-R as it is a key isolation attribute 477 of this architecture. The RoCE fabric is the same physical fabric 478 used for standard TCP/IP over Ethernet communications, with Converged 479 Enhanced Ethernet (CEE_enabled) switches. 481 An SMC-R Link is designed to support multiple TCP connections between 482 the same two peers. An SMC Link is intended to be long lived while 483 the underlying TCP connections can dynamically come and go. The 484 associated RMBs can also be dynamically added and removed from the 485 link as needed. The first TCP connection between the peers 486 establishes the SMC-R link. Subsequent TCP connections then use the 487 previously established link. When the last TCP connection terminates 488 the link can then be terminated, typically after an implementation 489 defined idle time-out period has elapsed. The TCP server is 490 responsible for initiating and terminating the SMC Link. 492 2.1. Remote Memory Buffers (RMBs) 494 Figure 2 shows the hosts X and Y and their associated RMBs within 495 each host. With the SMC-R link and the associated RMB keys (Rkeys)and 496 RDMA virtual addresses each SMC stack can remotely access its peer's 497 RMBs using RDMA. The RKeys and virtual addresses are exchanged during 498 the rendezvous processing when the link is established. The 499 combination of the Rkey and the virtual address is the Rtoken. Note 500 that the SMC-R Link ends at the QP providing access to the RMB (via 501 the Link + RToken). 503 Host X Host Y 504 +-------------------+ ,.--.,_ +-------------------+ 505 | | .'` '. | | 506 | Protection | ,' `, | Protection | 507 | Domain X | / \ | Domain Y | 508 | +------+ / \ +------+ | 509 | QP 8 |RNIC 1| | SMC-R Link | |RNIC 2| QP 64 | 510 | | | |<-------------------->| | | | 511 | | | || || | | | 512 | | +------+| VLAN A |+------+ | | 513 | | || || | | 514 | | | | RoCE | | | | 515 | |RTokenX) | \ / |RToken (Y)| | 516 | | | \ / | | | 517 | V | `. ,' | V | 518 | +--------+ | '._ ,' | +--------+ | 519 | | | | `''-'`` | | | | 520 | | RMB | | | | RMB | | 521 | | | | | | | | 522 | +--------+ | | +--------+ | 523 +-------------------+ +-------------------+ 524 Figure 2 SMC link and RMBs 526 An SMC-R link can support multiple RMBs which are independently 527 managed by each peer. The number of and the size of RMBs are managed 528 by the peers based on host unique memory management requirements; 529 however the maximum number of RMBs that can be associated to a link 530 group on one peer is 255. The QP has a single protection domain, but 531 each RMB has a unique RToken. All RTokens must be exchanged with the 532 peer. 534 Each peer manages the RMBs in its local memory for its remote SMC-R 535 peer by sharing access to the RMBs via Rtokens with its peers. The 536 remote peer writes into the RMBs via RDMA and the local peer (RMB 537 owner) then reads from the RMBs. 539 When two peers decide to use SMC-R for a given TCP connection, they 540 each allocate a local RMB Element for the TCP connection and 541 communicate the location of this local RMB Element during rendezvous 542 processing. To that end, RMB elements are created in pairs, with one 543 RMB element allocated locally on each peer of the SMC-R link. 545 --- +-----------+----------------+ 546 /\ |Eyecatcher | | 547 | +-----------+ | 548 | | | 549 RMB Element 1 | | 550 | | Receive Buffer | 551 | | | 552 | | | 553 \/ | | 554 --- +-----------+----------------+ 555 /\ |Eyecather | | 556 | +-----------+ | 557 | | | 558 RMB Element 2 | | 559 | | Receive Buffer | 560 | | | 561 | | | 562 \/ | | 563 --- +----------------------------+ 564 | . | 565 | . | 566 | . | 567 | . | 568 | (up to 255 elements) | 569 +----------------------------+ 570 Figure 3 RMB Format 572 Figure 3 illustrates the basic format of an RMB. The RMB is a virtual 573 memory buffer whose backing real memory is pinned, which can support 574 up to 255 TCP connections to exactly one remote SMC-R peer. Each RMB 575 is therefore associated with the SMC-R links for the two peers and a 576 specific RoCE Protection Domain. Other than the 2 peers identified by 577 the SMC-R link no other SMC-R peers can have RDMA access to an RMB; 578 this requires a unique Protection Domain for every SMC-R Link. This 579 is critical to ensure integrity of SMC-R communications. 581 RMBs are allocated with multiple entries for efficiency; multiple TCP 582 connections across an SMC link can share the same memory for RDMA 583 purposes, reducing the overhead of having to register additional 584 memory with the RNIC for every new TCP connection. The number of 585 entries in an RMB and the size of each RMB Element is entirely 586 governed by the owning peer subject to the SMC-R architecture rules. 587 Each peer can decide the level of resource sharing that is desirable 588 across TCP connections based on local constraints such as available 589 system memory, etc. Each RMB supports multiple RMB Elements, one per 590 TCP connection; however, all RMB elements within a given RMB must 591 have the same size. An RMB Element is identified to the remote SMC-R 592 peer via an RMB Element Token which consists of the following: 594 o RMB RToken: The combination of the Rkey and virtual address 595 provided by the RNIC that identifies the start of the RMB for RDMA 596 operations. 598 o RMB Index: Identifies the RMB element index in the RMB. Used to 599 locate a specific RMB element within an RMB. Valid value range is 600 1-255. 602 o RMB element length: The length of the RMB element's control area 603 plus the length of receive buffer. This length is equal for all 604 RMB elements in a given RMB. This length can be variable across 605 different RMBs. 607 Multiple RMBs can be associated to an SMC-R link and each peer in an 608 SMC-R link manages allocation of its RMBs. RMB allocation can be 609 asymmetric. For example, server X can allocate 2 RMBs to an SMC-R 610 link while server Y allocates 5. This provides maximum 611 implementation flexibility to allow hosts optimize RMB management for 612 their own local requirements. The maximum number of RMBs that can be 613 allocated on one peer to a link group is 255. If more RMBs are 614 required, the peer may fall back to IP for subsequent connections or, 615 if the peer is the server, create a parallel link group. 617 One use case for multiple RMBs is multiple receive buffer sizes. 618 Since every element in an RMB must be the same size, multiple RMBs 619 with different element sizes can be allocated if varying receive 620 buffer sizes are required. 622 Also since the maximum number of TCP connections whose receive 623 buffers can be allocated to an RMB is 255, multiple RMBs may be 624 required to provide capacity for large numbers of TCP connections 625 between two peers. 627 Separately from the RMB, the stack that owns each RMB maintains 628 control data for each RMB element within its local control 629 structures. The control data contains flags for maintaining the 630 state of the TCP data (for example, urgent indicator) and most 631 importantly, two cursors which are illustrated in Figure 4: 633 o The peer producer cursor: This is a wrapping offset into the RMB 634 element's receive buffer that points to the next byte of data to 635 be written by the peer. This cursor is provided by the peer in a 636 Connection Data Control (CDC message), which is sent using RDMA 637 message passing, and tells the local stack how far it can consume 638 data in the RMBE write buffer. 640 o The peer consumer cursor: This is a wrapping offset into the 641 peer's RMB element's receive buffer that points to the next byte 642 of data to be consumed by the peer in its own RMBE. This stack 643 cannot write into the peer's RMBE beyond this point without 644 causing data loss. This cursor is also provided by the peer using 645 a Connection Data Control message. 647 Each TCP connection peer maintains its cursors for a TCP connection's 648 RMBE in its local control structures. In other words, the stack who 649 writes into a peer's RMBE provides its producer cursor to the peer 650 whose RMB it has written into. The stack who reads from its RMBE 651 provides its consumer cursor to the writing peer. In this manner the 652 reads and writes between peers are kept coordinated. 654 For example, referring to Figure 4, peer B writes the hashed data 655 into the receive buffer of peer A's RMBE. After that write 656 completes, peer B uses a CDC message to update its producer cursor to 657 peer A, to indicate to peer A how much data is available for peer A 658 to consume. The CDC message that peer B sends to peer A wakes up 659 peer A and notifies it that there is data to be consumed. 661 Similarly, when peer A consumes data written by peer B, it uses a CDC 662 message to update its consumer cursor to peer B to let peer B know 663 how much data it has consumed, so peer B knows how much space is 664 available for further writes. If peer B were to write enough data to 665 peer A that it would wrap the RMBE receive buffer and exceed the 666 consumer cursor, data loss would result. 668 Note that this is a simplistic description of the control flows and 669 they are optimized to minimize the number of CDC messages required, 670 as described in 4 ..6. RMB data flows. 672 Peer A's RMBE Control Info Peer B's RMBE Control Info 673 +--------------------------+ +--------------------------+ 674 | | | | 675 /----Peer producer cursor | +-----+-Peer consumer cursor | 676 /| | | | | 677 | +--------------------------+ | +--------------------------+ 678 | Peer A's RMBE | 679 | +--------------------------+ | 680 | | +------------------+ 681 | | | | 682 | | \/ | 683 | | +------------| 684 | |-------------+/////////// | 685 | |//RMA data written by /// | 686 | |/// peer B that is ////// | 687 | |/available to be consumed/| 688 | |///////////////////////// | 689 | |///////// +---------------| 690 | |----------+/\ | 691 | | | | 692 \| | | 693 \ / | 694 |\---------/ | 695 | | 696 | | 697 Figure 4 RMBE cursors 699 Additional flags and indicators are communicated between peers. In 700 all cases, these flags and indicators are updated by the peer using 701 CDC messages with the control information contained in inline data. 702 More details on these additional flags and indicators are described 703 in . 4.3. RMBE control information. 705 2.2. SMC-R Link groups 707 SMC-R links are be logically grouped together to form an SMC-R Link 708 Group. The purpose of the Link Group is for supporting multiple links 709 between the same two peers to provide for: 711 o Resilience: Provides transparent and dynamic switching of the link 712 used by existing TCP connections during link failures, typically 713 hardware related. TCP traffic using the failing link can be 714 switched to an active link within the link group avoiding 715 disruptions to application workloads. 717 o Link utilization: Provides an active/active link usage model 718 allowing TCP traffic to be balanced across the links, which 719 increases bandwidth and avoids hardware imbalances and 720 bottlenecks. Note that both adapter and switch utilization can 721 become potential resource constraint issues 723 SMC-R Link Group support is required. Resilience is not optional. 725 Multiple links that are formed between the same two peers fall into 726 two distinct categories: 728 1. Equal Links: Links providing equal access to the same RMB(s) at 729 both endpoints whereby all TCP connections associated with the 730 links must have the same VLAN ID and have the same TCP server 731 and TCP client roles or relationship. 733 2. Unequal Links: Links providing access to unique, unrelated and 734 isolated RMB(s) (i.e. for unique VLANs or unique and isolated 735 application workloads, etc.) or have unique TCP server or client 736 roles. 738 Links that are logically grouped together forming an SMC Link Group 739 must be equal links. 741 2.2.1. Link types 743 Equal links within a link group also have another "Link Type" 744 attribute based on the link's associated underlying physical path. 745 The following SMC-R link types are defined: 747 1. Single Link: the only active link within a link group 749 2. Parallel Link: not allowed - SMC Links having the same physical 750 RNIC at both hosts 752 3. Asymmetric Link: links that have unique RNIC adapters at one 753 host but share a single adapter at the peer host 755 4. Symmetric Link: links that have unique RNIC adapters at both 756 hosts 758 These link types are further explained in the following figures and 759 descriptions. 761 Figure 2 above shows the single link case. The single link 762 illustrated in Figure 2 also establishes the SMC-R Link Group. Link 763 groups are supposed to have multiple links, but when only one RNIC is 764 available at both hosts then only a single link can be created. This 765 is expected to be a transient case. 767 Figure 5 shows the symmetric link case. Both hosts have unique and 768 redundant RNIC adapters. This configuration meets the objectives for 769 providing full RoCE redundancy required to provide the level of 770 resilience required for high availability for SMC-R. While this 771 configuration is not required, it is a strongly recommended "best 772 practice" for the exploitation of SMC-R. Single and asymmetric links 773 must be supported but are intended to provide for short term 774 transient conditions, for example during a temporary outage or 775 recycle of a RNIC. 777 Host X Host Y 778 +-------------------+ +-------------------+ 779 | | | | 780 | Protection | | Protection | 781 | Domain X | | Domain Y | 782 | +------+ +------+ | 783 | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 64 | 784 |RToken X| | |<-------------------->| | | | 785 | | | | | | |RToken Y| 786 | \/ +------+ +------+ \/ | 787 |+--------+ | | +--------+ | 788 || | | | | | | 789 || RMB | | | | RMB | | 790 || | | | | | | 791 |+--------+ | | +--------+ | 792 | /\ +------+ +------+ /\ | 793 |RToken Z| | | SMC-R Link 2 | | |RToken W| 794 | | |RNIC 3|<-------------------->|RNIC 4| | | 795 | QP 9 | | | | QP 65 | 796 | +------+ +------+ | 797 +-------------------+ +-------------------+ 798 Figure 5 Symmetric SMC-R links 800 Host X Host Y 801 +-------------------+ +-------------------+ 802 | | | | 803 | Protection | | Protection | 804 | Domain X | | Domain Y | 805 | +------+ +------+ | 806 | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 64 | 807 |RToken X| | |<-------------------->| | | | 808 | | | | .->| | |RToken Y| 809 | \/ +------+ .` +------+ \/ | 810 |+--------+ | .` | +--------+ | 811 || | | .` | | | | 812 || RMB | | .` | | RMB | | 813 || | | .`SMC-R | | | | 814 |+--------+ | .` Link 2 | +--------+ | 815 | /\ +------+ .` +------+ | 816 |Rtoken Z| | | .` | |down or | 817 | | |RNIC 3|<-` |RNIC 4|unavailable | 818 | QP 9 | | | | | 819 | +------+ +------+ | 820 +-------------------+ +-------------------+ 821 Figure 6 Asymmetric SMC-R links 823 In the example provided by Figure 6, host X has two RNICs but Host Y 824 only has one RNIC. This configuration allows for the creation of an 825 asymmetric link. While an asymmetric link will provide some 826 resilience (i.e. when RNIC 1 fails) ideally each host should provide 827 two redundant RNICs. This should be a transient case, and when RNIC 828 4 becomes available, this configuration must transition to a 829 symmetric link configuration. 831 Host X Host Y 832 +-------------------+ +-------------------+ 833 | | | | 834 | Protection | | Protection | 835 | Domain X | | Domain Y | 836 | +------+ SMC-R link 1 +------+ | 837 | QP 8 |RNIC 1|<-------------------->|RNIC 2| QP 64 | 838 |RToken X| | | | | | | 839 | | | |<-------------------->| | |Rtoken Y| 840 | \/ +------+ SMC-R link 2 +------+ \/ | 841 |+--------+ QP 9 | | QP 65 +--------+ | 842 || | | | | | | | | 843 || RMB |<-- + | | +---->| RMB | | 844 || | | | | | | 845 |+--------+ | | +--------+ | 846 | +------+ +------+ | 847 | down or| | | |down or | 848 | unavailale|RNIC 3| |RNIC 4|unavailable | 849 | | | | | | 850 | +------+ +------+ | 851 +-------------------+ +-------------------+ 852 Figure 7 SMC-R parallel links (not supported) 854 Figure 7 shows parallel links, which are two links in the link group 855 that use the same hardware. This configuration is not permitted. 856 Because SMC-R multiplexes multiple TCP connections over an SMC-R link 857 and both links are using the exact same hardware, there is no 858 additional redundancy or capacity benefit obtained from this 859 configuration. However this configuration does add unnecessary 860 overhead of additional queue pairs, generation of additional Rkeys, 861 etc. 863 2.2.2. Maximum number of links in link group 865 The SMC-R protocol defines a maximum of 8 symmetric SMC-R links 866 within a single SMC-R link group. This allows for support for up to 867 8 unique physical paths between peer hosts. However, in terms of 868 meeting the basic requirements for redundancy support for at least 2 869 symmetric links must be implemented. Supporting greater than 2 870 links also simplifies implementation for practical matters relating 871 to dynamically adding and removing links, for example starting a 872 third SMC-R link prior to taking down one of the two existing links. 873 Recall that all links within a link group must have equal access to 874 all associated RMBs. 876 The SMC-R protocol allows an implementation to implement an 877 implementation specific and appropriate value for maximum symmetric 878 links. The implementation value must not exceed the architecture 879 limit of 8 and the implementation must not be lower than 2, because 880 the SMC-R protocol requires redundancy. This does not mean that two 881 RNICs are physically required to enable SMC-R connectivity, but at 882 least two RNICs for redundancy are strongly recommended. 884 The SMC-R stacks exchange their implementation maximum link values 885 during the link group establishment using the defined maximum link 886 value in the CONFIRM LINK LLC command. Once the initial exchange 887 completes the value is set for the life of the link group. The 888 maximum link value can be provided by both the server and client. The 889 server must supply a value, whereas the client maximum link value is 890 optional. When the client does not supply a value, it indicates that 891 the client accepts the server supplied maximum value. If the client 892 provides a value it can not exceed the server maximum value. If the 893 client passes a lower value then this lower value then becomes the 894 final negotiated maximum number of symmetric links for this link 895 group. Again, the minimum value is 2. 897 During run time the client must never request that the server add a 898 symmetric link to a link group that would exceed the negotiated 899 maximum link value. Likewise the server must never attempt to add a 900 symmetric link to a link group that would exceed the negotiated 901 maximum value. 903 In terms of counting the active link count within a link group, the 904 initial link (or the only / last) link is always counted as 1. Then 905 as additional links are added they are either symmetric or asymmetric 906 links. 908 With regards to enforcing the maximum link rules, asymmetric links 909 are an exception having a unique set of rules: 911 o Asymmetric links are always limited to one asymmetric link allowed 912 per link group 914 o Asymmetric links must not be counted in the maximum symmetric link 915 count calculation. When tracking the current count or enforcing 916 the negotiated maximum number of links, an asymmetric link is not 917 to be counted 919 2.2.3. Forming and managing link groups 921 SMC-R link groups are self-defining. The first SMC-R link in a link 922 group is created using TCP option flows on the TCP three-way 923 handshake followed by CLC message flows over the TCP connection. 924 Subsequent SMC-R links in the link group are created by sending LLC 925 messages over an SMC-R link that already exists in the link group. 926 Once an SMC-R link group is created, no additional SMC-R links in 927 that group are created using TCP and CLC negotiation. Because 928 subsequent SMC-R links are created exclusively by sending LLC 929 messages over an existing SMC-R link in a link group, the membership 930 of SMC-R links to a link group is self-defining. 932 This architecture does not define a specific identifier for an SMC-R 933 link group. This identification may be useful for network management 934 and may be assigned in a platform specific manner, or in an extension 935 to this architecture. 937 In each SMC-R link group, one peer is the server for all TCP 938 connections and the other peer is the client. If there are 939 additional TCP connections between the peers that use SMC-R and have 940 the client and server roles reversed, another SMC-R link group is set 941 up between them with the opposite client-server relationship. 943 This is required because there are specific responsibilities divided 944 between the client and server in the management of an SMC-R link 945 group. 947 In this architecture, the following decision of whether or not to use 948 an existing SMC-R link group or create a new SMC-R link group for a 949 TCP connection is made exclusively by the server 951 Management of the links in an SMC-R link group is also a server 952 responsibility. The server is responsible for adding and deleting 953 links in a link group. The client may request that the server take 954 certain actions but the final responsibility is the server's. 956 2.2.4. SMC-R link identifiers 958 This architecture defines multiple identifiers to identify SMC-R 959 links and peers. 961 o Link number: This is a one-byte value that identifies an SMC-R 962 link within a link group. Both the server and the client use this 963 number to distinguish an SMC-R link from other links within the 964 same link group. It is only unique within a link group. 966 o Link User ID: This is an architecturally opaque four byte value 967 that a peer uses to uniquely define an SMC-R link within its own 968 space. This means that a link user ID is unique within one stack 969 only. Each peer defines its own link user ID for a link. The 970 peers exchange this information once during link setup and it is 971 never used architecturally again. The purpose of this identifier 972 is for network management, display, and debugging purposes. For 973 example an operator on a client could provide the operator on the 974 server with the server's link user ID if he requires the server's 975 operator to check on the operation of a link that the client is 976 having trouble with. 978 o Peer ID: The SMC-R peer ID uniquely identifies a specific instance 979 of a specific stack. It is required because in clustered and load 980 balancing environments, an IP address does not uniquely identify a 981 stack. An RNIC's MAC/GID also doesn't uniquely or reliably 982 identify a stack because RNICs can go up and down and even be 983 redeployed to other stacks in a multiple partitioned or 984 virtualized environment. The peer ID is not only unique per stack 985 but is also unique per instance of a stack, meaning that if a 986 stack is restarted, its peer ID changes. 988 2.3. SMC-R resilience and load balancing 990 The SMC-R multi-link architecture provides resilience for network 991 high availability via failover capability to an alternate RoCE 992 adapter. 994 The SMC-R multilink architecture does not define primary, secondary 995 or alternate roles to the links. Instead there are multiple active 996 links representing multiple redundant RoCE paths over the same VLAN. 998 If a hardware failure occurs or a QP failure associated with an 999 individual link, then the TCP connections that were associated with 1000 the failing link are be dynamically and transparently switched to use 1001 another available link. The server or the client can detect a 1002 failure and immediately move their TCP connections and then notify 1003 their peer via the DELETE LINK LLC command. The server must perform 1004 the actual link deletion. 1006 The movement of TCP connections to another link can be accomplished 1007 without notifying or coordinating with the peer. The TCP connection 1008 movement is also transparent to and non disruptive to the TCP socket 1009 application workloads. After a failure, the surviving links and all 1010 associated hardware must handle the link group's workload. 1012 As each SMC-R stack begins to move active TCP connections to another 1013 link all current RDMA write operations must be allowed to complete 1014 and then may be retried, in order, over the new link if the 1015 previously attempted RDMA write operation did not successfully 1016 complete. . Any data writes or CDC messages for which the sender did 1017 not receive write completion must be replayed before any subsequent 1018 data or CDC write operations are sent. LLC messages are not retried 1019 over the new link because they are dependent on a known link 1020 configuration, which has just changed because of the failure. The 1021 initiator of an LLC message exchange that fails will be responsible 1022 for retrying once the link group configuration stabilizes. 1024 When a new link becomes available and is re-added to the link group 1025 then each stack is free to rebalance its current TCP connections as 1026 needed or only assign new TCP connections to the newly added link. 1027 Both the server and client are free to manage TCP connections across 1028 the link group as needed. TCP connection movement does not have to 1029 stimulated by a link failure. 1031 The SMC-R architecture also defines orderly vs. disorderly failover. 1032 The type is communicated in the LLC Delete Link command and is simply 1033 a means to indicate that the link has terminated (disorderly) or link 1034 termination is imminent (orderly). The orderly link deletion could 1035 be initiated via operator command or programmatically to bring down 1036 an idle link. For example an operator command could initiate orderly 1037 shut down of an adapter for service. Implementation of the two types 1038 is based on implementation requirements and is beyond the scope of 1039 the SMC-R architecture. 1041 3. SMC-R Rendezvous architecture 1043 Rendezvous is the process that SMC-R capable peers use to dynamically 1044 discover each others' capabilities, negotiate SMC-R connections, set 1045 up SMC-R links and link groups, and manage those link groups. A key 1046 aspect of SMC-R rendezvous is that it occurs dynamically and 1047 automatically, without requiring SMC link configuration to be defined 1048 by an administrator. 1050 SMC-R Rendezvous starts with the TCP/IP three-way handshake during 1051 which connection peers use TCP options to announce their SMC-R 1052 capabilities. If both endpoints are SMC-R capable, then Connection 1053 Layer Control (CLC) messages are exchanged between the peers' SMC-R 1054 layers over the newly established TCP connection to negotiate SMC-R 1055 credentials. The CLC message mechanism is analogous to the messages 1056 exchanged by SSL. 1058 If a new SMC-R link is being set up, Link Layer Control (LLC) 1059 messages are used to confirm RDMA connectivity. LLC messages are 1060 also used by the SMC-R layers at each peer to manage the links and 1061 link groups. 1063 Once an SMC-R link is set up or agreed to by the peers, the TCP 1064 sockets are passed to the peer applications which use them as normal. 1065 The SMC-R layer, which resides under the sockets layer, transmits the 1066 socket data between peers over RDMA using the SMC-R protocol, 1067 bypassing the TCP/IP stack. 1069 3.1. TCP options 1071 During the TCP/IP three-way handshake, the client and server indicate 1072 their support for SMC-R by including experimental TCP option 254 on 1073 the three-way handshake flows, in accordance with draft-ietf-tcpm- 1074 experimental-options-01.txt. The magic number value used is the 1075 string 'SMCR' in EBCDIC (IBM-1047) encoding (0xE2D4C3D9). 1077 After completion of the 3-way TCP handshake each peer queries its 1078 peer's options. If both peers set the TCP option on the three-way 1079 handshake, inline SMC-R negotiation occurs using CLC messages. If 1080 neither peer or only one peer set the TCP option, SMC-R cannot be 1081 used for the TCP connection, and the TCP connection completes setup 1082 using the IP fabric. 1084 3.2. Connection Layer Control (CLC) messages 1086 CLC messages are sent as data payload over the newly opened TCP 1087 connection between SMC-R layers at the peers. They are analogous to 1088 the messages used to exchange parameters for SSL. 1090 Use of CLC messages is detailed in the following sections. The 1091 following list provides a summary of the defined CLC messages and 1092 their purposes: 1094 o SMC PROPOSAL: Sent from the client to propose that this TCP 1095 connection is eligible to be moved to SMC-R. The client identifies 1096 itself and its subnet to the server and passes the SMC-R elements 1097 for a suggested RoCE path via the MAC and GID. 1099 o SMC ACCEPT: Sent from the server to accept the client's TCP 1100 connection SMC proposal. The server responds to the client's 1101 proposal by identifying itself to the client and passing the 1102 elements of a RoCE path that the client can use to to perform RDMA 1103 writes to the server. This consists of SMC-R ink elements such as 1104 RoCE MAC, GID, RMB information etc. 1106 o SMC CONFIRM: Sent from the client to confirm the server's 1107 acceptance of SMC connection. The client responds to the server's 1108 acceptance by passing the elements of a RoCE path that the server 1109 can use to to perform RDMA writes to the client. This consists of 1110 SMC-R ink elements such as RoCE MAC, GID, RMB information etc. 1112 o SMC DECLINE: Sent from either the server or the client to reject 1113 the SMC connection, indicating the reason the peer must decline 1114 the SMC proposal and allowing the TCP connection to revert back to 1115 IP connectivity. 1117 3.3. LLC messages 1119 Link Layer Control (LLC) messages are sent between peer SMC-R layers 1120 over an SMC-R link to manage the link or the link group. LLC 1121 messages are sent using RoCE sendmsg with inline data and are 44 1122 bytes long. The 44 bytes size is based on what can fit into a RoCE 1123 Work Queue Element (WQE) without requiring the posting of receive 1124 buffers. 1126 LLC messages generally follow a request-reply semantic. Each message 1127 has a request flavor and a reply flavor, and each request must be 1128 confirmed with a reply, except where otherwise noted. Use of LLC 1129 messages is detailed in the following sections. The following list 1130 provides a summary of the defined LLC messages and their purposes: 1132 o ADD LINK: Add a new link to a link group. Sent from the server to 1133 the client to initiate addition of a new link to the link group, 1134 or from the client to the server to request that the server 1135 initiate addition of a new link. 1137 o ADD LINK CONTINUATION: This is a continuation of ADD link that 1138 allows the ADD link to span multiple commands, because all the 1139 link information cannot be contained in a single ADD LINK message 1141 o CONFIRM LINK: Used to confirm that RoCE connectivity over a newly 1142 created SMC-R link is working correctly. Initiated by the server, 1143 and both this message and its reply must flow over the SMC-R link 1144 being confirmed. 1146 o DELETE LINK: When initiated by the server, deletes a specific link 1147 from the link group or deletes the entire link group. When 1148 initiated by the client, requests that the server delete a 1149 specific link or the entire link group. 1151 o CONFIRM RKEY: Informs the peer on the SMC-R link of the addition 1152 of an RMB to the link group. 1154 o CONFIRM RKEY CONTINUATION: This is a continuation of CONFIRM RKEY 1155 that allows the ADD link to span multiple commands, in the event 1156 that all of the information cannot be contained in a single 1157 CONFIRM RKEY message. 1159 o DELETE RKEY: Informs the peer on the SMC-R link of the deletion of 1160 one or more RMBs from the link group 1162 o TEST LINK: Verifies that an already-active SMC-R link is active 1163 and healthy 1165 o Optional LLC message: Any LLC message in which the two high order 1166 bits of the opcode are b'10' is an optional message and must be 1167 silently discarded by a receiving peer that does not support the 1168 opcode. No such messages are defined in this version of the 1169 architecture, however the concept is defined to allow for 1170 toleration of possible advanced, optional functions. 1172 CONFIRM LINK and TEST LINK are sensitive to which link they flow on 1173 and must flow on the link being confirmed or tested. The other flows 1174 may flow over any active link in the link group. When there are 1175 multiple links in a link group, a response to an LLC message must 1176 flow over the same link that the original message flowed over, with 1177 the following exceptions: 1179 o ADD LINK request from a server in response to an ADD LINK from a 1180 client 1182 o DELETE LINK request from a server in response to a DELETE LINK 1183 from a client 1185 3.4. Rendezvous flows 1187 Rendezvous information for SMC-R is be exchanged as TCP options on 1188 the TCP 3-way handshake flows to indicate capability, followed by in- 1189 line TCP negotiation messages to actually do the SMC-R setup. Formats 1190 of all rendezvous options and messages discussed in this section are 1191 detailed in Appendix A. 1193 3.4.1. First contact 1195 First contact between RoCE peers occurs when a new SMC-R link group 1196 is being set up. This could be because no SMC-R links already exist 1197 between the peers, or the server decides to create a new SMC-R link 1198 group in parallel with an existing one. 1200 3.4.1.1. TCP Options pre-negotiation 1202 The client and server indicate their SMC-R capability to each other 1203 using TCP option 254 on the TCP 3-way handshake flows. 1205 A client who wishes to do SMC-R will include TCP option 254 using a 1206 magic number equal to the EBCDIC (codepage IBM-1047) encoding of 1207 "SMCR" on its SYN flow. 1209 A server that supports SMC-R will include TCP option 254 with the 1210 magic number value of EBCDIC "SMCR" on its SYN-ACK flow. Because the 1211 server is listening for connections and does not know where client 1212 connections will come from, the server unconditionally includes this 1213 TCP option if it supports SMC-R. This may be required for servers 1214 such as Linux where proprietary extensions to the TCP stack are not 1215 practical. For proprietary servers which can add code to examine and 1216 react to packets during the three-way handshake, the server should 1217 only include the SMC-R TCP option on SYN-ACK if the client included 1218 it on its SYN packet. 1220 A client who supports SMC-R and meets the three conditions outlined 1221 above may optionally include the TCP option for SMC-R on its ACK 1222 flow, regardless of whether or not the server included it on its SYN- 1223 ACK flow. Some stacks may have to include it if the SMC-R layer 1224 cannot modify the options on the socket until the 3-way handshake 1225 completes. Proprietary servers should not include this option on the 1226 ACK flow, since including it on the SYN flow was sufficient to 1227 indicate the client's capabilities. 1229 Once the initial three-way TCP handshake is completed, each peer 1230 examines the socket options. Proprietary stacks may do this by 1231 examining what was actually provided on the SYN and SYN-ACK packets, 1232 and open stacks may do this by performing a getsockopt() operation to 1233 determine the options set by the peer. If neither peer, or only one 1234 peer, specified the TCP option for SMC-R, then SMC-R cannot be used 1235 on this connection and it proceeds using normal IP flows and 1236 processing. 1238 If both peers specified the TCP option for SMC-R, then the TCP 1239 connection is not started yet and the peers proceed to SMC-R 1240 negotiation using inline data flows, similar to the SSL negotiation 1241 model. The socket is not yet turned over to the applications; 1242 instead the respective SMC layers exchange CLC messages over the 1243 newly formed TCP connection. 1245 3.4.1.2. Client Proposal 1247 If SMC-R is supported by both peers, the client sends an SMC Proposal 1248 CLC message to the server. On this flow from client to server it is 1249 not immediately apparent if this is a new or existing SMC-R link 1250 because in clustered environments a single IP address may represent 1251 multiple hosts. This type of cluster virtual IP address can be owned 1252 by a network based or host based layer 4 load balancer that 1253 distributes incoming TCP connections across a cluster of 1254 servers/hosts. Other clustered environments may also support the 1255 movement of a virtual IP address dynamically from one host in the 1256 cluster to another for high availability purposes. In summary, the 1257 client can not pre-determine that a connection is targeting the same 1258 host simply by matching the destination IP address for outgoing TCP 1259 connections. Therefore it cannot pre-determine the SMC-R link that 1260 will be used for a new TCP connection. This information will be 1261 dynamically learned and the appropriate actions will be taken as the 1262 SMC-R negotiation handshake unfolds. 1264 On the SMC-R proposal message, the initiator (client) proposes use of 1265 SMC-R by including its peer ID and GID and MAC addresses, as well as 1266 the IP subnet number of the outgoing interface (if IPv4) or the IP 1267 prefix list for the network that the proposal is sent over (if IPv6). 1268 At this point in the flow, the client makes no local commitments of 1269 resources for SMC-R. 1271 When the server receives the SMC Proposal CLC message, it uses the 1272 peer ID provided by the client plus subnet or prefix information 1273 provided by the client, to determine if it already has a usable SMC-R 1274 link with this SMC-R peer. If there is one or more existing SMC-R 1275 links with this SMC-R peer, the server then decides which SMC link it 1276 will use for this TCP connection. See subsequent sections for the 1277 cases of reusing an existing SMC-R link or creating a parallel SMC 1278 link group between SMC-R peers. 1280 If this is a first contact between SMC-R peers and the server must 1281 validate that it is on the same VLAN as the client before continuing. 1282 For IPv4, the server does this by verifying that it has an interface 1283 with an IP subnet number that matches the subnet number set by the 1284 client on the SMC Proposal. For IPv6 it does this by verifying that 1285 it is directly attached to at least one IP prefix that was listed by 1286 the client in its SMC Proposal message. 1288 If server agrees to use SMC-R, the server begins setup of a new SMC-R 1289 link by allocating local QP and RMB resources (setting its QP state 1290 to INIT) and providing its full SMC-R information in an SMC Accept 1291 CLC message to the client over the TCP connection, along with a flag 1292 set indicating that this is a first contact flow. While the SMC 1293 Accept message could flow over any route back to the client depending 1294 upon IP routing, the SMC-R credentials provided must be for the 1295 common subnet or prefix between the server and client, as determined 1296 above. If the server cannot or does not want to do SMC-R with the 1297 client it sends an SMC Decline CLC message to the client and the 1298 connection data may begin flowing using normal TCP/IP flows. 1300 3.4.1.3. Server acceptance 1302 When the client receives the SMC Accept from the server, it uses the 1303 combination of the first contact flag, its GID/MAC and the GID/MAC 1304 returned by the server plus the VLAN that the connection is setting 1305 up over and the QP number provided by the server to determine if this 1306 is a new or existing SMC-R link. 1308 If it is an existing SMC-R link, and the client agrees to use that 1309 link for the TCP connection, see 3.4.2. Subsequent contact below. If 1310 it is a new SMC-R link between peers that already have an SMC link, 1311 then the server is starting a new SMC link group. 1313 Assuming this is either a first contact between peers or the server 1314 is starting a new SMC link group, the client now allocates local QP 1315 and RMB resources for the SMC-R link (setting the QP state to RTR or 1316 "ready to receive"), associates them with the server QP as learned on 1317 the SMC Accept CLC message, and sends an SMC Confirm CLC message to 1318 the server over the TCP connection with its SMC-R link information 1319 included. The client also starts a timer to wait for the server to 1320 confirm the reliable connected QP as described below. 1322 3.4.1.4. Client confirmation 1324 Upon receipt of the client's SMC Confirm CLC message, the server 1325 associates its QP for this SMC-R link with the client's QP as learned 1326 on the SMC Confirm CLC message and sets its QP state to RTS (ready to 1327 send). Now the client and the server have reliable connected QPs. 1329 3.4.1.5. Link (QP) confirmation 1331 Since setting up the SMC-R link and its QPs did not require any 1332 network flows on the RoCE fabric, the client and server must now 1333 confirm connectivity over the RoCE fabric. To accomplish this, the 1334 server will send a "Confirm Link" Link Layer Control (LLC) message to 1335 the client over the RoCE fabric. The "Confirm Link" LLC message will 1336 provide the server's MAC, GID, and QP information for the connection, 1337 allow each partner to communicate the maximum number of links it can 1338 tolerate in this link group (the "link limit"), and will additionally 1339 provide two link IDs: 1341 o a one-byte server-assigned Link number that is used by both peers 1342 to identify the link within the link group and is only unique 1343 within a link group. 1345 o a four byte link user id. This opaque value is assigned by the 1346 server for the server's local use and is provided to the client 1347 for management purposes, for example to use in network management 1348 displays and products. 1350 When the server sends this message, it will set a timer for receiving 1351 confirmation from the client. 1353 When the client receives the server's confirmation "Confirm Link" LLC 1354 message it will cancel the confirmation timer it set when it sent the 1355 SMC Confirm message. It will also advance its QP state to RTS and 1356 respond over the RoCE fabric with a "Confirm Link" response LLC 1357 message, providing its MAC, GID, QP number, link limit, confirming 1358 the one byte link number sent by the server, and providing its own 1359 four byte link user id to the server. 1361 Host X -- Server Host Y -- Client 1362 +-------------------+ +-------------------+ 1363 | PeerID = PS1 | | PeerID = PC1 | 1364 | +------+ +------+ | 1365 | QP 8 |RNIC 1| |RNIC 2| QP 64 | 1366 |RToken X| |MAC MA| |MAC MB| | | 1367 | | |GID GA| |GID GB| |Rtoken Y| 1368 | \/ +------+ (Subnet S1) +------+ \/ | 1369 |+--------+ | | +--------+ | 1370 || RMB | | | | RMB | | 1371 |+--------+ | | +--------+ | 1372 | +------+ +------+ | 1373 | |RNIC 3| |RNIC 4| | 1374 | |MAC MC| |MAC MD| | 1375 | |GID GC| |GID GD| | 1376 | +------+ +------+ | 1377 +-------------------+ +-------------------+ 1379 SYN TCP options(254,"SMCR") 1380 <--------------------------------------------------------- 1382 SYN-ACK TCP options(254, "SMCR") 1383 ---------------------------------------------------------> 1385 ACK [TCP options(254, "SMCR")] 1386 <-------------------------------------------------------- 1388 SMC Proposal(PC1,MB,GB,S1) 1389 <-------------------------------------------------------- 1391 SMC Accept(PS1,first contact,MA,GA,QP8,RToken=X,RMB element index) 1392 ---------------------------------------------------------> 1394 SMC Confirm(PC1,MB,GB,QP64,RToken=Y, RMB element index) 1395 <-------------------------------------------------------- 1397 Confirm Link (MA,GA,QP8, link lim, server's link userid, linknum) 1398 .........................................................> 1400 Confirm Link Rsp(MB,GB,QP64, link lim, client link userid, linknum) 1401 <........................................................ 1403 Legend: 1404 ------------ TCP/IP and CLC flows 1405 ............ RoCE (LLC) flows 1407 Figure 8 First contact rendezvous flows 1409 Technically, the data for the TCP connection could now flow over the 1410 RoCE path. However if this is first contact, there is no alternate 1411 for this recently established RoCE path. Since in the current 1412 architecture there is no failover from RoCE to IP once connection 1413 data starts flowing, this means that a failure of this path would 1414 disrupt the TCP connection, meaning that the level of redundancy and 1415 failover is less than that provided by IP. If the network has 1416 alternate RoCE paths available, they would not be usable at this 1417 point, which is an unacceptable condition 1419 3.4.1.6. Second SMC-R link setup 1421 Because of the unacceptable situation described above, TCP data will 1422 not be allowed to flow on the newly established SMC-R link until a 1423 second path has been set up, or at least attempted. 1425 If the server has a second RNIC available on the same VLAN, it 1426 attempts to set up the second SMC-R link over that second RNIC. If 1427 it only has one RNIC available on the VLAN, it will attempt to set up 1428 the second SMC-R link over that one RNIC. In the latter case, the 1429 server is attempting to set up an asymmetric link, in case the client 1430 does have a second RNIC on the VLAN. 1432 In either case the server allocates a new QP over the RNIC it is 1433 attempting to use for the second link, assigns a link number to the 1434 new link and also creates an RToken for the RMB over this second QP 1435 (note that this means that the first and second QP each has its own 1436 RToken to represent the same RMB). The server provides this 1437 information, as well as the MAC and GID of the RNIC it is attempting 1438 set up the second link over in an "Add Link" LLC message which it 1439 sends to the client over the SMC-R link that is already set up. 1441 3.4.1.6.1. Client processing of "Add Link" LLC message from server 1443 When the client receives the server's "Add Link" LLC message, it 1444 examines the GID and MAC provided by the server to determine if the 1445 server is attempting to use the same server-side RNIC as the existing 1446 SMC-R link, or a different one. 1448 If the server is attempting to use the same server-side RNIC as the 1449 existing SMC-R link, then the client verifies that it has a second 1450 RNIC on the same VLAN. If it does not, the client rejects the "Add 1451 Link" request from the server, because the resulting link would be a 1452 parallel link which is not supported within a link group. If the 1453 client does have a second RNIC on the same VLAN, it accepts the 1454 request and an asymmetric link will be set up. 1456 If the server is using a different server-side RNIC from the existing 1457 SMC-R link then the client will accept the request and a second SMC-R 1458 link will set up in this SMC-R link group. If the client has a 1459 second RNIC on the same VLAN, that second RNIC will be used for the 1460 second SMC-R link, creating symmetric links. If the client does not 1461 have a second RNIC on the same VLAN, it will use the same RNIC as was 1462 used for the initial SMC-R link, resulting in the setup of an 1463 asymmetric link in the SMC-R link group. 1465 In either case, when the client accepts the server's "Add Link" 1466 request, it allocates a new QP on the chosen RNIC and creates an Rkey 1467 over that new QP for the client-side RMB for the SMC link group, then 1468 sends an "Add Link" reply LLC message to the server providing that 1469 information as well as echoing the Link number that was set by the 1470 server. 1472 If the client rejects the server's "Add Link" request, it sends an 1473 "Add Link" reply LLC message to the server with the reason code for 1474 the rejection. 1476 3.4.1.6.2. Server processing of "Add Link" reply LLC message from the 1477 client 1479 If the client sends a negative response to the server or no reply is 1480 received, the server frees the RoCE resources it had allocated for 1481 the new link. Having a single link in an SMC-R link group is 1482 undesirable and the server's recovery is detailed in C.8. Failure to 1483 add second SMC-R link to a link group. 1485 If the client sends a positive reply to the server with 1486 MAC/GID/QP/Rkey information, the server associates its QP for the new 1487 SMC-R link to the QP that the client provided. Now the new SMC-R 1488 link is in the same situation that the first was in after the client 1489 sent its ACK packet - there is a reliable connected QP over the new 1490 RoCE path, but there have been no RoCE flows to confirm that it's 1491 actually usable. So at this point the client and server will 1492 exchange "Confirm Link" LLC messages just like they did on the first 1493 SMC-R link. 1495 If either peer receives a failure during this second "Confirm Link" 1496 LLC exchange (either an immediate failure which implies that the 1497 message did not reach the partner, or a timeout), it sends a "Delete 1498 Link" LLC message to the partner over the first (and now only) link 1499 in the link group which must be acknowledged before data can flow on 1500 the single link in the link group. 1502 Host X -- Server Host Y -- Client 1503 +-------------------+ +-------------------+ 1504 | PeerID = PS1 | | PeerID = PC1 | 1505 | +------+ +------+ | 1506 | QP 8 |RNIC 1| |RNIC 2| QP 64 | 1507 |RToken X| |MAC MA| |MAC MB| | | 1508 | | |GID GA| |GID GB| |RToken Y| 1509 | \/ +------+ +------+ \/ | 1510 |+--------+ | | +--------+ | 1511 || | | | | | | 1512 || RMB | | | | RMB | | 1513 || | | | | | | 1514 |+--------+ | | +--------+ | 1515 | /\ +------+ +------+ /\ | 1516 | | |RNIC 3| |RNIC 4| | | 1517 |RToken Z| |MAC MC| |MAC MD| |RToken W| 1518 | QP 9 |GID GC| |GID GD| QP 65 | 1519 | +------+ +------+ | 1520 +-------------------+ +-------------------+ 1522 First SMC-R link setup as shown in Figure 8 1523 <-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-> 1525 ADD link request (QP9,MC,GC, link number=2) 1526 ............................................> 1528 ADD link response (QP65,MD,GD, link number=2) 1529 <............................................ 1531 ADD link continuation request (RToken=Z) 1532 ............................................> 1534 ADD link continuation response(RToken=W) 1535 <............................................ 1537 Confirm Link(MC,GC,QP9,link number=2, link userid) 1538 .............................................> 1540 Confirm Link response(MD,GD,QP65,link number=2, link userid) 1541 <............................................. 1543 Legend: 1544 ------------ TCP/IP and CLC flows 1545 ............ RoCE (LLC) flows 1547 Figure 9 First contact, second link setup 1549 3.4.1.6.3. Exchange of Rkeys on second SMC-R link 1551 Note that in the scenario described here, first contact, there is 1552 only one RMB Rkey to exchange on the second SMC-R link and it is 1553 exchanged in the Add Link Continuation request and reply. In 1554 scenarios other than first contact, for example, adding a new SMC-R 1555 link to a longstanding link group with multiple RMBs, additional 1556 flows will be required to exchange additional RMB Rkeys. See 1557 3 ..4.5.2.3. Adding a new SMC-R link to a link group with multiple RMBs 1558 for more details on these flows 1560 3.4.1.6.4. Aborting SMC-R and falling back to IP 1562 If both partners don't provide the SMC-R TCP option during the 3 way 1563 TCP handshake, the connection falls back to normal TCP/IP. During 1564 the SMC-R negotiation that occurs after the 3 way TCP handshake, 1565 either partner may break off SMC-R by sending an SMC Decline CLC 1566 message. The SMC Decline CLC message may be sent in place of any 1567 expected message, and may also be sent during the Confirm Link LLC 1568 exchange if there is a failure before any application data has flowed 1569 over the RoCE fabric. For more detail on exactly when an SMC Decline 1570 can flow during link group setup, see C.1. SMC Decline during CLC 1571 negotiation and C.2. SMC Decline during LLC negotiation 1573 If this fallback to IP happens while setting up a new SMC-R link 1574 group, the RoCE resources allocated for this SMC-R link group 1575 relationship are torn down and it will be retried as a new SMC-R link 1576 group next time a connection starts between these peers with SMC-R 1577 proposed. Note that if this happens because one side doesn't support 1578 SMC-R, there will be very little to tear down as the TCP option will 1579 have failed to flow either on the initial SYN or the SYN-ACK, before 1580 either side had reserved any local RoCE resources. 1582 3.4.2. Subsequent contact 1584 "Subsequent contact" means setting up a new TCP connection between 1585 two peers that already have an SMC-R link group between them, and 1586 reusing the existing SMC-R link group. In this case it is not 1587 necessary to allocate new QPs. However it is possible that a new RMB 1588 has been allocated for this TCP connection, if the previous TCP 1589 connection used the last element available in the previously used 1590 RMB, or for any other implementation-dependent reason. For this 1591 reason, and for convenience and error checking, the same TCP option 1592 254 followed by inline negotiation method described for initial 1593 contact will be used for subsequent contact, but the processing 1594 differs in some ways. That processing is described below. 1596 3.4.2.1. SMC-R proposal 1598 When the client begins the inline negotiation with the server, it 1599 does not know if this is a first contact or a subsequent contact. 1600 The client cannot know this information until it sees the server's 1601 peer ID to determine whether or not it already has an SMC-R link with 1602 this peer that it can use. There are several reasons why it is not 1603 sufficient to use the partner IP address, subnet, VLAN or other IP 1604 information to make this determination. The most obvious reason is 1605 distributed systems: if the server IP address is actually a virtual 1606 IP address representing a distributed cluster, the actual host 1607 serving this TCP connection may not be the same as the host that 1608 served the last TCP connection to this same IP address. 1610 After the TCP three way handshake, assuming both partners indicate 1611 SMC-R capability, the client builds and sends the SMC Proposal CLC 1612 message to the server in exactly the same manner as it does in the 1613 first contact case, and in fact at this point doesn't know if it's 1614 first contact or subsequent contact. As in the first contact case, 1615 the client sends its Peer ID value, suggested RNIC GID/MAC, and IP 1616 subnet or prefix information. 1618 Upon receiving the client's proposal, the server looks up the peer ID 1619 provided to determine if it already has a usable SMC-R link group 1620 with this peer. If it does already have a usable SMC-R link group, 1621 the server then needs to decide if it will use the existing SMC-R 1622 link group, or create a new link group. For the new link group 1623 case, see 3.4.3. First contact variation: creating a parallel link 1624 group, below. 1626 For this discussion assume the server decides to use the existing 1627 SMC-R link group for the TCP connection, which is expected to be the 1628 most common case. The server is responsible for making this decision. 1629 Then the server needs to communicate that information to the client, 1630 but it is not necessary to allocate, associate, and confirm QPs for 1631 the chosen SMC-R link. All that remains to be done is to set up RMB 1632 space for this TCP connection. 1634 If one of the RMBs already in use for this SMC-R link group has an 1635 available element that uses the appropriate buffer size, the server 1636 merely chooses one for this TCP connection and then sends an SMC 1637 Confirm CLC message, providing the full RoCE information for the 1638 chosen SMC-R link to the client, using the same format as the SMC 1639 Confirm CLC message described in the initial contact section above. 1641 The server may choose to use the SMC-R link that matches the 1642 suggested MAC/GID provided by the client on the SMC Proposal for its 1643 RDMA writes but is not obligated to. The final decision on which 1644 specific SMC-R link to assign a TCP connection to is an independent 1645 server and client decision. 1647 It may be necessary for the server to allocate a new RMB for this 1648 connection. The reasons for this are implementation dependent and 1649 could include: no available space in existing RMB or RMBs, or desire 1650 to allocate a new RMB that uses a different buffer size from the ones 1651 already created, or any other implementation dependent reason. In 1652 this case the server will allocate the new RMB and then perform the 1653 flows described in 3.4.5.2.1. Adding a new RMB to an SMC-R link 1654 group. Once that processing is complete, the server then provides the 1655 full RoCE information, including the new Rkey, for this connection 1656 on an SMC Confirm CLC message to the client. 1658 3.4.2.2. SMC-R acceptance 1660 Upon receiving the SMC Accept CLC message from the server, the client 1661 examines the RoCE information provided by the server to determine if 1662 this is a first contact for a new SMC link group, or subsequent 1663 contact for an existing SMC-R link group. It is subsequent contact 1664 if the server side peer ID, GID, MAC and QP number provided on the 1665 packet match a known SMC-R link, and the "first contact" flag is not 1666 set. If this is not the case, for example the GID and MAC match but 1667 the QP is new, then the server is creating a new, parallel SMC-R link 1668 group and this is treated as a first contact. 1670 A different RMB RToken does not indicate a first contact as the 1671 server may have allocated a new RMB, or be using several RMBs for 1672 this SMC-R link. The client needs the server's RMB information only 1673 for its RDMA writes to the server, and since there is no requirement 1674 for symmetric RMBs, this information is simply control information 1675 for the RDMA writes on this SMC-R link. 1677 The client must validate that the RMB element being provided by the 1678 server is not in use by another TCP connection on this SMC-R link 1679 group. This validation must validate the new across 1680 all known on this link group. See 4.4.2. RMB element 1681 reuse and conflict resolution for the case in which the server tries 1682 to use an RMB element that is already in use on this link group. 1684 Once the client has determined that this TCP connection is a 1685 subsequent contact over an existing SMC link, it performs a similar 1686 RMB allocation process as the server did: it either allocates an 1687 element from an RMB already associated with this SMC-R link, or it 1688 allocates a new RMB and associates it with this SMC-R link and then 1689 chooses an element out of it. 1691 If the client allocates a new RMB for this TCP connection, it 1692 performs the processing described in 3.4.5.2.1. Adding a new RMB to 1693 an SMC-R link group. Once that processing is complete, the client 1694 provides its full RoCE information for this TCP connection on an SMC 1695 Confirm CLC message. 1697 Because an SMC-R link with a verified connected QP already exists and 1698 is being reused, there is no need for verification or alternate QP 1699 selection flows or timers. 1701 3.4.2.3. SMC-R confirmation 1703 When the server receives the client's SMC Confirm CLC message on a 1704 subsequent contact, it verifies the following: 1706 o the RMB element provided by the client is not already in use by 1707 another TCP connection on this SMC-R link group (see section 1708 4.4.2. RMB element reuse and conflict resolution for the case in 1709 which it is). 1711 o The MAC/GID/QP info provided by the client matches an active link 1712 within the link group. The client is free to select any valid / 1713 active link. The client is not required to select the same link as 1714 the server. 1716 If this validation passes, the server stores the client's RMB 1717 information for this connection and the RoCE setup of the TCP 1718 connection is complete. 1720 3.4.2.4. TCP data flow race with SMC Confirm CLC message 1722 On a subsequent contact TCP/IP connection, a peer may send data as 1723 soon as it has received the peer RMB information for the connection. 1724 There are no additional RoCE confirmation flows, since the QPs on the 1725 SMC link are already reliably connected and verified. 1727 In the majority of cases the first data will flow from the client to 1728 the server. The client must send the SMC Confirm CLC message before 1729 sending any TCP data over the chosen SMC-R link, however the client 1730 need not wait for confirmation of this message, and in fact there 1731 will be no such confirmation. Since the server is required to have 1732 the RMB fully set up and ready to receive data from the client before 1733 sending SMC Accept CLC message, the client can begin sending data 1734 over the SMC-R link immediately upon completing the send of the SMC 1735 Confirm CLC message. 1737 It is possible that data from the client will arrive into the server 1738 side RMB before the SMC Confirm CLC message from the client has been 1739 processed. In this case the server must handle this race condition, 1740 and not provide the arrived TCP data to the socket application until 1741 the SMC Confirm CLC message has been received and fully processed, 1742 opening the socket. 1744 If the server has initial data to send to the client which is not a 1745 response to the client (this case should be rare), it can send the 1746 data immediately upon receiving and processing the SMC Confirm CLC 1747 message from the client. The client must have opened the TCP socket 1748 to the client application upon sending of SMC Confirm CLC message so 1749 the client will be ready to process data from the server. 1751 3.4.3. First contact variation: creating a parallel link group 1753 Recall that parallel SMC-R links within an SMC-R link group are not 1754 supported. These are multiple SMC-R links within a link group that 1755 use the same network path. However, multiple SMC-R link groups 1756 between the same peers are supported. This means that if multiple 1757 SMC-R links over the same RoCE path are desired, it is necessary to 1758 use multiple SMC-R link groups. While not a recommended practice, 1759 this could be done for platform specific reasons, like QP separation 1760 of different workloads. Only the server can drive the creation of 1761 multiple SMC-R link groups between peers. 1763 At a high level, when the server decides to create an additional SMC- 1764 R link group with a client it already has an SMC-R link group with, 1765 the flows are basically the same as the normal "first contact" case 1766 described above. The following provides more detail and 1767 clarification of processing in this case. 1769 When the server receives the SMC Proposal CLC message from the client 1770 and using the GID/MAC info determines that it already has an SMC-R 1771 link group with this client, the server can either reuse the existing 1772 SMC-R link group (detailed in 3.4.2. Subsequent contact above) or it 1773 can create a new SMC-R link group in addition to the existing one. 1775 If the server decides to create a new SMC-R link group, it does the 1776 same processing it would have done for first contact: allocate QP and 1777 RMB resources as well as alternate QP resources, and communicate the 1778 QP and RMB information to the client on the SMC Accept CLC message 1779 with the "first contact" flag set. 1781 When the client receives the server's SMC Accept CLC message with the 1782 new QP information and the "first contact" flag, it knows the server 1783 is creating a new SMC-R link group even though it already has an SMC- 1784 R link group with the server. In this case the client will also 1785 allocate a new QP for this new SMC link and allocate an RMB for this 1786 link and generate an Rkey for it. 1788 Note that multiple SMC-R link groups between the same peers must 1789 access different RMB resources, so new RMBs will be required. Using 1790 the same RMBs that are in use in another SMC-R link group is not 1791 permitted. 1793 The client then associates its new QP with the server's new QP and 1794 sends its SMC Confirm CLC message back to the server providing the 1795 new QP/RMB information and sets its confirmation timer for the new 1796 SMC-R link. 1798 When the server receives the client's SMC Confirm CLC message it 1799 associates its QP with the client's QP as learned on the SMC Confirm 1800 CLC message and sends a confirmation LLC message. The rest of the 1801 flow, with the confirmation QP and setup of additional SMC-R links, 1802 unfolds just like the first contact case. 1804 3.4.4. Normal SMC-R link termination 1806 The normal sockets API trigger points are used by the SMC-R layer to 1807 initiate SMC-R connection termination flows. The main design point 1808 for SMC-R normal connection flows is to use the SMC-R protocol to 1809 first shutdown the SMC-R connection and free up any SMC-R RDMA 1810 resources and then allow the normal TCP connection termination 1811 protocol (i.e. FIN processing) to drive cleanup of the TCP connection 1812 that exists on the IP fabric. This design point is very important in 1813 ensuring that RDMA resources such as the RMBEs are only freed and 1814 reused when both SMC-R end points are completely done with their RDMA 1815 Write operations to the partner's RMBE. 1817 When the last TCP connection over an SMC-R link group terminates, the 1818 link group can be terminated. Similar to creation of SMC-R links and 1819 link groups, the primary responsibility for determining that normal 1820 termination is needed and initiating it lies with the server. 1821 Implementations may opt to set timers to keep SMC-R link groups up 1822 for a specified time after the last TCP connection ends, to avoid 1823 churn in cases when TCP connections come and go regularly. 1825 The link or link group may also be terminated as a result of an 1826 operator initiated command. This command can be entered at either 1827 the client or the server. If entered at the client, the client 1828 requests that the server perform link or link group termination, and 1829 the responsibility for doing so ultimately lies with the server. 1831 When the server determines that the SMC-R link group is to be 1832 terminated, it sends a DELETE LINK LLC message to the client, with a 1833 flag set indicating that all links in the link group are to be 1834 terminated. After receiving confirmation from the adapter that the 1835 DELETE LINK LLC message has been sent, the server can clean up its 1836 end of the link group (QPs, RMBs, etc). Upon receipt of the DELETE 1837 LINK message from the server, the client must immediately comply and 1838 clean up its end of the link group. Any TCP connections that the 1839 client believes to be active on the link group must be immediately 1840 terminated. 1842 The client can request that the server delete the link group as well. 1843 The client does this by sending a DELETE LINK message to the server 1844 indicating that cleanup of all links is requested. The server must 1845 comply by sending a DELETE LINK to the client and processing as 1846 described above. If there are TCP connections active on the link 1847 group when the server receives this request, they are immediately 1848 terminated by sending a RST flow over the IP fabric. 1850 3.4.5. Link group management flows 1852 3.4.5.1. Adding and deleting links in an SMC-R link group 1854 The server has the lead role in managing the composition of the link 1855 group. Links are added to link group by the server. The client may 1856 notify the server of new conditions that may result in the server 1857 adding a new link, but the server is ultimately responsible. In 1858 general links are deleted from the link group by the server, however 1859 in certain error cases the client may inform the server that a link 1860 must be deleted and treat it as deleted without waiting for action 1861 from the server. These flows are detailed in the following sections 1863 3.4.5.1.1. Server initiated Add Link processing 1865 As described in previous sections, the server initiates an Add Link 1866 exchange to create redundancy in a newly created link group. Once a 1867 link group is established the server may also initiate Add Link for 1868 other reasons, including: 1870 o Availability of additional resources on the server host to support 1871 an additional SMC-R link. This may include the provisioning of an 1872 additional RNIC, more storage becoming available to support 1873 additional QP resources, operator command, or any other 1874 implementation dependent reason. Note that, to be available for 1875 an existing link group, a new RNIC must be attached to the same 1876 RoCE VLAN that the link group is using. 1878 o Receipt of notification from the client that additional resources 1879 on the client are available to support an additional SMC-R link. 1880 See 3 ..4.5.1.2. Client initiated Add Link processing. 1882 Server initiated Add Link processing in an established SMC-R link 1883 group is the same as the Add Link processing described in 3.4.1.6. 1884 Second SMC-R link setup with the following changes: 1886 o If an asymmetric SMC-R link already exists in the link group a 1887 second asymmetric link will not be created. Only one asymmetric 1888 link is permitted in a link group. 1890 o TCP data flow on already existing link(s) in the link group is not 1891 halted or otherwise affected during the process of setting up the 1892 additional link. 1894 In no case will the server initiate Add Link processing if the link 1895 group already has the maximum number of links negotiated by the 1896 partners. 1898 3.4.5.1.2. Client initiated Add Link processing 1900 If an additional RNIC becomes available for an existing SMC-R link 1901 group on the client's side, the client notifies the server by sending 1902 an Add Link request LLC message to the server. Unlike an Add Link 1903 request sent by the server to the client, this Add Link request 1904 merely informs the server that the client has a new RNIC. If the 1905 link group lacks redundancy, or has redundancy only on an asymmetric 1906 link with a single RNIC on the client side, the server must initiate 1907 an Add Link exchange in response to this message, to create or 1908 improve the link group's redundancy. 1910 If the link group already has symmetric link redundancy but has fewer 1911 than the negotiated maximum number of links, the server may respond 1912 by initiating an Add Link exchange to create a new link using the 1913 client's new resource but is not required to. 1915 If the link group already has the negotiated maximum number of links, 1916 the server must ignore the client's Add Link request LLC message. 1918 Because the server is not required to respond to the client's Add 1919 Link LLC message in all cases, the client must not wait for a 1920 response or throw an error if one does not come. 1922 3.4.5.1.3. Server initiated Delete Link Processing 1924 Reasons that a server may delete a link include: 1926 o The link has not been used for TCP connections for an 1927 implementation defined time interval, and deleting the link will 1928 not cause the link group to lack redundancy 1930 o An error in resources supporting the link. These may include but 1931 are not limited to: RNIC errors, QP errors, software errors 1933 o The RNIC supporting this SMC-R link is being taken down, either 1934 because of an error case or because of an operator or software 1935 command. 1937 If a link being deleted is supporting TCP connections, and there are 1938 one or more surviving links in the link group, the TCP connections 1939 are moved to the surviving links. For more information on this 1940 processing see 2.3. SMC-R resilience and load balancing. 1942 The server deletes a link from the link group by sending a Delete 1943 Link request LLC message to the client over any of the usable links 1944 in the link group. Because the Delete Link LLC message specifies 1945 which link is to be deleted, it may flow over any link in the link 1946 group. The server must not clean up its RoCE resources for the link 1947 until the client responds. 1949 The client responds to the server's Delete Link request LLC message 1950 by sending the server a Delete Link response LLC message. The client 1951 must respond positively; it cannot decline to delete the link. Once 1952 the server has received the client's Delete Link response, both sides 1953 may clean up their resources for the link. 1955 Positive write completion or other indication from the RNIC on the 1956 client's side is sufficient to indicate to the client that the server 1957 has received the Delete Link response. 1959 Host X Host Y 1960 +-------------------+ +-------------------+ 1961 | +------+ +------+ | 1962 | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 9 | 1963 |RToken X| |Failed|<--X----X----X----X-->| | | 1964 | | | | | | | 1965 | \/ +------+ +------+ | 1966 |+--------+ | | | 1967 || deleted| | | | 1968 || RMB | | | | 1969 || | | | | 1970 |+--------+ | | | 1971 | /\ +------+ +------+ | 1972 |RToken Z| | | SMC-R Link 2 | | | 1973 | | |RNIC 3|<-------------------->|RNIC 4| | 1974 | QP 64| | | | QP 65 | 1975 | +------+ +------+ | 1976 +-------------------+ +-------------------+ 1978 DELETE LINK(Request, link number = 1, 1979 ................................................> 1980 reason code = RNIC failure) 1982 DELETE LINK(Response, link number = 1) 1983 <................................................ 1985 (note, architecturally this exchange can flow over either 1986 SMC-R link but most likely flows over link 2 since 1987 the RNIC for link 1 has failed) 1989 Figure 10 Server initiated Delete Link flow 1991 3.4.5.1.4. Client initiated Delete Link request 1993 The client may request that the server delete a link for the same 1994 reasons that the server may delete a link, except for inactivity 1995 timeout. 1997 Because the client depends on the server to delete links, there are 1998 two types of delete requests from client to server: 2000 o Orderly: the client is requesting that the server delete the link 2001 when able. This would result from an operator command to bring 2002 down the RNIC or some other nonfatal reason. In this case the 2003 server is required to delete the link, but may not do it right 2004 away. 2006 o Disorderly: the server must delete the link right away, because 2007 the client has experienced a fatal error with the link. 2009 In either case the server responds by initiating a Delete Link 2010 exchange with the client as described in the previous section. The 2011 difference between the two is whether the server must do so 2012 immediately or can delay for an opportunity to gracefully delete the 2013 link. 2015 Host X Host Y 2016 +-------------------+ +-------------------+ 2017 | +------+ +------+ | 2018 | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 9 | 2019 |RToken X| | |<---X--X--X--X--X--X->|Failed| | 2020 | | | | | | | 2021 | \/ +------+ +------+ | 2022 |+--------+ | | | 2023 || deleted| | | | 2024 || RMB | | | | 2025 || | | | | 2026 |+--------+ | | | 2027 | /\ +------+ +------+ | 2028 |RToken Z| | | SMC-R Link 2 | | | 2029 | | |RNIC 3|<-------------------->|RNIC 4| | 2030 | QP 64| | | | QP 65 | 2031 | +------+ +------+ | 2032 +-------------------+ +-------------------+ 2034 DELETE LINK(Request, link number = 1, disorderly, 2035 <............................................... 2036 reason code = RNIC failure) 2038 DELETE LINK(Request, link number = 1, 2039 ................................................> 2040 reason code = RNIC failure) 2042 DELETE LINK(Response, link number = 1) 2043 <................................................ 2045 (note, architecturally this exchange can flow over either 2046 SMC-R link but most likely flows over link 2 since 2047 the RNIC for link 1 has failed) 2049 Figure 11 Client-initiated Delete Link 2051 3.4.5.2. Managing multiple Rkeys over multiple SMC-R links in a link 2052 group 2054 After the initial contact sequence completes and the number of TCP 2055 connections increases it is possible that the SMC peers could add 2056 additional RMBs to the Link Group. Recall that each peer 2057 independently manages its RMBs. Also recall that an RMB's RToken is 2058 specific to a QP, which means that when there are multiple SMC-R 2059 links in a link group, each RMB accessed with the link group requires 2060 a separate RToken for each SMC-R link in the group. 2062 Each RMB that is added to a link must be added to all links within 2063 the Link Group. The set of RMBs created for the Link is called the 2064 "RToken Set". The RTokens must be exchanged with the peer. As RMBs 2065 are added and deleted, the RToken Set must remain in sync. 2067 3.4.5.2.1. Adding a new RMB to an SMC-R link group 2069 A new RMB can be added to an SMC-R link group on either the client or 2070 the server side. When an additional RMB is added to an existing SMC- 2071 R link group, that RMB must be associated with the QPs for each link 2072 in the link group. Therefore when an RMB is added to an SMC-R link 2073 group, its RMB RToken for each SMC-R link's QP must be communicated 2074 to the peer. 2076 The tokens for a new RMB added to an existing SMC-R link group are 2077 communicated using "Confirm Rkey" LLC messages, as shown in Figure 2078 12. The RToken set is specified as pairs: an SMC link number, paired 2079 with the new RMB's RToken over that SMC Link. To preserve failover 2080 capability, any TCP connection that uses a newly added RMB cannot go 2081 active until all RTokens for the RMB have been communicated for all 2082 the links in the link group. 2084 Host X Host Y 2085 +-------------------+ +-------------------+ 2086 | +------+ +------+ | 2087 | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 9 | 2088 |RToken X| | |<-------------------->| | | 2089 | | | | | | | 2090 | \/ +------+ +------+ | 2091 |+--------+ | | | 2092 || new | | | | 2093 || RMB | | | | 2094 || | | | | 2095 |+--------+ | | | 2096 | /\ +------+ +------+ | 2097 |RToken Z| | | SMC-R Link 2 | | | 2098 | | |RNIC 3|<-------------------->|RNIC 4| | 2099 | QP 64| | | | QP 65 | 2100 | +------+ +------+ | 2101 +-------------------+ +-------------------+ 2103 CONFIRM RKEY(Request, Add, 2104 ................................................> 2105 RToken set((Link 1,RToken X),(Link2,RToken Z))) 2107 CONFIRM RKEY(Response, Add, 2108 <................................................ 2109 RToken set((Link 1,RToken X),(Link2,RToken Z))) 2111 (note, this exchange can flow over either SMC-R link) 2113 Figure 12 Add RMB to existing link group 2115 Implementations may choose to proactively add RMBs to link groups in 2116 anticipation of need. For example, an implementation may add a new 2117 RMB when all of its existing RMBs are over a certain threshold 2118 percentage used. 2120 A new RMB may also be added to an existing link group on an as needed 2121 basis. For example, when a new TCP connection is added to the link 2122 group but there are no available RMB elements. In this case the CLC 2123 exchange is paused while the peer that requires the new RMB adds it. 2124 An example of this is illustrated in figure 13. 2126 Host X -- Server Host Y -- Client 2127 +-------------------+ +-------------------+ 2128 | PeerID = PS1 | | PeerID = PC1 | 2129 | +------+ +------+ | 2130 | QP 8 |RNIC 1| SMC-R link 1 |RNIC 2| QP 64 | 2131 |RToken X| |MAC MA|<-------------------->|MAC MB| | | 2132 | | |GID GA| |GID GB| |RTokenY2| 2133 | \/ +------+ +------+ \/ | 2134 |+--------+ | | +--------+ | 2135 || | | SUBNET S1 | | New | | 2136 || RMB | | | | RMB | | 2137 |+--------+ | | +--------+ | 2138 | /\ +------+ +------+ /\ | 2139 | | |RNIC 3| SMC-R link 2 |RNIC 4| |RTokenW2| 2140 | | |MAC MC|<-------------------->|MAC MD| | | 2141 | QP 9 |GID GC| |GID GD| QP65 | 2142 | +------+ +------+ | 2143 +-------------------+ +-------------------+ 2145 SYN / SYN-ACK / ACT TCP 3-way handshake with TCP option 2146 <---------------------------------------------------------> 2148 SMC Proposal(PC1,MB,GB,S1) 2149 <-------------------------------------------------------- 2151 SMC Accept(PS1,not 1st contact,MA,GA,QP8,RToken=X,RMB elem index) 2152 ---------------------------------------------------------> 2154 Confirm Rkey(Request, Add, 2155 <........................................................ 2156 RToken set((Link1, RToken Y2),{Link2, RToken W2))) 2158 Confirm Rkey(Response, Add, 2159 ........................................................> 2160 RToken set((Link1, RToken Y2),{Link2, RToken W2))) 2162 SMC Confirm(PC1,MB,GB,QP64,RToken=Y2, RMB element index) 2163 <-------------------------------------------------------- 2165 Legend: 2166 ------------ TCP/IP and CLC flows 2167 ............ RoCE (LLC) flows 2169 Figure 13 Client adds RMB during TCP connection setup 2171 3.4.5.2.2. Deleting an RMB from an SMC-R link group 2173 Either peer can delete one or more of its RMBs as long as it is not 2174 being used for any TCP connections. Ideally an SMC-R host would use 2175 a timer to avoid freeing an RMB immediately after the last TCP 2176 connection stops using it, to keep the RMB available for later TCP 2177 connections and avoid thrashing with addition and deletion of RMBs. 2178 Once an SMC-R peer decides to delete an RMB, it sends a DELETE RKEY 2179 LLC message to its peer. It can then free the RMB once it receives a 2180 response from the peer. Multiple RMBs can be deleted in a DELETE 2181 RKEY exchange. 2183 Note that in a DELETE RKEY message, it is not necessary to specify 2184 the full RToken for a deleted RMB. The RMB's Rkey over one link in 2185 the link group is sufficient to specify which RMB is being deleted. 2187 Host X Host Y 2188 +-------------------+ +-------------------+ 2189 | +------+ +------+ | 2190 | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 9 | 2191 |RToken X| | |<-------------------->| | | 2192 | | | | | | | 2193 | \/ +------+ +------+ | 2194 |+--------+ | | | 2195 || deleted| | | | 2196 || RMB | | | | 2197 || | | | | 2198 |+--------+ | | | 2199 | /\ +------+ +------+ | 2200 |RToken Z| | | SMC-R Link 2 | | | 2201 | | |RNIC 3|<-------------------->|RNIC 4| | 2202 | QP 9 | | | | | 2203 | +------+ +------+ | 2204 +-------------------+ +-------------------+ 2206 DELETE RKEY(Request, Rkey list(Rkey X)) 2207 ................................................> 2209 DELETE RKEY(Response, Rkey list(Rkey X)) 2210 <................................................ 2212 (note, this exchange can flow over either SMC-R link) 2214 Figure 14 Delete RMB from SMC-R link group 2216 3.4.5.2.3. Adding a new SMC-R link to a link group with multiple RMBs 2218 When a new SMC-R link is added to an existing link group, there could 2219 be multiple RMBs on each side already associated with the link group. 2220 There could also be a different number of RMBs on one side as on the 2221 other, because each peer manages its RMBs independently. Each of 2222 these RMBs will require a new RToken to be used on the new SMC-R 2223 link, and then those new RTokens must be communicated to the peer. 2224 This requires two-way communication as the server will have to 2225 communicate its RTokens to the client and vice versa. 2227 RTokens are communicated between peers in pairs. Each RToken pair 2228 consists of: 2230 o The RToken for the RMB, as is already known on an existing SMC-R 2231 link in the link group 2233 o The RToken for the same RMB, to be used on the new SMC-R link. 2235 These pairs are required to ensure that each peer knows which RTokens 2236 across QPs are equivalent. 2238 The "Add Link" request and response LLC messages do not have room to 2239 contain any RToken pairs. "Add Link continuation" LLC messages are 2240 used to communicate these pairs, as shown in Figure 15. The "Add 2241 Link Continuation" LLC messages are sent on the same SMC-R link that 2242 the "Add Link" LLC messages were sent over, and in both the "Add 2243 Link" and the "Add Link Continuation" LLC messages, the first RToken 2244 in each RToken pair will be the RToken for the RMB as known on the 2245 SMC-R link that the LLC message is being sent over. 2247 Host X -- Server Host Y -- Client 2248 +-------------------+ +-------------------+ 2249 | PeerID = PS1 | | PeerID = PC1 | 2250 | +------+ +------+ | 2251 | QP 8 |RNIC 1| |RNIC 2| QP 64 | 2252 |Rkey Set| |MAC MA| |MAC MB| |Rkey set| 2253 |X,Y,Z | |GID GA| |GID GB| |Q,R,S,T | 2254 | \/ +------+ +------+ \/ | 2255 |+--------+ | | +--------+ | 2256 || 3 RMBs | | | | 4 RMBs | | 2257 |+--------+ | | +--------+ | 2258 | /\ +------+ +------+ /\ | 2259 |Rkey set| |RNIC 3| |RNIC 4| | Rkey set| 2260 |U,V,W | |MAC MC| |MAC MD| | L,M,N,P | 2261 | QP 9 |GID GC| |GID GD| QP 65 | 2262 | +------+ +------+ | 2263 +-------------------+ +-------------------+ 2265 ADD link request (QP9,MC,GC, link number=2) 2266 ............................................> 2268 ADD link response (QP65,MD,GD, link number=2) 2269 <............................................ 2271 ADD link continuation req(RToken Pairs=((X,U),(Y,V),(Z,W))) 2272 ............................................> 2274 ADD link continuation rsp(RToken Pairs=((Q,L),(R,M),(S,N),(T,P))) 2275 <............................................. 2277 Confirm Link Req/Rsp exchange on link 2 2278 <.............................................> 2280 Legend: 2281 ------------ TCP/IP and CLC flows 2282 ............ RoCE (LLC) flows 2283 Figure 15 Exchanging Rkeys when a new link is added to a link group 2285 3.4.5.3. Serialization of LLC exchanges, and collisions 2287 LLC flows can be divided into two main groups for serializaion 2288 considerations. 2290 The first group is LLC messages that are independent and can flow at 2291 any time. These are one-time, unsolicited messages that either do 2292 not have a required response, or that have a simple response that 2293 does not interfere with the operations of another group of messages. 2294 These messages are: 2296 o TEST LINK from either the client or the server: This message 2297 requires a TEST LINK response to be returned, but does not affect 2298 the configuration of the link group or the Rkeys. 2300 o ADD LINK from the client to the server: This message is provided 2301 as an "FYI" to the server to let it know that the client has an 2302 additional RNIC available. The server is not required to act upon 2303 or respond to this message. 2305 o DELETE_LINK from the client to the server: This message informs 2306 the server that the client has either experienced an error or 2307 problem that requires a link or link group to be terminated, or 2308 that an operator has commanded that a link or link group be 2309 terminated. The server does not respond directly to the message, 2310 rather it initiates a DELETE LINK exchange as a result of 2311 receiving it. 2313 o DELETE LINK from the server to the client with the "delete entire 2314 link group" flag set: This message informs the client that the 2315 entire link group is being deleted. 2317 The second group is LLC messages that are part of an exchange of LLC 2318 messages that affects link group configuration that must complete 2319 before another exchange of LLC messages that affects link group 2320 configuration can be processed. When a peer knows that one of these 2321 exchanges is in progress, it must not start another exchange. These 2322 exchanges are: 2324 o ADD LINK / ADD LINK response / ADD LINK CONTINUATION / ADD LINK 2325 CONTINUATION response / CONFIRM LINK / CONFIRM LINK RESPONSE: 2326 This exchange, by adding a new link, changes the configuration of 2327 the link group. 2329 o DELETE LINK / DELETE LINK response initiated by the server: This 2330 exchange, by deleting a link, changes the configuration of the 2331 link group. 2333 o CONFIRM RKEY / CONFIRM RKEY response or DELETE RKEY / DELETE RKEY 2334 response: This exchange changes the RMB configuration of the link 2335 group. RKeys can not change while links are being added or 2336 deleted (while ADD or DELETE LINK is in progress). However, 2337 CONFIRM RKEY and DELETE RKEY are unique in that both the client 2338 and server can independently manage (add or remove) their own 2339 RMBs. This allows each peer to concurrently change their RKeys 2340 and therefore concurrently send CONFIRM RKEY or DELETE RKEY 2341 requests. The concurrent CONFIRM RKEY or DELETE RKEY requests can 2342 be independently processed and do not represent a collision 2344 Because the server is in control of the configuration of the link 2345 group, many timing windows and collisions are avoided but there are 2346 still some that must be handled. 2348 3.4.5.3.1. Collisions with ADD LINK / CONFIRM LINK exchange 2350 Colliding LLC message: TEST LINK 2352 Action to resolve: Send immediate TEST LINK reply 2354 Colliding LLC Message: ADD LINK from client to server 2356 Action to resolve: Server ignores the ADD LINK message. When 2357 client receives server's ADD LINK, client will consider that 2358 message to be in response to its ADD LINK message and the flow 2359 works. Since both client and server know not to start this 2360 exchange if an ADD LINK operation is already underway, this can 2361 only occur if the client sends this message before receiving the 2362 server's ADD LINK and this message crosses with the server's ADD 2363 LINK message, therefore the server's ADD LINK arrives at the 2364 client immediately after the client sent this message. 2366 Colliding LLC Message: DELETE LINK from client to server, specific 2367 link specified 2369 Action to resolve: Server queues the DELETE link message and 2370 processes after the ADD LINK exchange completes. If it is an 2371 orderly link termination, it can wait until after this exchange 2372 continues. If it is disorderly and the link affected is the one 2373 that the current exchange is using, the server will discover the 2374 outage when a message in this exchange fails. 2376 Colliding LLC Message: DELETE LINK from client to server, entire link 2377 group to be deleted 2379 Action to resolve: Immediately clean up the link group 2381 Colliding LLC message: CONFIRM RKEY from the client 2383 Action to resolve: Send negative CONFIRM_RKEY response to the 2384 client. Once the current exchange finishes, client will have to 2385 recompute its Rkey set to include the new link, and start a new 2386 CONFIRM RKEY exchange. 2388 3.4.5.3.2. Collisions during DELETE LINK exchange 2390 Colliding LLC Message: TEST LINK from either peer 2392 Action to resolve: Send immediate TEST LINK response 2394 Colliding LLC message: ADD LNK from client to server 2396 Action to resolve: Server queues the ADD LINK and processes it 2397 after the current exchange completes 2399 Colliding LLC message: DELETE LINK from client to server (specific 2400 link) 2402 Action to resolve: Server queues the DELETE link message and 2403 processes after the current exchange completes. If it is an 2404 orderly link termination, it can wait until after this exchange 2405 continues. If it is disorderly and the link affected is the one 2406 that the current exchange is using, the server will discover the 2407 outage when a message in this exchange fails 2409 Colliding LLC message: DELETE LINK from either client or server, 2410 deleting the entire link group 2412 Action to resolve: immediately clean up the link group 2414 Colliding LLC message: CONFIRM_RKEY from client to server 2416 Action to resolve: Send negative CONFIRM_RKEY response to the 2417 client. Once the current exchange finishes, client will have to 2418 recompute its Rkey set to include the new link, and start a new 2419 CONFIRM RKEY exchange 2421 3.4.5.3.3. Collisions during CONFIRM_RKEY exchange 2423 Colliding LLC Message: TEST LINK 2425 Action to resolve: Send immediate TEST LINK reply 2427 Colliding LLC message: ADD LINK from client to server 2429 Action to resolve: Queue the ADD LINK and process it after the 2430 current exchange completes 2432 Colliding LLC message: ADD LINK from server to client (CONFIRM RKEY 2433 exchange was initiated by the client and it crossed with the server 2434 initiating an ADD LINK exchange) 2436 Action to resolve: Process the ADD LINK. Client will receive a 2437 negative CONFIRM RKEY from the server and will have to redo this 2438 CONFIRM RKEY exchange after the ADD LINK exchange completes. 2440 Colliding LLC message: DELETE LINK from client to server, specific 2441 link to be deleted (CONFIRM RKEY exchange was initiated by the server 2442 and it crossed with the client's DELETE LINK request 2444 Action to resolve: Server queues the DELETE link message and 2445 processes after the ADD LINK exchange completes. If it is an 2446 orderly link termination, it can wait until after this exchange 2447 continues. If it is disorderly and the link affected is the one 2448 that the current exchange is using, the server will discover the 2449 outage when a message in this exchange fails. 2451 Colliding LLC message: DELETE LINK from server to client, specific 2452 link deleted (CONFIRM RKEY exchange was initiated by the client and 2453 it crossed with the server's DELETE LINK) 2455 Action to resolve: Process the DELETE LINK. Client will receive a 2456 negative CONFIRM RKEY from the server and will have to redo this 2457 CONFIRM RKEY exchange after the ADD LINK exchange completes. 2459 Colliding LLC message: DELETE LINK from either client or server, 2460 entire link group deleted 2462 Action to resolve: immediately clean up the link group 2464 Colliding LLC message: CONFIRM LINK from the peer that did not start 2465 the current CONFIRM LINK exchange 2467 Action to resolve: Queue the request and process it after the 2468 current exchange completes. 2470 4. SMC-R memory sharing architecture 2472 4.1. RMB element allocation considerations 2474 Each TCP connection using SMC-R must be allocated a RMBE by each SMC- 2475 R peer. This allocation is performed by each end point independently 2476 to allow each end point to select an RMBE that best matches the 2477 characteristics on its TCP socket end point. The RMBE associated with 2478 a TCP socket endpoint must have a Receive buffer that is at least as 2479 large as the TCP receive buffer size in effect for that connection. 2480 The receive buffer size can be determined by what is specified 2481 explicitly by the application using setsockopt() or implicitly via 2482 the system configured default value. This will allow sufficient data 2483 to be RDMA written by the peer SMC-R host to fill an entire receive 2484 buffer size worth of data on a given data flow. Given that each RMB 2485 must have fixed length RMBEs this implies that an SMC-R end point may 2486 need to maintain multiple RMBs of various sizes for SMC-R connections 2487 on a given SMC link and can then select an RMBE that most closely 2488 fits a connection. 2490 4.2. RMB and RMBE format 2492 An RMB is a virtual memory buffer whose backing real memory is 2493 pinned, which is divided into a whole number of equal sized RMB 2494 Elements (RMBEs). Each RMBE begins with a four byte eye catcher for 2495 diagnostic and service purposes, followed by the receive data buffer. 2496 The diagnostic eyecatcher should be used by the local SMC-R stack to 2497 check for overlay errors by verifying an intact eye catcher with 2498 every RMBE access. 2500 The RMBE is a wrapping receive buffer for receiving RDMA writes from 2501 the peer. Cursors, as described below, are exchanged between peers 2502 to manage and track RDMA writes and local data reads from the RMBE 2503 for a TCP connection. 2505 4.3. RMBE control information 2507 RMBE control information consists of consumer and producer cursors, 2508 wrap counts, control flags such are urgent data and writer blocked 2509 indicators, and TCP connection information such as termination flags. 2510 This information is exchanged between SMC-R peers using CDC messages, 2511 which are passed using RDMA message passing with inline data, with 2512 the control information contained in the inline data. An SMC-R stack 2513 must receive and store this information in its internal data 2514 structures as it is used to manage the RMBE and its data buffer. 2516 The format and contents of this inline data is described in detail in 2517 4 ..3. RMBE control information. The following is a high level 2518 description of what this control information contains. 2520 o Connection state flags such as sending done, connection closed, 2521 and abnormal close 2523 o Producer cursor: a wrapping offset into the receiver's RMBE data 2524 area. Set by the peer that is writing into the RMBE, it points to 2525 where the writing peer will write the next byte of data into an 2526 RMBE. This cursor is accompanied by a wrap sequence number to help 2527 the RMBE owner (the receiver) identify full window size wrapping 2528 writes. 2530 o Consumer cursor: a wrapping offset into the receiver's RMBE data 2531 area. Set by the owner of the RMBE (the peer that is reading from 2532 it), this cursor points to the offset of the next byte of data to 2533 be consumed by the peer in its own RMBE. The sender cannot write 2534 beyond this cursor into the receiver's RMBE without causing data 2535 loss. Like the producer cursor, this is accompanied by a wrap 2536 count to help the writer identify full window size wrapping reads. 2538 Data flags such as urgent data, writer blocked indicator, and cursor 2539 update requests. 2541 4.4. Use of RMBEs 2543 4.4.1. Initializing and accessing RMBEs 2545 The RMBE eyecatcher is initialized by the RMB owner prior to 2546 assigning it to a specific TCP connection and communicating its RMB 2547 index to the SMC-R partner. After an RMBE index is communicated to 2548 the SMC-R partner the RMBE can only be referenced in "read only mode" 2549 by the owner and all updates to it are performed by the remote SMC-R 2550 partner via RDMA write operations. 2552 Initialization of an RMBE must include the following: 2554 o Zeroing out the entire RMBE receive buffer, which helps minimize 2555 data integrity issues (e.g. data from a previous connection 2556 somehow being presented to the current connection). 2558 o Setting the beginning RMBE eye catcher. This eye catcher plays an 2559 important role in helping detect accidental overlays of the RMBE. 2560 The RMB owner must always validate these eye catchers before each 2561 new reference to the RMBE. If the eye catchers are found to be 2562 corrupted the local host must reset the TCP connection associated 2563 with this RMBE and log the appropriate diagnostic information. 2565 4.4.2. RMB element reuse and conflict resolution 2566 RMB elements can be reused once their associated TCP and SMC-R 2567 connections are terminated. Under normal and abnormal SMC-R 2568 connection termination processing both SMC-R peers must explicitly 2569 acknowledge that they are done using an RMBE before that element can 2570 be freed and reassigned to another SMC-R connection instance. For 2571 more details on SMC-R connection termination refer to section 4.7. 2572 However, there are some error scenarios where this 2 way explicit 2573 acknowledgement may not be completed. In these scenarios (mentioned 2574 explicitly elsewhere in this document) an RMBE owner may chose to re- 2575 assign this RMBE to a new SMC-R connection instance on this SMC link 2576 group. When this occurs the partner SMC-R peer must detect this 2577 condition during SMC-R rendezvous processing when presented with an 2578 RMBE that it believes is already in use for a different SMC-R 2579 connection. In this case, the SMC-R peer must abort the existing 2580 SMC-R connection associated with this RMBE. The abort processing 2581 Resets the TCP connection (if it is still active) but it must not 2582 attempt to perform any RDMA writes to this RMBE and must also ignore 2583 any data sitting in the local RMBE associated with the existing 2584 connection. It then proceeds to free up the local RMBE and notify 2585 the local application that the connection is being abnormally reset. 2587 The remote SMC-R peer then proceeds to normal processing for this new 2588 SMC-R connection. 2590 4.5. SMC-R protocol considerations 2592 The following sections describe considerations for the SMC-R protocol 2593 as compared to the TCP protocol. 2595 4.5.1. SMC-R protocol optimized window size updates 2597 An SMC-R receiver host sends its Consumer Cursor information to the 2598 sender to convey the progress that the receiving application has made 2599 in consuming the sent data. The difference between the writer's 2600 Producer Cursor and the associated receiver's Consumer Cursor 2601 indicates the window size available for the sender to write into. 2602 This is somewhat similar to TCP window update processing and 2603 therefore has some similar considerations, such as silly window 2604 syndrome avoidance, whereby the TCP protocol has an optimization that 2605 minimizes the overhead of very small, unproductive window size 2606 updates associated with sub-optimal socket applications consuming 2607 very small amount of data on every receive() invocation. For SMC-R, 2608 the receiver only updates its Consumer Cursor via a unique CDC 2609 message under the following conditions: 2611 o The current window size (from a sender's perspective) is less than 2612 half of the Receive Buffer space and the Consumer Cursor update 2613 will result in a minimum increase in the window size of 10% of the 2614 Receive buffer space. Some examples: 2616 a. Receive Buffer size: 64K, Current window size (from a 2617 sender's perspective): 50K. No need to update the Consumer 2618 Cursor. Plenty of space is available for the sender. 2620 b. Receive Buffer size: 64K, Current window size (from a 2621 sender's perspective): 30K, Current window size from a 2622 receiver's perspective: 31K. No need to update the Consumer 2623 Cursor; even though the sender's window size < 1/2 of the 2624 64K, the window update would only increase that by 1K which 2625 is < 1/10th of the 64K buffer size. 2627 c. Receive Buffer size: 64K, Current window size (from a 2628 sender's perspective): 30K, Current window size from a 2629 receiver's perspective: 64K. The receiver updates update the 2630 Consumer Cursor (sender's window size < 1/2 of the 64K, the 2631 window update would increase that by > 6.4K). 2633 o The receiver must always include a Consumer Cursor update whenever 2634 it sends a CDC message to the partner for another flow (i.e. send 2635 flow in the opposite direction). This allows the window size 2636 update to be delivered with no additional overhead. This is 2637 somewhat similar to TCP DelayAck processing and quite effective 2638 for request/response data patterns. 2640 o The optimized window size updates are overridden when the sender 2641 sets the Consumer Cursor Update Requested flag in a CDC message to 2642 the receiver. When this indicator is on the consumer must send a 2643 Consumer Cursor update immediately when data is consumed by the 2644 local application or if the cursor has not been updated for a 2645 while (i.e. local copy consumer cursor does not match the last 2646 consumer cursor value sent to the the partner). This allows the 2647 sender to perform optional diagnostics for detecting a stalled 2648 receiver application (data has been sent but not consumed). It is 2649 recommended that the Consumer Cursor Update Requested flag only be 2650 sent for diagnostic procedures as it may result in non-optimal 2651 data path performance. 2653 4.5.2. Small data sends 2655 The SMC-R protocol makes no special provisions for handling small 2656 data segments sent across a stream socket. Data is always sent if 2657 sufficient window space is available. There are no special provisions 2658 for coalescing small data segments, similar to the TCP Nagle 2659 algorithm. 2661 An implementation of SMC-R may optimize its sending processing by 2662 coalescing outbound data for a given SMC-R connection so that it can 2663 reduce the number of RDMA write operations it performed in a similar 2664 fashion to Nagle's algorithm. However, any such coalescing would 2665 require a timer on the sending host that would ensure that data was 2666 eventually sent. And the sending host would have to opt out of this 2667 processing if Nagle's algorithm had been disabled (programmatically 2668 or via system configuration). 2670 4.5.3. TCP Keepalive processing 2672 TCP keepalive processing allows applications to direct the local 2673 TCP/IP host to periodically "test" the viability of an idle TCP 2674 connection. Since SMC-R connections have both a TCP representation 2675 along with an SMC-R representation there are unique keepalive 2676 processing considerations: 2678 o SMC-R layer keepalive processing: If keepalive is enabled for an 2679 SMC-R connection the local host maintains a keepalive timer that 2680 reflects how long an SMC-R connection has been idle. The local 2681 host also maintains a timestamp of last activity for each SMC link 2682 (for any SMC-R connection on that link). When it is determined 2683 that an SMC-R connection has been idle longer than the keepalive 2684 interval the host checks whether the SMC-R link has been idle for 2685 a duration longer than the keepalive timeout. If both conditions 2686 are met, the local host then performs a Test Link LLC command to 2687 test the viability of the SMC link over the RoCE fabric (RC-QPs). 2688 If a Test Link LLC command response is received within a 2689 reasonable amount of time then the link is considered viable and 2690 all connections using this link are considered viable as well. If 2691 however a response is not received in a reasonable amount of time 2692 or there's a failure in sending the Test Link LLC command then 2693 this is considered a failure in the SMC link and failover 2694 processing to an alternate SMC link must be triggered. If no 2695 alternate SMC link exists in the SMC link group then all the SMC-R 2696 connections on this link are abnormally terminated by resetting 2697 the TCP connections represented by these SMC-R connections. Given 2698 that multiple SMC-R connections can share the same SMC link, 2699 implementing an SMC link level probe using the Test Link LLC 2700 command will help reduce the amount of unproductive keepalive 2701 traffic for SMC-R connections; as long as some SMC-R connections 2702 on a given SMC link are active (i.e. have had I/O activity within 2703 the keepalive interval) then there is no need to perform 2704 additional link viability testing. 2706 o TCP layer keepalives processing: Traditional TCP "keepalive" 2707 packets are not as relevant for SMC-R connections given that the 2708 TCP path is not used for these connections once the SMC-R 2709 rendezvous processing is completed. All SMC-R connections by 2710 default have associated TCP connections that are idle. Are TCP 2711 keepalive probes still needed for these connections? There are 2712 two main scenarios to consider: 2714 1. TCP keepalives that are used determine whether the peer TCP 2715 endpoint is still active. This is not needed for SMC-R 2716 connections as the SMC-R level keepalives mentioned above will 2717 determine whether the remote endpoint connections are still 2718 active. 2720 2. TCP keepalives that are used to ensure that TCP connections 2721 traversing an intermediate proxy maintain an active state. For 2722 example, stateful firewalls typically maintain state 2723 representing every valid TCP connection that traverses the 2724 firewall. These types of firewalls are known to expire idle 2725 connections by removing their state in the firewall to conserve 2726 memory. TCP keepalives are often used in this scenario to 2727 prevent firewalls from timing out otherwise idle connections. 2728 When using SMC-R, both end points must reside in the same layer 2729 2 network (i.e. the same subnet). As a result, firewalls can 2730 not be injected in the path between two SMC-R endpoints. 2731 However, other intermediate proxies, such as TCP/IP layer load 2732 balancers may be injected in the path of two SMC-R endpoints. 2733 These types of load balancers also maintain connection state so 2734 that they can forward TCP connection traffic to the appropriate 2735 cluster end point. When using SMC-R these TCP connections will 2736 appear to be completely idle making them susceptible to 2737 potential timeouts at the LB proxy. As a result, for this 2738 scenario, TCP keepalives may still be relevant. 2740 The following are the TCP level keepalive processing requirements for 2741 SMC-R enabled hosts: 2743 o SMC-R hosts should allow TCP keepalives to flow on the TCP path of 2744 SMC-R connections based on existing TCP keepalive configuration 2745 and programming options. However, it is strongly recommended that 2746 platforms that provide the ability to specify very granular 2747 keepalive timers (for example, single digit second timers) should 2748 consider providing a configuration option that limits the minimum 2749 keepalive timer that will be used for TCP layer keepalives on SMC- 2750 R connections. This is important to minimize the amount of TCP 2751 keepalive packets transmitted in the network for SMC-R 2752 connections. 2754 o SMC-R hosts must always respond to inbound TCP layer keepalives 2755 (by sending ACKs for these packets) even if the connection is 2756 using SMC-R. Typically, once a TCP connection has completed the 2757 SMC-R rendezvous processing and using SMC-R for data flows, no new 2758 inbound TCP segments are expected on that TCP connection other 2759 than TCP termination segments (FIN, RST, etc). TCP keepalives are 2760 the one exception that must be supported. And since TCP keepalive 2761 probes do not carry any application layer data this has no adverse 2762 impact on the application's inbound data stream. 2764 4.6. RMB data flows 2766 The following sections describe the RDMA wire flows for the SMC-R 2767 protocol after a TCP connection has switched into SMC-R mode (i.e. 2768 SMC-R rendezvous processing is complete and a pair of RMB elements 2769 has been assigned and communicated by the partner SMC-R hosts). The 2770 ladder diagrams below include the following: 2772 o RMBE control information kept by each peer. Only a subset of the 2773 information is depicted, specifically only the fields that reflect 2774 the stream of data written by Host A and read by Host B. 2776 o Time line 0-x that shows the wire flows in a time relative fashion 2778 o Note that RMBE control information is only shown in a time 2779 interval if its value changed (otherwise assume the value is 2780 unchanged from previously depicted value) 2782 o The local copy of the producer and consumer cursors that is 2783 maintained by each host is not depicted in these figures. 2785 4.6.1. Scenario 1: Send flow, window size unconstrained 2787 SMC Host A SMC HostB 2788 RMBE A Info RMBE B Info 2789 (Consumer Cursors) (Producer Cursors) 2790 Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 2791 0 0 0 0 0 0 0 2792 0 0 1 ---------------> 1 0 0 0 2793 RDMA-WR Data 2794 (0:999) 2795 0 0 2 ...............> 2 1000 0 0 2796 CDC Message 2798 Figure 16 Scenario 1: Send flow, window size unconstrained 2800 Scenario assumptions: 2802 o Kernel implementation 2804 o New SMC-R connection, no data has been sent on the connection 2806 o Host A: Application issues send for 1,000 bytes to Host B 2808 o Host B: RMBE receive buffer size is 10,000, application has issued 2809 a recv for 10,000 bytes 2811 Flow description: 2813 1. Application issues send() for 1,000 bytes, SMC-R layer copies 2814 data into a kernel send buffer. It then schedules an RDMA write 2815 operation to move the data into the peer's RMBE receive buffer, 2816 at relative position 0-999. Note that no immediate data or alert 2817 (i.e. interrupt) is provided to host B for this RDMA operation. 2819 2. Host A sends a CDC message to update the Producer Cursor to byte 2820 1000. This CDC message will deliver an interrupt to Host B. At 2821 this point, the SMC-R layer can return control back to the 2822 application. Host B, once notified of the completion of the 2823 previous RDMA operation, locates the RMBE associated with the 2824 RMBE alert token that was included in the message and proceeds 2825 to perform normal receive side processing, waking up the 2826 suspended application read thread, copying the data into the 2827 application's receive buffer, etc. It will use the Producer 2828 Cursor as an indicator of how much data is available to be 2829 delivered to the local application. After this processing is 2830 complete, the SMC-R layer will also update its local Consumer 2831 Cursor to match the Producer Cursor (i.e. indicating that all 2832 data has been consumed). Note that a message to the peer 2833 updating the Consumer Cursor is not needed at this time as the 2834 window size if unconstrained (> 1/2 of the receive buffer size). 2835 The window size is calculated using by taking the difference 2836 between the Producer and the Consumer cursors in the RMBEs 2837 (10,000-1,000=9,000). 2839 4.6.2. Scenario 2: Send/Receive flow, window unconstrained 2841 SMC Host A SMC HostB 2842 RMBE A Info RMBE B Info 2843 (Consumer Cursors) (Producer Cursors) 2844 Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 2845 0 0 0 0 0 0 0 2846 0 0 1 ---------------> 1 0 0 0 2847 RDMA-WR Data 2848 (0:999) 2849 0 0 2 ...............> 2 1000 0 0 2850 CDC Message 2852 0 0 3 <-------------- 3 1000 0 0 2853 RDMA-WR Data 2854 (0:499) 2855 1000 0 4 <.............. 4 1000 0 0 2856 CDC Message 2858 Figure 17 Scenario 2: Send/Recv flow, window size unconstrained 2860 Scenario assumptions: 2862 o New SMC-R connection, no data has been sent on the connection 2864 o Host A: Application issues send for 1,000 bytes to Host B 2866 o Host B: RMBE receive buffer size is 10,000, application has 2867 already issued a recv for 10,000 bytes. Once the receive is 2868 completed, the application sends a 500 byte response to Host A. 2870 Flow description: 2872 1. Application issues send() for 1,000 bytes, SMC-R layer copies 2873 data into a kernel send buffer. It then schedules an RDMA write 2874 operation to move the data into the peer's RMBE receive buffer, 2875 at relative position 0-999. Note that no immediate data or alert 2876 (i.e. interrupt) is provided to host B for this RDMA operation. 2878 2. Host A sends a CDC message to update the Producer Cursor to 2879 byte 1000. This CDC message will deliver an interrupt to Host B. 2880 At this point, the SMC-R layer can return control back to the 2881 application. 2883 3. Host B, once notified of the receipt of the previous CDC 2884 message, locates the RMBE associated with the RMBE alert token 2885 and proceeds to perform normal receive side processing, waking 2886 up the suspended application read thread, copying the data into 2887 the application's receive buffer, etc. After this processing is 2888 complete, the SMC-R layer will also update its local Consumer 2889 Cursor to match the Producer Cursor (i.e. indicating that all 2890 data has been consumed). Note that an update of the Consumer 2891 Cursor to the peer is not needed at this time as the window size 2892 is unconstrained (> 1/2 of the receive buffer size). The 2893 application then performs a send() for 500 bytes to Host A. The 2894 SMC-R layer will copy the data into a kernel buffer and then 2895 schedule an RDMA Write into the partner's RMBE receive buffer. 2896 Note that this RDMA write operation includes no immediate data 2897 or notification to Host A. 2899 4. Host B sends a CDC message to update the partner's RMBE Control 2900 information with the latest Producer Cursor (set to 500 and not 2901 shown in the diagram above) and to also inform the peer that the 2902 Consumer Cursor value is now 1000. It also updates the local 2903 Current Consumer Cursor and Last Sent Consumer Cursor to 1000. 2904 This CDC message includes notification since we are updating 2905 our Producer Cursor which requires attention by the peer host. 2907 4.6.3. Scenario 3: Send Flow, window constrained 2909 SMC Host A SMC HostB 2910 RMBE A Info RMBE B Info 2911 (Consumer Cursors) (Producer Cursors) 2912 Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 2913 0 0 0 0 0 0 0 2914 0 0 1 ---------------> 1 0 0 0 2915 RDMA-WR Data 2916 (0:2999) 2917 0 0 2 ...............> 2 3000 0 0 2918 CDC Message 2919 0 0 3 3 3000 0 0 2920 0 0 4 ---------------> 4 3000 0 0 2921 RDMA-WR Data 2922 (3000:6999) 2923 0 0 5 ................> 5 7000 0 0 2924 CDC Message 2925 7000 0 6 <................ 6 7000 0 0 2926 CDC Message 2928 Figure 18 Scenario 3: Send Flow, window size constrained 2930 Scenario assumptions: 2932 o New SMC-R connection, no data has been sent on this connection 2934 o Host A: Application issues send for 3,000 bytes to Host B and then 2935 another send for 4,000 2937 o Host B: RMBE receive buffer size is 10,000. Application has 2938 already issued a recv for 10,000 bytes 2940 Flow description: 2942 1. Application issues send() for 3,000 bytes, SMC-R layer copies 2943 data into a kernel send buffer. It then schedules an RDMA write 2944 operation to move the data into the peer's RMBE receive buffer, 2945 at relative position 0-2,999. Note that no immediate data or 2946 alert (i.e. interrupt) is provided to host B for this RDMA 2947 operation. 2949 2. Host A sends a CDC message to update its Producer Cursor to byte 2950 3000. This CDC message will deliver an interrupt to Host B. At 2951 this point, the SMC-R layer can return control back to the 2952 application. 2954 3. Host B, once notified of the receipt of the previous CDC 2955 message, locates the RMBE associated with the RMBE alert token 2956 and proceeds to perform normal receive side processing, waking 2957 up the suspended application read thread, copying the data into 2958 the application's receive buffer, etc. After this processing is 2959 complete, the SMC-R layer will also update its local Consumer 2960 Cursor to match the Producer Cursor (i.e. indicating that all 2961 data has been consumed). It will not however update the partner 2962 with this information as the window size is not constrained 2963 (10000-3000=7000 of available space). The application on Host B 2964 also issues a new recv() for 10,000. 2966 4. On Host A, application issues a send() for 4,000 bytes. The SMC- 2967 R layer copies the data into a kernel buffer and schedules an 2968 async RDMA write into the peer's RMBE receive buffer at relative 2969 position 3000-6999. Note that no alert is provided to host B for 2970 this flow. 2972 5. Host A sends a CDC message to update the Producer Cursor to 2973 byte 7000. This CDC message will deliver an interrupt to Host B. 2974 At this point, the SMC-R layer can return control back to the 2975 application. 2977 6. Host B, once notified of the receipt of the previous CDC 2978 message, locates the RMBE associated with the RMBE alert token 2979 and proceeds to perform normal receive side processing, waking 2980 up the suspended application read thread, copying the data into 2981 the application's receive buffer, etc. After this processing is 2982 complete, the SMC-R layer will also update its local Consumer 2983 Cursor to match the Producer Cursor (i.e. indicating that all 2984 data has been consumed). It will then determine whether it 2985 needs to update the Consumer Cursor to the peer. The available 2986 window size is now 3,000 (10,000 - (Producer Cursor - Last Sent 2987 Consumer Cursor)) which < 1/2 receive buffer size 2988 (10,000/2=5,000) and the advance of the window size is > 10% of 2989 the windows size (1,000). Therefore a CDC message is issued to 2990 update the Consumer Cursor to peer A. 2992 4.6.4. Scenario 4: Large send, flow control, full window size writes 2994 SMC Host A SMC HostB 2995 RMBE A Info RMBE B Info 2996 (Consumer Cursors) (Producer Cursors) 2997 Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 2998 1000 1 0 0 1000 1 0 2999 1000 1 1 ---------------> 1 1000 1 0 3000 RDMA-WR Data 3001 (1000:9999) 3002 1000 1 2 ---------------> 2 1000 1 0 3003 RDMA-WR Data 3004 (0:999) 3005 1000 1 3 ...............> 3 1000 2 Wrt 3006 CDC Message 3007 1000 2 4 <............... 4 1000 2 Wrt 3008 CDC Message 3009 1000 2 5 ---------------> 5 1000 2 Wrt 3010 RDMA-WR Data Blk 3011 (1000:9999) 3012 1000 2 6 ---------------> 6 1000 2 Wrt 3013 RDMA-WR Data Blk 3014 (0:999) 3015 1000 2 7 ...............> 7 1000 3 Wrt 3016 CDC Message 3017 1000 3 8 <............... 8 1000 3 Wrt 3018 CDC Message 3019 Figure 19 Scenario 4: Large send, flow control, full window size 3020 writes 3022 Scenario assumptions: 3024 o Kernel implementation 3026 o Existing SMC-R connection, Host B's receive window size is fully 3027 open(Peer Consumer Cursor = Peer Producer Cursor). 3029 o Host A: Application issues send for 20,000 bytes to Host B 3031 o Host B: RMB receive buffer size is 10,000, application has issued 3032 a recv for 10,000 bytes 3034 Flow description: 3036 1. Application issues send() for 20,000 bytes, SMC-R layer copies 3037 data into a kernel send buffer (assumes send buffer space of 3038 20,000 is available for this connection). It then schedules an 3039 RDMA write operation to move the data into the peer's RMBE 3040 receive buffer, at relative position 1000-9999. Note that no 3041 immediate data or alert (i.e. interrupt) is provided to host B 3042 for this RDMA operation. 3044 2. Host A then schedules an RDMA write operation to fill the 3045 remaining 1000 bytes of available data into the peer's RMBE 3046 receive buffer, at relative position 0-999. Note that no 3047 immediate data or alert (i.e. interrupt) is provided to host B 3048 for this RDMA operation. Also note that an implementation of 3049 SMC-R may optimize this processing by combining step 1 and 2 3050 into a single RDMA Write operation (with 2 different data 3051 sources). 3053 3. Host A sends CDC message to update the Producer Cursor to byte 3054 1000. Since the entire receive buffer space is filled, the 3055 Producer Writer Blocked flag (WrtBlk indicator above) is set and 3056 the Producer Window Wrap Sequence Number (Producer WrapSeq# 3057 above) is incremented. This CDC message will deliver an 3058 interrupt to Host B. At this point, the SMC-R layer can return 3059 control back to the application. 3061 4. Host B, once notified of the receipt of the previous CDC 3062 message, locates the RMBE associated with the RMBE alert token 3063 and proceeds to perform normal receive side processing, waking 3064 up the suspended application read thread, copying the data into 3065 the application's receive buffer, etc. In this scenario, Host B 3066 notices that the Producer Cursor has not been advanced (same 3067 value as Consumer Cursor), however, it notices that the Producer 3068 Window Wrap Size Sequence number is different from its local 3069 value (1) indicating that a full window of new data is 3070 available. All the data in the receive buffer can be processed, 3071 the first segment (1000-9999) followed by the second segment (0- 3072 999). Because the Producer Writer Blocked indicator was set, 3073 Host B schedules a CDC message to update its latest information 3074 to the peer: Consumer Cursor (1000), Consumer Window Wrap Size 3075 Sequence Number (2: the current Producer Window Wrap Sequence 3076 Number is used). 3078 5. Host A, upon receipt of the CDC message locates the RMBE 3079 associated with the alert token, and upon examining the control 3080 information provided notices that Host B has consumed all of the 3081 data (based on the Consumer Cursor and the Consumer Window Wrap 3082 Size Sequence number) and initiates the next RDMA write to fill 3083 the receive buffer at offset 1000-9999. 3085 6. Host A then moves the remaining 1000 bytes into the beginning of 3086 the receive buffer (0-999) by scheduling an RDMA write 3087 operation. 3089 7. Host A then sends a CDC message to set the Producer Writer 3090 Blocked indicator and to increment the Producer Window Wrap Size 3091 Sequence Number (3). 3093 8. Host B, upon notification completes the same processing as step 3094 4 above, including sending a CDC message to update the peer to 3095 indicate that all data has been consumed. 3097 4.6.5. Scenario 5: Send flow, urgent data, window size unconstrained 3099 SMC Host A SMC HostB 3100 RMBE A Info RMBE B Info 3101 (Consumer Cursors) (Producer Cursors) 3102 Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flag 3103 1000 1 0 0 1000 1 0 3104 1000 1 1 ---------------> 1 1000 1 0 3105 RDMA-WR Data 3106 (1000:1499) 3107 1000 1 2 ...............> 2 1500 1 UrgP 3108 CDC Message UrgA 3110 1500 1 3 <............... 3 1500 1 UrgP 3111 CDC Message UrgA 3113 1500 1 4 ---------------> 4 1500 1 UrgP 3114 RDMA-WR Data UrgA 3115 (1500:2499) 3116 1500 1 5 ...............> 5 2500 1 0 3117 CDC Message 3119 Figure 20 Scenario 5: send Flow, urgent data, window size open 3121 Scenario assumptions: 3123 o Kernel implementation 3125 o Existing SMC-R connection, window size open, all data has been 3126 consumed by receiver. 3128 o Host A: Application issues send for 500 bytes with urgent data 3129 indicator (OOB) to Host B, then sends 1000 of normal data 3131 o Host B: RMBE Receive buffer size is 10,000, application has issued 3132 a recv for 10,000 bytes and is also monitoring the socket for 3133 urgent data 3135 Flow description: 3137 1. Application issues send() for 500 bytes of urgent data. SMC-R 3138 layer copies data into a kernel send buffer. It then schedules 3139 an RDMA write operation to move the data into the peer's RMBE 3140 receive buffer, at relative position 1000-1499. Note that no 3141 immediate data or alert (i.e. interrupt) is provided to host B 3142 for this RDMA operation. 3144 2. Host A sends a CDC message to update its Producer Cursor to byte 3145 1500 and to turn on the Producer Urgent Data Pending (UrgP) and 3146 Urgent Data Present (UrgA) flags. This CDC message will deliver 3147 an interrupt to Host B. At this point, the SMC-R layer can 3148 return control back to the application. 3150 3. Host B, once notified of the receipt of the previous CDC 3151 message, locates the RMBE associated with the RMBE alert token, 3152 notices that the Urgent Data Pending flag is on and proceeds 3153 with Out of Band socket API notification. For example, 3154 satisfying any outstanding select() or poll() requests on the 3155 socket by indicating that urgent data is pending (i.e. by 3156 setting the exception bit on). The Urgent Data Present indicator 3157 allows Host B to also determine the position of the urgent data 3158 (Producer cursor points one byte beyond the last byte of urgent 3159 data). Host B can then perform normal receive side processing 3160 (including specific urgent data processing), copying the data 3161 into the application's receive buffer, etc. Host B then sends a 3162 CDC message to update the partner's RMBE Control area with its 3163 latest Consumer Cursor (1500). Note this CDC message must occur 3164 regardless of the current local window size that is available. 3165 The partner host (Host A) cannot initiate any additional RDMA 3166 writes until acknowledgement that the urgent data has been 3167 processed (or at least processed/remembered at the SMC-R layer). 3169 4. Upon receipt of the message, Host A wakes up, sees that peer 3170 consumed all data up to and including the last byte of Urgent 3171 data and now resumes sending any pending data. In this case, 3172 the application had previously issued a send for 1000 bytes of 3173 normal data which would have been copied in the send buffer and 3174 control would have been returned to the application. Host A now 3175 initiates a RDMA write to move that data to the Peer's receive 3176 buffer at position 1500-2499. 3178 5. Host A then sends a CDC message with inline data update its 3179 Producer Cursor value (2500) and turn off the Urgent Data 3180 Pending and Urgent Data Present flags. Host B wakes up, 3181 processes the new data (resumes application, copies data into 3182 the application receive buffer) and then proceeds to update the 3183 Local current consumer cursor (2500). Given that the window size 3184 is unconstrained there is no need for Consumer Cursor update in 3185 the peer's RMBE. 3187 4.6.6. Scenario 6: Send flow, urgent data, window size closed 3189 SMC Host A SMC HostB 3190 RMBE A Info RMBE B Info 3191 (Consumer Cursors) (Producer Cursors) 3192 Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flag 3193 1000 1 0 0 1000 2 Wrt 3194 Blk 3196 1000 1 1 ...............> 1 1000 2 Wrt 3197 CDC Message Blk 3198 UrgP 3200 1000 2 2 <............... 2 1000 2 Wrt 3201 CDC Message Blk 3202 UrgP 3204 1000 2 3 ---------------> 3 1000 2 Wrt 3205 RDMA-WR data l Blk 3206 (1000:1499) UrgP 3208 1000 2 4 ...............> 4 1500 2 UrgP 3209 CDC Message UrgA 3211 1500 2 5 <............... 5 1500 2 UrgP 3212 CDC Message UrgA 3214 1500 2 6 ---------------> 6 1500 2 UrgP 3215 RDMA-WR data l UrgA 3216 (1500:2499) 3217 1000 2 7 ...............> 7 2500 2 0 3218 CDC Message 3220 Figure 21 Scenario 6: Send flow, urgent data, window size closed 3222 Scenario assumptions: 3224 o Kernel implementation 3226 o Existing SMC-R connection, window size closed, writer is blocked. 3228 o Host A: Application issues send for 500 bytes with urgent data 3229 indicator (OOB) to Host B, then sends 1000 of normal data. 3231 o Host B: RMBE Receive buffer size is 10,000, application has no 3232 outstanding recv() (for normal data) and is monitoring the socket 3233 for urgent data. 3235 Flow description: 3237 1. Application issues send() for 500 bytes of urgent data. SMC-R 3238 layer copies data into a kernel send buffer (if available). 3239 Since the writer is blocked (window size closed) it cannot send 3240 the data immediately. It then sends a CDC message to notify the 3241 peer of the Urgent Data Pending (UrgP)indicator (the Writer 3242 Blocked indicator remains on as well). This serves as a signal 3243 to Host B that urgent data is pending in the stream. Control is 3244 also returned to the application at this point. 3246 2. Host B, once notified of the receipt of the previous CDC 3247 message, locates the RMBE associated with the RMBE alert token, 3248 notices that the Urgent Data Pending flag is on and proceeds 3249 with Out of Band socket API notification. For example, 3250 satisfying any outstanding select() or poll() requests on the 3251 socket by indicating that urgent data is pending (i.e. by 3252 setting the exception bit on). At this point it is expected that 3253 the application will enter urgent data mode processing, 3254 expeditiously processing all normal data (by issuing recv API 3255 calls) so that it can get to the urgent data byte. Whether the 3256 application has this urgent mode processing or not, at some 3257 point the application will consume some or all of the pending 3258 data in the receive buffer. When this occurs, Host B will also 3259 send a CDC message with inline data to update its Consumer 3260 Cursor and Consumer Window Wrap Sequence Number to the peer. In 3261 the example above, a full window worth of data was consumed. 3263 3. Host A, once awakened by the message will notice that the window 3264 size is now open on this connection (based on the Consumer 3265 Cursor and the Consumer Window Wrap Sequence Number which now 3266 matches the Producer Window Wrap Sequence Number) and resume 3267 sending of the urgent data segment by scheduling an RDMA write 3268 into relative position 1000-1499. 3270 4. Host A the sends a CDC message to advance its Producer Cursor 3271 (1500) and to also notify Host B of the Urgent Data Present 3272 (UrgA) indicator (and turn off the Writer Blocked indicator). 3273 This signals to Host B that the urgent data is now in the local 3274 receive buffer and that the Producer Cursor points to the last 3275 byte of urgent data. 3277 5. Host B wakes up, processes the urgent data and once the urgent 3278 data is consumed sends a CDC message with inline data to update 3279 its Consumer Cursor (1500). 3281 6. Host A wakes up, sees that Host B has consumed the sequence 3282 number associated with the urgent data and then initiates the 3283 next RDMA write operation to move the 1000 bytes associated with 3284 the next send() of normal data into the peer's receive buffer at 3285 position (1500-2499). Note that send() API would have likely 3286 completed earlier in the process by copying the 1000 bytes into 3287 a send buffer and returning back to the application even though 3288 we could not send any new data until the urgent data was 3289 processed and acknowledged by Host B. 3291 7. Host A sends a CDC message to advance its Producer Cursor to 3292 2500 and to reset the Urgent Data Pending and Present flags. 3293 Host B wakes up and processes the inbound data. 3295 4.7. Connection termination 3297 Just as SMC-R connections are established using a combination of TCP 3298 connection establishment flows and SMC-R protocol flows, the 3299 termination of SMC-R connections also uses a similar combination of 3300 SMC-R protocol termination flows and normal TCP protocol connection 3301 termination flows. The following sections describe the SMC-R protocol 3302 normal and abnormal connection termination flows. 3304 4.7.1. Normal SMC-R connection termination flows 3306 Normal SMC-R connection flows are triggered via the normal stream 3307 socket API semantics, namely by the application issuing a close() or 3308 shutdown() API. Most applications, after consuming all incoming data 3309 and after sending any outbound data will then issue a close() API to 3310 indicate that they are done both sending and receiving data. Some 3311 applications, typically a small percentage, make use of the 3312 shutdown() API that allows then to indicate that the application is 3313 done sending data, receiving data or both sending and receiving data. 3314 The main use of this API is scenarios where a TCP application wants 3315 to alert its partner end point that it is done sending data, yet is 3316 still receiving data on its socket (shutdown for Write). Issuing 3317 shutdown for both sending and receiving data is really no different 3318 than issuing a close() and can therefore be treated in a similar 3319 fashion. Shutdown for read is typically not a very useful operation 3320 and in normal circumstances does not trigger any network flows to 3321 notify the partner TCP end point of this operation. 3323 These same trigger points will be used by the SMC-R layer to initiate 3324 SMC-R connections termination flows. The main design point for SMC-R 3325 normal connection flows is to use the SMC-R protocol to first 3326 shutdown the SMC-R connection and free up any SMC-R RDMA resources 3327 and then allow the normal TCP connection termination protocol (i.e. 3329 FIN processing) to drive cleanup of the TCP connection. This design 3330 point is very important in ensuring that RDMA resources such as the 3331 RMBEs are only freed and reused when both SMC-R end points are 3332 completely done with their RDMA Write operations to the partner's 3333 RMBE. 3335 1 3336 +-----------------+ 3337 |-------------->| CLOSED |<-------------| 3338 3D | | | | 4D 3339 | +-----------------+ | 3340 | | | 3341 | 2 | | 3342 | V | 3343 +----------------+ +-----------------+ +----------------+ 3344 |AppFinCloseWait | | ACTIVE | |PeerFinCloseWait| 3345 | | | | | | 3346 +----------------+ +-----------------+ +----------------+ 3347 | | | | 3348 | Active Close | 3A | 4A | Passive Close | 3349 | V | V | 3350 | +--------------+ | +-------------+ | 3351 |--<----|PeerCloseWait1| | |AppCloseWait1|--->----| 3352 3C | | | | | | | 4C 3353 | +--------------+ | +-------------+ | 3354 | | | | | 3355 | | 3B | 4B | | 3356 | V | V | 3357 | +--------------+ | +-------------+ | 3358 |--<----|PeerCloseWait2| | |AppCloseWait2|--->----| 3359 | | | | | 3360 +--------------+ | +-------------+ 3361 | 3362 | 3363 Figure 22 SMC-R connection states 3365 Figure 23 describes the states that an SMC-R connection typically 3366 goes through. Note that there are variations to these states that can 3367 occur when an SMC-R connection is abnormally terminated, similar in a 3368 way to when a TCP connection is reset. The following are the high 3369 level state transitions for an SMC-R connection: 3371 1. An SMC-R connection begins in the Closed state. This state is 3372 meant to reflect an RMBE that is not currently in use (was 3373 previously in use but no longer is or one that was never 3374 allocated) 3376 2. An SMC-R connection progresses to the Active state once the SMC- 3377 R rendezvous processing has successfully completed, RMB element 3378 indices have been exchanged and SMC-R links have been activated. 3379 In this state, TCP connection is fully established, rendezvous 3380 processing has been completed and SMC-R peers can begin exchange 3381 of data via RDMA. 3383 3. Active close processing (on SMC-R peer that is initiating the 3384 connection termination) 3386 A. When an application on one of the SMC-R connection peers issues 3387 a close() or shutdown(write or both) the SMC-R layer on that host 3388 will initiate SMC-R connection termination processing. First if 3389 close() or shutdown(both) is issued it will check to see that 3390 there's no data in the local RMB element that has not been read 3391 by the application. If unread data is detected, the SMC-R 3392 connection must be abnormally reset - for more detail on this 3393 refer to "SMC-R connection reset". If no unread data is pending, 3394 it then checks to see whether any outstanding data is waiting to 3395 be written to the peer or if any outstanding RDMA writes for this 3396 SMC-R connection have not yet completed. If either of these two 3397 scenarios are true, an indicator that this connection is in a 3398 pending close state is saved in internal data structures 3399 representing this SMC-R connection and control is returned to the 3400 application. If all data to be written to the partner has 3401 completed this peer will send a CDC message to notify the peer of 3402 either the PeerConnectionClosed indicator (close or shutdown for 3403 both was issued) or the PeerDoneWriting indicator. This will 3404 provide stimulus to the partner SMC-R peer that the connection is 3405 terminating. At this point the local side of the SMC-R connection 3406 transitions in the PeerCloseWait1 state and control can be 3407 returned to the application. If this process could not be 3408 completed synchronously (close pending condition mentioned above) 3409 it is completed when all RDMA writes for data and control cursors 3410 have been completed. 3412 B. At some point the SMC-R peer application (passive close) will 3413 consume all incoming data, realize that that partner is done 3414 sending data on this connection and proceed to initiate its own 3415 close of the connection once it has completed sending all data 3416 from its end. The partner application can initiate this 3417 connection termination processing via a close() or shutdown() 3418 APIs. If the application does so by issuing a shutdown() for 3419 write, then the partner SMC-R layer will send a CDC message to 3420 notify the peer (active close side) of the PeerDoneWriting 3421 indicator. When the "active close" SMC-R peer wakes up as a 3422 result of the previous CDC message, it will notice that the 3423 PeerDoneWriting indicator is now on and transition to the 3424 PeerCloseWait2 state. This state indicates that the peer is done 3425 sending data and may still be reading data. The "active close" 3426 peer will also at this point need to ensure that any outstanding 3427 recv() calls for this socket are woken up and remember that that 3428 no more data is forthcoming on this connection (in case the local 3429 connection was shutdown() for write only) 3431 C. This flow is a common transition from 3a or 3b above. When the 3432 SMC-R peer (passive close) consumes all data, updates all 3433 necessary cursors to the peer and the application closes its 3434 socket (close or shutdown for both) it will send a CDC message to 3435 the peer (the active close side) with the PeerConnectionClosed 3436 indicator set. At this point the connection can transition back 3437 to Closed state if the local application has already closed (or 3438 issued shutdown for both) the socket. Once in the Closed state, 3439 the RMBE can now be safely be reused for a new SMC-R connection. 3440 When the PeerConnectionClosed indicator is turned on, the SMC-R 3441 peer is indicating that it is done updating the partner's RMBE. 3443 D. Conditional State: If the local application has not yet issued 3444 a close() or shutdown(both) yet, we need to wait until the 3445 application does so (ApplFinWaitState). Once it does, the local 3446 host will send a CDC message to notify the peer of the 3447 PeerConnectionClosed indicator and then transition to the Closed 3448 state. 3450 4. Passive close processing (on SMC-R peer that receives an 3451 indication that the partner is closing the connection) 3453 A. Upon receipt of an inbound RDMA write notice the SMC-R layer 3454 will detect that the PeerConnectionClosed indicator or 3455 PeerDoneWriting indicator is on. If any outstanding recv() calls 3456 are pending they are completed with an indicator that the partner 3457 has closed the connection (zero length data presented to 3458 application). If any pending data to be written and 3459 PeerConnectionClosed is on then an SMC-R connection reset must be 3460 performed. The connection then enters the ApplCloseWait1 state on 3461 the passive close side waiting for the local application to 3462 initiate its own close processing 3463 B. If the local application issues a shutdown() for writing then 3464 the SMC-R layer will send a CDC message to notify the partner of 3465 the PeerDoneWriting indicator transition the local side of the 3466 SMC-R connection to the ApplCloseWait2 state. 3468 C. When the application issues a close() or shutdown() for both, 3469 the local SMC-R peer will send a message informing the peer of 3470 the PeerConnectionClosed indicator and transition to the Closed 3471 state if peer has also sent the local stack the 3472 PeerConnectionClosed indicator. If the peer has not sent the 3473 PeerConnectionClosed indicator, we transition into the 3474 PeerFinalCloseWait state. 3476 D. The local SMC-R connection stays in this state until the peer 3477 sends the PeerConnectionClosed indicator in our RMBE. When the 3478 indicator is sent we transition to the Closed state and are then 3479 free to reuse this RMBE. 3481 Note that each SMC-R needs to provide some logic that will prevent 3482 being stranded in termination state indefinitely. For example, if an 3483 Active Close SMC-R host is in a PeerCloseWait (1 or 2) state awaiting 3484 the remote SMC-R peer to update its connection termination status it 3485 needs to provide a timer that will prevent it from waiting in that 3486 state indefinitely should the remote SMC-R peer not respond to this 3487 termination request. This could occur in error scenarios; for 3488 example, if the remote SMC-R peer suffered a failure prior to being 3489 able to respond to the termination request or the remote application 3490 is not responding to this connection termination request by closing 3491 its own socket. This latter scenario is similar to the TCP FINWAIT2 3492 state that has been known to sometimes cause issues when remote 3493 TCP/IP hosts lose track of established connections and neglect to 3494 close them. Even though the TCP standards do not mandate a time out 3495 from the TCP FINWAIT2 state, most TCP/IP implementations implement a 3496 timeout for this state. A similar timeout will be required for SMC-R 3497 connections. When this timeout occurs, the local SMC-R host performs 3498 TCP reset processing for this connection. However, no additional 3499 RDMA writes to the partner RMBE can occur at this point (we have 3500 already indicated that we are done updating the peer's RMBE). After 3501 the TCP connection is Reset the RMBE can be returned to the free pool 3502 for reallocation. See section 3.2.5 for more details. 3504 Also note that it is possible to have two SMC-R end points initiate 3505 an Active close concurrently. In that scenario the flows above still 3506 apply, however, both end points follow the active close path (path 3507 3). 3509 4.7.1.1. Abnormal SMC-R connection termination flows 3511 Abnormal SMC-R connection termination can occur for a variety of 3512 reasons, including: 3514 o The TCP connection associated with an SMC-R connection is reset. 3515 In the TCP protocol either end point can send a RST segment to 3516 abort an existing TCP connection when error conditions are 3517 detected for the connection or the application overtly requests 3518 that the connection be reset. 3520 o Normal SMC-R connection termination processing has unexpectedly 3521 stalled for a given connection. When the stall is detected 3522 (connection termination timeout condition) an abnormal SMC-R 3523 connection termination flow is initiated. 3525 In these scenarios it is very important that resources associated 3526 with the affected SMC-R connections are properly cleaned up to ensure 3527 that there are no orphaned resources and that resources can reliably 3528 be reused for new SMC-R connections. Given that SMC-R relies heavily 3529 on the RDMA Write processing, special care needs to be taken to 3530 ensure that an RMBE is no longer being used by a SMC-R peer before 3531 logically reassigning that RMBE to a new SMC-R connection. 3533 When an SMC-R host initiates a TCP connection reset it also initiates 3534 an SMC-R abnormal connection flow at the same time. The SMC-R peers 3535 explicitly signal their intent to abnormally terminate an SMC-R 3536 connection and await explicit acknowledgement that the peer has 3537 received this notification and has also completed abnormal connection 3538 termination on its end. Note that TCP connection reset processing can 3539 occur in parallel to these flows. 3541 +-----------------+ 3542 |-------------->| CLOSED |<-------------| 3543 | | | | 3544 | +-----------------+ | 3545 | | 3546 | | 3547 | | 3548 | +-----------------+ | 3549 | | Any State | | 3550 |1B | (before setting | 2B| 3551 | | PeerConnClosed | | 3552 | | Indicator in | | 3553 | | Peer's RMBE) | | 3554 | +-----------------+ | 3555 | 1A | | 2A | 3556 | Active Abort | | Passive Abort | 3557 | V V | 3558 | +--------------+ +--------------+ | 3559 |-------|PeerAbortWait | | Process Abort|------| 3560 | | | | 3561 +--------------+ +--------------+ 3563 Figure 23 SMC-R abnormal connection termination state diagram 3565 Figure 24 above shows the SMC-R abnormal connection termination state 3566 diagram: 3568 1. Active abort designates the SMC-R peer that is initiating the 3569 TCP RST processing. At the time that the TCP RST is sent the 3570 active abort side must also 3572 A. Send the PeerConnAbort indicator to the partner via RDMA 3573 messaging with inline data and then transition to the 3574 PeerAbortWait state. During this state it will monitor this SMC- 3575 R connection waiting for the peer to send its corresponding 3576 PeerConnAbort indicator but will ignore any other activity in 3577 this connection (i.e. new incoming data). It will also surface an 3578 appropriate error to any socket API calls issued against this 3579 socket (e.g. ECONNABORTED, ECONNRESET, etc.) 3581 B. Once the peer sendsthe PeerConnAbort indicator to the local 3582 host, the local host can transition this SMC-R connection to the 3583 Closed state and reuse this RMBE. Note that the SMC-R peer that 3584 goes into the Active abort state must provide some protection 3585 against staying in that state indefinitely should the remote SMC- 3586 R peer not respond by sending its own PeerConnAbort indicator to 3587 the local host. While this should be a rare scenario it could 3588 occur if the remote SMC-R peer (passive abort) suffered a failure 3589 right after the local SMC-R host (active abort) sent the 3590 PeerConnAbort indicator. To protect against these types of 3591 failures, a timer can be set after entering the PeerAbortWait 3592 state and when if that timer pops before the peer has sent its 3593 local PeerConnAbort indicator (to the active abort side) then 3594 this RMBE can be returned to the free pool for possible re- 3595 allocation. See section See section 3.2.5 for more details. 3597 2. Passive abort designates the SMC-R peer that is the recipient of 3598 an SMC-R abort from the peer designated by the PeerConnAbort 3599 indicator being sent by the peer in a CDC message. Upon 3600 receiving this request, the local peer must 3602 A. Indicate to the socket application that this connection has 3603 been aborted using the appropriate error codes, purge all in- 3604 flight data for this connection that is waiting to be read or 3605 waiting to be sent. 3607 B. Send a CDC message to notify the peer of the PeerConnAbort 3608 indicator and once that is completed transition this RMBE to the 3609 Closed state. 3611 If an SMC-R host receives a TCP RST for a given SMC-R connection it 3612 also initiates SMC-R abnormal connection termination processing if it 3613 has not already been notified (via the PeerConnAbort indicator) that 3614 the partner is severing the connection. It is possible to have two 3615 SMC-R endpoints concurrently be in an Active abort role for a given 3616 connection. In that scenario the flows above still apply but both 3617 end points take the active abort path (path 1). 3619 4.7.1.2. Other SMC-R connection termination conditions 3620 The following are additional conditions that have implications of 3621 SMC-R connection termination: 3623 o A SMC-R host being gracefully shut down. If an SMC-R host supports 3624 a graceful shutdown operation it should attempt to terminate all 3625 SMC-R connections as part of shutdown processing. This could be 3626 accomplished via LLC Delete Link requests on all active SMC Links. 3628 o Abnormal termination of an SMC-R host. In this example, there may 3629 be no opportunity for the host to perform any SMC-R cleanup 3630 processing. In this scenario it is up to the remote peer to 3631 detect a RoCE communications failure with the failing host. This 3632 could trigger an SMC link switch but that would also surface RoCE 3633 errors causing the remote host to eventually terminate all 3634 existing SMC-R connections to this peer. 3636 o Loss of RoCE connectivity between two SMC-R peers. If two peers 3637 are no longer reachable across any links in their SMC Link group 3638 then both peers perform a TCP reset for the connections, surface 3639 an error to the local applications and free up all QP resources 3640 associated with the link group. 3642 5. Security considerations 3644 5.1. VLAN considerations 3646 The concepts and access control of virtual LANs (VLANs) must be 3647 extended to also cover the RoCE network traffic flowing across the 3648 ethernet. 3650 The RoCE VLAN configuration and accesses must mirror the IP VLAN 3651 configuration and accesses over the CEE fabric. This means that 3652 hosts, routers and switches that have access to specific VLANs on the 3653 IP fabric must also have the same VLAN access across the RoCE 3654 fabric. In other words, the SMC-R connectivity will follow the same 3655 virtual network access permissions as normal TCP/IP traffic. 3657 5.2. Firewall considerations 3659 As mentioned above, the RoCE fabric inherits the same VLAN 3660 topology/access as the IP fabric. RoCE is a layer 2 protocol that 3661 requires both end points to reside in the same layer 2 network (i.e. 3662 VLAN). RoCE traffic can not traverse multiple VLANs as there is no 3663 support for routing RoCE traffic beyond a single VLAN. As a result, 3664 SMC-R communications will also be confined to stacks that are members 3665 of the same VLAN. IP based firewalls are typically inserted between 3666 VLANs (or physical lans) and rely on normal IP routing to insert 3667 themselves in the data path. Since RoCE (and by extension SMC-R) is 3668 not routable beyond the local VLAN, there is no ability to insert a 3669 firewall in the network path of two SMC-R peers. 3671 5.3. IP Filters 3673 Because SMC-R maintains the TCP three-way handshake for connection 3674 setup before switching to RoCE out of band, existing IP filters that 3675 control connection setup flows remain effective in an SMC-R 3676 environment. IP filters that operate on traffic flowing in an active 3677 TCP connection are not supported, because the connection data does 3678 not flow over IP. 3680 5.4. Intrusion Detection Services 3682 Similar to IP filters, intrusion detection services that operate on 3683 TCP connection setups are compatible with SMC-R with no changes 3684 required. However once the TCP connection has switched to RoCE out 3685 of band, packets are not available for examination. 3687 5.5. IP Security (IPSec) 3689 IP Security is not compatible with SMC-R because there are no IP 3690 packets to operate on. TCP connections that require IP security must 3691 opt out of SMC-R. 3693 5.6. TLS/SSL 3695 TLS/SSL is preserved in an SMC-R environment. The TLS/SSL layer 3696 resides above the SMC-R layer and outgoing connection data is 3697 encrypted before being passed down to the SMC-R layer for RMDA write. 3698 Similarly, incoming connection data goes through the SMC-R layer 3699 encrypted and is decrypted by the TLS/SSL layer as it is today. 3701 The TLS/SSL handshake messages flow over the TCP connection after the 3702 connection has switched to SMC-R, so are exchanged using RDMA writes 3703 by the SMC-R layer, transparently to the TLS/SSL layer. 3705 6. IANA considerations 3707 The scarcity of TCP option codes available for assignment is 3708 understood and this architecture uses experimental TCP options 3709 following the conventions of draft-ietf-tcpm-experimental-options- 3710 01.txt. 3712 If this protocol achieves wide acceptance a discrete option code may 3713 be requested by subsequent versions of this protocol. 3715 7. References 3717 7.1. Normative References 3719 [ROCE] RDMA over Converged Ethernet specification, URL, 3720 http://members.infinibandta.org/kwspub/spec/Annex_RoCE_fina 3721 l.pdf 3723 [IBTA] Infiniband Architecture specification, URL, 3724 http://www.infinibandta.org/specs 3726 [RFC793] University of Southern California Information Services 3727 Institute, "Transmission Control Protocol", RFC 793, 3728 September 1981. 3730 [RFC4727] Fenner B., "Experimental Values in IPv4, IPv6, ICMPv4, 3731 ICMPv6, UDP, and TCP Headers", RFC 4727, November 2006. 3733 7.2. Informative References 3735 [Tou2012] Touch, J., "Shared use of Experimental TCP Options", draft 3736 URL, http://tools.ietf.org/html/draft-ietf-tcpm- 3737 experimental-options-01 3739 8. Acknowledgments 3741 This document was prepared using 2-Word-v2.0.template.dot. 3743 9. Conventions used in this document 3745 In the rendezvous flow diagrams, dashed lines (----) are used to 3746 indicate flows over the TCP/IP fabric and dotted lines (....) are 3747 used to indicate flows over the RoCE fabric. 3749 In the data transfer ladder diagrams, dashed lines (----) are used to 3750 indicate RDMA write operations and dotted lines (....) are used to 3751 indicate CDC messages, which are RDMA messages with inline data that 3752 contain control information for the connection. 3754 Appendix A. Formats 3756 A.1. TCP option 3758 The SMC-R TCP option is formatted in accordance with draft-ietf-tcpm- 3759 experimental-options-01.txt. The magic number is IBM-1047 (EBCDIC) 3760 encoding for 'SMCR' 3762 0 1 2 3 3763 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 3764 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3765 | Kind = 254 | Length = 6 | x'E2' | x'D4' | 3766 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3767 | x'C3' | x'D9' | 3768 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3769 Figure 24 SMC-R TCP option format 3771 A.2. CLC messages 3773 The following rules apply to all CLC messages: 3775 General rules on formats: 3777 o Reserved fields must be set to zero and not validated 3779 o Each message has an eyecatcher at the start and another eyecatcher 3780 at the end. These must both be validated by the receiver. 3782 o SMC version indicator: The only SMC-R version defined in this 3783 architecture is version 1. In the future, if peers have a 3784 mismatch of versions, the lowest common version number is used. 3786 A.2.1. Peer ID format 3788 All CLC messages contain a peer ID that uniquely identifies an 3789 instance of a stack. This peer ID is required to be universally 3790 unique across stacks and instances (including restarts) of stacks. 3792 0 1 2 3 3793 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 3794 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3795 | Instance ID | RoCE MAC (first two bytes) | 3796 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3797 | RoCE MAC (last four bytes) | 3798 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3799 Figure 25 Peer ID format 3801 Instance ID 3803 A two-byte instance count that ensures that if the same RNIC MAC 3804 is later used in the peer ID for a different stack, for example 3805 if an RNIC is redeployed to another stack, the values are unique. 3806 It also ensures that if a stack is restarted, the instance ID 3807 changes. Value is implementation defined, with one suggestion 3808 being two bytes of the system clock. 3810 RoCE MAC 3812 The RoCE MAC address for one of the stack's RNICs. Note that in 3813 a virtualized environment this will be the virtual MAC of one of 3814 the stack's RNICs. 3816 A.2.2. SMC Proposal CLC message format 3818 0 1 2 3 3819 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 3820 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3821 | x'E2' | x'D4' | x'C3' | x'D9' | 3822 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3823 | Type = 1 | Length |Version| Rsrvd | 3824 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3825 | | 3826 +- Client's Peer ID -+ 3827 | | 3828 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3829 | | 3830 +- -+ 3831 | | 3832 +- Client's preferred GID -+ 3833 | | 3834 +- -+ 3835 | | 3836 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3837 | Client's preferred RoCE | 3838 +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3839 | | Reserved | 3840 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3841 | IPv4 Subnet Mask | 3842 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3843 | IPv4 Mask Lgth| Reserved |Num IPv6 prfx | 3844 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3845 : : 3846 : (Variable length) array of IPv6 Prefixes : 3847 : : 3848 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3849 | x'E2' | x'D4' | x'C3' | x'D9' | 3850 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3852 Figure 26 SMC Proposal CLC message format 3854 The fields present in the SMC Proposal CLC message are: 3856 Eyecatchers 3858 Like all CLC messages, the SMC Proposal has beginning and ending 3859 eyecatchers to aid with verification and parsing. The hex digits 3860 spell 'SMCR' in IBM-1047 (EBCDIC) 3862 Type 3863 CLC message type 1 indicates SMC Proposal 3865 Length 3867 The length of this CLC message. If this an IPv4 flow, this 3868 value is 52. Otherwise it is variable depending upon how many 3869 prefixes are listed. 3871 Version 3873 Version of the SMC-R protocol. Version 1 is the only currently 3874 defined value 3876 Client's Peer ID 3878 As described in A.2.1. above 3880 Client's preferred RoCE GID 3882 This is the IPv6 address of the client's preferred RNIC on the 3883 RoCE fabric 3885 Client's preferred RoCE MAC address 3887 The MAC address of the client's preferred RNIC on the RoCE 3888 fabric. It is required as some operating systems do not have 3889 neighbor discovery or ARP support for RoCE RNICs. 3891 IPv4 Subnet mask 3893 If this message is flowing over an IPv4 TCP connection, the value 3894 of the subnet mask associated with the interface the client sent 3895 this message over. If this an IPv6 flow this field is all zeroes 3897 IPv4 Mask Lgth 3899 If this message is flowing over an IPv4 TCP connection, the 3900 number of significant bits in the IPv4 subnet mask. If this an 3901 IPv6 flow, this field is zero. 3903 Num IPv6 prfx 3905 If this message is flowing over an IPv6 TCP connection, the 3906 number of IPv6 prefixes that follow, with a maximum value of 8. 3907 if this is an IPv4 flow this field is zero and is immediately 3908 followed by the ending eyecatcher. 3910 Array of IPv6 Prefixes 3912 For IPv6 TCP connections, a list of the IPv6 prefixes associated 3913 with the network the client sent this message over, up to a 3914 maximum of 8 prefixes. 3916 0 1 2 3 3917 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 3918 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3919 | | 3920 + + 3921 | | 3922 + IPv6 Prefix value + 3923 | | 3924 + + 3925 | | 3926 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3927 | Prefix Length | 3928 +-+-+-+-+-+-+-+-+ 3930 Figure 27 Format for IPv6 Prefix array element 3932 A.2.3. SMC Accept CLC message format 3934 0 1 2 3 3935 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 3936 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3937 | x'E2' | x'D4' | x'C3' | x'D9' | 3938 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3939 | Type = 2 | Length = 68 |Version|F|Rsvd | 3940 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3941 | | 3942 +- Server's Peer ID -+ 3943 | | 3944 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3945 | | 3946 +- -+ 3947 | | 3948 +- Server's RoCE GID -+ 3949 | | 3950 +- -+ 3951 | | 3952 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3953 | Server's RoCE | 3954 +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3955 | | Server QP (bytes 1-2) | 3956 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---+ 3957 |Srvr QP byte 3 | Server RMB Rkey (bytes 1-3) | 3958 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3959 |Srvr RMB byte 4|Server RMB indx| Srvr RMB alert tkn (bytes 1-2)| 3960 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3961 | Srvr RMB alert tkn (bytes 3-4)|Bsize | MTU | Reserved | 3962 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3963 | | 3964 +- Server's RMB virtual address -+ 3965 | | 3966 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3967 | Reserved | Server's initial packet sequence number | 3968 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3969 | x'E2' | x'D4' | x'C3' | x'D9' | 3970 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3971 Figure 28 SMC Accept CLC message format 3973 The fields present on the SMC Accept CLC message are: 3975 Eyecatchers 3976 Like all CLC messages, the SMC Accept has beginning and ending 3977 eyecatchers to aid with verification and parsing. The hex digits 3978 spell 'SMCR' in IBM-1047 (EBCDIC) 3980 Type 3982 CLC message type 2 indicates SMC Accept 3984 Length 3986 The SMC Accept CLC message is 64 bytes long 3988 Version 3990 Version of the SMC-R protocol. Version 1 is the only currently 3991 defined value. 3993 F-bit 3995 First Contact flag: A 1-bit flag that indicates that the server 3996 believes this TCP connection is the first SMC-R contact for this 3997 link group 3999 Server's Peer ID 4001 As described in A.2.1. above 4003 Server's RoCE GID 4005 This is the IPv6 address of the RNIC that the server chose for 4006 this SMC Link 4008 Server's RoCE MAC address 4010 The MAC address of the server's RNIC for the SMC link. It is 4011 required as some operating systems do not have neighbor discovery 4012 or ARP support for RoCE RNICs. 4014 Server's QP number 4016 The number for the reliably connected queue pair that the server 4017 created for this SMC link 4019 Server's RMB Rkey 4021 The RDMA Rkey for the RMB that the server created or chose for 4022 this TCP connection 4024 Server's RMB element index 4026 This indexes which element within the server's RMB will represent 4027 this TCP connection 4029 Server's RMB element alert token 4031 A platform defined, architecturally opaque token that identifies 4032 this TCP connection. Added by the client as immediate data on 4033 RDMA writes from the client to the server to inform the server 4034 that there is data for this connection to retrieve from the RMB 4035 element 4037 Bsize: 4039 Server's RMB element buffer size in four bits compressed 4040 notation: x=4 bits. Actual buffer size value is (2^(x+4)) * 1K. 4041 Smallest possible value is 16K. Largest size supported by this 4042 architecture is 512K. 4044 MTU 4046 An enumerated value indicating this peer's QP MTU size. The two 4047 peers exchange this value and the minimum of the peer's value 4048 will be used for the QP. This field should only be validated on a 4049 first contact exchange. 4051 The enumerated MTU values are: 4053 0: reserved 4055 1: 256 4057 2: 512 4059 3: 1024 4061 4: 2048 4063 5: 4096 4065 6-15: reserved 4067 Server's RMB virtual address 4069 The virtual address of the server's RMB as assigned by the 4070 server's RNIC. 4072 Server's initial packet sequence number 4074 The starting packet sequence number that this stack will use when 4075 sending to the peer, so that the peer can prepare its QP for the 4076 sequence number to expect. 4078 A.2.4. SMC Confirm CLC message format 4080 0 1 2 3 4081 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4082 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4083 | x'E2' | x'D4' | x'C3' | x'D9' | 4084 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4085 | Type = 3 | Length = 68 |Version| Rsrvd | 4086 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4087 | | 4088 +- Client's Peer ID -+ 4089 | | 4090 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4091 | | 4092 +- -+ 4093 | | 4094 +- Client's RoCE GID -+ 4095 | | 4096 +- -+ 4097 | | 4098 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4099 | Client's RoCE | 4100 +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4101 | | Client QP (bytes 1-2) | 4102 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---+ 4103 |Clnt QP byte 3 | Client RMB Rkey (bytes 1-3) | 4104 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4105 |Clnt RMB byte 4|Client RMB indx| Clnt RMB alert tkn (bytes 1-2)| 4106 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4107 | Clnt RMB alert tkn (bytes 3-4)|Bsize | MTU | Reserved | 4108 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4109 | | 4110 +- Client's RMB Virtual Address -+ 4111 | | 4112 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4113 | Reserved | Client's initial packet sequence number | 4114 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4115 | x'E2' | x'D4' | x'C3' | x'D9' | 4116 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4117 Figure 29 SMC Confirm CLC message format 4119 The SMC Confirm CLC message is nearly identical to the SMC Accept 4120 except that it contains client information and lacks a first contact 4121 flag. 4123 The fields present on the SMC Confirm CLC message are: 4125 Eyecatchers 4127 Like all CLC messages, the SMC Confirm has beginning and ending 4128 eyecatchers to aid with verification and parsing. The hex digits 4129 spell 'SMCR' in IBM-1047 (EBCDIC) 4131 Type 4133 CLC message type 3 indicates SMC Confirm 4135 Length 4137 The SMC Confirm CLC message is 60 bytes long 4139 Version 4141 Version of the SMC-R protocol. Version 1 is the only currently 4142 defined value. 4144 Client's Peer ID 4146 As described in A.2.1. above 4148 Clients's RoCE GID 4150 This is the IPv6 address of the RNIC that the client chose for 4151 this SMC Link 4153 Client's RoCE MAC address 4155 The MAC address of the client's RNIC for the SMC link. It is 4156 required as some operating systems do not have neighbor discovery 4157 or ARP support for RoCE RNICs. 4159 Client's QP number 4161 The number for the reliably connected queue pair that the client 4162 created for this SMC link 4164 Client's RMB Rkey 4165 The RDMA Rkey for the RMB that the client created or chose for 4166 this TCP connection 4168 Client's RMB element index 4170 This indexes which element within the client's RMB will represent 4171 this TCP connection 4173 Client's RMB element alert token 4175 A platform defined, architecturally opaque token that identifies 4176 this TCP connection. Added by the server as immediate data on 4177 RDMA writes from the server to the client to inform the client 4178 that there is data for this connection to retrieve from the RMB 4179 element 4181 Bsize: 4183 Client's RMB element buffer size in four bits compressed 4184 notation: x=4 bits. Actual buffer size value is (2^(x+4)) * 1K. 4185 Smallest possible value is 16K. Largest size supported by this 4186 architecture is 512K. 4188 MTU 4190 An enumerated value indicating this peer's QP MTU size. The two 4191 peers exchange this value and the minimum of the peer's value 4192 will be used for the QP. The values are enumerated in A.2.3. This 4193 value should only be validated on the first contact exchange. 4195 Client's RMB virtual address 4197 The virtual address of the server's RMB as assigned by the 4198 server's RNIC. 4200 Client's initial packet sequence number 4202 The starting packet sequence number that this stack will use when 4203 sending to the peer, so that the peer can prepare its QP for the 4204 sequence number to expect 4206 . 4208 A.2.5. SMC Decline CLC message format 4210 0 1 2 3 4211 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4212 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4213 | x'E2' | x'D4' | x'C3' | x'D9' | 4214 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4215 | Type = 4 | Length = 28 |Version| Rsrvd | 4216 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4217 | | 4218 +- Sender's Peer ID -+ 4219 | | 4220 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4221 | Peer Diagnosis Information | 4222 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4223 | | 4224 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4225 | x'E2' | x'D4' | x'C3' | x'D9' | 4226 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4227 Figure 30 SMC Decline CLC message format 4229 The fields present on the SMC Decline CLC message are: 4231 Eyecatchers 4233 Like all CLC messages, the SMC Decline has beginning and ending 4234 eyecatchers to aid with verification and parsing. The hex digits 4235 spell 'SMCR' in IBM-1047 (EBCDIC) 4237 Type 4239 CLC message type 4 indicates SMC Decline 4241 Length 4243 The SMC Decline CLC message is 28 bytes long 4245 Version 4247 Version of the SMC-R protocol. Version 1 is the only currently 4248 defined value. 4250 Sender's Peer ID 4252 As described in A ..2.1. above 4254 Peer Diagnosis Information 4255 Four bytes of diagnosis information provided by the peer. These 4256 values are defined by the individual peers and it is necessary to 4257 consult the peer's system documentation to interpret the results. 4259 A.3. LLC messages 4261 LLC messages are sent over an existing SMC-R link using RoCE message 4262 passing and are always 44 bytes long so that they fit into the space 4263 available in a single WQE without requiring the receiver to post 4264 receive buffers. If all 44 bytes are not needed, they are padded out 4265 with zeroes. LLC messages are in a request/response format. The 4266 message type is the same for request and response, and a flag 4267 indicates whether a message is flowing as a request or a response. 4269 The two high order bits of an LLC message opcode indicate how it is 4270 to be handled by a peer that does not support the opcode. 4272 If the high order bits of the opcode are b'00' then the peer must 4273 support the LLC message and indicate a protocol error if it does not. 4275 If the high order bits of the opcode are b'10' then the peer must 4276 silently discard the LLC message if does not support the opcode. This 4277 requirement is inserted to allow for toleration of advanced, but 4278 optional function. 4280 High order bits of b'11' indicate a Connection Data Control (CDC) 4281 message as described in A.4. 4283 A.3.1. CONFIRM LINK LLC message format 4285 0 1 2 3 4286 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4287 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4288 | type = 1 | length = 44 |Version| Rsrvd |R| Reserved | 4289 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4290 | Sender's RoCE | 4291 +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4292 | | | 4293 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 4294 | | 4295 +- -+ 4296 | Sender's RoCE GID | 4297 +- -+ 4298 | | 4299 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4300 | |Sender's QP number, bytes 1-2 | 4301 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4302 |Sender QP byte3| Link number |Sender's link userid, bytes 1-2| 4303 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4304 |Sender's link userid bytes, 3-4| Max links | Reserved | 4305 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4306 | | 4307 +- Reserved -+ 4308 | | 4309 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4310 Figure 31 CONFIRM LINK LLC message format 4312 The CONFIRM LINK LLC message is required to be exchanged between the 4313 server and client over a newly created SMC-R link to complete the 4314 setup of an SMC link. Its purpose is to confirm that the RoCE path 4315 is actually usable. 4317 On first contact this flows after the server receives the SMC Confirm 4318 CLC message from the client over the IP connection. For additional 4319 links added to an SMC link group, it flows after the ADD LINK and ADD 4320 LINK CONTINUATION exchange. This flow provides confirmation that the 4321 queue pair is in fact usable. Each peer echoes its RoCE information 4322 back to the other. 4324 Type 4326 Type 1 indicates CONFIRM LINK 4328 Length 4329 All LLC messages are 44 bytes long 4331 Version 4333 Version of the SMC-R protocol. Version 1 is the only currently 4334 defined value. 4336 R 4338 Reply flag. When set indicates this is a CONFIRM LINK REPLY 4340 Sender's RoCE MAC address 4342 The MAC address of the sender's RNIC for the SMC link. It is 4343 required as some operating systems do not have neighbor discovery 4344 or ARP support for RoCE RNICs. 4346 Sender's RoCE GID 4348 This is the IPv6 address of the RNIC that the sender is using for 4349 this SMC-R Link 4351 Sender's QP number 4353 The number for the reliably connected queue pair that the sender 4354 created for this SMC-R link 4356 Link number 4358 An identifier assigned by the server that uniquely identifies the 4359 link within the link group. This identifier is ONLY unique 4360 within a link group. Provided by the server and echoed back by 4361 the client 4363 Link User ID 4365 An opaque, implementation defined identifier assigned by the 4366 sender and provided to the receiver solely for purposes of 4367 display, diagnosis, network management, etc. The link user ID 4368 should be unique across the sender's entire stack, including all 4369 link other link groups. 4371 Max Links 4373 The maximum number of links the sender can support in a link 4374 group. The maximum for this link group is the the smaller of the 4375 values provided by the two peers. 4377 A.3.2. ADD LINK LLC message format 4379 0 1 2 3 4380 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4381 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4382 | type = 2 | length = 44 |Version|RsnCode|R|Z| Reserved | 4383 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4384 | Sender's RoCE | 4385 +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4386 | | | 4387 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 4388 | | 4389 +- -+ 4390 | Sender's RoCE GID | 4391 +- -+ 4392 | | 4393 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4394 | |Sender's QP number, bytes 1-2 | 4395 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4396 |Sender QP byte3| Link number |Rsrvd | MTU |Initial PSN | 4397 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4398 | Initial PSN, continued | | 4399 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -+ 4400 | Reserved | 4401 +- -+ 4402 | | 4403 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4404 Figure 32 ADD LINK LLC message format 4406 The ADD LINK LLC message is sent over an existing link in the link 4407 group when a peer wishes to add an SMC-R link to an existing SMC-R 4408 link group. It sent by the server to add a new SMC-R link to the 4409 group, or by the client to request that the server add a new link, 4410 for example when a new RNIC becomes active. When sent from the 4411 client to the server, it represents a request that the server 4412 initiate an ADD LINK exchange. 4414 This message is sent immediately after the initial SMC link in the 4415 group completes, as described in 3.4.1. First contact. It can also be 4416 sent over an existing SMC-R link group at any time as new RNICs are 4417 added and become available. Therefore there can be as few as 1 new 4418 RMB RTokens to communicate, or several. Rtokens will be 4419 communicated using ADD LINK CONTINUATION messages. 4421 The contents of the ADD LINK LLC message are: 4423 Type 4425 Type 2 indicates ADD LINK 4427 Length 4429 All LLC messages are 44 bytes long 4431 Version 4433 Version of the SMC-R protocol. Version 1 is the only currently 4434 defined value. 4436 RsnCode 4438 If the Z (rejection) flag is set, this field provides the reason 4439 code. Values can be: 4441 X'1' - no alternate path available: set when the server provides 4442 the same MAC/GID as an existing SMC-R link in the group, and the 4443 client does not have any additional RNICs available (i.e., server 4444 is attempting to set up an asymmetric link but none is available) 4446 R 4448 Reply flag. When set indicates this is an ADD LINK REPLY 4450 Z 4452 Rejection flag. When set on reply indicates that the server's 4453 ADD LINK was rejected by the client. When this flag is set, the 4454 reason code will also be set. 4456 Sender's RoCE MAC address 4458 The MAC address of the sender's RNIC for the new SMC-R link. It 4459 is required as some operating systems do not have neighbor 4460 discovery or ARP support for RoCE RNICs. 4462 Sender's RoCE GID 4464 The IPv6 address of the RNIC that the sender is using for the new 4465 SMC-R Link 4467 Sender's QP number 4468 The number for the reliably connected queue pair that the sender 4469 created for the new SMC-R link 4471 Link number 4473 An identifier for the new SMC-R link. This is assigned by the 4474 server and uniquely identifies the link within the link group. 4475 This identifier is ONLY unique within a link group. Provided by 4476 the server and echoed back by the client 4478 MTU 4480 An enumerated value indicating this peer's QP MTU size. The two 4481 peers exchange this value and the minimum of the peer's value 4482 will be used for the QP. The values are enumerated in A.2.3. 4484 Initial PSN 4486 The starting packet sequence number that this stack will use when 4487 sending to the peer, so that the peer can prepare its QP for the 4488 sequence number to expect. 4490 A.3.3. ADD LINK CONTINUATION LLC message format 4492 0 1 2 3 4493 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4494 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4495 | type = 3 | length = 44 |Version| Rsrvd |R| Reserved | 4496 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4497 | Linknum | NumRTokens | Reserved | 4498 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4499 | | 4500 +- -+ 4501 | | 4502 +- Rkey/Rtoken Pair -+ 4503 | | 4504 +- -+ 4505 | | 4506 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4507 | | 4508 +- -+ 4509 | | 4510 +- Rkey/Rtoken Pair or zeroes -+ 4511 | | 4512 +- -+ 4513 | | 4514 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4515 | Reserved | 4516 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4517 Figure 33 ADD LINK CONTINUATION LLC message format 4519 When a new SMC-R link is added to an SMC-R link group, it is 4520 necessary to communicate the new link's RTokens for the RMBs that the 4521 SMC-r link group can access. This message follows the ADD LINK and 4522 provides the RTokens. 4524 The server kicks off this exchange by sending the first ADD LINK 4525 CONTINUATION LLC message, and the server controls the exchange as 4526 described below. 4528 o If the client and the server require the same number of ADD LINK 4529 CONTINUATION messages to communicate their RTokens, the server 4530 starts the exchange by sending the client the first ADD LINK 4531 CONTINUATION request to the client with its RTokens, then the 4532 client responds with an ADD LINK CONTINUATION response with its 4533 RTokens, and so on until the exchange is completed. 4535 o If the server requires more ADD LINK CONTINUATION messages than 4536 the client, then after the client has communicated all its 4537 RTokens, the server continues to send ADD LINK CONTINUATION 4538 request messages to the client. The client continues to respond, 4539 using empty (number of RTokens to be communicated = 0) ADD LINK 4540 CONTINUATION response messages. 4542 o If the client requires more ADD LINK CONTINUATION messages than 4543 the server, then after communicating all its RTokens the server 4544 will continue to send empty ADD LINK CONTINUATION messages to the 4545 client to solicit replies with the client's RTokens, until all 4546 have been communicated. 4548 The contents of this message are: 4550 Type 4552 Type 3 indicates ADD LINK CONTINUATION 4554 Length 4556 All LLC messages are 44 bytes long 4558 Version 4560 Version of the SMC-R protocol. Version 1 is the only currently 4561 defined value. 4563 R 4565 Reply flag. When set indicates this is an ADD LINK CONTINUATION 4566 REPLY 4568 LinkNum 4570 The link number of the new link within the SMC link group that 4571 Rkeys are being communicated for 4573 NumRTokens 4575 Number of RTokens remaining to be communicated (including the 4576 ones in this message). If the value is less than or equal to 2, 4577 this is the last message. If it is greater than 2, another 4578 continuation message will be required, and its value will be the 4579 value in this message minus 2, and so on until all Rkeys are 4580 communicated. The maximum value for this field is 255. 4582 Up to 2 Rkey/RToken pairs 4584 These consist of an Rkey for an RMB that is known on the SMC-R 4585 link that this message was sent over (the reference Rkey), paired 4586 with the same RMB's RToken over the new SMC link. A full RToken 4587 is not required for the reference because it is only being used 4588 to distinguish which RMB it applies to, not address it. 4590 0 1 2 3 4591 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4592 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4593 | Reference Rkey | 4594 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4595 | New Rkey | 4596 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4597 | | 4598 +- New Virtual Address -+ 4599 | | 4600 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4601 Figure 34 Rkey/Rtoken pair format 4603 The contents of the RKey/RToken pair are: 4605 Reference Rkey 4607 The Rkey of the RMB as it is already known on the SMC-R link over 4608 which this message is being sent. Required so that the peer knows 4609 which RMB to associate the new Rtoken with. 4611 New Rkey 4613 The Rkey of this RMB as it is known over the new SMC-R link 4615 New Virtual Address 4617 The virtual address of this RMB as it is known over the new SMC-R 4618 link. 4620 A.3.4. DELETE LINK LLC message format 4622 0 1 2 3 4623 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4624 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4625 | type = 4 | length = 44 |Version| Rsrvd |R|A|O| Rsrvd | 4626 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4627 | Linknum | Reason code (bytes 1-3) | 4628 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4629 |RsnCode byte 4 | | 4630 +-+-+-+-+-+-+-+-+ -+ 4631 | | 4632 +- -+ 4633 | | 4634 +- -+ 4635 | | 4636 +- Reserved -+ 4637 | | 4638 +- -+ 4639 | | 4640 +- -+ 4641 | | 4642 +- -+ 4643 | | 4644 +- -+ 4645 | | 4646 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4647 Figure 35 DELETE LINK LLC message format 4649 When the client or server detects that a QP or SMC-R link goes down 4650 or needs to come down, it sends this message over one of the other 4651 links in the link group. 4653 When the DELETE Link is sent from the client it only serves as a 4654 notification, and the client expects the server to send a DELETE LINK 4655 Request in response. To avoid races, only the server will initiate 4656 the actual DELETE LINK Request and Response sequence that results 4657 from notification from the client. 4659 The server can also initiate the DELETE Link without notification 4660 from the client if it detects an error or if orderly link termination 4661 was initiated. 4663 The client may also request termination of the entire link group and 4664 the server may terminate the entire link group using this message. 4666 The contents of this message are: 4668 Type 4670 Type 4 indicates DELETE LINK 4672 Length 4674 All LLC messages are 44 bytes long 4676 Version 4678 Version of the SMC-R protocol. Version 1 is the only currently 4679 defined value. 4681 R 4683 Reply flag. When set indicates this is an ADD LINK CONTINUATION 4684 REPLY 4686 A 4688 All flag. When set indicates that all links in the link group 4689 are to be terminated. This terminates the link group. 4691 O 4693 Orderly flag. Indicates orderly termination. Orderly termination 4694 is generally caused by an operator command rather than an error 4695 on the link. When the client requests orderly termination, the 4696 server may wait to complete other work before terminating. 4698 LinkNum 4700 The link number of the link to be terminated 4702 RsnCode 4704 The termination reason code. Currently defined reason codes are: 4706 Request Reason Codes: 4708 o X'00010000' = lost path 4710 o X'00020000' = operator initiated termination 4711 o X'00030000' = stack (program) initiated termination (link 4712 inactivity) 4714 o X'00040000' = LLC protocol violation 4716 o Others TBD 4718 Response Reason Codes: 4720 o X'00100000' = Unknown Link ID (no link) 4722 o X'00200000' = Unknown Link Group (no links) 4724 o Others TBD 4726 A.3.5. CONFIRM RKEY LLC message format 4728 0 1 2 3 4729 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4730 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4731 | type = 6 | length = 44 |Version| Rsrvd |R|0|Z|C|Rsrvd | 4732 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4733 | NumLinks | New RMB Rkey for this link (bytes 1-3) | 4734 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4735 |ThisLink byte 4| | 4736 +-+-+-+-+-+-+-+-+ -+ 4737 | New RMB virtual address for this link | 4738 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4739 | | | 4740 +-+-+-+-+-+-+-+-+ -+ 4741 | | 4742 +- Other link RMB specification or zeros -+ 4743 | | 4744 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4745 | | | 4746 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -+ 4747 | | 4748 +- -+ 4749 | Other link RMB specification or zeroes | 4750 +- +-+-+-+-+-+-+-+-+ 4751 | | Reserved | 4752 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4753 Figure 36 CONFIRM RKEY LLC message format 4755 The CONFIRM_RKEY flow can be sent at any time from either the client 4756 or the server, to inform the peer that an RMB has been created or 4757 deleted. The creator of a new RMB must inform its peer of the new 4758 RMB's RToken for all SMC-R links in the SMC-R link group. The 4759 deleter of an RMB must inform its peer of the deleted RMB's RToken 4760 for all SMC-R links. 4762 For RMB creation, the creator sends this message over the SMC link 4763 that the first TCP connection that uses the new RMB is using. This 4764 message contains the new RMB RToken for the SMC link that the message 4765 is sent over, then it lists the sender's SMC links in the link group 4766 paired with the new RToken for the new RMB for that link. This 4767 message can communicate the new RTokens for 3 QPs: the QP for the 4768 link this message is sent over, and 2 others. If there are more than 4769 3 links in the SMC-R link group, CONFIRM_RKEY_CONTINUATION will be 4770 required. 4772 For RMB deletion, the creator sends the same format of message with a 4773 delete flag set, to inform the peer that the RMB's RTokens on all 4774 links in the group are deleted. 4776 In both cases, the peer responds by simply echoing the message with 4777 the response flag set. If the response is a negative response, the 4778 sender must recalculate the RToken set and start a new CONFIRM_RKEY 4779 exchange from the beginning. The timing of this retry is controlled 4780 by the C flag as described below. 4782 The contents of this message are: 4784 Type 4786 Type 6 indicates CONFIRM RKEY 4788 Length 4790 All LLC messages are 44 bytes long 4792 Version 4794 Version of the SMC-R protocol. Version 1 is the only currently 4795 defined value. 4797 R 4799 Reply flag. When set indicates this is a CONFIRM RKEY REPLY 4801 0 4803 Reserved bit 4805 Z 4807 Negative response flag 4809 C 4811 Configuration Retry bit. If this is a negative response and this 4812 flag is set, the originator should recalculate the Rkey set and 4813 retry this exchange as soon as the current configuration change 4814 is completed. If this flag is not set on a negative response, the 4815 originator must wait for the next natural stimulus (for example, 4816 a new TCP connection started that requires a new RMB) before 4817 retrying. 4819 NumLinks 4821 The number of link/RToken pairs, including those provided in this 4822 message, to be communicated. If this value is three or fewer 4823 this is the only message in the exchange. If this value is 4824 greater than three, a CONFIRM RKEY CONTINUATION message will be 4825 required. 4827 Note: in this version of the architecture, 8 is the maximum 4828 number of links supported in a link group. 4830 New RMB Rkey for this link 4832 The new RMB's Rkey as assigned on the link this message is being 4833 sent over. 4835 New RMB virtual address for this link 4837 The new RMB's virtual address as assigned on the link this 4838 messages is being sent over. 4840 Other link RMB specification 4842 The new RMB's specification on the other links in the link group, 4843 as shown in Figure 38. 4845 0 1 2 3 4846 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4847 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4848 | Link number | RMB's Rkey for the specified link (bytes 1-3) | 4849 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4850 |New Rkey byte 4| | 4851 +-+-+-+-+-+-+-+-+ -+ 4852 | RMB's virtual address for the specified link | 4853 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4854 | | 4855 +-+-+-+-+-+-+-+-+ 4856 Figure 37 Format of link number/Rkey pairs 4858 Link number 4860 The link number for a link in the link group 4862 RMB's Rkey for the specified link 4864 The Rkey used to reach the RMB over the link whose number was 4865 specified in the link number field. 4867 RMB's virtual address for the specified link 4869 The virtual address used to reach the RMB over the link whose 4870 number was specified in the link number field. 4872 A.3.6. CONFIRM RKEY CONTINUATION LLC message format 4874 0 1 2 3 4875 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4876 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4877 | type = 8 | length = 44 |Version| Rsrvd |R|0|Z|C| Rsrvd | 4878 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4879 | NumLinksLeft | | 4880 +-+-+-+-+-+-+-+-+ -+ 4881 | | 4882 +- Other link RMB specification -+ 4883 | | 4884 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4885 | | | 4886 +-+-+-+-+-+-+-+-+ -+ 4887 | | 4888 +- Other link RMB specification or zeros -+ 4889 | | 4890 +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4891 | | | 4892 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -+ 4893 | | 4894 +- -+ 4895 | Other link RMB specification or zeroes | 4896 +- +-+-+-+-+-+-+-+-+ 4897 | | Reserved | 4898 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4900 The CONFIRM RKEY CONTINUATION LLC message is used to communicate any 4901 additional RMB RTokens that did not fit into the CONFIRM RKEY 4902 message. Each of these messages can hold up to 3 RMB RTokens. The 4903 Numlinks field indicates how many RMB RTokens are to be communicated, 4904 including the ones in this message. If the value is 3 or less, this 4905 is the last message of the group. If the value is 4 or higher, 4906 additional CONFIRM RKEY CONTINUATION messages will follow, and the 4907 Numlinks value will be a countdown until all are communicated. 4909 Like the CONFIRM RKEY message, the peer responds by echoing the 4910 message back with the reply flag set. 4912 The contents of this message are: 4914 Type 4916 Type 8 indicates CONFIRM RKEY CONTINUATION 4918 Length 4920 All LLC messages are 44 bytes long 4922 Version 4924 Version of the SMC-R protocol. Version 1 is the only currently 4925 defined value. 4927 R 4929 Reply flag. When set indicates this is a CONFIRM RKEY 4930 CONTINUATION REPLY 4932 0 4934 Reserved bit 4936 Z 4938 Negative response flag 4940 C 4942 Configuration Retry bit. If this is a negative response and this 4943 flag is set, the originator should recalculate the Rkey set and 4944 retry this exchange as soon as the current configuration change 4945 is completed. If this flag is not set on a negative response, the 4946 originator must wait for the next natural stimulus (for example, 4947 a new TCP connection started that requires a new RMB) before 4948 retrying. 4950 NumLinksLeft 4952 The number of link/RToken pairs, including those provided in this 4953 message, that are remaining to be communicated. If this value is 4954 three or fewer this is the last message in the exchange. If this 4955 value is greater than three, another CONFIRM RKEY CONTINUATION 4956 message will be required. Note that in this version of the 4957 architecture, 8 is the maximum number of links supported in a 4958 link group. 4960 Other link RMB specifications 4961 The new RMB's specification on other links in the link group, as 4962 shown in Figure 38. 4964 A.3.7. DELETE RKEY LLC message format 4966 0 1 2 3 4967 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4968 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4969 | type = 9 | length = 44 |Version| Rsrvd |R|0|Z| Rsrvd | 4970 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4971 | Count | Error Mask | Reserved | 4972 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4973 | First deleted Rkey | 4974 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4975 | Second deleted Rkey or zeros | 4976 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4977 | Third deleted Rkey or zeros | 4978 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4979 | Fourth deleted Rkey or zeros | 4980 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4981 | Fifth deleted Rkey or zeros | 4982 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4983 | Sixth deleted Rkey or zeros | 4984 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4985 | Seventh deleted Rkey or zeros | 4986 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4987 | Eighth deleted Rkey or zeros | 4988 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4989 | Reserved | 4990 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4992 The DELETE_RKEY flow can be sent at any time from either the client 4993 or the server, to inform the peer that one or more RMBs have been 4994 deleted. Because the peer already knows every RMB's Rkey on each 4995 link in the link group, this message only specifies one Rkey for each 4996 RMB being deleted. The Rkey provided for each deleted RMB will be its 4997 Rkey as known on the SMC-R link that this message is sent over. 4999 It is not necessary to provide the entire RToken. The Rkey alone is 5000 sufficient for identifying an existing RMB. 5002 The peer responds by simply echoing the message with the response 5003 flag set. If the peer did not recognize an Rkey, a negative response 5004 flag will be set, however no aggressive recovery action beyond 5005 logging the error will be taken. 5007 The contents of this message are: 5009 Type 5011 Type 9 indicates DELETE RKEY 5013 Length 5015 All LLC messages are 44 bytes long 5017 Version 5019 Version of the SMC-R protocol. Version 1 is the only currently 5020 defined value. 5022 R 5024 Reply flag. When set indicates this is a DELETE RKEY REPLY 5026 0 5028 Reserved bit 5030 Z 5032 Negative response flag 5034 Count 5036 Number of RMBs being deleted by this message. Maximum value is 8 5038 Error Mask 5040 If this is a negative response, indicates which RMBs were not 5041 successfully deleted. Each bit corresponds to a listed RMB. So 5042 for example b'01010000' indicates that the second and fourth 5043 Rkeys weren't successfully deleted. 5045 Deleted Rkeys 5047 A list of Count Rkeys. Provided on the request flow and echoed 5048 back on the response flow. Each Rkey is valid on the link this 5049 message is sent over, and represents a deleted RMB. Up to eight 5050 RMBs can be deleted in this message. 5052 A.3.8. TEST LINK LLC message format 5054 0 1 2 3 5055 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 5056 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5057 | type = 7 | length = 44 |Version| Rsrvd |R| Reserved | 5058 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5059 | | 5060 +- -+ 5061 | | 5062 +- User Data -+ 5063 | | 5064 +- -+ 5065 | | 5066 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5067 | | 5068 +- -+ 5069 | | 5070 +- -+ 5071 | Reserved | 5072 +- -+ 5073 | | 5074 +- -+ 5075 | | 5076 +- -+ 5077 | | 5078 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5079 Figure 38 TEST LINK LLC message format 5081 The TEST_LINK request can be sent from either peer to the other on an 5082 existing SMC-R link at any time to test that the SMC-R link is active 5083 and healthy at the stack level. A stack which receives a TEST_LINK 5084 LLC message immediately sends back a TEST_LINK reply, echoing back 5085 the user data. Also refer to 4.5.3. TCP Keepalive processing. 5087 The contents of this message are: 5089 Type 5091 Type 7 indicates TEST LINK 5093 Length 5094 All LLC messages are 44 bytes long 5096 Version 5098 Version of the SMC-R protocol. Version 1 is the only currently 5099 defined value. 5101 R 5103 Reply flag. When set indicates this is a CONFIRM RKEY REPLY 5105 User Data 5107 The receiver of this message echoes the sender's data back in a 5108 TEST_LINK response LLC message 5110 A.4. Connection Data Control (CDC) message format 5112 The RMBE control data is communicated using Connection Data Control 5113 (CDC) messages, which use RDMA message passing using inline data, 5114 similar to LLC messages. Also similar to LLC messages, this data 5115 block is 44 bytes long to ensure that it can it into private data 5116 areas of receive WQEs, without requiring the receiver to post receive 5117 buffers. 5119 Unlike LLC messages, this data is integral to the data path so its 5120 processing must be prioritized and optimized similarly to other data 5121 path processing. While LLC messages may be processed on a slower 5122 path than data, these messages cannot be. 5124 0 1 2 3 5125 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 5126 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5127 | Type = x'FE' | Length = 44 | Sequence number | 5128 4 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5129 | SMC-R alert token | 5130 8 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5131 | Reserved | Producer cursor wrap seqno | 5132 12 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5133 | Producer Cursor | 5134 16 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5135 | Reserved | Consumer cursor wrap seqno | 5136 20 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5137 | Consumer Cursor | 5138 24 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5139 |B|P|U|R| Rsrvd |D|C|A| Reserved | 5140 28 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5141 | | 5142 32 +- -+ 5143 | | 5144 36 +- Reserved -+ 5145 | | 5146 40 +- -+ 5147 | | 5148 44 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5150 Figure 39 Connection Data Control (CDC) Message Format 5152 Type = x'FE' 5154 This type number has the two high order bits turned on to enable 5155 processing to quickly distinguish it from an LLC message 5157 Length = 44 5159 The length of inline data that does not require posting of a 5160 receive buffer. 5162 Sequence number 5164 A 2 byte unsigned integer that represents a wrapping sequence 5165 number. Incremented with every control message send, and used to 5166 guard against processing an old control message out of sequence. 5167 If this number is less than the last received value, discard this 5168 message. If greater, processes this message. Old control 5169 messages can be lost with no ill effect, but cannot be processed 5170 after newer ones. 5172 SMC-R alert token 5174 The endpoint-assigned alert token that identifies which TCP 5175 connection on the link group this control message refers to. 5177 Producer cursor wrap seqno 5179 A 2 byte unsigned integer that represents wrapping counter 5180 incremented by the producer whenever the data written into this 5181 RMBE receiver buffer causes a wrap (i.e. the producer cursor 5182 wraps). This is used by the receiver to determine when new data 5183 is available even though the cursors appear unchanged such as 5184 when a full window size write is completed (Producer cursor of 5185 this RMBE sent by peer = Local Consumer Cursor) or in scenarios 5186 where the Producer Cursor sent for this RMBE < Local Consumer 5187 Cursor). 5189 Producer cursor 5191 Unsigned, 4 byte integer that is a wrapping offset into the RMBE 5192 data area. Points to the next byte of data to be written by the 5193 sender. Can advance up to the receiver's Consumer Cursor as known 5194 by the sender. When the urgent data present indicator is on then 5195 points one byte beyond the last byte of urgent data. 5197 Consumer cursor wrap seqno 5199 2 byte unsigned integer that mirrors the value of the Producer 5200 cursor wrap sequence number when the last read from this RMBE 5201 occurred. Used as an indicator on how far along the consumer is 5202 in reading data (i.e. processed last wrap point or not). The 5203 producer side can use this indicator to detect whether more data 5204 can be written to the partner in full window write scenarios 5205 (where the Producer Cursor = Consumer Cursor as known on the 5206 remote RMBE). In this scenario if the consumer sequence number 5207 equals the local producer sequence number the producer knows that 5208 more data can be written. 5210 Consumer Cursor 5212 Unsigned 4 byte integer that is a wrapping offset into the 5213 sender's RMBE data area. Points to the offset of the next byte 5214 of data to be consumed by the peer in its own RMBE. The sender 5215 cannot write beyond this cursor into the peer's RMBE without 5216 causing data loss. 5218 B-bit 5220 Writer blocked indicator: Sender is blocked for writing, requires 5221 explicit notification when receive buffer space is available. 5223 P-bit 5225 Urgent data pending: Sender has urgent data pending for this 5226 connection 5228 U-bit 5230 Urgent data present: Indicates that urgent is data present in the 5231 RMBE data area, and the producer cursor points to one byte beyond 5232 the last byte of urgent data. 5234 R-bit 5236 Request for consumer cursor update: Indicates that a consumer 5237 cursor update is requested bypassing any window size optimization 5238 algorithms. 5240 D-bit 5242 Sending done indicator: Sent by a peer when it is done writing 5243 new data into the receiver's RMBE data area. 5245 C-bit 5247 Peer Closed Connection indicator: Sent by a peer when it is 5248 completely done with this connection and will no longer be making 5249 any updates to the receiver's RMBE, and will also not be sending 5250 any more control messages. 5252 A-bit 5254 Abnormal Close indicator: Sent by a peer when the connection is 5255 abnormally terminated (for example, the TCP connection was 5256 Reset). When sent it indicates that the peer is completely done 5257 with this connection and will no longer be making any updates to 5258 this RMBE or sending any more control messages. It also indicates 5259 that the RMBE owner must flush any remaining data on this 5260 connection and surface an error return code to any outstanding 5261 socket APIs on this connection (same processing as receiving an 5262 RST segment on a TCP connection). 5264 Appendix B. Socket API considerations 5266 A key design goal for SMC-R is to require no application changes for 5267 exploitation. It is confined to socket applications using stream 5268 (i.e. TCP protocol) sockets over IPv4 or IPv6. By virtue of the fact 5269 that the switch to the SMC-R protocol occurs after a TCP connection 5270 is established no changes are required in socket address family or in 5271 the IP addresses and ports that the socket application are using. 5272 Existing socket APIs that allow the application to retrieve local and 5273 remote socket address structures for an established TCP connection 5274 (for example, getsockname() and getpeername()) will continue to 5275 function as they have before. Existing DNS setup and APIs for 5276 resolving hostnames to IP addresses and vice versa also continue to 5277 function without any changes. In general all of the usual socket APIs 5278 that are used for TCP communicates (send APIs, recv APIs, etc.) will 5279 continue to function as they do today even if SMC-R is used as the 5280 underlying protocol. 5282 Each SMC-R enabled implementation does however need to pay special 5283 attention to any socket APIs that have a reliance on the underlying 5284 TCP and IP protocols and ensure that their behavior in an SMC-R 5285 environment is reasonable and minimizes impact to the application. 5286 While the basic socket API set is fairly similar across different 5287 Operating Systems, when it comes to advanced socket API options there 5288 is more variability. Each implementation needs to perform a detailed 5289 analysis of its API options and SMC-R impact and implications. As 5290 part of that step a discussion or review with other implementations 5291 supporting SMC-R would be useful to ensure a consistent 5292 implementation. 5294 setsockopt()/ getsockopt() considerations 5296 These APIs allow socket applications to manipulate socket, transport 5297 (TCP/UDP) and IP level options associated with a given socket. 5298 Typically, a platform restricts the number of IP options available to 5299 stream (TCP) socket applications given their connection oriented 5300 nature. The general guideline here is to continue processing these 5301 APIs in a manner that allows for application compatibility. Some 5302 options will be relevant to the SMC-R protocol and will require 5303 special processing under the covers. For example, the ability to 5304 manipulate TCP send and receive buffer sizes is still valid for SMC- 5305 R. However, other options may have no meaning for SMC-R. For 5306 example, if an application enabled the TCP_NODELAY option to disable 5307 Nagle's algorithm it should have no real effect in SMC-R 5308 communications as there is no notion of Nagle's algorithm with this 5309 new protocol. But the implementation must accept the TCP_NODELAY 5310 option as it does today and save it so that it can be later extracted 5311 via getsockopt() processing. Note that any TCP or IP level options 5312 will still have an effect on any TCP/IP packets flowing for an SMC-R 5313 connection (i.e. as part of TCP/IP connection establishment and 5314 TCP/IP connection termination packet flows). 5316 Under the covers manipulation of the TCP options will also include 5317 the SMC layer setting and reading the SMC-R experimental option 5318 before and after completion of the 3 way TCP handshake. 5320 Appendix C. Rendezvous Error scenarios 5322 Error scenarios in setting up and managing SMC-R links are discussed 5323 in this section. 5325 C.1. SMC Decline during CLC negotiation 5327 A peer to the SMC-R CLC negotiation can send SMC Decline in lieu of 5328 any expected CLC message to decline SMC and force the TCP connection 5329 back to IP fabric. There can be several reasons for an SMC Decline 5330 during the CMC negotiation including: RNIC went down, SMC-R forbidden 5331 by local policy, subnet (IPv4) or prefix (IPv6) doesn't match, lack 5332 of resources to perform SMC-R. In all cases when an SMC Decline is 5333 sent in lieu of an expected CLC message, no confirmation is required 5334 and the TCP connection immediately falls back to using the IP fabric. 5336 To prevent ambiguity between CLC messages and application data, an 5337 SMC Decline cannot "chase" another CLC message. SMC Decline can only 5338 be sent in lieu of an expected CLC message. For example, if the 5339 client sends SMC Proposal then its RNIC goes down, it must wait for 5340 the SMC Accept for the server and then it can reply to that with an 5341 SMC Decline. 5343 This "no chase" rule means that if this TCP connection is not a first 5344 contact between RoCE peers, a server cannot send SMC Decline after 5345 sending SMC Accept - it can only either break the TCP connection. 5346 Similarly, once the client sends SMC Confirm on a TCP connection that 5347 isn't first contact, it is committed to SMC-R for this TCP connection 5348 and cannot fall back to IP. 5350 C.2. SMC Decline during LLC negotiation 5352 For a TCP connection that represents first contact between RoCE 5353 pairs, it is possible for SMC to fail back to IP during the LLC 5354 negotiation. This is possible until the first contact SMC link is 5355 confirmed. For example, see Figure 40. After a first contact SMC 5356 link is confirmed, fallback to IP is no longer possible. The rule 5357 that this translates to is: a first contact peer can send SMC Decline 5358 at any time during LLC negotiation until it has successfully sent its 5359 CONFIRM LINK (request or response) flow. After that point, it cannot 5360 fall back to IP. 5362 Host X -- Server Host Y -- Client 5363 +-------------------+ +-------------------+ 5364 | PeerID = PS1 | | PeerID = PC1 | 5365 | +------+ +------+ | 5366 | QP 8 |RNIC 1| SMC-R link 1 |RNIC 2| QP 64 | 5367 | RKey X | |MAC MA|<-------------------->|MAC MB| | | 5368 | | |GID GA| attempted setup |GID GB| | RKey Y2| 5369 | \/ +------+ +------+ \/ | 5370 |+--------+ | | +--------+ | 5371 || RMB | | | | RMB | | 5372 |+--------+ | | +--------+ | 5373 | /\ +------+ +------+ /\ | 5374 | | |RNIC 3| |RNIC 4| | Rkey W2| 5375 | | |MAC MC| |MAC MD| | | 5376 | QP 9 |GID GC| |GID GD| QP65 | 5377 | +------+ +------+ | 5378 +-------------------+ +-------------------+ 5380 SYN / SYN-ACK / ACT TCP 3-way handshake with TCP option 5381 <---------------------------------------------------------> 5383 SMC Proposal / SMC Accept / SMC Confirm exchange 5384 <--------------------------------------------------------> 5386 CONFIRM LINK(request, link 1) 5387 .........................................................> 5389 CONFIRM LINK(response, link 1) 5390 X................................... 5391 : 5392 : ROCE write faliure 5393 :.................................> 5395 SMC Decline(PC1, reason code) 5396 <-------------------------------------------------------- 5398 Connection data flows over IP fabric 5399 <-------------------------------------------------------> 5401 Legend: 5402 ------------ TCP/IP and CLC flows 5403 ............ RoCE (LLC) flows 5405 Figure 40 SMC Decline during LLC negotiation 5407 C.3. The SMC Decline window 5409 Because SMC-R does not support fall-back to IP for a TCP connection 5410 that is already using RDMA, there are specific rules on when SMC 5411 Decline, which signals a fall-back to IP because of an error or 5412 problem with the RoCE fabric, can be sent during TCP connection 5413 setup. There is a point of no return after which a connection cannot 5414 fall back to IP, and RoCE errors that occur after this point require 5415 the connection to be broken with a RST flow in the IP fabric. 5417 For first contact, that point of no return is after the Add Link LLC 5418 message has been successfully sent for the second SMC-R link. 5419 Specifically, the server cannot fall back to IP after receiving 5420 either a positive write completion indication for the Add Link 5421 request, or after receiving the Add Link response from the client, 5422 whichever comes first. The client cannot fall back to IP after 5423 either sending a negative Add Link response, receiving a positive 5424 write complete on a positive Add Link response, or receiving a 5425 Confirm Link for the second SMC-R link from the server, whichever 5426 comes first. 5428 For subsequent contact, that point of no return is after the last 5429 send of the CLC negotiation completes. This, in combination with the 5430 rule that error "chasers" are not allowed during CLC negotiation, 5431 means that the server cannot send SMC Decline after sending an SMC 5432 Accept, and the client cannot send an SMC Decline after sending an 5433 SMC Confirm. 5435 C.4. Out of synch conditions during SMC-R negotiation 5437 The SMC Accept CLC message contains a "first contact" flag that 5438 indicates to the client whether or not the server believes it is 5439 setting up a new link group, or using an existing link group. This 5440 flag is used to detect an out of synch condition between the client 5441 and the server. The scenario detected is as follows: There is a 5442 single existing SMC-R link between the peers. After the client sends 5443 the SMC Proposal CLC message, the existing SMC-R link between the 5444 client and the server fails. The client cannot chase the SMC 5445 Proposal CLC message with an SMC Decline CLC message in this case 5446 because the client does not yet know that the server would have 5447 wanted to choose the SMC-R link that just crashed. The QP that 5448 failed recovers before the server returns its SMC Accept CLC message. 5449 This means that there is a QP but no SMC link. Since the server had 5450 not yet learned of the SMC link failure when it sent the SMC Accept 5451 CLC message, it attempts to re-use the SMC link that just failed. 5452 This means the server would not set the "first contact" flag, 5453 indicating to the client that the server thinks it is reusing an SMC- 5454 R link. However the client does not have an SMC-R link that matches 5455 the server's specification. Because the "first contact" flag is off, 5456 the client realizes it is out of synch with the server and sends SMC 5457 Decline to cause the connection to fall back to IP. 5459 C.5. Timeouts during CLC negotiation 5461 Because the SMC-R negotiation flows as TCP data, there are built-in 5462 timeouts and retransmits at the TCP layer for individual messages. 5463 Implementations also must to protect the overall TCP/CLC handshake 5464 with a timer or timers to prevent connections from hanging 5465 indefinitely due to SMC-R processing. This can be done with 5466 individual timers for individual CLC messages or an overall timer for 5467 the entire exchange, which may include the TCP handshake and the CLC 5468 handshake under one timer or separate timers. This decision is 5469 implementation dependent. 5471 If the TCP and/or CLC handshakes time out, the TCP connection must be 5472 terminated as it would be in a legacy IP environment when connection 5473 setup doesn't complete in a timely manner. Because the CLC flows are 5474 TCP messages, if they cannot be sent and received in a timely 5475 fashion, the TCP connection is not healthy and would not work if 5476 fallback to IP were attempted. 5478 C.6. Protocol errors during CLC negotiation 5480 Protocol errors occur during CLC negotiation when a message is 5481 received that is not expected. For example, a peer that is expecting 5482 a CLC message but instead receives application data has experienced a 5483 protocol error, and also indicates a likely software error as the two 5484 sides are out of synch. When application data is expected, this data 5485 is not parsed to ensure it's not a CLC message. 5487 When a peer is expecting a CLC negotiation message, any parsing error 5488 except a bad enumerated value in that message must be treated as 5489 application data. The CLC negotiation messages are designed with 5490 beginning and ending eyecatchers to help verify that they are 5491 actually the expected message. If other parsing errors in an 5492 expected CLC message occur, such as incorrect length fields or 5493 incorrectly formatted fields, the message must be treated as 5494 application data. 5496 All protocol errors with the exception of bad enumerated values must 5497 result in termination of the TCP connection. No fallback to IP is 5498 allowed in the case of a protocol error because if the protocols are 5499 out of synch, mismatched, or corrupted, then data and security 5500 integrity cannot be ensured. 5502 The exception to this rule is enumerated values, for example the QP 5503 MTU values on SMC Accept and SMC Confirm. If a reserved value is 5504 received, the proper error response is to send SMC Decline and fall 5505 back to IP. The reason for this is that use of a reserved enumerated 5506 value indicates that the other partner likely has additional support 5507 that the receiving partner does not have. This indicated mismatch of 5508 SMC-R capabilities is not an integrity problem, but indicates that 5509 SMC-R cannot be used for this connection 5511 C.7. Timeouts during LLC negotiation 5513 Whenever a peer sends an LLC message to which a reply is expected, it 5514 sets a timer after the send posts to wait for the reply. An expected 5515 response may be a reply flavor of the LLC message (for example 5516 CONFIRM LINK REPLY) or a new LLC message (for example an ADD LINK 5517 CONTINUATION expected from the server by the client if there are more 5518 Rkeys to communicate). 5520 On LLC flows that are part of a first contact setup of a link group, 5521 the value of the timer is implementation dependent but should be long 5522 enough to allow the other peer have a write complete timeout and 2-3 5523 retransmits of an SMC Decline on the TCP fabric. For LLC flows 5524 that are maintaining the link group and not part of first contact 5525 setup of a link group, the timers may be shorter. Upon receipt of an 5526 expected reply the timer is cancelled. If a timer pops without a 5527 reply having been received, the sender must initiate a recovery 5528 action 5530 During first contact processing, failure of an LLC verification timer 5531 is a should-not-occur which indicates a problem with one of the 5532 endpoints. The reason for this is that if there is a "routine" 5533 failure in the RoCE fabric that causes an LLC verification send to 5534 fail, the sender will get a write completion failure and will then 5535 send SMC Decline to the partner. The only time an LLC verification 5536 timer will expire on a first contact is when the sender thinks the 5537 send succeeded but it actually didn't. Because of the reliable 5538 connected nature of QP connections on the RoCE fabric, this is 5539 indicates a problem with one of the peers, not with the RoCE fabric. 5541 After the reliable connected QP for the first SMC-R link in a link 5542 group is set up on initial contact, the client sets a timer to wait 5543 for a RoCE verification message from the server that the QP is 5544 actually connected and usable. If the server experiences a failure 5545 sending its QP confirmation message, it will send SMC Decline, which 5546 should arrive at the client before the client's verification timer 5547 expires. If the client's timer expires without receiving either an 5548 SMC Decline or a RoCE message confirmation from the server, there is 5549 a problem either with the server or with the TCP fabric. In either 5550 case the client must break the TCP connection and clean up the SMC-R 5551 link. 5553 There are two scenarios in which the client's response to the QP 5554 verification message fails to reach the server. The main difference 5555 is whether or not the client has successfully completed the send of 5556 the CONFIRM LINK response. 5558 In the normal case of a problem with the RoCE path, the client will 5559 learn of the failure by getting a write completion failure, before 5560 the server's timer expires. In this case, the client sends an SMC 5561 Decline CLC message to the server and the TCP connection falls back 5562 to IP. 5564 If the client's send of the Confirmation message receives a positive 5565 return code but for some reason still does not reach the server, or 5566 the client's SMC Decline CLC message fails to reach the server after 5567 the client fails to send its RoCE confirmation message, then the 5568 server's timer will time out and the server must break the TCP 5569 connection by sending RST. This is expected to be a very rare case, 5570 because if the client cannot send its CONFIRM LINK RSP LLC message, 5571 the client should get a negative return code and initiate fallback to 5572 IP. A client receiving a positive return code on a send that fails 5573 to reach the server should be extremely rare. 5575 C.7.1. Recovery actions for LLC timeouts and failures 5577 The following table describes recovery actions for LLC timeouts. A 5578 write completion failure or other indication of failure to send on 5579 the send of the LLC command is treated the same as a timeout. 5581 LLC Message: CONFIRM LINK from server (first contact, first link in 5582 the link group) 5584 Timer waits for: CONFIRM LINK reply from client 5586 Recovery action: Break the TCP connection by sending RST and 5587 clean up the link. The server should have received an SMC 5588 Decline from the client by now if the client had an LLC send 5589 failure. 5591 LLC Message: CONFIRM LINK from server (first contact, second link in 5592 the link group) 5594 Timer waits for: CONFIRM LINK reply from client 5595 Recovery action: The second link was not successfully set up. 5596 Send DELETE LINK to the client. Connection data cannot flow in 5597 the first link in the link group, until the reply to this DELETE 5598 LINK is received, to prevent the peers from being out of synch on 5599 the state of the link group. 5601 LLC Message: CONFIRM LINK from server (not first contact) 5603 Timer Waits for: CONFIRM LINK reply from client 5605 Recovery action: Clean up the new link and set a timer to retry. 5606 Send DELETE LINK to the client, in case the client has a longer 5607 timer interval, so the client can stop waiting 5609 LLC Message: CONFIRM LINK REPLY from client (first contact) 5611 Timer waits for: ADD LINK from server 5613 Recovery action: Clean up the SMC-R link and break the TCP 5614 connection by sending RST over the IP fabric. There is a problem 5615 with the server. If the server had a send failure, it should 5616 have have sent SMC Decline by now. 5618 LLC Message: ADD LINK from server (first contact) 5620 Timer waits for: ADD LINK reply from client 5622 Recovery action: Break the TCP connection with RST and clean up 5623 RoCE resources. The connection is past the point where the 5624 server can fall back to IP, and if the client had a send problem 5625 it should have sent SMC Decline by now. 5627 LLC Message: ADD LINK from server (not first contact) 5629 Timer waits for: ADD LINK reply from client 5631 Recovery action: Clean up resources (QP, RMB keys, etc) for the 5632 new link and treat the link that the ADD LINK was sent over as if 5633 it had failed. If there is another link available to resend the 5634 ADD LINK and the link group still needs another link, retry the 5635 ADD LINK over another link in the link group. 5637 LLC Message: ADD LINK REPLY from client (and there are more Rkeys to 5638 be communicated) 5640 Timer waits for: ADD LINK CONTINUATION from server 5641 Recovery action: Treat the same as ADD LINK timer failure 5643 LLC Message: ADD LINK REPLY or ADD LINK CONTINUATION reply from 5644 client (and there are no more Rkeys to be communicated, for the 5645 second link in a first contact scenario) 5647 Timer waits for: CONFIRM LINK from the server on the new link 5649 Recovery action: The new link has failed to set up. Send DELETE 5650 LINK to the server. Do not consider the socket opened to the 5651 client application until receiving confirmation from the server 5652 in the form of a DELETE LINK request for this link and sending 5653 the reply (to prevent the partners from being out of synch on the 5654 state of the link group). 5656 Set a timer to send another ADD LINK to the server if there is 5657 still an unused RNIC on the client side. 5659 LLC Message: ADD LINK REPLY or ADD LINK CONTINUATION reply from the 5660 client (and there are no more Rkeys to be communicated) 5662 Timer waits for: CONFIRM LINK from the server, over the new link 5664 Recovery action: Send a DELETE LINK to the server for the new 5665 link, then clean up any resource allocated for the new link and 5666 set a timer to send ADD LINK to the server if there is still an 5667 unused RNIC on the client side. The new link has failed to set 5668 up, but the link that the ADD LINK exchange occurred over is 5669 unaffected. 5671 LLC Message: ADD LINK CONTINUATION from server 5673 Timer waits for: ADD LINK CONTINUATION REPLY from client 5675 Recovery action: Treat the same as ADD LINK timer failure 5677 LLC Message: ADD LINK CONTINUATION reply from client (first contact, 5678 and RMB count fields indicate that the server owes more ADD LINK 5679 CONTINUATION messages) 5681 Timer waits for: ADD LINK CONTINUATION from the server 5683 Recovery action: Clean up the SMC link and break the TCP 5684 connection by sending RST. There is a problem with the server. 5686 If the server had a send failure, it should have have sent SMC 5687 Decline by now. 5689 LLC Message: ADD LINK CONTINUATION reply from client (not first 5690 contact and RMB count fields indicate that the server owes more ADD 5691 LINK CONTINUATION messages) 5693 Timer waits for: ADD LINK CONTINUATION from server 5695 Recovery action: Treat as is if client detected link failure on 5696 the link the ADD LINK exchange is using. Send DELETE LINK to 5697 the server over another active link if one exists, otherwise 5698 clean up the link group. 5700 LLC Message: DELETE LINK from client 5702 Timer waits for: DELETE LINK request from server 5704 Recovery action: If the scope of the request is to delete a 5705 single link, the surviving link, over which the client sent the 5706 DELETE LINK is no longer usable either. If this is the last link 5707 in the link group, end TCP connections over the link group by 5708 sending RST packets. If there are other surviving links in the 5709 link group, resend over a surviving link. Also send a DELETE 5710 LINK over a surviving link for the link that the client attempted 5711 to send the initial DELETE LINK message over. If the scope of 5712 the request is to delete the entire link group, try resending on 5713 other links in the link group until success is achieved. If all 5714 sends fail, tear down the link group and any TCP connections that 5715 exist on it. 5717 LLC Message: DELETE LINK from server (scope: entire link group) 5719 Timer waits for: Confirmation from the adapter that the message 5720 was delivered. 5722 Recovery action: Tear down the link group and any TCP connections 5723 that exist over it. 5725 LLC Message: DELETE LINK from server (scope: single link) 5727 Timer waits for: DELETE LINK reply from the client 5729 Recovery action: The over which the client sent the DELETE LINK 5730 is no longer usable either. If this is the last link in the link 5731 group, end TCP connections over the link group by sending RST 5732 packets. If there are other surviving links in the link group, 5733 resend over a surviving link. Also send a DELETE LINK over a 5734 surviving link for the link that the server attempted to send the 5735 initial DELETE LINK message over. If the scope of the request is 5736 to delete the entire link group, try resending on other links in 5737 the link group until success is achieved. If all sends fail, 5738 tear down the link group and any TCP connections that exist on 5739 it. 5741 LLC Message: CONFIRM RKEY from the client 5743 Timer waits for: CONFIRM RKEY REPLY from the server 5745 Recovery action: Perform normal client procedures for detection 5746 of failed link. The link over which the message was sent has 5747 failed. 5749 LLC Message: CONFIRM RKEY from the server 5751 Timer waits for : CONFIRM RKEY REPLY from the client 5753 Recovery action: Perform normal server procedures for detection 5754 of failed link. The link over which the message was sent has 5755 failed. 5757 LLC Message: TEST LINK from the client 5759 Timer waits for: TEST LINK REPLY from the server 5761 Recovery action: Perform normal client procedures for detection 5762 of failed link. The link over which the message was sent has 5763 failed. 5765 LLC Message: TEST LINK from the server 5767 Timer waits for : TEST LINK REPLY from the client 5769 Recovery action: Perform normal server procedures for detection 5770 of failed link. The link over which the message was sent has 5771 failed. 5773 The following table describes recovery actions for invalid LLC 5774 messages. These could be misformatted or contain out of synch data. 5776 LLC Message received: CONFIRM LINK from server 5778 What could be bad: Incorrect link information 5779 Recovery action: Protocol error. The link must be brought down 5780 by sending a DELETE LINK for the link over another link in the 5781 link group if one exists. If this is first contact, fall back to 5782 IP by sending SMC Decline to server. 5784 LLC Message received: ADD LINK reply from client 5786 What could be bad: Client side link information that would result 5787 in a parallel link being set up 5789 Recovery action: Parallel links are not permitted. Delete the 5790 link by sending DELETE LINK to the client over another link in 5791 the link group. 5793 LLC Message received: Any link group command from the server except 5794 DELETE LINK for the entire link group 5796 What could be bad: Client has sent DELETE LINK for the link that 5797 the message was received on 5799 Recovery action: Ignore the LLC message. Worst case the server 5800 will time out. Best case the DELETE LINK crosses with the 5801 command from the server and the server realizes it failed. 5803 LLC Message received: ADD LINK CONTINUATION from the server or ADD 5804 LINK CONTINUATION REPLY from the client 5806 What could be bad: Number of RMBs provided doesn't match count 5807 given on initial ADD LINK or ADD LINK reply message 5809 Recovery action: Protocol error. Treat as if detected link outage 5811 LLC Message received: DELETE LINK from client 5813 What could be bad: Link indicated doesn't exist 5815 Recovery action: assume timing window and ignore message. 5817 LLC Message received: CONFIRM RKEY form either client or server 5819 What could be bad: No Rkey provided for one or more of the links 5820 in the link group 5822 Recovery action: Treat as if detected failure of the link(s) for 5823 which no RKEY was provided 5825 LLC message received: TEST LINK reply 5826 What could be bad: User data doesn't match what was sent in the 5827 TEST LINK request 5829 Recovery action: Treat as if detected that the link has gone 5830 down. This is a protocol error 5832 LLC message received: Unknown LLC type with high order bits of opcode 5833 equal b'10' 5835 What could be bad: This is an optional LLC message which the 5836 receiver does not support 5838 Recovery action: Ignore (silently discard) the message 5840 LLC message received: any unambiguously incorrect or out of synch LLC 5841 message 5843 What it indicates: Link is out of sync 5845 Recovery action: Treat as if detected that the link has gone 5846 down. Note that an unsupported or unknown LLC opcode whose two 5847 high order bits are b'10' is not an error, and must be silently 5848 discarded. Any other unknown or unsupported LLC opcode is an 5849 error. 5851 C.8. Failure to add second SMC-R link to a link group 5853 When there is any failure in setting up the second SMC-R link in an 5854 SMC-R link group, including confirmation timer expiration, the SMC-R 5855 link group is allowed to continue, without available failover. 5856 However this situation is extremely undesirable and the server must 5857 endeavor to correct it as soon as it can. 5859 The server peer in the SMC-R link group must set a timer to drive it 5860 to retry setup of a failed additional SMC-R link. The server will 5861 immediately retry the SMC-R link setup when the first of the 5862 following events occurs: 5864 o The retry timer expires 5866 o A new RNIC becomes available to the server, on the same VLAN as 5867 the SMC-R link group 5869 o An "Add Link" LLC request message is received from the client, 5870 which indicates availability of a new RNIC on the client side. 5872 Authors' Addresses 5874 Mike Fox 5875 IBM 5876 3039 Cornwallis Rd. 5877 Research Triangle Park, NC 27709 5879 Email: mjfox@us.ibm.com 5881 Constantinos (Gus) Kassimis 5882 IBM 5883 3039 Cornwallis Rd. 5884 Research Triangle Park, NC 27709 5886 Email: kassimis@us.ibm.com 5888 Jerry Stevens 5889 IBM 5890 3039 Cornwallis Rd. 5891 Research Triangle Park, NC 27709 5893 Email: sjerry@us.ibm.com