idnits 2.17.1 draft-talpey-nfsv4-rdma-sess-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 3 instances of too long lines in the document, the longest one being 2 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 1758 has weird spacing: '...E4resok resok...' == Line 1813 has weird spacing: '...ionMode mode;...' == Line 1836 has weird spacing: '...D4resok resok...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'Connection' is mentioned on line 454, but not defined == Missing Reference: 'Segment' is mentioned on line 903, but not defined == Unused Reference: 'DAFS' is defined on line 2022, but no explicit reference was found in the text == Unused Reference: 'RDDP' is defined on line 2087, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'CCM' -- Possible downref: Non-RFC (?) normative reference: ref. 'CJ89' -- Possible downref: Non-RFC (?) normative reference: ref. 'DAFS' -- Possible downref: Non-RFC (?) normative reference: ref. 'DDP' -- Possible downref: Non-RFC (?) normative reference: ref. 'FJDAFS' -- Possible downref: Non-RFC (?) normative reference: ref. 'FJNFS' -- Possible downref: Non-RFC (?) normative reference: ref. 'IB' -- Possible downref: Non-RFC (?) normative reference: ref. 'KM02' ** Downref: Normative reference to an Informational RFC: RFC 3234 (ref. 'MIDTAX') -- Possible downref: Non-RFC (?) normative reference: ref. 'NFSDDP' -- Possible downref: Non-RFC (?) normative reference: ref. 'NFSPS' -- Possible downref: Non-RFC (?) normative reference: ref. 'RDMAREQ' ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530) -- Possible downref: Non-RFC (?) normative reference: ref. 'RDDP' -- Possible downref: Non-RFC (?) normative reference: ref. 'RDDPPS' -- Possible downref: Non-RFC (?) normative reference: ref. 'RDMAP' -- Possible downref: Non-RFC (?) normative reference: ref. 'RPCRDMA' Summary: 5 errors (**), 0 flaws (~~), 9 warnings (==), 17 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet-Draft Tom Talpey 3 Expires: August 2004 Network Appliance, Inc. 4 Spencer Shepler 5 Sun Microsystems, Inc. 7 February, 2004 9 NFSv4 RDMA and Session Extensions 10 draft-talpey-nfsv4-rdma-sess-01 12 Status of this Memo 14 This document is an Internet-Draft and is subject to all provisions 15 of Section 10 of RFC2026. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six 23 months and may be updated, replaced, or obsoleted by other 24 documents at any time. It is inappropriate to use Internet-Drafts 25 as reference material or to cite them other than as "work in 26 progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 Copyright Notice 36 Copyright (C) The Internet Society (2004). All Rights Reserved. 38 Abstract 40 Extensions are proposed to NFS version 4 which enable it to support 41 sessions, connection management, and operation atop either TCP or 42 RDMA-capable RPC. These extensions enable universal support for 43 exactly-once semantics by NFSv4 servers, enhanced security, 44 multipathing and trunking of transport connections. These 45 extensions provide identical benefits over both TCP and RDMA 46 connection types. 48 Table Of Contents 50 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 3 51 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . 4 52 1.2. Problem Statement . . . . . . . . . . . . . . . . . . . 5 53 1.3. NFSv4 Session Extension Characteristics . . . . . . . . 6 54 2. Transport Issues . . . . . . . . . . . . . . . . . . . . . 7 55 2.1. Session Model . . . . . . . . . . . . . . . . . . . . . 7 56 2.1.1. Connection State . . . . . . . . . . . . . . . . . . . 8 57 2.1.2. Channels . . . . . . . . . . . . . . . . . . . . . . . 9 58 2.1.3. Reconnection, Trunking, Failover . . . . . . . . . . . 10 59 2.1.4. Server Duplicate Request Cache . . . . . . . . . . . . 11 60 2.2. RDMA . . . . . . . . . . . . . . . . . . . . . . . . . . 12 61 2.2.1. RDMA Requirements . . . . . . . . . . . . . . . . . . 12 62 2.2.2. RDMA Negotiation . . . . . . . . . . . . . . . . . . . 12 63 2.2.3. Connection Resources . . . . . . . . . . . . . . . . . 14 64 2.2.4. Inline Transfer Model . . . . . . . . . . . . . . . . 14 65 2.2.5. Direct Transfer Model . . . . . . . . . . . . . . . . 17 66 2.3. Connection Models . . . . . . . . . . . . . . . . . . . 20 67 2.3.1. TCP Connection Model . . . . . . . . . . . . . . . . . 21 68 2.3.2. Negotiated RDMA Connection Model . . . . . . . . . . . 21 69 2.3.3. Automatic RDMA Connection Model . . . . . . . . . . . 22 70 2.4. Buffer Management, Transfer, Flow Control . . . . . . . 23 71 2.5. Retry and Replay . . . . . . . . . . . . . . . . . . . . 26 72 2.6. The Back Channel . . . . . . . . . . . . . . . . . . . . 26 73 2.7. COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . 28 74 2.8. Data Alignment . . . . . . . . . . . . . . . . . . . . . 29 75 3. NFSv4 Integration . . . . . . . . . . . . . . . . . . . . 30 76 3.1. Minor Versioning . . . . . . . . . . . . . . . . . . . . 30 77 3.2. Stream Identifiers and Exactly-Once Semantics . . . . . 31 78 3.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 32 79 3.4. eXternal Data Representation Efficiency . . . . . . . . 33 80 3.5. Effect of Sessions on Existing Operations . . . . . . . 34 81 3.6. Authentication Efficiencies . . . . . . . . . . . . . . 35 82 4. Security Considerations . . . . . . . . . . . . . . . . . 36 83 5. IANA Considerations . . . . . . . . . . . . . . . . . . . 37 84 6. NFSv4 Protocol Extensions . . . . . . . . . . . . . . . . 37 85 6.1. SESSION_CREATE . . . . . . . . . . . . . . . . . . . . . 38 86 6.2. SESSION_BIND . . . . . . . . . . . . . . . . . . . . . . 39 87 6.3. SESSION_DESTROY . . . . . . . . . . . . . . . . . . . . 41 88 6.4. OPERATION_CONTROL . . . . . . . . . . . . . . . . . . . 42 89 6.5. CB_CREDITRECALL . . . . . . . . . . . . . . . . . . . . 43 90 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 43 91 References . . . . . . . . . . . . . . . . . . . . . . . . 43 92 Authors' Addresses . . . . . . . . . . . . . . . . . . . . 46 93 Full Copyright Statement . . . . . . . . . . . . . . . . . 46 95 1. Introduction 97 This draft proposes extensions to NFS version 4 enabling it to 98 support sessions and connection management, and to support 99 operation atop RDMA-capable RPC over transports such as iWARP. 100 [RDMAP, DDP] These extensions enable universal support for exactly- 101 once semantics by NFSv4 servers, multipathing and trunking of 102 transport connections, and enhanced security. The ability to 103 operate over RDMA enables greatly enhanced performance. Operation 104 over existing TCP is enhanced as well. 106 While discussed here with respect to IETF-chartered transports, the 107 proposed protocol is intended to function over other standards, 108 such as Infiniband. [IB] 110 The following are the major aspects of this proposal: 112 o Changes are proposed within the framework of NFSv4 minor 113 versioning. RPC, XDR, and the NFSv4 procedures and operations 114 are preserved. The proposed minor version functions equally 115 well over existing transports and RDMA, and interoperates 116 transparently with existing implementations, both at the local 117 programmatic interface and over the wire. 119 o An explicit session is introduced to NFSv4, and four new 120 operations are added to support it. The session allows for 121 enhanced trunking, failover and recovery, and authentication 122 efficiency, along with necessary support for RDMA. The 123 session is implemented as operations within NFSv4 COMPOUND and 124 does not impact layering or interoperability with existing 125 NFSv4 implementations. The NFSv4 callback channel is 126 associated with a session, and is connected by the client and 127 not the server, enhancing security and operation through 128 firewalls. In fact, the callback channel will be enabled to 129 share the same connection as the operations channel. 131 o An enhanced RPC layer enables NFSv4 operation atop RDMA. The 132 session is RDMA-aware, and additional facilities are provided 133 for managing RDMA resources at both NFSv4 server and client. 134 Existing NFSv4 operations continue to function as before, 135 though certain size limits are negotiated. A companion draft 136 to this document, "RDMA Transport for ONC RPC" [RPCRDMA] is to 137 be referenced for details of RPC RDMA support. 139 o Support for exactly-once semantics ("EOS") is enabled by the 140 new session facilities, providing to the server a way to bound 141 the size of the duplicate request cache for a single client, 142 and to manage its persistent storage. 144 Block Diagram 146 +-------------------+------------------------------------+ 147 | NFSv4 | NFSv4 + extensions | 148 +-------------------+-----+----------------+-------------+ 149 | Operations | Session | | 150 +-------------------------+----------------+ | 151 | RPC/XDR | | 152 +---------------------------------+--------+ | 153 | Stream Transport | RDMA Transport | 154 +---------------------------------+----------------------+ 156 1.1. Motivation 158 NFS version 4 [RFC3530] has been granted "Proposed Standard" 159 status. The NFSv4 protocol was developed along several design 160 points, important among them: effective operation over wide-area 161 networks, including the Internet itself; strong security 162 integrated into the protocol; extensive cross-platform 163 interoperability including integrated locking semantics compatible 164 with multiple operating systems; and protocol extensibility. 166 The NFS version 4 protocol, however, does not provide support for 167 certain important transport aspects. For example, the protocol 168 does not provide a way to implement exactly-once semantics for 169 clients, nor an interoperable way to support trunking and 170 multipathing of connections. This leads to inefficiencies, 171 especially where trunking and multipathing are concerned, and 172 presents additional difficulties in supporting RDMA fabrics, in 173 which endpoints may require dedicated or specialized resources. 175 Sessions can be employed to unify NFS-level constructs such as the 176 clientid with transport-level constructs such as transport 177 endpoints. The transport endpoint is abstracted to be a member of 178 the session. Resource management can be more strictly maintained, 179 leading to greater server efficiency in implementing the protocol. 180 The enhanced operation over a session affords an opportunity to the 181 server to implement highly reliable and exactly-once semantics. 183 NFSv4 advances the state of high-performance local sharing, by 184 virtue of its integrated security, locking, and delegation, and its 185 excellent coverage of the sharing semantics of multiple operating 186 systems. It is precisely this environment where exactly-once 187 semantics become a fundamental requirement. 189 Additionally, efforts to standardize a set of protocols for Remote 190 Direct Memory Access, RDMA, over the Internet Protocol Suite have 191 made significant progress. RDMA is a general solution to the 192 problem of CPU overhead incurred due to data copies, primarily at 193 the receiver. Substantial research has addressed this and has 194 borne out the efficacy of the approach. An overview of this is the 195 RDDP Problem Statement document, [RDDPPS]. 197 Numerous upper layer protocols achieve extremely high bandwidth and 198 low overhead through the use of RDMA. Products from a wide variety 199 of vendors employ RDMA to advantage, and prototypes have 200 demonstrated the effectiveness of many more. Here, we are 201 concerned specifically with NFS and NFS-style upper layer 202 protocols; examples from Network Appliance [DAFS, DCK+03], Fujitsu 203 Prime Software Technologies [FJNFS, FJDAFS] and Harvard University 204 [KM02] are all relevant. 206 By layering a session binding for NFS version 4 directly atop a 207 standard RDMA transport, a greatly enhanced level of performance 208 and transparency can be supported on a wide variety of operating 209 system platforms. These combined capabilities alter the landscape 210 between local filesystems and network attached storage, enable a 211 new level of performance, and lead new classes of application to 212 take advantage of NFS. 214 1.2. Problem Statement 216 Two issues drive the current proposal: correctness, and 217 performance. Both are instances of "raising the bar" for NFS, 218 whereby the desire to use NFS in new classes applications can be 219 accommodated by providing the basic features to make such use 220 feasible. Such applications include tightly coupled sharing 221 environments such as cluster computing, high performance computing 222 (HPC) and information processing such as databases. These trends 223 are explored in depth in [NFSPS]. 225 The first issue, correctness, exemplified among the attributes of 226 local filesystems, is support for exactly-once semantics. Such 227 semantics have not been reliably available with NFS. Server-based 228 duplicate request caches [CJ89] help, but do not reliably provide 229 strict correctness. For the type of application which is expected 230 to make extensive use of the high-performance RDMA-enabled 231 environment, the reliable provision of such semantics are a 232 fundamental requirement. 234 Introduction of a session to NFSv4 will address these issues. With 235 higher performance and enhanced semantics comes the problem of 236 enabling advanced endpoint management, for example high-speed 237 trunking, multipathing and failover. These characteristics enable 238 availability and performance. RFC3530 presents some issues in 239 permitting a single clientid to access a server over multiple 240 connections. 242 A second issue encountered in common by NFS implementations is the 243 CPU overhead required to implement the protocol. Primary among the 244 sources of this overhead is the movement of data from NFS protocol 245 messages to its eventual destination in user buffers or aligned 246 kernel buffers. The data copies consume system bus bandwidth and 247 CPU time, reducing the available system capacity for applications. 248 [RDDPPS] Achieving zero-copy with NFS has to date required 249 sophisticated, "header cracking" hardware and/or extensive 250 platform-specific virtual memory mapping tricks. 252 Combined in this way, NFSv4, RDMA and the emerging high-speed 253 network fabrics will enable delivery of performance which matches 254 that of the fastest local filesystems, while preserving the key 255 existing local filesystem semantics. 257 RDMA implementations generally have other interesting properties, 258 such as hardware assisted protocol access, and support for user 259 space access to I/O. RDMA is compelling here for another reason; 260 hardware offloaded networking support in itself does not avoid data 261 copies, without resorting to implementing part of the NFS protocol 262 in the NIC. Support of RDMA by NFS enables the highest performance 263 at the architecture level rather than by implementation; this 264 enables ubiquitous and interoperable solutions. 266 By providing file access performance equivalent to that of local 267 file systems, NFSv4 over RDMA will enable applications running on a 268 set of client machines to interact through an NFSv4 file system, 269 just as applications running on a single machine might interact 270 through a local file system. 272 This raises the issue of whether additional protocol enhancements 273 to enable such interaction would be desirable and what such 274 enhancements would be. This is a complicated issue which the 275 working group needs to address and will not be further discussed in 276 this document. 278 1.3. NFSv4 Session Extension Characteristics 280 This draft will present a solution based upon minor versioning of 281 NFSv4. It will introduce a session to collect transport issues 282 together, which in turn enables enhancements such as trunking, 283 failover and recovery. It will describe use of RDMA by employing 284 support within an underlying RPC layer [RPCRDMA]. Most 285 importantly, it will focus on making the best possible use of an 286 RDMA transport. 288 These extensions are proposed as elements of a new minor revision 289 of NFS version 4. In this draft, NFS version 4 will be referred to 290 generically as "NFSv4", when describing properties common to all 291 minor versions. When referring specifically to properties of the 292 original, minor version 0 protocol, "NFSv4.0" will be used, and 293 changes proposed here for minor version 1 will be referred to as 294 "NFSv4.1". 296 This draft proposes only changes which are strictly upward- 297 compatible with existing RPC and NFS Application Programming 298 Interfaces (APIs). 300 2. Transport Issues 302 The Transport Issues section of the document explores the details 303 of utilizing the various supported transports. 305 2.1. Session Model 307 The first and most evident issue in supporting diverse transports 308 is how to provide for their differences. This draft proposes 309 introducing an explicit session. 311 An initialized session will be required before processing requests 312 contained within COMPOUND and CB_COMPOUND procedures of NFSv4.1. A 313 session introduces minimal protocol requirements, and provides for 314 a highly useful and convenient way to manage numerous endpoint- 315 related issues. The session is a local construct; it represents a 316 named, higher-layer object to which connections can refer, and 317 encapsulates properties important to each transport layer endpoint. 319 A session is a dynamically created, persistent object created by a 320 client, used over time from one or more transport connections. Its 321 function is to maintain the server's state relative to any single 322 client instance. This state is entirely independent of the 323 connection itself. The session in effect becomes the "top-level" 324 object representing an active client. 326 The session enables several things immediately. Clients may 327 disconnect and reconnect (voluntarily or not) without loss of 328 context at the server. (Of course, locks, delegations and related 329 associations require special handling which generally expires 330 without an open connection.) Clients may connect multiple 331 transport endpoints to this common state. The endpoints may have 332 all the same attributes, for instance when trunked on multiple 333 physical network links for bandwidth aggregation or path failover. 334 Or, the endpoints can have specific, special purpose attributes 335 such as channels for callbacks. 337 The NFSv4 specification does not provide for any form of flow 338 control; instead it relies on the windowing provided by TCP to 339 throttle requests. This unfortunately does not work with RDMA, 340 which in general provides no operation flow control and will 341 terminate a connection in error when limits are exceeded. Flow 342 control limits are therefore exchanged when a connection is bound 343 to a session; they are then managed within these limits as 344 described in [RPCRDMA]. The bound state of a connection will be 345 described in this document as a "channel". 347 The presence of deterministic flow control on the channels 348 belonging to a given session bounds the requirements of the 349 duplicate request cache. This can be used to advantage by a 350 server, which can accurately determine any storage needs and enable 351 it to maintain persistence and to provide reliable exactly-once 352 semantics. 354 Finally, given adequate connection-oriented transport security 355 semantics, authentication and authorization may be cached on a per- 356 session basis, enabling greater efficiency in the issuing and 357 processing of requests on both client and server. A proposal for 358 transparent, server-driven implementation of this in NFSv4 has been 359 made. [CCM] The existence of the session greatly adds to the 360 convenience of this approach. This is discussed in detail in the 361 Authentication Efficiencies section later in this draft. 363 2.1.1. Connection State 365 In RFC3530, the combination of a connected transport endpoint and a 366 clientid forms the basis of connection state. While provably 367 workable, there are difficulties in correct and robust 368 implementation. The NFSv4.0 protocol must provide a clientid 369 negotiation (SETCLIENTID and SETCLIENTID_CONFIRM), must provide a 370 server-initiated connection for the callback channel, and must 371 carefully specify the persistence of client state at the server in 372 the face of transport interruptions. In effect, each transport 373 connection is used as the server's representation of client state. 374 But, transport connections are potentially fragile and transitory. 376 In this proposal, a session identifier is assigned by the server 377 upon initial session negotiation on each connection. This 378 identifier is used to associate additional connections, to 379 renegotiate after a reconnect, and to provide an abstraction for 380 the various session properties. The session identifier is unique 381 within the server's scope and may be subject to certain server 382 policies such as being bounded in time. A channel identifier is 383 issued for each new connection as it binds to the session. The 384 channel identifier is unique within the session, and may be unique 385 within a wider scope, at the server's choosing. 387 It is envisioned that the primary transport model will be 388 connection oriented. Connection orientation brings with it certain 389 potential optimizations, such as caching of per-connection 390 properties, which are easily leveraged through the generality of 391 the session. However, it is possible that in future, other 392 transport models could be accommodated below the session and 393 channel abstractions. 395 2.1.2. Channels 397 As mentioned above, different NFSv4 operations can lead to 398 different resource needs. For example, server callback operations 399 (CB_RECALL) are specific, small messages which flow from server to 400 client at arbitrary times, while data transfers such as read and 401 write have very different sizes and asymmetric behaviors. It is 402 impractical for the RDMA peers (NFSv4 client and NFSv4 server) to 403 post buffers for these various operations on a single connection. 404 Commingling of requests with responses at the client receive queue 405 is particularly troublesome, due both to the need to manage both 406 solicited and unsolicited completions, and to provision buffers for 407 both purposes. Due to the lack of any ordering of callback 408 requests versus response arrivals, without any other mechanisms, 409 the client would be forced to allocate all buffers sized to the 410 worst case. 412 The callback requests are likely to be handled by a different task 413 context from that handling the responses. Significant 414 demultiplexing and thread management may be required if both are 415 received on the same queue. 417 If the client explicitly binds each new connection to an existing 418 session, multiple connections may be conveniently used to separate 419 traffic by channel identifier within a session. For example, reads 420 and writes may be assigned to specific, optimized channels, or 421 sorted and separated by any or all of size, idempotency, etc. 423 To address the problems described above, this proposal defines a 424 "channel" that is created by the act of binding a connection to a 425 session for a specific purpose. A new connection may be created 426 for each channel, or a single connection may be bound to more than 427 one channel. There are at least two types of channels: the 428 "operations" channel used for ordinary requests from client to 429 server, and the "back" channel, used for callback requests from 430 server to client. The protocol does not permit binding multiple 431 duplicate operations channels to a single connection. There is no 432 benefit in doing so; supporting this would require increased 433 complexity in the server duplicate request cache. 435 Single Connection model: 437 NFSv4.1 client instance 438 | 439 Session 440 / \ 441 Operations_Channel [Back_Channel] 442 \ / 443 Connection 444 | 446 Multi-connection model (2 operations channels shown): 448 NFSv4.1 client instance 449 | 450 Session 451 / \ 452 Operations_Channels [Back_Channel] 453 | | | 454 Connection Connection [Connection] 455 | | | 457 In this way, implementation as well as resource management may be 458 optimized. Each channel (operations, back) will have its own 459 credits and buffering. Clients which do not require certain 460 behaviors may optimize such resources away completely, by not even 461 creating the channels. 463 2.1.3. Reconnection, Trunking, Failover 465 Reconnection after failure references potentially stored state on 466 the server associated with lease recovery during the grace period. 467 The session provides a convenient handle for storing and managing 468 information regarding the client's previous state on a per- 469 connection basis, e.g. to be used upon reconnection. Reconnection 470 and rebinding to a previously existing session, and its stored 471 resources, are covered in the "Connection Models" section below. 473 For Reliability Availability and Serviceability (RAS) issues such 474 as bandwidth aggregation and multipathing, clients frequently seek 475 to make multiple connections through multiple logical or physical 476 channels. The session is a convenient point to aggregate and 477 manage these resources. 479 2.1.4. Server Duplicate Request Cache 481 Server duplicate request caches, while not a part of an NFS 482 protocol, have become a standard, even required, part of any NFS 483 implementation. First described in [CJ89], the duplicate request 484 cache was initially found to reduce work at the server by avoiding 485 duplicate processing for retransmitted requests. A second, and in 486 the long run more important benefit, was improved correctness, as 487 the cache avoided certain destructive non-idempotent requests from 488 being reinvoked. 490 However, such caches do not provide correctness guarantees; they 491 cannot be managed in a reliable, persistent fashion. The reason is 492 understandable - their storage requirement is unbounded due to the 493 lack of any such bound in the NFS protocol. 495 As proposed in this draft, the presence of message flow control 496 credits and negotiated maximum sizes allows the size and duration 497 of the cache to be bounded, and coupled with a persistent session 498 identifier, enables its persistent storage on a per-session basis. 500 This provides a single unified mechanism which provides the 501 following guarantees required in the NFSv4 specification, while 502 extending them to all requests, rather than limiting them only to a 503 subset of state-related requests: 505 "It is critical the server maintain the last response sent to 506 the client to provide a more reliable cache of duplicate non- 507 idempotent requests than that of the traditional cache 508 described in [CJ89]..." [RFC3530] 510 The credit limit is the count of active operations, which bounds 511 the number of entries in the cache. Constraining the size of 512 operations additionally serves to limit the required storage to the 513 product of the current credit count and the maximum response size. 514 This storage requirement enables server-side efficiencies. 516 Session negotiation allows the server to maintain other state. An 517 NFSv4.1 client invoking the session destroy operation will cause 518 the server to denegotiate (close) the session, allowing the server 519 to deallocate cache entries. Clients can potentially specify that 520 such caches not be kept for appropriate types of sessions (for 521 example, read-only sessions). This can enable more efficient 522 server operation resulting in improved response times. 524 Similarly, it is important for the client to explicitly learn 525 whether the server is able to implement these semantics. Knowledge 526 of whether exactly-once semantics are in force is critical for a 527 highly reliable client, one which must provide transactional 528 integrity guarantees. When clients request that the semantics be 529 enabled for a given session, the session reply must inform the 530 client if the mode is in fact enabled. In this way the client can 531 confidently proceed with operations without having to implement 532 consistency facilities of its own. 534 2.2. RDMA 536 2.2.1. RDMA Requirements 538 A complete discussion of the operation of RPC-based protocols atop 539 RDMA transports is in [RPCRDMA], and a general discussion of NFS 540 RDMA requirements is in [RDMAREQ]. Where RDMA is considered, this 541 proposal assumes the use of such a layering; it addresses only the 542 upper layer issues relevant to making best use of RPC/RDMA. 544 A connection oriented (reliable sequenced) RDMA transport will be 545 required. There are several reasons for this. First, this model 546 most closely reflects the general NFSv4 requirement of long-lived 547 and congestion-controlled transports. Second, to operate correctly 548 over either an unreliable or unsequenced RDMA transport, or both, 549 would require significant complexity in the implementation and 550 protocol not appropriate for a strict minor version. For example, 551 retransmission on connected endpoints is explicitly disallowed in 552 the current NFSv4 draft; it would again be required with these 553 alternate transport characteristics. Third, the proposal assumes a 554 specific RDMA ordering semantic, which presents the same set of 555 ordering and reliability issues to the RDMA layer over such 556 transports. 558 The RDMA implementation provides for making connections to other 559 RDMA-capable peers. In the case of the current proposals before 560 the RDDP working group, these RDMA connections are preceded by a 561 "streaming" phase, where ordinary TCP (or NFS) traffic might flow. 562 However, this is not assumed here and sizes and other parameters 563 are explicitly exchanges upon entering RDMA mode in all cases. 565 2.2.2. RDMA Negotiation 567 It is proposed that session negotiation be the method to enable 568 RDMA mode on an NFSv4 connection. 570 On transport endpoints which support automatic RDMA mode, that is, 571 endpoints which are created in the RDMA-enabled state, a single, 572 preposted buffer must initially be provided by both peers, and the 573 client session negotiation must be the first exchange. 575 On transport endpoints supporting dynamic negotiation, a more 576 sophisticated negotiation is possible. Clients may connect to the 577 server in traditional NFSv4 mode and enter RDMA mode only after a 578 successful NFSv4.1 channel binding negotiation returning the RDMA 579 capability. If RDMA capability is not indicated, the negotiation 580 still completes and the benefits of the session are available on 581 the existing TCP stream connection. 583 Some of the parameters to be exchanged at session binding time are 584 as follows. 586 Maximum Credits 587 The client's desired maximum credits (number of concurrent 588 requests) is passed, in order to allow the server to size its 589 reply cache storage. The server may modify the client's 590 requested limit downward (or upward) to match its local policy 591 and/or resources. The maximum credits available on a single 592 bound channel may also be limited by the maximum credits for 593 the session. Over RDMA-capable RPC transports, the per- 594 request management of message credits is handled within the 595 RPC layer. [RPCRDMA] 597 Maximum Request/Response Sizes 598 The maximum request and response sizes are exchanged in order 599 to permit allocation of appropriately sized buffers and 600 request cache entries. The size must allow for certain 601 protocol minima, allowing the receipt of maximally sized 602 operations (e.g. RENAME requests which contains two name 603 strings). The server may reduce the client's requested sizes. 605 RDMA Read Resources 606 RDMA implementations must explicitly provision resources to 607 support RDMA Read requests from connected peers. These values 608 must be explicitly specified, to provide adequate resources 609 for matching the peer's expected needs and the connection's 610 delay-bandwidth parameters. The values are asymmetric and 611 should be set to zero at the server in order to conserve RDMA 612 resources, since clients do not issue RDMA Read operations in 613 this proposal. The result is communicated in the session 614 response, to permit matching of values across the connection. 615 The value may not be changed in the duration of the 616 connection, although a new value may be requested as part of a 617 reconnection. 619 Inline Padding/Alignment 620 The server can inform the client of any padding which can be 621 used to deliver NFSv4 inline WRITE payloads into aligned 622 buffers. Such alignment can be used to avoid data copy 623 operations at the server, even when direct RDMA is not used. 624 The client informs the server in each operation when padding 625 has been applied [RPCRDMA]. 627 Transport Attributes 628 A placeholder for transport-specific attributes is provided, 629 with a format to be determined. Examples of information to be 630 passed in this parameter include transport security attributes 631 to be used on the connection, RDMA-specific attributes, legacy 632 "private data" as used on existing RDMA fabrics, transport 633 Quality of Service attributes, etc. This information is to be 634 passed to the peer's transport layer by local means which is 635 currently outside the scope of this draft. 637 2.2.3. Connection Resources 639 RDMA imposes several requirements on upper layer consumers. 640 Registration of memory and the need to post buffers of a specific 641 size and number for receive operations are a primary consideration. 643 Registration of memory can be a relatively high-overhead operation, 644 since it requires pinning of buffers, assignment of attributes 645 (e.g. readable/writable), and initialization of hardware 646 translation. Preregistration is desirable to reduce overhead. 647 These registrations are specific to hardware interfaces and even to 648 RDMA connection endpoints, therefore negotiation of their limits is 649 desirable to manage resources effectively. 651 Following the basic registration, these buffers must be posted by 652 the RPC layer to handle receives. These buffers remain in use by 653 the RPC/NFSv4 implementation; the size and number of them must be 654 known to the remote peer in order to avoid RDMA errors which would 655 cause a fatal error on the RDMA connection. 657 Each channel within a session will potentially have different 658 requirements, negotiated per-connection but accounted for per- 659 session. The session provides a natural way for the server to 660 manage resource allocation to each client rather than to each 661 transport connection itself. This enables considerable flexibility 662 in the administration of transport endpoints. 664 2.2.4. Inline Transfer Model 666 The RDMA Send transfer model is used for all NFS requests and 667 replies. Use of Sends is required to ensure consistency of data 668 and to deliver completion notifications. 670 Sends may carry data as well as control. When a Send carries data 671 associated with a request type, the data is referred to as 672 "inline". This method is typically used where the data payload is 673 small, or where for whatever reason target memory for RDMA is not 674 available. 676 Inline message exchange 678 Client Server 679 : Request : 680 Send : ------------------------------> : untagged 681 : : buffer 682 : Response : 683 untagged : <------------------------------ : Send 684 buffer : : 686 Client Server 687 : Read request : 688 Send : ------------------------------> : untagged 689 : : buffer 690 : Read response with data : 691 untagged : <------------------------------ : Send 692 buffer : : 694 Client Server 695 : Write request with data : 696 Send : ------------------------------> : untagged 697 : : buffer 698 : Write response : 699 untagged : <------------------------------ : Send 700 buffer : : 702 Responses must be sent to the client on the same channel that the 703 request was sent. This is important to preserve ordering of 704 operations, and especially RMDA consistency. Additionally, it 705 ensures that the RPC RDMA layer makes no requirement of the RDMA 706 provider to open its memory registration handles (Steering Tags) 707 beyond the scope of a single RDMA connection. This is an important 708 security consideration. 710 Two values must be known to each peer prior to issuing Sends: the 711 maximum number of sends which may be posted, and their maximum 712 size. These values are referred to, respectively, as the message 713 credits and the maximum message size. While the message credits 714 might vary dynamically over the duration of the session, the 715 maximum message size does not. The server must commit to posting a 716 number of receive buffers equal to or greater than its currently 717 advertised credit value, each of the advertised size. If fewer 718 credits or smaller buffers are provided, the connection may fail 719 with an RDMA transport error. 721 While tempting to consider, it is not possible to use the TCP 722 window as an RDMA operation flow control mechanism. First, to do 723 so would violate layering, requiring both senders to be aware of 724 the existing TCP outbound window at all times. Second, since 725 requests are of variable size, the TCP window can hold a widely 726 variable number of them, and since it cannot be reduced without 727 actually receiving data, the receiver cannot limit the sender. 728 Third, any middlebox interposing on the connection would wreck any 729 possible scheme. [MIDTAX] In this proposal, credits, in the form of 730 explicit operation counts, are exchanged to allow correct 731 provisioning of receive buffers. 733 When not operating over RDMA, credits and sizes are still employed 734 in NFSv4.1, but instead of being required for correctness, they 735 provide the basis for efficient server implementation of exactly- 736 once semantics. The limits are chosen based upon the expected 737 needs and capabilities of the client and server, and are in fact 738 arbitrary. Sizes may be specified as zero (no specific size limit) 739 and credits may be chosen in proportion to the client's 740 capabilities. For example, a limit of 1000 allows 1000 requests to 741 be in progress, which may generally be far more than adequate to 742 keep local networks and servers fully utilized. 744 Both client and server have independent sizes and buffering, but 745 over RDMA fabrics client credits are easily managed by posting a 746 receive buffer prior to sending each request. Each such buffer may 747 not be completed with the corresponding reply, since responses from 748 NFSv4 servers arrive in arbitrary order. When the operations 749 channel is used for callbacks, the client must account for callback 750 requests by posting additional buffers. Note that implementation- 751 specific facilities such as a "shared receive queue" may allow 752 optimization of these allocations. 754 When a connection is bound to a session (creating a channel), the 755 client requests a preferred buffer size, and the server provides 756 its answer. The server posts all buffers of at least this size. 757 The client must comply by not sending requests greater than this 758 size. It is recommended that server implementations do all they 759 can to accommodate a useful range of possible client requests. 760 There is a provision in [RPCRDMA] to allow the sending of client 761 requests which exceed the server's receive buffer size, but it 762 requires the server to "pull" the client's request as a "read 763 chunk" via RDMA Read. This introduces at least one additional 764 network roundtrip, plus other overhead such as registering memory 765 for RDMA Read at the client and additional RDMA operations at the 766 server, and is to be avoided. 768 An issue therefore arises when considering the NFSv4 COMPOUND 769 procedures. Since an arbitrary number (total size) of operations 770 can be specified in a single COMPOUND procedure, its size is 771 effectively unbounded. This cannot be supported by RDMA Sends, and 772 therefore this size negotiation places a restriction on the 773 construction and maximum size of both COMPOUND requests and 774 responses. If a COMPOUND results in a reply at the server that is 775 larger than can be sent in an RDMA Send to the client, then the 776 COMPOUND must terminate and the operation which causes the overflow 777 will provide a TOOSMALL error status result. A chaining facility 778 is provided to overcome some of the resulting limitations, 779 described later in the draft. 781 2.2.5. Direct Transfer Model 783 Placement of data by explicitly tagged RDMA operations is referred 784 to as "direct" transfer. This method is typically used where the 785 data payload is relatively large, that is, when RDMA setup has been 786 performed prior to the operation, or when any overhead for setting 787 up and performing the transfer is regained by avoiding the overhead 788 of processing an ordinary receive. 790 The client advertises RDMA buffers in this proposed model, and not 791 the server. This means the "XDR Decoding with Read Chunks" 792 described in [RPCRDMA] is not employed by NFSv4.1 replies, and 793 instead all results transferred via RDMA to the client employ "XDR 794 Decoding with Write Chunks". There are several reasons for this. 796 First, it allows for a correct and secure mode of transfer. The 797 client may advertise specific memory buffers only during specific 798 times, and may revoke access when it pleases. The server is not 799 required to expose copies of local file buffers for individual 800 clients, or to lock or copy them for each client access. 802 Second, client credits based on fixed-size request buffers are 803 easily managed on the server, but for the server additional 804 management of buffers for client RDMA Reads is not well-bounded. 805 For example, the client may not perform these RDMA Read operations 806 in a timely fashion, therefore the server would have to protect 807 itself against denial-of-service on these resources. 809 Third, it reduces network traffic, since buffer exposure outside 810 the scope and duration of a single request/response exchange 811 necessitates additional memory management exchanges. 813 There are costs associated with this decision. Primary among them 814 is the need for the server to employ RDMA Read for operations such 815 as large WRITE. The RDMA Read operation is a two-way exchange at 816 the RDMA layer, which incurs additional overhead relative to RDMA 817 Write. Additionally, RDMA Read requires resources at the data 818 source (the client in this proposal) to maintain state and to 819 generate replies. These costs are overcome through use of 820 pipelining with credits, with sufficient RDMA Read resources 821 negotiated at session initiation, and appropriate use of RDMA for 822 writes by the client - for example only for transfers above a 823 certain size. 825 A description of which NFSv4 operation results are eligible for 826 data transfer via RDMA Write is in [NFSDDP]. There are only two 827 such operations: READ and READLINK. When XDR encoding these 828 requests on an RDMA transport, the NFSv4.1 client must insert the 829 appropriate xdr_write_list entries to indicate to the server 830 whether the results should be transferred via RDMA or inline with a 831 Send. As described in [NFSDDP], a zero-length write chunk is used 832 to indicate an inline result. In this way, it is unnecessary to 833 create new operations for RDMA-mode versions of READ and READLINK. 835 Another tool to avoid creation of new, RDMA-mode operations is the 836 Reply Chunk [RPCRDMA], which is used by RPC in RDMA mode to return 837 large replies via RDMA as if they were inline. Reply chunks are 838 used for operations such as READDIR, which returns large amounts of 839 information, but in many small XDR segments. Reply chunks are 840 offered by the client and the server can use them in preference to 841 inline. Reply chunks are transparent to upper layers such as 842 NFSv4. 844 In any very rare cases where another NFSv4.1 operation requires 845 larger buffers than were negotiated when the channel was bound (for 846 example extraordinarily large RENAMEs), the underlying RPC layer 847 may support the use of "Message as an RDMA Read Chunk" and "RDMA 848 Write of Long Replies" as described in [RPCRDMA]. No additional 849 support is required in the NFSv4.1 client for this. The client 850 should be certain that its requested buffer sizes are not so small 851 as to make this a frequent occurrence, however. 853 All operations are initiated by a Send, and are completed with a 854 Send. This is exactly as in conventional NFSv4, but under RDMA has 855 a significant purpose: RDMA operations are not complete, that is, 856 guaranteed consistent, at the data sink until followed by a 857 successful Send completion (i.e. a receive). These events provide 858 a natural opportunity for the initiator (client) to enable and 859 later disable RDMA access to the memory which is the target of each 860 operation, in order to provide for consistent and secure operation. 861 The RDMAP Send with Invalidate operation may be worth employing in 862 this respect, as it relieves the client of certain overhead in this 863 case. 865 A "onetime" boolean advisory to each RDMA region might become a 866 hint to the server that the client will use the three-tuple for 867 only one NFSv4 operation. For a transport such as iWARP, the 868 server can assist the client in invalidating the three-tuple by 869 performing a Send with Solicited Event and Invalidate. The server 870 may ignore this hint, in which case the client must perform a local 871 invalidate after receiving the indication from the server that the 872 NFSv4 operation is complete. This may be considered in a future 873 version of this draft and [NFSDDP]. 875 In a trusted environment, it may be desirable for the client to 876 persistently enable RDMA access by the server. Such a model is 877 desirable for the highest level of efficiency and lowest overhead. 879 RDMA message exchanges 881 Client Server 882 : Direct Read Request : 883 Send : ------------------------------> : untagged 884 : : buffer 885 : Segment : 886 tagged : <------------------------------ : RDMA Write 887 buffer : : : 888 : [Segment] : 889 tagged : <------------------------------ : [RDMA Write] 890 buffer : : 891 : Direct Read Response : 892 untagged : <------------------------------ : Send (w/Inv.) 893 buffer : : 895 Client Server 896 : Direct Write Request : 897 Send : ------------------------------> : untagged 898 : : buffer 899 : Segment : 900 tagged : v------------------------------ : RDMA Read 901 buffer : +-----------------------------> : 902 : : : 903 : [Segment] : 904 tagged : v------------------------------ : [RDMA Read] 905 buffer : +-----------------------------> : 906 : : 907 : Direct Write Response : 908 untagged : <------------------------------ : Send (w/Inv.) 909 buffer : : 911 2.3. Connection Models 913 There are three scenarios in which to discuss the connection model. 914 Each will be discussed individually, after describing the common 915 case encountered at initial connection establishment. 917 After a successful connection, the first request proceeds, in the 918 case of a new client association, to initial session creation, and 919 then to session binding, prior to regular operation. Session 920 binding, which creates a channel, is a required first step for 921 NFSv4.1 operation on each connection, and there is no change in 922 binding permitted. The client previously asserted that it does or 923 does not wish to negotiate RDMA mode in its session creation 924 request, and the server responded, possibly negatively in which 925 case all connections remain in traditional TCP mode. Special rules 926 apply for the RDMA cases, as described below. 928 In the case of a reconnect, the session creation step is not 929 performed and a session binding is attempted to the previously 930 established session only. If this rebinding is successful at the 931 server, the server will have located the previous session's state, 932 including any surviving locks, delegations, duplicate request cache 933 entries, etc. The previous session will be reestablished with its 934 previous state, ensuring exactly-once semantics of any previously 935 issued NFSv4 requests. If the rebinding fails, then the server has 936 restarted and does not support persistent state. This would have 937 been noted in the server's original reply to the session creation, 938 however. 940 Since the session is explicitly created and destroyed by the 941 client, and each client is uniquely identified, the server may be 942 specifically instructed to discard unneeded presistent state. For 943 this reason, it is possible that a server will retain any previous 944 state indefinitely, and place its destruction under administrative 945 control. Or, a server may choose to retain state for some 946 configurable period, provided that the period meets other NFSv4 947 requirements. 949 After successful session establishment, the traditional (TCP 950 stream) connection model used by NFSv4.0 and NFSv4.1 ensures the 951 connection is ready to proceed with issuing requests and returning 952 responses. This mode is arrived at when the client does not 953 request that the connection be placed into RDMA mode. 955 2.3.1. TCP Connection Model 957 The following is a schematic diagram of the NFSv4.1 protocol 958 exchanges leading up to normal operation on a TCP stream. 960 Client Server 961 TCPmode : Session Create(nfs_client_id4, ...) : TCPmode 962 : ------------------------------> : 963 : : 964 : Session reply(sessionid, ...) : 965 : <------------------------------ : 966 : : 967 : Session bind(session id, size S, : 968 : opchan, STREAM, credits N, ...): 969 : ------------------------------> : 970 : : 971 : Bind reply(size S', credits N') : 972 : <------------------------------ : 973 : : 974 : : 975 : ------------------------------> : 976 : <------------------------------ : 977 : : : 979 No net additional exchange is added to the initial negotiation by 980 this proposal. In the NFSv4.1 exchange, the SETCLIENTID and 981 SETCLIENTID_CONFIRM operations are not performed, as described 982 later in the document. 984 2.3.2. Negotiated RDMA Connection Model 986 The following is a schematic diagram of the NFSv4.1 protocol 987 exchanges negotiating upgrade to RDMA mode on a TCP stream. 989 Client Server 990 TCPmode : Session Create(nfs_client_id4, ...) : TCPmode 991 : ------------------------------> : 992 : : 993 : Session reply(sessionid, ...) : 994 : <------------------------------ : 995 : : 996 : Session bind(session id, size S', : 997 : opchan, RDMA, credits N, ...) : 998 : ------------------------------> : 999 : : Prepost N' receives 1000 : Bind reply(size S', credits N') : of size S' 1001 : <------------------------------ : RDMAMode 1002 RDMAmode : : 1003 : : 1004 : ------------------------------> : 1005 : <------------------------------ : 1006 : : : 1008 In iWARP, the Bind reply and RDMA mode entry are combined into a 1009 single, atomic operation within the Provider, where the Bind reply 1010 is sent in TCP streaming mode and RDMA mode is enabled immediately. 1011 There is no opportunity for a race between the client's first 1012 operation, the preposting of receive descriptors, and RDMA mode 1013 entry at the server. 1015 2.3.3. Automatic RDMA Connection Model 1017 The following is a schematic diagram of the NFSv4.1 protocol 1018 exchanges performed on an RDMA connection. 1020 Client Server 1021 RDMAmode : : : RDMAmode 1022 : : : 1023 Prepost : : : Prepost 1024 receive : : : receive 1025 : : 1026 : Session Create(nfs_client_id4, ...) : 1027 : ------------------------------> : 1028 : : Prepost 1029 : Session reply(sessionid, ...) : receive 1030 : <------------------------------ : 1031 Prepost : : 1032 receive : Session bind(session id, size S, : 1033 : opchan, RDMA, credits N, ...) : 1034 : ------------------------------> : 1035 : : Prepost N' receives 1036 : Bind reply(size S', credits N') : of size S' 1037 : <------------------------------ : 1038 : : 1039 : : 1040 : ------------------------------> : 1041 : <------------------------------ : 1042 : : : 1044 2.4. Buffer Management, Transfer, Flow Control 1046 Inline operations in NFSv4.1 behave effectively the same as TCP 1047 sends. Procedure results are passed in a single message, and its 1048 completion at the client signal the receiving process to inspect 1049 the message. 1051 RDMA operations are performed solely by the server in this 1052 proposal, as described in the previous "RDMA Direct Model" section. 1053 Since server RDMA operations do not result in a completion at the 1054 client, and due to ordering rules in RDMA transports, after all 1055 required RDMA operations are complete, a Send (Send with Solicited 1056 Event for iWARP) containing the procedure results is performed from 1057 server to client. This Send operation will result in a completion 1058 which will signal the client to inspect the message. 1060 In the case of client read-type NFSv4 operations, the server will 1061 have issued RDMA Writes to transfer the resulting data into client- 1062 advertised buffers. The subsequent Send operation performs two 1063 necessary functions: finalizing any active or pending DMA at the 1064 client, and signaling the client to inspect the message. 1066 In the case of client write-type NFSv4 operations, the server will 1067 have issued RDMA Reads to fetch the data from the client-advertised 1068 buffers. No data consistency issues arise at the client, but the 1069 completion of the transfer must be acknowledged, again by a Send 1070 from server to client. 1072 In either case, the client advertises buffers for direct (RDMA 1073 style) operations. The client may desire certain advertisement 1074 limits, and may wish the server to perform remote invalidation on 1075 its behalf when the server has completed its RDMA. This may be 1076 considered in a future version of this draft. 1078 Credit updates over RDMA transports are supported at the RPC layer 1079 as described in [RPCRDMA]. In each request, the client requests a 1080 desired number of credits to be made available to the channel on 1081 which it sends the request. The client must not send more requests 1082 than the number which the server has previously advertised, or in 1083 the case of the first request, only one. If the client exceeds its 1084 credit limit, the connection may close with a fatal RDMA error. 1086 The server then executes the request, and replies with an updated 1087 credit count accompanying its results. Since replies are sequenced 1088 by their RDMA Send order, the most recent results always reflect 1089 the server's limit. In this way the client will always know the 1090 maximum number of requests it may safely post. 1092 Because the client requests an arbitrary credit count in each 1093 request, it is relatively easy for the client to request more, or 1094 fewer, credits to match its expected need. A client that 1095 discovered itself frequently queuing outgoing requests due to lack 1096 of server credits might increase its requested credits 1097 proportionately in response. Or, a client might have a simple, 1098 configurable number. 1100 Occasionally, a server may wish to reduce the number of credits it 1101 offers a certain client channel. This could be encountered if a 1102 client were found to be consuming its credits slowly, or not at 1103 all. A client might notice this itself, and reduce its requested 1104 credits in advance, for instance requesting only the count of 1105 operations it currently has queued, plus a few as a base for 1106 starting up again. Such mechanism are, however, potentially 1107 complicated and are implementation-defined. The protocol does not 1108 require them. 1110 Because of the way in which RDMA fabrics function, it is not 1111 possible for the server (or client back channel) to cancel 1112 outstanding receive operations. Therefore, effectively only one 1113 credit can be withdrawn per receive completion. The server (or 1114 client back channel) would simply not replenish a receive operation 1115 when replying. The server can still reduce the available credit 1116 advertisement in its replies to the target value it desires, as a 1117 hint to the client that its credit target is lower and it should 1118 expect it to be reduced accordingly. Of course, even if the server 1119 could cancel outstanding receives, it cannot do so, since the 1120 client may have already sent requests in expectation of the 1121 previous limit. 1123 This brings out an interesting scenario similar to the client 1124 reconnect discussed earlier in "Connection Models". How does the 1125 server reduce the credits of an inactive client? 1127 One approach is for the server to simply close such a connection 1128 and require the client to reconnect at a new credit limit. This is 1129 acceptable, if inefficient, when the connection setup time is short 1130 and where the server supports persistent session semantics. 1132 A better approach is to provide a back channel request to return 1133 the operations channel credits. The server may request the client 1134 to return some number of credits, the client must comply by 1135 performing operations on the operations channel, provided of course 1136 that the request does not drop the client's credit count to zero 1137 (in which case the channel would deadlock). If the client finds 1138 that it has no requests with which to consume the credits it was 1139 previously granted, it must send zero-length Send RDMA operations, 1140 or NULL NFSv4 operations in order to return the channel resources 1141 to the server. If the client fails to comply in a timely fashion, 1142 the server can recover the resources by breaking the connection. 1144 While in principle, the back channel credits could be subject to a 1145 similar resource adjustment, in practice this is not an issue, 1146 since the back channel is used purely for control and is expected 1147 to be statically provisioned. 1149 It is important to note that in addition to credits, the sizes of 1150 buffers are negotiated per-channel. This permits the most 1151 efficient allocation of resources on both peers. There is an 1152 important requirement on reconnection: the sizes offered at 1153 reconnect (session bind) must be at least as large as previously 1154 used, to allow recovery. Any replies that are replayed from the 1155 server's duplicate request cache must be able to be received into 1156 client buffers. In the case where a client has received replies to 1157 all its retried requests (and therefore received all its expected 1158 responses), then the client may disconnect and reconnect with 1159 different buffers at will, since no cache replay will be required. 1161 2.5. Retry and Replay 1163 NFSv4.0 forbids retransmission on active connections over reliable 1164 transports; this includes connected-mode RDMA. This restriction 1165 must be maintained in NFSv4.1. 1167 If one peer were to retransmit a request (or reply), it would 1168 consume an additional credit on the other. If the server 1169 retransmitted a reply, it would certainly result in an RDMA 1170 connection loss, since the client would typically only post a 1171 single receive buffer for each request. If the client 1172 retransmitted a request, the additional credit consumed on the 1173 server might lead to RDMA connection failure unless the client 1174 accounted for it and decreased its available credit, leading to 1175 wasted resources. 1177 Credits present a new issue to the duplicate request cache in 1178 NFSv4.1. The request cache may be used when a connection within a 1179 session is lost, such as after the client reconnects and rebinds. 1180 Credit information is a dynamic property of the channel, and stale 1181 values must not be replayed from the cache. This may occur on 1182 another existing channel, or a new channel, with potentially new 1183 credits and buffers. This implies that the request cache contents 1184 must not be blindly used when replies are issued from it, and 1185 credit information appropriate to the channel must be refreshed by 1186 the RPC layer. 1188 Finally, RDMA fabrics do not guarantee that the memory handles 1189 (Steering Tags) within each rdma three-tuple are valid on a scope 1190 outside that of a single connection. Therefore, handles used by 1191 the direct operations become invalid after connection loss. The 1192 server must ensure that any RDMA operations which must be replayed 1193 from the request cache use the newly provided handle(s) from the 1194 most recent request. 1196 2.6. The Back Channel 1198 The NFSv4 callback operations present a significant resource 1199 problem for the RDMA enabled client. Clearly, their number must be 1200 negotiated in the way credits are for the ordinary operations 1201 channel for requests flowing from client to server. But, for 1202 callbacks to arrive on the same RDMA endpoint as operation replies 1203 would require dedicating additional resources, and specialized 1204 demultiplexing and event handling. Or, callbacks may not require 1205 RDMA sevice at all (they do not normally carry substantial data 1206 payloads). It is highly desirable to streamline this critical path 1207 via a second communications channel. 1209 The session binding facility is designed for exactly such a 1210 situation, by dynamically associating a new connected endpoint with 1211 the session, and separately negotiating sizes and counts for active 1212 operations. The ChannelType designation in the session bind 1213 operation serves to identify the channel. The binding operation is 1214 firewall-friendly since it does not require the server to initiate 1215 the connection. 1217 This same method serves as well for ordinary TCP connection mode. 1218 It is expected that all NFSv4.1 clients may make use of the session 1219 binding facility to streamline their design. 1221 The back channel functions exactly the same as the operations 1222 channel except that no RDMA operations are required to perform 1223 transfers, instead the sizes are required to be sufficiently large 1224 to carry all data inline, and of course the client and server 1225 reverse their roles with respect to which is in control of credit 1226 management. The same rules apply for all transfers, with the 1227 server being required to flow control its callback requests. 1229 The back channel is optional. If not bound on a given session, the 1230 server must not issue callback operations to the client. This in 1231 turn implies that such a client must never put itself in the 1232 situation where the server will need to do so, lest the client lose 1233 its connection by force, or its operation be incorrect. For the 1234 same reason, if a back channel is bound, the client is subject to 1235 revocation of its delegations if the back channel is lost. Any 1236 connection loss should be corrected by the client as soon as 1237 possible. 1239 This can be convenient for the NFSv4.1 client; if the client 1240 expects to make no use of back channel facilities such as 1241 delegations, then there is no need to create it. This may save 1242 significant resources and complexity at the client. 1244 For these reasons, if the client wishes to use the back channel, 1245 that channel must be bound first, before the operations channel. 1246 In this way, the server will not find itself in a position where it 1247 will send callbacks on the operations channel when the client is 1248 not prepared for them. 1250 There is one special case, that where the back channel is bound in 1251 fact to the operations channel. This configuration would be used 1252 normally over a TCP stream connection to exactly implement the 1253 NFSv4.0 behavior, but over RDMA would require complex resource and 1254 event management at both sides of the connection. The server is 1255 not required to accept such a bind request on an RDMA connection 1256 for this reason, though it is recommended. 1258 2.7. COMPOUND Sizing Issues 1260 Very large responses may pose duplicate request cache issues. 1261 Since servers will want to bound the storage required for such a 1262 cache, the unlimited size of response data in COMPOUND may be 1263 troublesome. If COMPOUND is used in all its generality, then a 1264 non-idempotent request might include operations that return any 1265 amount of data via RDMA. 1267 It is not satisfactory for the server to reject COMPOUNDs at will 1268 with NFS4ERR_RESOURCE when they pose such difficulties for the 1269 server, as this results in serious interoperability problems. 1270 Instead, any such limits must be explicitly exposed as attributes 1271 of the session, ensuring that the server can explicitly support any 1272 duplicate request cache needs at all times. 1274 A need may therefore arise to handle requests of a size which is 1275 greater than this maximum. When COMPOUNDed requests would exceed 1276 the provided buffer, a chaining facility may be used. 1278 Chaining, when used, provides for executing requests on the channel 1279 in strict sequence at the server. At most a single chain may be in 1280 effect on a channel at any time, and the chain is broken when any 1281 request within the chain is incomplete, for example when an error 1282 is returned, or a incomplete result such as a short write. A new 1283 error is provided for flushing subsequent chained requests. 1285 Chained request sequences are subject to ordinary flow control 1286 since each request is a new, independent request on the channel. 1287 When a chain is in effect, the server executes requests strictly in 1288 the sequence as issued in the chain. When the chain is terminated 1289 by the client, server operation returns to normal, fully parallel 1290 mode. 1292 Chaining is implemented in the OPERATION_CONTROL operation within 1293 each compound. A ChainFlags word indicates the beginning, 1294 continuation and end of each chain. Requests which arrive in an 1295 unexpected state (for example, a "continuation" request without a 1296 "begin") result in a CHAIN_INVALID error. Requests which follow an 1297 incomplete result are not executed and result in a CHAIN_BROKEN 1298 error. The client terminates the chain by explicitly ending the 1299 chain with the "end" flag, or by transmitting any unchained 1300 request. The explicit "end" flag allows a chain to immediately 1301 follow another. 1303 When a chain is in effect, the current filehandle and saved 1304 filehandle are maintained across chained requests as for a single 1305 COMPOUND. This permits passing such results forward in the chain. 1307 The current and saved filehandles are not available outside the 1308 chain. 1310 2.8. Data Alignment 1312 A negotiated data alignment enables certain scatter/gather 1313 optimizations. A facility for this is supported by [RPCRDMA]. 1314 Where NFS file data is the payload, specific optimizations become 1315 highly attractive. 1317 Header padding is requested by each peer at session initiation, and 1318 may be zero (no padding). Padding leverages the useful property 1319 that RDMA receives preserve alignment of data, even when they are 1320 placed into anonymous (untagged) buffers. If requested, client 1321 inline writes will insert appropriate pad bytes within the request 1322 header to align the data payload on the specified boundary. The 1323 client is encouraged to be optimistic and simply pad all WRITEs 1324 within the RPC layer to the negotiated size, in the expectation 1325 that the server can use them efficiently. 1327 It is highly recommended that clients offer to pad headers to an 1328 appropriate size. Most servers can make good use of such padding, 1329 which allows them to chain receive buffers in such a way that any 1330 data carried by client requests will be placed into appropriate 1331 buffers at the server, ready for filesystem processing. The 1332 receiver's RPC layer encounters no overhead from skipping over pad 1333 bytes, and the RDMA layer's high performance makes the insertion 1334 and transmission of padding on the sender a significant 1335 optimization. In this way, the need for servers to perform RDMA 1336 Read to satisfy all but the largest client writes is obviated. An 1337 added benefit is the reduction of message roundtrips on the network 1338 - a potentially good trade, where latency is present. 1340 The value to choose for padding is subject to a number of criteria. 1341 A primary source of variable-length data in the RPC header is the 1342 authentication information, the form of which is client-determined, 1343 possibly in response to server specification. The contents of 1344 COMPOUNDs, sizes of strings such as those passed to RENAME, etc. 1345 all go into the determination of a maximal NFSv4 request size and 1346 therefore minimal buffer size. The client must select its offered 1347 value carefully, so as not to overburden the server, and vice- 1348 versa. The payoff of an appropriate padding value is higher 1349 performance. 1351 Sender gather: 1352 |RPC Request|Pad bytes|Length| -> |User data...| 1353 \------+---------------------/ \ 1354 \ \ 1355 \ Receiver scatter: \--------------+- ... 1356 /-----+----------------\ \ \ 1357 |RPC Request|Pad|Length| -> |FS buffer| -> |FS buffer| -> ... 1359 In the above case, the server may recycle unused buffers to the 1360 next posted receive if unused by the actual received request, or 1361 may pass the now-complete buffers by reference for normal write 1362 processing. For a server which can make use of it, this removes 1363 any need for data copies of incoming data, without resorting to 1364 complicated end-to-end buffer advertisement and management. This 1365 includes most kernel-based and integrated server designs, among 1366 many others. The client may perform similar optimizations, if 1367 desired. 1369 Padding is negotiated by the session binding operation, and 1370 subsequently used by the RPC RDMA layer, as described in [RPCRDMA]. 1372 3. NFSv4 Integration 1374 The following section discusses the integration of the proposed 1375 RDMA extensions with NFSv4.0. 1377 3.1. Minor Versioning 1379 Minor versioning is the existing facility to extend the NFSv4 1380 protocol, and this proposal takes that approach. 1382 Minor versioning of NFSv4 is relatively restrictive, and allows for 1383 tightly limited changes only. In particular, it does not permit 1384 adding new "procedures" (it permits adding only new "operations"). 1385 Interoperability concerns make it impossible to consider additional 1386 layering to be a minor revision. This somewhat limits the changes 1387 that can be proposed when considering extensions. 1389 To support exactly-once semantics integrated with sessions and flow 1390 control, it is desirable to tag each request with an identifier to 1391 be called a Streamid. This identifier must be passed by NFSv4 when 1392 running atop any transport, including traditional TCP. Therefore 1393 it is not desirable to add the Streamid to a new RPC transport, 1394 even though such a transport is indicated for support of RDMA. 1395 This draft and [RPCRDMA] do not propose such an approach. 1397 Instead, this proposal follows these requirements faithfully, 1398 through the use of a new operation within NFSv4 COMPOUND procedures 1399 as detailed below. 1401 3.2. Stream Identifiers and Exactly-Once Semantics 1403 The presence of deterministic flow control on a channel enables in- 1404 progress requests to be assigned unique values with useful 1405 properties. 1407 The RPC layer provides a transaction ID (xid), which, while 1408 required to be unique, is not especially convenient for tracking 1409 requests. The transaction ID is only meaningful to the issuer 1410 (client), it cannot be interpreted at the server except to test for 1411 equality with previously issued requests. Because RPC operations 1412 may be completed by the server in any order, many transaction IDs 1413 may be outstanding at any time. The client may therefore perform a 1414 computationally expensive lookup operation in the process of 1415 demultiplexing each reply. 1417 When flow control is in effect, there is a limit to the number of 1418 active requests. This immediately enables a convenient, 1419 computationally efficient index for each request which is 1420 designated as a Stream Identifier, or streamid. 1422 When the client issues a new request, it selects a streamid in the 1423 range 0..N-1, where N is the server's current "totalrequests" limit 1424 granted the client on the session over which the request is to be 1425 issued. The streamid must be unused by any of the requests which 1426 the client has already active on the session. "Unused" here means 1427 the client has no outstanding request for that streamid. Because 1428 the stream id is always an integer in the range 0..N-1, client 1429 implementations can use the streamid from a server response to 1430 efficiently match responses with outstanding requests, such as, for 1431 example, by using the streamid to index into a outstanding request 1432 array. 1434 The server in turn may use this streamid, in conjunction with the 1435 transaction id within the RPC portion of the request, to maintain 1436 its duplicate request cache (DRC) for the session, as opposed to 1437 the traditional approach of ONC RPC applications that use the XID 1438 to index into the DRC. Unlike the XID, the streamid is always 1439 within a specific range; this has two implications. The first 1440 implication is that for a given session, the server need only cache 1441 the results of a limited number of COMPOUND requests. The second 1442 implication derives from the first, which is unlike XID indexed 1443 DRCs, the streamid DRC by its nature cannot be overflowed. This 1444 makes it practical to maintain all the required entries for an 1445 effective, exactly-once semantics, DRC. 1447 It is required to encode the streamid information in such a way 1448 that does not violate the minor versioning rules of the NFSv4.0 1449 specification. This is accomplished here by encoding it in a 1450 control operation within each NFSv4.1 COMPOUND and CB_COMPOUND 1451 procedure. The operation easily piggybacks within existing 1452 messages. The implementation section of this document describes 1453 the specific proposal. 1455 Exactly-once semantics completely replace the functionality 1456 provided by NFSv4.0 sequence numbers. It is no longer necessary to 1457 employ NFS sequence numbers and their contents must be ignored by 1458 NFSv4.1 servers when a session is in effect for the connection. As 1459 previously discussed, such a server will never request open- 1460 confirmation response to OPEN requests, and a client must not issue 1461 an OPEN_CONFIRM operation. 1463 In the case where the server is actively adjusting its granted flow 1464 control credits to the client, it may not be able to use receipt of 1465 the streamid to retire a cache entry. The streamid used in an 1466 incoming request may not reflect the server's current idea of the 1467 client's credit limit, because the request may have been sent from 1468 the client before the update was received. Therefore, in the 1469 credit downward adjustment case, the server may have to retain a 1470 number of duplicate request cache entries at least as large as the 1471 old credit value, until operation sequencing rules allow it to 1472 infer that the client has seen its reply. 1474 Finally, note that the streamid is a guarantee of uniqueness only 1475 in the scope of an unbroken connection. A channel identifier, 1476 assigned at bind time and unique within the session, provides the 1477 means by which this is detected. If a request is received on a 1478 channel with a channel identifier which does not match the incoming 1479 request, then the request must be handled as a potential retry on 1480 the previous channel identifier. It is possible to receive 1481 requests up to the credit limit previously in effect for the old 1482 channel, but new requests outside this range should be rejected. 1483 As in the flow control downward adjustment case, the server may 1484 finally retire the old channel's request cache entries based on 1485 operation sequencing rules. 1487 3.3. COMPOUND and CB_COMPOUND 1489 Support for per-operation control can be piggybacked onto NFSv4 1490 COMPOUNDs with full transparency, by placing such facilities into 1491 their own, new operation, and placing this operation first in each 1492 COMPOUND under the new NFSv4 minor protocol revision. The contents 1493 of the operation would then apply to the entire COMPOUND. 1495 Recall that the NFSv4 minor revision is contained within the 1496 COMPOUND header, encoded prior to the COMPOUNDed operations. By 1497 simply requiring that the new operation always be contained in 1498 NFSv4 minor COMPOUNDs, the control protocol can piggyback perfectly 1499 with each request and response. 1501 In this way, the NFSv4 RDMA Extensions may stay in compliance with 1502 the minor versioning requirements specified in section 10 of 1503 [RFC3530]. 1505 Referring to section 13.1 of the same document, the proposed 1506 session-enabled COMPOUND and CB_COMPOUND have the form: 1508 +-----+--------------+-----------+------------+-----------+---- 1509 | tag | minorversion | numops | control op | op + args | ... 1510 | | (== 1) | (limited) | + args | | 1511 +-----+--------------+-----------+------------+-----------+---- 1513 and the reply's structure is: 1515 +------------+-----+--------+-------------------------------+--// 1516 |last status | tag | numres | status + control op + results | // 1517 +------------+-----+--------+-------------------------------+--// 1518 //-----------------------+---- 1519 // status + op + results | ... 1520 //-----------------------+---- 1522 The single control operation within each NFSv4.1 COMPOUND defines 1523 the context and operational session parameters which govern that 1524 COMPOUND request and reply. Placing it first in the COMPOUND 1525 encoding is required in order to allow its processing before other 1526 operations in the COMPOUND. This is especially important where 1527 chaining is in effect, as the chain must be checked for correctness 1528 prior to execution. 1530 3.4. eXternal Data Representation Efficiency 1532 RDMA is a copy avoidance technology, and it is important to 1533 maintain this efficiency when decoding received messages. 1534 Traditional XDR implementations frequently use generated 1535 unmarshaling code to convert objects to local form, incurring a 1536 data copy in the process (in addition to subjecting the caller to 1537 recursive calls, etc). Often, such conversions are carried out 1538 even when no size or byte order conversion is necessary. 1540 It is recommended that implementations pay close attention to the 1541 details of memory referencing in such code. It is far more 1542 efficient to inspect data in place, using native facilities to deal 1543 with word size and byte order conversion into registers or local 1544 variables, rather than formally (and blindly) performing the 1545 operation via fetch, reallocate and store. 1547 Of particular concern is the result of the READDIR_DIRECT 1548 operation, in which such encoding abounds. 1550 3.5. Effect of Sessions on Existing Operations 1552 The use of a session and associated message credits to provide 1553 exactly-once semantics allows considerable simplification of a 1554 number of mechanisms in the base protocol that are all devoted in 1555 some way to providing replay protection. In particular, the use of 1556 sequence id's on many operations becomes superfluous. Rather than 1557 replace existing operations with variants that delete the sequence 1558 id's, sequence id's will still be present but their value must not 1559 be checked for correctness, nor used for replay protection. In 1560 addition, when a session is in effect for the connection, OPENs 1561 will never require confirmation, the server must not require 1562 confirmation, and the OPEN_CONFIRM operation must not be issued by 1563 the client. 1565 Since each session will only be used by a single client, the use of 1566 a clientid in many operations will no longer be required. Rather 1567 than remove clientid parameters, the existing operations that use 1568 them will remain unchanged but a value of zero can be used. The 1569 determination of the client will follow from the session membership 1570 of the connection on which the request arrived. 1572 A similar situation to sequence numbers, described earlier, exists 1573 for NFSv4.0 clientid operations. There is no longer a need for 1574 SETCLIENTID and SETCLIENTID_CONFIRM, as clientid uniqueness is 1575 managed by the server through the session, and negotiation is both 1576 unnecessary and redundant. Additionally, the cb_program and 1577 cb_location which are obtained by the server in SETCLIENTID_CONFIRM 1578 must not be used by the server, because the NFSv4.1 client performs 1579 callback channel designation with SESSION_BIND. A server should 1580 return an error to NFSv4.1 clients which might issue either 1581 operation. 1583 Finally the RENEW operation is made unnecessary when a session is 1584 present, and the server should return an error to clients which 1585 might issue it. 1587 In summary, the 1589 o OPEN_CONFIRM 1591 o SETCLIENTID 1593 o SETCLIENTID_CONFIRM 1595 o RENEW 1597 operations must not be issued or handled by client nor server when 1598 a session is in effect. 1600 Since the session carries the client indication with it implicitly, 1601 any request on a session associated with a given client will renew 1602 that client's leases. 1604 3.6. Authentication Efficiencies 1606 NFSv4 requires the use of the RPCSEC_GSS ONC RPC security flavor 1607 [RFC2203] to provide authentication, integrity, and privacy via 1608 cryptography. The server dictates to the client the use of 1609 RPCSEC_GSS, the service (authentication, integrity, or privacy), 1610 and the specific GSS-API security mechanism that each remote 1611 procedure call and result will use. 1613 If the connection's integrity is protected by an additional means 1614 than RPCSEC_GSS, such as via IPsec, then the use of RPCSEC_GSS's 1615 integrity service is nearly redundant (See the Security 1616 Considerations section for more explanation of why it is "nearly" 1617 and not completely redundant). Likewise, if the connection's 1618 privacy is protected by additional means, then the use of both 1619 RPCSEC_GSS's integrity and privacy services is nearly redundant. 1621 Connection protection schemes, such as IPsec, are more likely to be 1622 implemented in hardware than upper layer protocols like RPCSEC_GSS. 1623 Hardware-based cryptography at the IPsec layer will be more 1624 efficient than software-based cryptography at the RPCSEC_GSS layer. 1626 When transport integrity can be obtained, it is possible for server 1627 and client to downgrade their per-operation authentication, after 1628 an appropriate exchange. This downgrade can in fact be as complete 1629 as to establish security mechanisms that have zero cryptographic 1630 overhead, effectively using the underlying integrity and privacy 1631 services provided by transport. 1633 Based on the above observations, a new GSS-API mechanism, called 1634 the Channel Conjunction Mechanism [CCM], is being defined. The CCM 1635 works by creating a GSS-API security context using as input a 1636 cookie that the initiator and target have previously agreed to be a 1637 handle for GSS-API context created previously over another GSS-API 1638 mechanism. 1640 NFSv4.1 clients and servers should support CCM and they must use as 1641 the cookie the handle from a successful RPCSEC_GSS context creation 1642 over a non-CCM mechanism (such as Kerberos V5). The value of the 1643 cookie will be equal to the handle field of the rpc_gss_init_res 1644 structure from the RPCSEC_GSS specification. 1646 The [CCM] Draft provides further discussion and examples. 1648 4. Security Considerations 1650 The NFSv4 minor version 1 retains all of existing NFSv4 security; 1651 all security considerations present in NFSv4.0 apply to it equally. 1653 Security considerations of any underlying RDMA transport are 1654 additionally important, all the more so due to the emerging nature 1655 of such transports. Examining these issues is outside the scope of 1656 this draft. 1658 When protecting a connection with RPCSEC_GSS, all data in each 1659 request and response (whether transferred inline or via RDMA) 1660 continues to receive this protection over RDMA fabrics [RPCRDMA]. 1661 However when performing data transfers via RDMA, RPCSEC_GSS 1662 protection of the data transfer portion works against the 1663 efficiency which RDMA is typically employed to achieve. This is 1664 because such data is normally managed solely by the RDMA fabric, 1665 and intentionally is not touched by software. Therefore when 1666 employing RPCSEC_GSS under CCM, and where integrity protection has 1667 been "downgraded", the cooperation of the RDMA transport provider 1668 is critical to maintain any integrity and privacy otherwise in 1669 place for the session. The means by which the local RPCSEC_GSS 1670 implementation is integrated with the RDMA data protection 1671 facilities are outside the scope of this draft. 1673 It is logical to use the same GSS context on a session's callback 1674 channel as that used on its operations channel(s), but the issue 1675 warrants careful analysis. 1677 If the NFS client wishes to maintain full control over RPCSEC_GSS 1678 protection, it may still perform its transfer operations using 1679 either the inline or RDMA transfer model, or of course employ 1680 traditional TCP stream operation. In the RDMA inline case, header 1681 padding is recommended to optimize behavior at the server. At the 1682 client, close attention should be paid to the implementation of 1683 RPCSEC_GSS processing to minimize memory referencing and especially 1684 copying. These are well-advised in any case! 1686 Proper authentication of the session binding operation of the 1687 proposed NFSv4.1 exactly follows the similar requirement on client 1688 identifiers in NFSv4.0. It must not be possible for a client to 1689 bind to an existing session by guessing its session identifier. To 1690 protect against this, NFSv4.0 requires appropriate authentication 1691 and matching of the principal used. This is discussed in Section 1692 16, Security Considerations of [RFC3530]. The same requirement 1693 before binding to a session identifier applies here. 1695 The proposed session binding improves security over that provided 1696 by NFSv4 for the callback channel. The connection is client- 1697 initiated, and subject to the same firewall and routing checks as 1698 the operations channel. The connection cannot be hijacked by an 1699 attacker who connects to the client port prior to the intended 1700 server. The connection is set up by the client with its desired 1701 attributes, such as optionally securing with IPsec or similar. The 1702 binding is fully authenticated before being activated. 1704 The server should take care to protect itself against denial of 1705 service attacks in the creation of sessions and clientids. Clients 1706 who connect and create sessions, only to disconnect and never bind 1707 to them may leave significant state behind. (The same issue 1708 applies to NFSv4.0 with clients who may perform SETCLIENTID, then 1709 never perform SETCLIENTID_CONFIRM.) Careful authentication coupled 1710 with resource checks is highly recommended. 1712 5. IANA Considerations 1714 As a proposal based on minor protocol revision, any new minor 1715 number might be registered and reserved with the agreed-upon 1716 specification. Assigned operation numbers and any RPC constants 1717 might undergo the same process. 1719 There are no issues stemming from RDMA use itself regarding port 1720 number assignments not already specified by [RFC3530]. Initial 1721 connection is via ordinary TCP stream services, operating on the 1722 same ports and under the same set of naming services. 1724 In the Automatic RDMA connection model described above, it is 1725 possible that a new well-known port, or a new transport type 1726 assignment (netid) as described in [RFC3530], may be desirable. 1728 6. NFSv4 Protocol Extensions 1730 This section specifies details of the five extensions to NFSv4 1731 proposed by this document. Existing NFSv4 operations (under minor 1732 version 0) continue to be fully supported, unmodified. 1734 6.1. SESSION_CREATE 1736 SYNOPSIS 1738 sessionparams -> sessionresults 1740 ARGUMENT 1742 struct SESSIONCREATE4args { 1743 nfs_client_id4 clientid; 1744 bool persist; 1745 uint32 totalrequests; 1746 }; 1748 RESULT 1750 struct SESSIONCREATE4resok { 1751 uint64 sessionid; 1752 bool persist; 1753 uint32 totalrequests; 1754 }; 1756 union SESSIONCREATE4res switch (nfsstat4 status) { 1757 case NFS4_OK: 1758 SESSIONCREATE4resok resok4; 1759 default: 1760 void; 1761 }; 1763 DESCRIPTION 1765 The SESSION_CREATE operation creates a session to which client 1766 connections may be bound with SESSION_BIND. 1768 The "persist" argument indicates to the server whether the client 1769 requires strict response caching for the session. For example, a 1770 read-only session may set persist to FALSE. The server may choose 1771 to change the returned value of "persist" to match its 1772 implementation choice. 1774 The "totalrequests" argument allows the server to size any 1775 necessary response cache storage. It is the largest number of 1776 outstanding requests which the client will adhere to session-wide. 1778 Note that the SESSION_CREATE operation never appears with an 1779 associated streamid. Therefore the SESSION_CREATE operation may 1780 not receive the same level of exactly-once replay protection in the 1781 face of transport failure. However, because at most one 1782 SESSION_CREATE operation may be issued on a connection, servers can 1783 provide "special" caching of the result (the sessionid) to 1784 compensate for this. 1786 ... 1788 ERRORS 1790 1792 6.2. SESSION_BIND 1794 SYNOPSIS 1796 sessionparams -> sessionresults 1798 ARGUMENT 1800 enum ChannelType { 1801 OPERATION = 0, 1802 BACK = 1 1803 }; 1805 enum ConnectionMode { 1806 STREAM = 0, 1807 RDMA = 1 1808 }; 1810 struct SESSIONBIND4args { 1811 uint64 sessionid; 1812 ChannelType channel; 1813 ConnectionMode mode; 1814 count4 maxrequestsize; 1815 count4 maxresponsesize; 1816 count4 headerpadsize; 1817 count4 maxrequests; 1818 count4 maxrdmareads; 1819 opaque transportattrs<>; 1820 }; 1822 RESULT 1824 struct SESSIONBIND4resok { 1825 uint32 channelid; 1826 count4 maxrequestsize; 1827 count4 maxresponsesize; 1828 count4 headerpadsize; 1829 count4 maxrequests; 1830 count4 maxrdmareads; 1831 opaque transportattrs<>; 1832 }; 1834 union SESSIONBIND4res switch (nfsstat4 status) { 1835 case NFS4_OK: 1836 SESSIONBIND4resok resok4; 1837 default: 1838 void; 1839 }; 1841 DESCRIPTION 1843 The SESSION_BIND operation causes the connection on which the 1844 operation is issued to be associated with the specified session, 1845 creating a new channel. The channel type may be specified to be 1846 for multiple purposes. Multiple channels may be bound to a single 1847 connection within a session. Normally, only one back channel is 1848 bound. 1850 Credits and sizes are interpreted relative to the initiator of each 1851 channel, that is, the operations channel specifies server credits 1852 and sizes for the operations channel, while the back channel 1853 specifies client credits and sizes for the back channel. Padding 1854 and also direct operations are generally not required on the back 1855 channel. 1857 The channelid is a unique session-wide indentifier for each newly 1858 bound connection. New requests must be issued on a channel with 1859 the matching identifier, while requests retried after connection 1860 failure must reissue the original identifier. 1862 When ConnectionMode is "RDMA", the channel may be promoted to RDMA 1863 mode by the server before replying, if supported. 1865 The "maxrequests" value is a hint which the client may use to 1866 communicate to the server its expected credit use on the channel. 1867 The client must always adhere to the "totalrequests" value, 1868 aggregated on all channels within the session, which it negotiated 1869 with the server at session creation. 1871 Note that the SESSION_BIND operation never appears with an 1872 associated streamid, but also never requires replay protection. A 1873 client which suffered a connection loss must immediately respond 1874 with new SESSION_BIND, and never a retransmit. Also, for this 1875 reason, it is recommended to use SESSION_BIND alone in its request. 1877 ... 1879 ERRORS 1881 1883 6.3. SESSION_DESTROY 1885 SYNOPSIS 1887 void -> status 1889 ARGUMENT 1891 void; 1893 RESULT 1895 struct SESSION_DESTROYres { 1896 nfsstat status; 1897 }; 1899 DESCRIPTION 1901 The SESSION_DESTROY operation closes the session and discards any 1902 active state such as locks, leases, and server duplicate request 1903 cache entries. Any remaining connections bound to the session are 1904 immediately unbound and may additionally be closed by the server. 1906 This operation must be the final, or only operation after the 1907 required OPERATION_CONTROL in any request. Because the operation 1908 results in destruction of the session, any duplicate request 1909 caching for this request, as well as previously completed rewuests, 1910 will be lost. For this reason, it is advisable to not place this 1911 operation in a request with other state-modifying operations. 1913 Note that because the operation will never be replayed by the 1914 server, a client that retransmits the request may receive an error 1915 in response, even though the session may have been successfully 1916 destroyed. 1918 ... 1920 ERRORS 1922 1924 6.4. OPERATION_CONTROL 1926 SYNOPSIS 1928 control -> control 1930 ARGUMENT 1932 enum ChainFlags { 1933 NOCHAIN = 0, 1934 CHAINBEGIN = 1, 1935 CHAINCONTINUE = 2, 1936 CHAINEND = 3 1937 }; 1939 struct OPERATIONCONTROL4args { 1940 uint32 channelid; 1941 uint32 streamid; 1942 enum ChainFlags chainflags; 1943 }; 1945 RESULT 1947 union OPERATIONCONTROL4res switch (nfsstat4 status) { 1948 case NFS4_OK: 1949 uint32 streamid; 1950 default: 1951 void; 1952 }; 1954 DESCRIPTION 1956 The OPERATION_CONTROL operation is used to manage operational 1957 accounting for the channel on which the operation is sent. The 1958 contents include the Streamid, used by the server to implement 1959 exactly-once semantics, and chaining flags to implement request 1960 chaining for the operations channel. This operation must appear 1961 once as the first operation in each COMPOUND and CB_COMPOUND sent 1962 after the channel is successfully bound, or a protocol error must 1963 result. 1965 The channelid and streamid are provided in the arguments in order 1966 to permit the server to implement duplicate request cache handling. 1967 The streamid is provided in the results in order to assist the 1968 client in efficiently demultiplexing the reply. 1970 ... 1972 ERRORS 1974 Streamid out of bounds 1975 CHAIN_INVALID and CHAIN_BROKEN 1977 6.5. CB_CREDITRECALL 1979 SYNOPSIS 1981 targetcount -> status 1983 ARGUMENT 1985 count4 target; 1987 RESULT 1989 struct CB_CREDITRECALLres { 1990 nfsstat status; 1991 }; 1993 DESCRIPTION 1995 The CB_CREDITRECALL operation requests the client to return credits 1996 at the server, by zero-length RDMA Sends or NULL NFSv4 operations. 1998 ... 2000 ERRORS 2002 2004 7. Acknowledgements 2006 The authors wish to acknowledge the valuable contributions and 2007 review of Brent Callaghan, Mike Eisler, John Howard, Chet Juszczak, 2008 Dave Noveck and Mark Wittle. 2010 8. References 2012 [CCM] 2013 M. Eisler, N. Williams, "The Channel Conjunction Mechanism 2014 (CCM) for GSS", Internet-Draft Work in Progress, 2015 http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-ccm-02 2017 [CJ89] 2018 C. Juszczak, "Improving the Performance and Correctness of an 2019 NFS Server," Winter 1989 USENIX Conference Proceedings, USENIX 2020 Association, Berkeley, CA, Februry 1989, pages 53-63. 2022 [DAFS] 2023 Direct Access File System, available from 2024 http://www.dafscollaborative.org 2026 [DCK+03] 2027 M. DeBergalis, P. Corbett, S. Kleiman, A. Lent, D. Noveck, T. 2028 Talpey, M. Wittle, "The Direct Access File System", in 2029 Proceedings of 2nd USENIX Conference on File and Storage 2030 Technologies (FAST '03), San Francisco, CA, March 31 - April 2031 2, 2003 2033 [DDP] 2034 H. Shah, J. Pinkerton, R. Recio, P. Culley, "Direct Data 2035 Placement over Reliable Transports", 2036 http://www.ietf.org/internet-drafts/draft-ietf-rddp-ddp-01 2038 [FJDAFS] 2039 Fujitsu Prime Software Technologies, "Meet the DAFS 2040 Performance with DAFS/VI Kernel Implementation using cLAN", 2041 http://www.pst.fujitsu.com/english/dafsdemo/index.html 2043 [FJNFS] 2044 Fujitsu Prime Software Technologies, "An Adaptation of VIA to 2045 NFS on Linux", 2046 http://www.pst.fujitsu.com/english/nfs/index.html 2048 [IB] InfiniBand Architecture Specification, Volume 1, Release 1.1. 2049 available from http://www.infinibandta.org 2051 [KM02] 2052 K. Magoutis, "Design and Implementation of a Direct Access 2053 File System (DAFS) Kernel Server for FreeBSD", in Proceedings 2054 of USENIX BSDCon 2002 Conference, San Francisco, CA, February 2055 11-14, 2002. 2057 [MAF+02] 2058 K. Magoutis, S. Addetia, A. Fedorova, M. Seltzer, J. Chase, D. 2059 Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber, "Structure 2060 and Performance of the Direct Access File System (DAFS)", in 2061 Proceedings of 2002 USENIX Annual Technical Conference, 2062 Monterey, CA, June 9-14, 2002. 2064 [MIDTAX] 2065 B. Carpenter, S. Brim, "Middleboxes: Taxonomy and Issues", 2066 Informational RFC, http://www.ietf.org/rfc/rfc3234 2068 [NFSDDP] 2069 B. Callaghan, T. Talpey, "NFS Direct Data Placement", 2070 Internet-Draft Work in Progress, http://www.ietf.org/internet- 2071 drafts/draft-callaghan-nfsdirect-01 2073 [NFSPS] 2074 T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", 2075 Internet-Draft Work in Progress, http://www.ietf.org/internet- 2076 drafts/draft-talpey-nfs-rdma-problem-statement-01 2078 [RDMAREQ] 2079 B. Callaghan, M. Wittle, "NFS RDMA requirements", Internet- 2080 Draft Work in Progress, http://www.ietf.org/internet- 2081 drafts/draft-callaghan-nfs-rdmareq-00 2083 [RFC3530] 2084 S. Shepler, et. al., "NFS Version 4 Protocol", Standards Track 2085 RFC, http://www.ietf.org/rfc/rfc3530 2087 [RDDP] 2088 Remote Direct Data Placement Working Group charter, 2089 http://www.ietf.org/html.charters/rddp-charter.html 2091 [RDDPPS] 2092 Remote Direct Data Placement Working Group Problem Statement, 2093 A. Romanow, J. Mogul, T. Talpey, S. Bailey, 2094 http://www.ietf.org/internet-drafts/draft-ietf-rddp-problem- 2095 statement-03 2097 [RDMAP] 2098 R. Recio, P. Culley, D. Garcia, J. Hilland, "An RDMA Protocol 2099 Specification", http://www.ietf.org/internet-drafts/draft- 2100 ietf-rddp-rdmap-01 2102 [RPCRDMA] 2103 B. Callaghan, T. Talpey, "RDMA Transport for ONC RPC" 2104 Internet-Draft Work in Progress, http://www.ietf.org/internet- 2105 drafts/draft-callaghan-rpc-rdma-01 2107 [RFC2203] 2108 M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol 2109 Specification", Standards Track RFC, 2110 http://www.ietf.org/rfc/rfc2203 2112 Authors' Addresses 2114 Tom Talpey 2115 Network Appliance, Inc. 2116 375 Totten Pond Road 2117 Waltham, MA 02451 USA 2119 Phone: +1 781 768 5329 2120 EMail: thomas.talpey@netapp.com 2122 Spencer Shepler 2123 Sun Microsystems, Inc. 2124 7808 Moonflower Drive 2125 Austin, TX 78750 USA 2127 Phone: +1 512 349 9376 2128 EMail: spencer.shepler@sun.com 2130 Full Copyright Statement 2132 Copyright (C) The Internet Society (2004). All Rights Reserved. 2134 This document and translations of it may be copied and furnished to 2135 others, and derivative works that comment on or otherwise explain 2136 it or assist in its implementation may be prepared, copied, 2137 published and distributed, in whole or in part, without restriction 2138 of any kind, provided that the above copyright notice and this 2139 paragraph are included on all such copies and derivative works. 2140 However, this document itself may not be modified in any way, such 2141 as by removing the copyright notice or references to the Internet 2142 Society or other Internet organizations, except as needed for the 2143 purpose of developing Internet standards in which case the 2144 procedures for copyrights defined in the Internet Standards process 2145 must be followed, or as required to translate it into languages 2146 other than English. 2148 The limited permissions granted above are perpetual and will not be 2149 revoked by the Internet Society or its successors or assigns. 2151 This document and the information contained herein is provided on 2152 an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET 2153 ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR 2154 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 2155 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 2156 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.