idnits 2.17.1 draft-ietf-nfsv4-sess-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 21. -- Found old boilerplate from RFC 3978, Section 5.5 on line 3065. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 3076. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 3083. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 3089. ** Found boilerplate matching RFC 3978, Section 5.4, paragraph 1 (on line 3056), which is fine, but *also* found old RFC 2026, Section 10.4C, paragraph 1 text on line 41. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 2 instances of too long lines in the document, the longest one being 1 character in excess of 72. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 2094: '... the server MUST NOT remove the c...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 2430 has weird spacing: '...L4resok reso...' == Line 2616 has weird spacing: '...E4resok reso...' == Line 2764 has weird spacing: '...L4resok reso...' == Line 2808 has weird spacing: '...E4resok res...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'Connection' is mentioned on line 490, but not defined == Missing Reference: 'Segment' is mentioned on line 989, but not defined -- Looks like a reference, but probably isn't: '16' on line 2694 -- Looks like a reference, but probably isn't: '0' on line 2682 == Unused Reference: 'DAFS' is defined on line 2935, but no explicit reference was found in the text == Unused Reference: 'RDDP' is defined on line 2992, but no explicit reference was found in the text ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530) Summary: 7 errors (**), 0 flaws (~~), 10 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 INTERNET-DRAFT Tom Talpey 3 Expires: December 2005 Network Appliance, Inc. 5 Spencer Shepler 6 Sun Microsystems, Inc. 8 Jon Bauman 9 University of Michigan 11 July, 2005 13 NFSv4 Session Extensions 14 draft-ietf-nfsv4-sess-02 16 Status of this Memo 18 By submitting this Internet-Draft, each author represents that any 19 applicable patent or other IPR claims of which he or she is aware 20 have been or will be disclosed, and any of which he or she becomes 21 aware will be disclosed, in accordance with Section 6 of BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF), its areas, and its working groups. Note that 25 other groups may also distribute working documents as Internet- 26 Drafts. 28 Internet-Drafts are draft documents valid for a maximum of six 29 months and may be updated, replaced, or obsoleted by other 30 documents at any time. It is inappropriate to use Internet-Drafts 31 as reference material or to cite them other than as "work in 32 progress." 34 The list of current Internet-Drafts can be accessed at 35 http://www.ietf.org/ietf/1id-abstracts.txt The list of 36 Internet-Draft Shadow Directories can be accessed at 37 http://www.ietf.org/shadow.html. 39 Copyright Notice 41 Copyright (C) The Internet Society (2005). All Rights Reserved. 43 Abstract 45 Extensions are proposed to NFS version 4 which enable it to support 46 long-lived sessions, endpoint management, and operation atop a 47 variety of RPC transports, including TCP and RDMA. These 48 extensions enable support for reliably implemented client response 49 caching by NFSv4 servers, enhanced security, multipathing and 50 trunking of transport connections. These extensions provide 51 identical benefits over both TCP and RDMA connection types. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 3 56 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . 4 57 1.2. Problem Statement . . . . . . . . . . . . . . . . . . . 5 58 1.3. NFSv4 Session Extension Characteristics . . . . . . . . 7 59 2. Transport Issues . . . . . . . . . . . . . . . . . . . . . 7 60 2.1. Session Model . . . . . . . . . . . . . . . . . . . . . 7 61 2.1.1. Connection State . . . . . . . . . . . . . . . . . . . 9 62 2.1.2. NFSv4 Channels, Sessions and Connections . . . . . . . 9 63 2.1.3. Reconnection, Trunking and Failover . . . . . . . . . 11 64 2.1.4. Server Duplicate Request Cache . . . . . . . . . . . . 12 65 2.2. Session Initialization and Transfer Models . . . . . . . 13 66 2.2.1. Session Negotiation . . . . . . . . . . . . . . . . . 13 67 2.2.2. RDMA Requirements . . . . . . . . . . . . . . . . . . 15 68 2.2.3. RDMA Connection Resources . . . . . . . . . . . . . . 15 69 2.2.4. TCP and RDMA Inline Transfer Model . . . . . . . . . . 16 70 2.2.5. RDMA Direct Transfer Model . . . . . . . . . . . . . . 19 71 2.3. Connection Models . . . . . . . . . . . . . . . . . . . 22 72 2.3.1. TCP Connection Model . . . . . . . . . . . . . . . . . 23 73 2.3.2. Negotiated RDMA Connection Model . . . . . . . . . . . 24 74 2.3.3. Automatic RDMA Connection Model . . . . . . . . . . . 24 75 2.4. Buffer Management, Transfer, Flow Control . . . . . . . 25 76 2.5. Retry and Replay . . . . . . . . . . . . . . . . . . . . 28 77 2.6. The Back Channel . . . . . . . . . . . . . . . . . . . . 28 78 2.7. COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . 30 79 2.8. Data Alignment . . . . . . . . . . . . . . . . . . . . . 30 80 3. NFSv4 Integration . . . . . . . . . . . . . . . . . . . . 31 81 3.1. Minor Versioning . . . . . . . . . . . . . . . . . . . . 32 82 3.2. Slot Identifiers and Server Duplicate Request Cache . . 32 83 3.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 36 84 3.4. eXternal Data Representation Efficiency . . . . . . . . 37 85 3.5. Effect of Sessions on Existing Operations . . . . . . . 37 86 3.6. Authentication Efficiencies . . . . . . . . . . . . . . 38 87 4. Security Considerations . . . . . . . . . . . . . . . . . 39 88 4.1. Authentication . . . . . . . . . . . . . . . . . . . . . 40 89 5. IANA Considerations . . . . . . . . . . . . . . . . . . . 41 90 6. NFSv4 Protocol Extensions . . . . . . . . . . . . . . . . 42 91 6.1. Operation: CREATECLIENTID . . . . . . . . . . . . . . . 42 92 6.2. Operation: CREATESESSION . . . . . . . . . . . . . . . . 47 93 6.3. Operation: BIND_BACKCHANNEL . . . . . . . . . . . . . . 52 94 6.4. Operation: DESTROYSESSION . . . . . . . . . . . . . . . 54 95 6.5. Operation: SEQUENCE . . . . . . . . . . . . . . . . . . 55 96 6.6. Callback operation: CB_RECALLCREDIT . . . . . . . . . . 57 97 6.7. Callback operation: CB_SEQUENCE . . . . . . . . . . . . 57 98 7. NFSv4 Session Protocol Description . . . . . . . . . . . . 59 99 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 65 100 9. References . . . . . . . . . . . . . . . . . . . . . . . . 65 101 9.1. Normative References . . . . . . . . . . . . . . . . . . 65 102 9.2. Informative References . . . . . . . . . . . . . . . . . 66 103 10. Authors' Addresses . . . . . . . . . . . . . . . . . . . . 68 104 11. Full Copyright Statement . . . . . . . . . . . . . . . . . 69 106 1. Introduction 108 This draft proposes extensions to NFS version 4 [RFC3530] enabling 109 it to support sessions and endpoint management, and to support 110 operation atop RDMA-capable RPC over transports such as iWARP. 111 [RDMAP, DDP] These extensions enable support for exactly-once 112 semantics by NFSv4 servers, multipathing and trunking of transport 113 connections, and enhanced security. The ability to operate over 114 RDMA enables greatly enhanced performance. Operation over existing 115 TCP is enhanced as well. 117 While discussed here with respect to IETF-chartered transports, the 118 proposed protocol is intended to function over other standards, 119 such as Infiniband. [IB] 121 The following are the major aspects of this proposal: 123 o Changes are proposed within the framework of NFSv4 minor 124 versioning. RPC, XDR, and the NFSv4 procedures and operations 125 are preserved. The proposed extension functions equally well 126 over existing transports and RDMA, and interoperates 127 transparently with existing implementations, both at the local 128 programmatic interface and over the wire. 130 o An explicit session is introduced to NFSv4, and new operations 131 are added to support it. The session allows for enhanced 132 trunking, failover and recovery, and authentication 133 efficiency, along with necessary support for RDMA. The 134 session is implemented as operations within NFSv4 COMPOUND and 135 does not impact layering or interoperability with existing 136 NFSv4 implementations. The NFSv4 callback channel is 137 dynamically associated and is connected by the client and not 138 the server, enhancing security and operation through 139 firewalls. In fact, the callback channel will be enabled to 140 share the same connection as the operations channel. 142 o An enhanced RPC layer enables NFSv4 operation atop RDMA. The 143 session assists RDMA-mode connection, and additional 144 facilities are provided for managing RDMA resources at both 145 NFSv4 server and client. Existing NFSv4 operations continue 146 to function as before, though certain size limits are 147 negotiated. A companion draft to this document, "RDMA 148 Transport for ONC RPC" [RPCRDMA] is to be referenced for 149 details of RPC RDMA support. 151 o Support for exactly-once semantics ("EOS") is enabled by the 152 new session facilities, by providing to the server a way to 153 bound the size of the duplicate request cache for a single 154 client, and to manage its persistent storage. 156 Block Diagram 158 +-----------------+-------------------------------------+ 159 | NFSv4 | NFSv4 + session extensions | 160 +-----------------+------+----------------+-------------+ 161 | Operations | Session | | 162 +------------------------+----------------+ | 163 | RPC/XDR | | 164 +-------------------------------+---------+ | 165 | Stream Transport | RDMA Transport | 166 +-------------------------------+-----------------------+ 168 1.1. Motivation 170 NFS version 4 [RFC3530] has been granted "Proposed Standard" 171 status. The NFSv4 protocol was developed along several design 172 points, important among them: effective operation over wide-area 173 networks, including the Internet itself; strong security 174 integrated into the protocol; extensive cross-platform 175 interoperability including integrated locking semantics compatible 176 with multiple operating systems; and protocol extensibility. 178 The NFS version 4 protocol, however, does not provide support for 179 certain important transport aspects. For example, the protocol 180 does not address response caching, which is required to provide 181 correctness for retried client requests across a network partition, 182 nor does it provide an interoperable way to support trunking and 183 multipathing of connections. This leads to inefficiencies, 184 especially where trunking and multipathing are concerned, and 185 presents additional difficulties in supporting RDMA fabrics, in 186 which endpoints may require dedicated or specialized resources. 188 Sessions can be employed to unify NFS-level constructs such as the 189 clientid, with transport-level constructs such as transport 190 endpoints. Each transport endpoint draws on resources via its 191 membership in a session. Resource management can be more strictly 192 maintained, leading to greater server efficiency in implementing 193 the protocol. The enhanced operation over a session affords an 194 opportunity to the server to implement a highly reliable duplicate 195 request cache, and thereby export exactly-once semantics. 197 NFSv4 advances the state of high-performance local sharing, by 198 virtue of its integrated security, locking, and delegation, and its 199 excellent coverage of the sharing semantics of multiple operating 200 systems. It is precisely this environment where exactly-once 201 semantics become a fundamental requirement. 203 Additionally, efforts to standardize a set of protocols for Remote 204 Direct Memory Access, RDMA, over the Internet Protocol Suite have 205 made significant progress. RDMA is a general solution to the 206 problem of CPU overhead incurred due to data copies, primarily at 207 the receiver. Substantial research has addressed this and has 208 borne out the efficacy of the approach. An overview of this is the 209 RDDP Problem Statement document, [RDDPPS]. 211 Numerous upper layer protocols achieve extremely high bandwidth and 212 low overhead through the use of RDMA. Products from a wide variety 213 of vendors employ RDMA to advantage, and prototypes have 214 demonstrated the effectiveness of many more. Here, we are 215 concerned specifically with NFS and NFS-style upper layer 216 protocols; examples from Network Appliance [DAFS, DCK+03], Fujitsu 217 Prime Software Technologies [FJNFS, FJDAFS] and Harvard University 218 [KM02] are all relevant. 220 By layering a session binding for NFS version 4 directly atop a 221 standard RDMA transport, a greatly enhanced level of performance 222 and transparency can be supported on a wide variety of operating 223 system platforms. These combined capabilities alter the landscape 224 between local filesystems and network attached storage, enable a 225 new level of performance, and lead new classes of application to 226 take advantage of NFS. 228 1.2. Problem Statement 230 Two issues drive the current proposal: correctness, and 231 performance. Both are instances of "raising the bar" for NFS, 232 whereby the desire to use NFS in new classes applications can be 233 accommodated by providing the basic features to make such use 234 feasible. Such applications include tightly coupled sharing 235 environments such as cluster computing, high performance computing 236 (HPC) and information processing such as databases. These trends 237 are explored in depth in [NFSPS]. 239 The first issue, correctness, exemplified among the attributes of 240 local filesystems, is support for exactly-once semantics. Such 241 semantics have not been reliably available with NFS. Server-based 242 duplicate request caches [CJ89] help, but do not reliably provide 243 strict correctness. For the type of application which is expected 244 to make extensive use of the high-performance RDMA-enabled 245 environment, the reliable provision of such semantics is a 246 fundamental requirement. 248 Introduction of a session to NFSv4 will address these issues. With 249 higher performance and enhanced semantics comes the problem of 250 enabling advanced endpoint management, for example high-speed 251 trunking, multipathing and failover. These characteristics enable 252 availability and performance. RFC3530 presents some issues in 253 permitting a single clientid to access a server over multiple 254 connections. 256 A second issue encountered in common by NFS implementations is the 257 CPU overhead required to implement the protocol. Primary among the 258 sources of this overhead is the movement of data from NFS protocol 259 messages to its eventual destination in user buffers or aligned 260 kernel buffers. The data copies consume system bus bandwidth and 261 CPU time, reducing the available system capacity for applications. 262 [RDDPPS] Achieving zero-copy with NFS has to date required 263 sophisticated, "header cracking" hardware and/or extensive 264 platform-specific virtual memory mapping tricks. 266 Combined in this way, NFSv4, RDMA and the emerging high-speed 267 network fabrics will enable delivery of performance which matches 268 that of the fastest local filesystems, preserving the key existing 269 local filesystem semantics, while enhancing them by providing 270 network filesystem sharing semantics. 272 RDMA implementations generally have other interesting properties, 273 such as hardware assisted protocol access, and support for user 274 space access to I/O. RDMA is compelling here for another reason; 275 hardware offloaded networking support in itself does not avoid data 276 copies, without resorting to implementing part of the NFS protocol 277 in the NIC. Support of RDMA by NFS enables the highest performance 278 at the architecture level rather than by implementation; this 279 enables ubiquitous and interoperable solutions. 281 By providing file access performance equivalent to that of local 282 file systems, NFSv4 over RDMA will enable applications running on a 283 set of client machines to interact through an NFSv4 file system, 284 just as applications running on a single machine might interact 285 through a local file system. 287 This raises the issue of whether additional protocol enhancements 288 to enable such interaction would be desirable and what such 289 enhancements would be. This is a complicated issue which the 290 working group needs to address and will not be further discussed in 291 this document. 293 1.3. NFSv4 Session Extension Characteristics 295 This draft will present a solution based upon minor versioning of 296 NFSv4. It will introduce a session to collect transport endpoints 297 and resources such as reply caching, which in turn enables 298 enhancements such as trunking, failover and recovery. It will 299 describe use of RDMA by employing support within an underlying RPC 300 layer [RPCRDMA]. Most importantly, it will focus on making the 301 best possible use of an RDMA transport. 303 These extensions are proposed as elements of a new minor revision 304 of NFS version 4. In this draft, NFS version 4 will be referred to 305 generically as "NFSv4", when describing properties common to all 306 minor versions. When referring specifically to properties of the 307 original, minor version 0 protocol, "NFSv4.0" will be used, and 308 changes proposed here for minor version 1 will be referred to as 309 "NFSv4.1". 311 This draft proposes only changes which are strictly upward- 312 compatible with existing RPC and NFS Application Programming 313 Interfaces (APIs). 315 2. Transport Issues 317 The Transport Issues section of the document explores the details 318 of utilizing the various supported transports. 320 2.1. Session Model 322 The first and most evident issue in supporting diverse transports 323 is how to provide for their differences. This draft proposes 324 introducing an explicit session. 326 A session introduces minimal protocol requirements, and provides 327 for a highly useful and convenient way to manage numerous endpoint- 328 related issues. The session is a local construct; it represents a 329 named, higher-layer object to which connections can refer, and 330 encapsulates properties important to each associated client. 332 A session is a dynamically created, long-lived server object 333 created by a client, used over time from one or more transport 334 connections. Its function is to maintain the server's state 335 relative to the connection(s) belonging to a client instance. This 336 state is entirely independent of the connection itself. The 337 session in effect becomes the object representing an active client 338 on a connection or set of connections. 340 Clients may create multiple sessions for a single clientid, and may 341 wish to do so for optimization of transport resources, buffers, or 342 server behavior. A session could be created by the client to 343 represent a single mount point, for separate read and write 344 "channels", or for any number of other client-selected parameters. 346 The session enables several things immediately. Clients may 347 disconnect and reconnect (voluntarily or not) without loss of 348 context at the server. (Of course, locks, delegations and related 349 associations require special handling, and generally expire in the 350 extended absence of an open connection.) Clients may connect 351 multiple transport endpoints to this common state. The endpoints 352 may have all the same attributes, for instance when trunked on 353 multiple physical network links for bandwidth aggregation or path 354 failover. Or, the endpoints can have specific, special purpose 355 attributes such as callback channels. 357 The NFSv4 specification does not provide for any form of flow 358 control; instead it relies on the windowing provided by TCP to 359 throttle requests. This unfortunately does not work with RDMA, 360 which in general provides no operation flow control and will 361 terminate a connection in error when limits are exceeded. Limits 362 are therefore exchanged when a session is created; These limits 363 then provide maxima within which each session's connections must 364 operate, they are managed within these limits as described in 365 [RPCRDMA]. The limits may also be modified dynamically at the 366 server's choosing by manipulating certain parameters present in 367 each NFSv4.1 request. 369 The presence of a maximum request limit on the session bounds the 370 requirements of the duplicate request cache. This can be used to 371 advantage by a server, which can accurately determine any storage 372 needs and enable it to maintain duplicate request cache persistence 373 and to provide reliable exactly-once semantics. 375 Finally, given adequate connection-oriented transport security 376 semantics, authentication and authorization may be cached on a per- 377 session basis, enabling greater efficiency in the issuing and 378 processing of requests on both client and server. A proposal for 379 transparent, server-driven implementation of this in NFSv4 has been 380 made. [CCM] The existence of the session greatly facilitates the 381 implementation of this approach. This is discussed in detail in 382 the Authentication Efficiencies section later in this draft. 384 2.1.1. Connection State 386 In RFC3530, the combination of a connected transport endpoint and a 387 clientid forms the basis of connection state. While has been made 388 to be workable with certain limitations, there are difficulties in 389 correct and robust implementation. The NFSv4.0 protocol must 390 provide a server-initiated connection for the callback channel, and 391 must carefully specify the persistence of client state at the 392 server in the face of transport interruptions. The server has only 393 the client's transport address binding (the IP 4-tuple) to identify 394 the client RPC transaction stream and to use as a lookup tag on the 395 duplicate request cache. (A useful overview of this is in [RW96].) 396 If the server listens on multiple adddresses, and the client 397 connects to more than one, it must employ different clientid's on 398 each, negating its ability to aggregate bandwidth and redundancy. 399 In effect, each transport connection is used as the server's 400 representation of client state. But, transport connections are 401 potentially fragile and transitory. 403 In this proposal, a session identifier is assigned by the server 404 upon initial session negotiation on each connection. This 405 identifier is used to associate additional connections, to 406 renegotiate after a reconnect, to provide an abstraction for the 407 various session properties, and to address the duplicate request 408 cache. No transport-specific information is used in the duplicate 409 request cache implementation of an NFSv4.1 server, nor in fact the 410 RPC XID itself. The session identifier is unique within the 411 server's scope and may be subject to certain server policies such 412 as being bounded in time. 414 It is envisioned that the primary transport model will be 415 connection oriented. Connection orientation brings with it certain 416 potential optimizations, such as caching of per-connection 417 properties, which are easily leveraged through the generality of 418 the session. However, it is possible that in future, other 419 transport models could be accommodated below the session 420 abstraction. 422 2.1.2. NFSv4 Channels, Sessions and Connections 424 There are at least two types of NFSv4 channels: the "operations" 425 channel used for ordinary requests from client to server, and the 426 "back" channel, used for callback requests from server to client. 428 As mentioned above, different NFSv4 operations on these channels 429 can lead to different resource needs. For example, server callback 430 operations (CB_RECALL) are specific, small messages which flow from 431 server to client at arbitrary times, while data transfers such as 432 read and write have very different sizes and asymmetric behaviors. 433 It is sometimes impractical for the RDMA peers (NFSv4 client and 434 NFSv4 server) to post buffers for these various operations on a 435 single connection. Commingling of requests with responses at the 436 client receive queue is particularly troublesome, due both to the 437 need to manage both solicited and unsolicited completions, and to 438 provision buffers for both purposes. Due to the lack of any 439 ordering of callback requests versus response arrivals, without any 440 other mechanisms, the client would be forced to allocate all 441 buffers sized to the worst case. 443 The callback requests are likely to be handled by a different task 444 context from that handling the responses. Significant 445 demultiplexing and thread management may be required if both are 446 received on the same queue. However, if callbacks are relatively 447 rare (perhaps due to client access patterns), many of these 448 difficulties can be minimized. 450 Also, the client may wish to perform trunking of operations channel 451 requests for performance reasons, or multipathing for availability. 452 This proposal permits both, as well as many other session and 453 connection possibilities, by permitting each operation to carry 454 session membership information and to share session (and clientid) 455 state in order to draw upon the appropriate resources. For 456 example, reads and writes may be assigned to specific, optimized 457 connections, or sorted and separated by any or all of size, 458 idempotency, etc. 460 To address the problems described above, this proposal allows 461 multiple sessions to share a clientid, as well as for multiple 462 connections to share a session. 464 Single Connection model: 466 NFSv4.1 Session 467 / \ 468 Operations_Channel [Back_Channel] 469 \ / 470 Connection 471 | 473 Multi-connection trunked model (2 operations channels shown): 475 NFSv4.1 Session 476 / \ 477 Operations_Channels [Back_Channel] 478 | | | 479 Connection Connection [Connection] 480 | | | 482 Multi-connection split-use model (2 mounts shown): 484 NFSv4.1 Session 485 / \ 486 (/home) (/usr/local - readonly) 487 / \ | 488 Operations_Channel [Back_Channel] | 489 | | Operations_Channel 490 Connection [Connection] | 491 | | Connection 492 | 494 In this way, implementation as well as resource management may be 495 optimized. Each session will have its own response caching and 496 buffering, and each connection or channel will have its own 497 transport resources, as appropriate. Clients which do not require 498 certain behaviors may optimize such resources away completely, by 499 using specific sessions and not even creating the additional 500 channels and connections. 502 2.1.3. Reconnection, Trunking and Failover 504 Reconnection after failure references stored state on the server 505 associated with lease recovery during the grace period. The 506 session provides a convenient handle for storing and managing 507 information regarding the client's previous state on a per- 508 connection basis, e.g. to be used upon reconnection. Reconnection 509 to a previously existing session, and its stored resources, are 510 covered in the "Connection Models" section below. 512 One important aspect of reconnection is that of RPC library 513 support. Traditionally, an Upper Layer RPC-based Protocol such as 514 NFS leaves all transport knowledge to the RPC layer implementation 515 below it. This allows NFS to operate over a wide variety of 516 transports and has proven to be a highly successful approach. The 517 session, however, introduces an abstraction which is, in a way, 518 "between" RPC and NFSv4.1. It is important that the session 519 abstraction not have ramifications within the RPC layer. 521 One such issue arises within the reconnection logic of RPC. 522 Previously, an explicit session binding operation, which 523 established session context for each new connection, was explored. 524 This however required that the session binding also be performed 525 during reconnect, which in turn required an RPC request. This 526 additional request requires new RPC semantics, both in 527 implementation and the fact that a new request is inserted into the 528 RPC stream. Also, the binding of a connection to a session 529 required the upper layer to become "aware" of connections, 530 something the RPC layer abstraction architecturally abstracts away. 531 Therefore the session binding is not handled in connection scope 532 but instead explicitly carried in each request. 534 For Reliability Availability and Serviceability (RAS) issues such 535 as bandwidth aggregation and multipathing, clients frequently seek 536 to make multiple connections through multiple logical or physical 537 channels. The session is a convenient point to aggregate and 538 manage these resources. 540 2.1.4. Server Duplicate Request Cache 542 Server duplicate request caches, while not a part of an NFS 543 protocol, have become a standard, even required, part of any NFS 544 implementation. First described in [CJ89], the duplicate request 545 cache was initially found to reduce work at the server by avoiding 546 duplicate processing for retransmitted requests. A second, and in 547 the long run more important benefit, was improved correctness, as 548 the cache avoided certain destructive non-idempotent requests from 549 being reinvoked. 551 However, such caches do not provide correctness guarantees; they 552 cannot be managed in a reliable, persistent fashion. The reason is 553 understandable - their storage requirement is unbounded due to the 554 lack of any such bound in the NFS protocol, and they are dependent 555 on transport addresses for request matching. 557 As proposed in this draft, the presence of maximum request count 558 limits and negotiated maximum sizes allows the size and duration of 559 the cache to be bounded, and coupled with a long-lived session 560 identifier, enables its persistent storage on a per-session basis. 562 This provides a single unified mechanism which provides the 563 following guarantees required in the NFSv4 specification, while 564 extending them to all requests, rather than limiting them only to a 565 subset of state-related requests: 567 "It is critical the server maintain the last response sent to 568 the client to provide a more reliable cache of duplicate non- 569 idempotent requests than that of the traditional cache 570 described in [CJ89]..." [RFC3530] 572 The maximum request count limit is the count of active operations, 573 which bounds the number of entries in the cache. Constraining the 574 size of operations additionally serves to limit the required 575 storage to the product of the current maximum request count and the 576 maximum response size. This storage requirement enables server- 577 side efficiencies. 579 Session negotiation allows the server to maintain other state. An 580 NFSv4.1 client invoking the session destroy operation will cause 581 the server to denegotiate (close) the session, allowing the server 582 to deallocate cache entries. Clients can potentially specify that 583 such caches not be kept for appropriate types of sessions (for 584 example, read-only sessions). This can enable more efficient 585 server operation resulting in improved response times, and more 586 efficient sizing of buffers and response caches. 588 Similarly, it is important for the client to explicitly learn 589 whether the server is able to implement reliable semantics. 590 Knowledge of whether these semantics are in force is critical for a 591 highly reliable client, one which must provide transactional 592 integrity guarantees. When clients request that the semantics be 593 enabled for a given session, the session reply must inform the 594 client if the mode is in fact enabled. In this way the client can 595 confidently proceed with operations without having to implement 596 consistency facilities of its own. 598 2.2. Session Initialization and Transfer Models 600 Session initialization issues, and data transfer models relevant to 601 both TCP and RDMA are discussed in this section. 603 2.2.1. Session Negotiation 605 The following parameters are exchanged between client and server at 606 session creation time. Their values allow the server to properly 607 size resources allocated in order to service the client's requests, 608 and to provide the server with a way to communicate limits to the 609 client for proper and optimal operation. They are exchanged prior 610 to all session-related activity, over any transport type. 611 Discussion of their use is found in their descriptions as well as 612 throughout this section. 614 Maximum Requests 615 The client's desired maximum number of concurrent requests is 616 passed, in order to allow the server to size its reply cache 617 storage. The server may modify the client's requested limit 618 downward (or upward) to match its local policy and/or 619 resources. Over RDMA-capable RPC transports, the per-request 620 management of low-level transport message credits is handled 621 within the RPC layer. [RPCRDMA] 623 Maximum Request/Response Sizes 624 The maximum request and response sizes are exchanged in order 625 to permit allocation of appropriately sized buffers and 626 request cache entries. The size must allow for certain 627 protocol minima, allowing the receipt of maximally sized 628 operations (e.g. RENAME requests which contains two name 629 strings). Note the maximum request/response sizes cover the 630 entire request/response message and not simply the data 631 payload as traditional NFS maximum read or write size. Also 632 note the server implementation may not, in fact probably does 633 not, require the reply cache entries to be sized as large as 634 the maximum response. The server may reduce the client's 635 requested sizes. 637 Inline Padding/Alignment 638 The server can inform the client of any padding which can be 639 used to deliver NFSv4 inline WRITE payloads into aligned 640 buffers. Such alignment can be used to avoid data copy 641 operations at the server for both TCP and inline RDMA 642 transfers. For RDMA, the client informs the server in each 643 operation when padding has been applied. [RPCRDMA] 645 Transport Attributes 646 A placeholder for transport-specific attributes is provided, 647 with a format to be determined. Possible examples of 648 information to be passed in this parameter include transport 649 security attributes to be used on the connection, RDMA- 650 specific attributes, legacy "private data" as used on existing 651 RDMA fabrics, transport Quality of Service attributes, etc. 652 This information is to be passed to the peer's transport layer 653 by local means which is currently outside the scope of this 654 draft, however one attribute is provided in the RDMA case: 656 RDMA Read Resources 657 RDMA implementations must explicitly provision resources 658 to support RDMA Read requests from connected peers. 659 These values must be explicitly specified, to provide 660 adequate resources for matching the peer's expected needs 661 and the connection's delay-bandwidth parameters. The 662 client provides its chosen value to the server in the 663 initial session creation, the value must be provided in 664 each client RDMA endpoint. The values are asymmetric and 665 should be set to zero at the server in order to conserve 666 RDMA resources, since clients do not issue RDMA Read 667 operations in this proposal. The result is communicated 668 in the session response, to permit matching of values 669 across the connection. The value may not be changed in 670 the duration of the session, although a new value may be 671 requested as part of a new session. 673 2.2.2. RDMA Requirements 675 A complete discussion of the operation of RPC-based protocols atop 676 RDMA transports is in [RPCRDMA]. Where RDMA is considered, this 677 proposal assumes the use of such a layering; it addresses only the 678 upper layer issues relevant to making best use of RPC/RDMA. 680 A connection oriented (reliable sequenced) RDMA transport will be 681 required. There are several reasons for this. First, this model 682 most closely reflects the general NFSv4 requirement of long-lived 683 and congestion-controlled transports. Second, to operate correctly 684 over either an unreliable or unsequenced RDMA transport, or both, 685 would require significant complexity in the implementation and 686 protocol not appropriate for a strict minor version. For example, 687 retransmission on connected endpoints is explicitly disallowed in 688 the current NFSv4 draft; it would again be required with these 689 alternate transport characteristics. Third, the proposal assumes a 690 specific RDMA ordering semantic, which presents the same set of 691 ordering and reliability issues to the RDMA layer over such 692 transports. 694 The RDMA implementation provides for making connections to other 695 RDMA-capable peers. In the case of the current proposals before 696 the RDDP working group, these RDMA connections are preceded by a 697 "streaming" phase, where ordinary TCP (or NFS) traffic might flow. 698 However, this is not assumed here and sizes and other parameters 699 are explicitly exchanged upon a session entering RDMA mode. 701 2.2.3. RDMA Connection Resources 703 On transport endpoints which support automatic RDMA mode, that is, 704 endpoints which are created in the RDMA-enabled state, a single, 705 preposted buffer must initially be provided by both peers, and the 706 client session negotiation must be the first exchange. 708 On transport endpoints supporting dynamic negotiation, a more 709 sophisticated negotiation is possible, but is not discussed in the 710 current draft. 712 RDMA imposes several requirements on upper layer consumers. 713 Registration of memory and the need to post buffers of a specific 714 size and number for receive operations are a primary consideration. 716 Registration of memory can be a relatively high-overhead operation, 717 since it requires pinning of buffers, assignment of attributes 718 (e.g. readable/writable), and initialization of hardware 719 translation. Preregistration is desirable to reduce overhead. 720 These registrations are specific to hardware interfaces and even to 721 RDMA connection endpoints, therefore negotiation of their limits is 722 desirable to manage resources effectively. 724 Following the basic registration, these buffers must be posted by 725 the RPC layer to handle receives. These buffers remain in use by 726 the RPC/NFSv4 implementation; the size and number of them must be 727 known to the remote peer in order to avoid RDMA errors which would 728 cause a fatal error on the RDMA connection. 730 The session provides a natural way for the server to manage 731 resource allocation to each client rather than to each transport 732 connection itself. This enables considerable flexibility in the 733 administration of transport endpoints. 735 2.2.4. TCP and RDMA Inline Transfer Model 737 The basic transfer model for both TCP and RDMA is referred to as 738 "inline". For TCP, this is the only transfer model supported, 739 since TCP carries both the RPC header and data together in the data 740 stream. 742 For RDMA, the RDMA Send transfer model is used for all NFS requests 743 and replies, but data is optionally carried by RDMA Writes or RDMA 744 Reads. Use of Sends is required to ensure consistency of data and 745 to deliver completion notifications. The pure-Send method is 746 typically used where the data payload is small, or where for 747 whatever reason target memory for RDMA is not available. 749 Inline message exchange 751 Client Server 752 : Request : 753 Send : ------------------------------> : untagged 754 : : buffer 755 : Response : 756 untagged : <------------------------------ : Send 757 buffer : : 759 Client Server 760 : Read request : 761 Send : ------------------------------> : untagged 762 : : buffer 763 : Read response with data : 764 untagged : <------------------------------ : Send 765 buffer : : 767 Client Server 768 : Write request with data : 769 Send : ------------------------------> : untagged 770 : : buffer 771 : Write response : 772 untagged : <------------------------------ : Send 773 buffer : : 775 Responses must be sent to the client on the same connection that 776 the request was sent. It is important that the server does not 777 assume any specific client implementation, in particular whether 778 connections within a session share any state at the client. This 779 is also important to preserve ordering of RDMA operations, and 780 especially RMDA consistency. Additionally, it ensures that the RPC 781 RDMA layer makes no requirement of the RDMA provider to open its 782 memory registration handles (Steering Tags) beyond the scope of a 783 single RDMA connection. This is an important security 784 consideration. 786 Two values must be known to each peer prior to issuing Sends: the 787 maximum number of sends which may be posted, and their maximum 788 size. These values are referred to, respectively, as the message 789 credits and the maximum message size. While the message credits 790 might vary dynamically over the duration of the session, the 791 maximum message size does not. The server must commit to 792 preserving this number of duplicate request cache entires, and 793 preparing a number of receive buffers equal to or greater than its 794 currently advertised credit value, each of the advertised size. 795 These ensure that transport resources are allocated sufficient to 796 receive the full advertised limits. 798 Note that the server must post the maximum number of session 799 requests to each client operations channel. The client is not 800 required to spread its requests in any particular fashion across 801 connections within a session. If the client wishes, it may create 802 multiple sessions, each with a single or small number of operations 803 channels to provide the server with this resource advantage. Or, 804 over RDMA the server may employ a "shared receive queue". The 805 server can in any case protect its resources by restricting the 806 client's request credits. 808 While tempting to consider, it is not possible to use the TCP 809 window as an RDMA operation flow control mechanism. First, to do 810 so would violate layering, requiring both senders to be aware of 811 the existing TCP outbound window at all times. Second, since 812 requests are of variable size, the TCP window can hold a widely 813 variable number of them, and since it cannot be reduced without 814 actually receiving data, the receiver cannot limit the sender. 815 Third, any middlebox interposing on the connection would wreck any 816 possible scheme. [MIDTAX] In this proposal, maximum request count 817 limits are exchanged at the session level to allow correct 818 provisioning of receive buffers by transports. 820 When operating over TCP or other similar transport, request limits 821 and sizes are still employed in NFSv4.1, but instead of being 822 required for correctness, they provide the basis for efficient 823 server implementation of the duplicate request cache. The limits 824 are chosen based upon the expected needs and capabilities of the 825 client and server, and are in fact arbitrary. Sizes may be 826 specified by the client as zero (requesting the server's preferred 827 or optimal value), and request limits may be chosen in proportion 828 to the client's capabilities. For example, a limit of 1000 allows 829 1000 requests to be in progress, which may generally be far more 830 than adequate to keep local networks and servers fully utilized. 832 Both client and server have independent sizes and buffering, but 833 over RDMA fabrics client credits are easily managed by posting a 834 receive buffer prior to sending each request. Each such buffer may 835 not be completed with the corresponding reply, since responses from 836 NFSv4 servers arrive in arbitrary order. When an operations 837 channel is also used for callbacks, the client must account for 838 callback requests by posting additional buffers. Note that 839 implementation-specific facilities such as a shared receive queue 840 may also allow optimization of these allocations. 842 When a session is created, the client requests a preferred buffer 843 size, and the server provides its answer. The server posts all 844 buffers of at least this size. The client must comply by not 845 sending requests greater than this size. It is recommended that 846 server implementations do all they can to accommodate a useful 847 range of possible client requests. There is a provision in 848 [RPCRDMA] to allow the sending of client requests which exceed the 849 server's receive buffer size, but it requires the server to "pull" 850 the client's request as a "read chunk" via RDMA Read. This 851 introduces at least one additional network roundtrip, plus other 852 overhead such as registering memory for RDMA Read at the client and 853 additional RDMA operations at the server, and is to be avoided. 855 An issue therefore arises when considering the NFSv4 COMPOUND 856 procedures. Since an arbitrary number (total size) of operations 857 can be specified in a single COMPOUND procedure, its size is 858 effectively unbounded. This cannot be supported by RDMA Sends, and 859 therefore this size negotiation places a restriction on the 860 construction and maximum size of both COMPOUND requests and 861 responses. If a COMPOUND results in a reply at the server that is 862 larger than can be sent in an RDMA Send to the client, then the 863 COMPOUND must terminate and the operation which causes the overflow 864 will provide a TOOSMALL error status result. 866 2.2.5. RDMA Direct Transfer Model 868 Placement of data by explicitly tagged RDMA operations is referred 869 to as "direct" transfer. This method is typically used where the 870 data payload is relatively large, that is, when RDMA setup has been 871 performed prior to the operation, or when any overhead for setting 872 up and performing the transfer is regained by avoiding the overhead 873 of processing an ordinary receive. 875 The client advertises RDMA buffers in this proposed model, and not 876 the server. This means the "XDR Decoding with Read Chunks" 877 described in [RPCRDMA] is not employed by NFSv4.1 replies, and 878 instead all results transferred via RDMA to the client employ "XDR 879 Decoding with Write Chunks". There are several reasons for this. 881 First, it allows for a correct and secure mode of transfer. The 882 client may advertise specific memory buffers only during specific 883 times, and may revoke access when it pleases. The server is not 884 required to expose copies of local file buffers for individual 885 clients, or to lock or copy them for each client access. 887 Second, client credits based on fixed-size request buffers are 888 easily managed on the server, but for the server additional 889 management of buffers for client RDMA Reads is not well-bounded. 891 For example, the client may not perform these RDMA Read operations 892 in a timely fashion, therefore the server would have to protect 893 itself against denial-of-service on these resources. 895 Third, it reduces network traffic, since buffer exposure outside 896 the scope and duration of a single request/response exchange 897 necessitates additional memory management exchanges. 899 There are costs associated with this decision. Primary among them 900 is the need for the server to employ RDMA Read for operations such 901 as large WRITE. The RDMA Read operation is a two-way exchange at 902 the RDMA layer, which incurs additional overhead relative to RDMA 903 Write. Additionally, RDMA Read requires resources at the data 904 source (the client in this proposal) to maintain state and to 905 generate replies. These costs are overcome through use of 906 pipelining with credits, with sufficient RDMA Read resources 907 negotiated at session initiation, and appropriate use of RDMA for 908 writes by the client - for example only for transfers above a 909 certain size. 911 A description of which NFSv4 operation results are eligible for 912 data transfer via RDMA Write is in [NFSDDP]. There are only two 913 such operations: READ and READLINK. When XDR encoding these 914 requests on an RDMA transport, the NFSv4.1 client must insert the 915 appropriate xdr_write_list entries to indicate to the server 916 whether the results should be transferred via RDMA or inline with a 917 Send. As described in [NFSDDP], a zero-length write chunk is used 918 to indicate an inline result. In this way, it is unnecessary to 919 create new operations for RDMA-mode versions of READ and READLINK. 921 Another tool to avoid creation of new, RDMA-mode operations is the 922 Reply Chunk [RPCRDMA], which is used by RPC in RDMA mode to return 923 large replies via RDMA as if they were inline. Reply chunks are 924 used for operations such as READDIR, which returns large amounts of 925 information, but in many small XDR segments. Reply chunks are 926 offered by the client and the server can use them in preference to 927 inline. Reply chunks are transparent to upper layers such as 928 NFSv4. 930 In any very rare cases where another NFSv4.1 operation requires 931 larger buffers than were negotiated when the session was created 932 (for example extraordinarily large RENAMEs), the underlying RPC 933 layer may support the use of "Message as an RDMA Read Chunk" and 934 "RDMA Write of Long Replies" as described in [RPCRDMA]. No 935 additional support is required in the NFSv4.1 client for this. The 936 client should be certain that its requested buffer sizes are not so 937 small as to make this a frequent occurrence, however. 939 All operations are initiated by a Send, and are completed with a 940 Send. This is exactly as in conventional NFSv4, but under RDMA has 941 a significant purpose: RDMA operations are not complete, that is, 942 guaranteed consistent, at the data sink until followed by a 943 successful Send completion (i.e. a receive). These events provide 944 a natural opportunity for the initiator (client) to enable and 945 later disable RDMA access to the memory which is the target of each 946 operation, in order to provide for consistent and secure operation. 947 The RDMAP Send with Invalidate operation may be worth employing in 948 this respect, as it relieves the client of certain overhead in this 949 case. 951 A "onetime" boolean advisory to each RDMA region might become a 952 hint to the server that the client will use the three-tuple for 953 only one NFSv4 operation. For a transport such as iWARP, the 954 server can assist the client in invalidating the three-tuple by 955 performing a Send with Solicited Event and Invalidate. The server 956 may ignore this hint, in which case the client must perform a local 957 invalidate after receiving the indication from the server that the 958 NFSv4 operation is complete. This may be considered in a future 959 version of this draft and [NFSDDP]. 961 In a trusted environment, it may be desirable for the client to 962 persistently enable RDMA access by the server. Such a model is 963 desirable for the highest level of efficiency and lowest overhead. 965 RDMA message exchanges 967 Client Server 968 : Direct Read Request : 969 Send : ------------------------------> : untagged 970 : : buffer 971 : Segment : 972 tagged : <------------------------------ : RDMA Write 973 buffer : : : 974 : [Segment] : 975 tagged : <------------------------------ : [RDMA Write] 976 buffer : : 977 : Direct Read Response : 978 untagged : <------------------------------ : Send (w/Inv.) 979 buffer : : 981 Client Server 982 : Direct Write Request : 983 Send : ------------------------------> : untagged 984 : : buffer 985 : Segment : 986 tagged : v------------------------------ : RDMA Read 987 buffer : +-----------------------------> : 988 : : : 989 : [Segment] : 990 tagged : v------------------------------ : [RDMA Read] 991 buffer : +-----------------------------> : 992 : : 993 : Direct Write Response : 994 untagged : <------------------------------ : Send (w/Inv.) 995 buffer : : 997 2.3. Connection Models 999 There are three scenarios in which to discuss the connection model. 1000 Each will be discussed individually, after describing the common 1001 case encountered at initial connection establishment. 1003 After a successful connection, the first request proceeds, in the 1004 case of a new client association, to initial session creation, and 1005 then optionally to session callback channel binding, prior to 1006 regular operation. 1008 Commonly, each new client "mount" will be the action which drives 1009 creation of a new session. However there are any number of other 1010 approaches. Clients may choose to share a single connection and 1011 session among all their mount points. Or, clients may support 1012 trunking, where additional connections are created but all within a 1013 single session. Alternatively, the client may choose to create 1014 multiple sessions, each tuned to the buffering and reliability 1015 needs of the mount point. For example, a readonly mount can 1016 sharply reduce its write buffering and also makes no requirement 1017 for the server to support reliable duplicate request caching. 1019 Similarly, the client can choose among several strategies for 1020 clientid usage. Sessions can share a single clientid, or create 1021 new clientids as the client deems appropriate. For kernel-based 1022 clients which service multiple authenticated users, a single 1023 clientid shared across all mount points is generally the most 1024 appropriate and flexible approach. For example, all the client's 1025 file operations may wish to share locking state and the local 1026 client kernel takes the responsibility for arbitrating access 1027 locally. For clients choosing to support other authentication 1028 models, perhaps example userspace implementations, a new clientid 1029 is indicated. Through use of session create options, both models 1030 are supported at the client's choice. 1032 Since the session is explicitly created and destroyed by the 1033 client, and each client is uniquely identified, the server may be 1034 specifically instructed to discard unneeded presistent state. For 1035 this reason, it is possible that a server will retain any previous 1036 state indefinitely, and place its destruction under administrative 1037 control. Or, a server may choose to retain state for some 1038 configurable period, provided that the period meets other NFSv4 1039 requirements such as lease reclamation time, etc. However, since 1040 discarding this state at the server may affect the correctness of 1041 the server as seen by the client across network partitioning, such 1042 discarding of state should be done only in a conservative manner. 1044 Each client request to the server carries a new SEQUENCE operation 1045 within each COMPOUND, which provides the session context. This 1046 session context then governs the request control, duplicate request 1047 caching, and other persistent parameters managed by the server for 1048 a session. 1050 2.3.1. TCP Connection Model 1052 The following is a schematic diagram of the NFSv4.1 protocol 1053 exchanges leading up to normal operation on a TCP stream. 1055 Client Server 1056 TCPmode : Create Clientid(nfs_client_id4) : TCPmode 1057 : ------------------------------> : 1058 : : 1059 : Clientid reply(clientid, ...) : 1060 : <------------------------------ : 1061 : : 1062 : Create Session(clientid, size S, : 1063 : maxreq N, STREAM, ...) : 1064 : ------------------------------> : 1065 : : 1066 : Session reply(sessionid, size S', : 1067 : maxreq N') : 1068 : <------------------------------ : 1069 : : 1070 : : 1071 : ------------------------------> : 1072 : <------------------------------ : 1073 : : : 1075 No net additional exchange is added to the initial negotiation by 1076 this proposal. In the NFSv4.1 exchange, the CREATECLIENTID 1077 replaces SETCLIENTID (eliding the callback "clientaddr4" 1078 addressing) and CREATESESSION subsumes the function of 1079 SETCLIENTID_CONFIRM, as described elsewhere in this document. 1080 Callback channel binding is optional, as in NFSv4.0. Note that the 1081 STREAM transport type is shown above, but since the transport mode 1082 remains unchanged and transport attributes are not necessarily 1083 exchanged, DEFAULT could also be passed. 1085 2.3.2. Negotiated RDMA Connection Model 1087 One possible design which has been considered is to have a 1088 "negotiated" RDMA connection model, supported via use of a session 1089 bind operation as a required first step. However due to issues 1090 mentioned earlier, this proved problematic. This section remains 1091 as a reminder of that fact, and it is possible such a mode can be 1092 supported. 1094 It is not considered critical that this be supported for two 1095 reasons. One, the session persistence provides a way for the 1096 server to remember important session parameters, such as sizes and 1097 maximum request counts. These values can be used to restore the 1098 endpoint prior to making the first reply. Two, there are currently 1099 no critical RDMA parameters to set in the endpoint at the server 1100 side of the connection. RDMA Read resources, which are in general 1101 not settable after entering RDMA mode, are set only at the client - 1102 the originator of the connection. Therefore as long as the RDMA 1103 provider supports an automatic RDMA connection mode, no further 1104 support is required from the NFSv4.1 protocol for reconnection. 1106 Note, the client must provide at least as many RDMA Read resources 1107 to its local queue for the benefit of the server when reconnecting, 1108 as it used when negotiating the session. If this value is no 1109 longer appropriate, the client should resynchronize its session 1110 state, destroy the existing session, and start over with the more 1111 appropriate values. 1113 2.3.3. Automatic RDMA Connection Model 1115 The following is a schematic diagram of the NFSv4.1 protocol 1116 exchanges performed on an RDMA connection. 1118 Client Server 1119 RDMAmode : : : RDMAmode 1120 : : : 1121 Prepost : : : Prepost 1122 receive : : : receive 1123 : : 1124 : Create Clientid(nfs_client_id4) : 1125 : ------------------------------> : 1126 : : Prepost 1127 : Clientid reply(clientid, ...) : receive 1128 : <------------------------------ : 1129 Prepost : : 1130 receive : Create Session(clientid, size S, : 1131 : maxreq N, RDMA ...) : 1132 : ------------------------------> : 1133 : : Prepost <=N' 1134 : Session reply(sessionid, size S', : receives of 1135 : maxreq N') : size S' 1136 : <------------------------------ : 1137 : : 1138 : : 1139 : ------------------------------> : 1140 : <------------------------------ : 1141 : : : 1143 2.4. Buffer Management, Transfer, Flow Control 1145 Inline operations in NFSv4.1 behave effectively the same as TCP 1146 sends. Procedure results are passed in a single message, and its 1147 completion at the client signal the receiving process to inspect 1148 the message. 1150 RDMA operations are performed solely by the server in this 1151 proposal, as described in the previous "RDMA Direct Model" section. 1152 Since server RDMA operations do not result in a completion at the 1153 client, and due to ordering rules in RDMA transports, after all 1154 required RDMA operations are complete, a Send (Send with Solicited 1155 Event for iWARP) containing the procedure results is performed from 1156 server to client. This Send operation will result in a completion 1157 which will signal the client to inspect the message. 1159 In the case of client read-type NFSv4 operations, the server will 1160 have issued RDMA Writes to transfer the resulting data into client- 1161 advertised buffers. The subsequent Send operation performs two 1162 necessary functions: finalizing any active or pending DMA at the 1163 client, and signaling the client to inspect the message. 1165 In the case of client write-type NFSv4 operations, the server will 1166 have issued RDMA Reads to fetch the data from the client-advertised 1167 buffers. No data consistency issues arise at the client, but the 1168 completion of the transfer must be acknowledged, again by a Send 1169 from server to client. 1171 In either case, the client advertises buffers for direct (RDMA 1172 style) operations. The client may desire certain advertisement 1173 limits, and may wish the server to perform remote invalidation on 1174 its behalf when the server has completed its RDMA. This may be 1175 considered in a future version of this draft. 1177 In the absence of remote invalidation, the client may perform its 1178 own, local invalidation after the operation completes. This 1179 invalidation should occur prior to any RPCSEC GSS integrity 1180 checking, since a validly remotely accessible buffer can possibly 1181 be modified by the peer. However, after invalidation and the 1182 contents integrity checked, the contents are locally secure. 1184 Credit updates over RDMA transports are supported at the RPC layer 1185 as described in [RPCRDMA]. In each request, the client requests a 1186 desired number of credits to be made available to the connection on 1187 which it sends the request. The client must not send more requests 1188 than the number which the server has previously advertised, or in 1189 the case of the first request, only one. If the client exceeds its 1190 credit limit, the connection may close with a fatal RDMA error. 1192 The server then executes the request, and replies with an updated 1193 credit count accompanying its results. Since replies are sequenced 1194 by their RDMA Send order, the most recent results always reflect 1195 the server's limit. In this way the client will always know the 1196 maximum number of requests it may safely post. 1198 Because the client requests an arbitrary credit count in each 1199 request, it is relatively easy for the client to request more, or 1200 fewer, credits to match its expected need. A client that 1201 discovered itself frequently queuing outgoing requests due to lack 1202 of server credits might increase its requested credits 1203 proportionately in response. Or, a client might have a simple, 1204 configurable number. The protocol also provides a per-operation 1205 "maxslot" exchange to assist in dynamic adjustment at the session 1206 level, described in a later section. 1208 Occasionally, a server may wish to reduce the total number of 1209 credits it offers a certain client on a connection. This could be 1210 encountered if a client were found to be consuming its credits 1211 slowly, or not at all. A client might notice this itself, and 1212 reduce its requested credits in advance, for instance requesting 1213 only the count of operations it currently has queued, plus a few as 1214 a base for starting up again. Such mechanisms can, however, be 1215 potentially complicated and are implementation-defined. The 1216 protocol does not require them. 1218 Because of the way in which RDMA fabrics function, it is not 1219 possible for the server (or client back channel) to cancel 1220 outstanding receive operations. Therefore, effectively only one 1221 credit can be withdrawn per receive completion. The server (or 1222 client back channel) would simply not replenish a receive operation 1223 when replying. The server can still reduce the available credit 1224 advertisement in its replies to the target value it desires, as a 1225 hint to the client that its credit target is lower and it should 1226 expect it to be reduced accordingly. Of course, even if the server 1227 could cancel outstanding receives, it cannot do so, since the 1228 client may have already sent requests in expectation of the 1229 previous limit. 1231 This brings out an interesting scenario similar to the client 1232 reconnect discussed earlier in "Connection Models". How does the 1233 server reduce the credits of an inactive client? 1235 One approach is for the server to simply close such a connection 1236 and require the client to reconnect at a new credit limit. This is 1237 acceptable, if inefficient, when the connection setup time is short 1238 and where the server supports persistent session semantics. 1240 A better approach is to provide a back channel request to return 1241 the operations channel credits. The server may request the client 1242 to return some number of credits, the client must comply by 1243 performing operations on the operations channel, provided of course 1244 that the request does not drop the client's credit count to zero 1245 (in which case the connection would deadlock). If the client finds 1246 that it has no requests with which to consume the credits it was 1247 previously granted, it must send zero-length Send RDMA operations, 1248 or NULL NFSv4 operations in order to return the resources to the 1249 server. If the client fails to comply in a timely fashion, the 1250 server can recover the resources by breaking the connection. 1252 While in principle, the back channel credits could be subject to a 1253 similar resource adjustment, in practice this is not an issue, 1254 since the back channel is used purely for control and is expected 1255 to be statically provisioned. 1257 It is important to note that in addition to maximum request counts, 1258 the sizes of buffers are negotiated per-session. This permits the 1259 most efficient allocation of resources on both peers. There is an 1260 important requirement on reconnection: the sizes posted by the 1261 server at reconnect must be at least as large as previously used, 1262 to allow recovery. Any replies that are replayed from the server's 1263 duplicate request cache must be able to be received into client 1264 buffers. In the case where a client has received replies to all 1265 its retried requests (and therefore received all its expected 1266 responses), then the client may disconnect and reconnect with 1267 different buffers at will, since no cache replay will be required. 1269 2.5. Retry and Replay 1271 NFSv4.0 forbids retransmission on active connections over reliable 1272 transports; this includes connected-mode RDMA. This restriction 1273 must be maintained in NFSv4.1. 1275 If one peer were to retransmit a request (or reply), it would 1276 consume an additional credit on the other. If the server 1277 retransmitted a reply, it would certainly result in an RDMA 1278 connection loss, since the client would typically only post a 1279 single receive buffer for each request. If the client 1280 retransmitted a request, the additional credit consumed on the 1281 server might lead to RDMA connection failure unless the client 1282 accounted for it and decreased its available credit, leading to 1283 wasted resources. 1285 RDMA credits present a new issue to the duplicate request cache in 1286 NFSv4.1. The request cache may be used when a connection within a 1287 session is lost, such as after the client reconnects. Credit 1288 information is a dynamic property of the connection, and stale 1289 values must not be replayed from the cache. This implies that the 1290 request cache contents must not be blindly used when replies are 1291 issued from it, and credit information appropriate to the channel 1292 must be refreshed by the RPC layer. 1294 Finally, RDMA fabrics do not guarantee that the memory handles 1295 (Steering Tags) within each rdma three-tuple are valid on a scope 1296 outside that of a single connection. Therefore, handles used by 1297 the direct operations become invalid after connection loss. The 1298 server must ensure that any RDMA operations which must be replayed 1299 from the request cache use the newly provided handle(s) from the 1300 most recent request. 1302 2.6. The Back Channel 1304 The NFSv4 callback operations present a significant resource 1305 problem for the RDMA enabled client. Clearly, callbacks must be 1306 negotiated in the way credits are for the ordinary operations 1307 channel for requests flowing from client to server. But, for 1308 callbacks to arrive on the same RDMA endpoint as operation replies 1309 would require dedicating additional resources, and specialized 1310 demultiplexing and event handling. Or, callbacks may not require 1311 RDMA sevice at all (they do not normally carry substantial data 1312 payloads). It is highly desirable to streamline this critical path 1313 via a second communications channel. 1315 The session callback channel binding facility is designed for 1316 exactly such a situation, by dynamically associating a new 1317 connected endpoint with the session, and separately negotiating 1318 sizes and counts for active callback channel operations. The 1319 binding operation is firewall-friendly since it does not require 1320 the server to initiate the connection. 1322 This same method serves as well for ordinary TCP connection mode. 1323 It is expected that all NFSv4.1 clients may make use of the session 1324 facility to streamline their design. 1326 The back channel functions exactly the same as the operations 1327 channel except that no RDMA operations are required to perform 1328 transfers, instead the sizes are required to be sufficiently large 1329 to carry all data inline, and of course the client and server 1330 reverse their roles with respect to which is in control of credit 1331 management. The same rules apply for all transfers, with the 1332 server being required to flow control its callback requests. 1334 The back channel is optional. If not bound on a given session, the 1335 server must not issue callback operations to the client. This in 1336 turn implies that such a client must never put itself in the 1337 situation where the server will need to do so, lest the client lose 1338 its connection by force, or its operation be incorrect. For the 1339 same reason, if a back channel is bound, the client is subject to 1340 revocation of its delegations if the back channel is lost. Any 1341 connection loss should be corrected by the client as soon as 1342 possible. 1344 This can be convenient for the NFSv4.1 client; if the client 1345 expects to make no use of back channel facilities such as 1346 delegations, then there is no need to create it. This may save 1347 significant resources and complexity at the client. 1349 For these reasons, if the client wishes to use the back channel, 1350 that channel must be bound first, before using the operations 1351 channel. In this way, the server will not find itself in a 1352 position where it will send callbacks on the operations channel 1353 when the client is not prepared for them. 1355 There is one special case, that where the back channel is bound in 1356 fact to the operations channel's connection. This configuration 1357 would be used normally over a TCP stream connection to exactly 1358 implement the NFSv4.0 behavior, but over RDMA would require complex 1359 resource and event management at both sides of the connection. The 1360 server is not required to accept such a bind request on an RDMA 1361 connection for this reason, though it is recommended. 1363 2.7. COMPOUND Sizing Issues 1365 Very large responses may pose duplicate request cache issues. 1366 Since servers will want to bound the storage required for such a 1367 cache, the unlimited size of response data in COMPOUND may be 1368 troublesome. If COMPOUND is used in all its generality, then the 1369 inclusion of certain non-idempotent operations within a single 1370 COMPOUND request may render the entire request non-idempotent. 1371 (For example, a single COMPOUND request which read a file or 1372 symbolic link, then removed it, would be obliged to cache the data 1373 in order to allow identical replay). Therefore, many requests 1374 might include operations that return any amount of data. 1376 It is not satisfactory for the server to reject COMPOUNDs at will 1377 with NFS4ERR_RESOURCE when they pose such difficulties for the 1378 server, as this results in serious interoperability problems. 1379 Instead, any such limits must be explicitly exposed as attributes 1380 of the session, ensuring that the server can explicitly support any 1381 duplicate request cache needs at all times. 1383 2.8. Data Alignment 1385 A negotiated data alignment enables certain scatter/gather 1386 optimizations. A facility for this is supported by [RPCRDMA]. 1387 Where NFS file data is the payload, specific optimizations become 1388 highly attractive. 1390 Header padding is requested by each peer at session initiation, and 1391 may be zero (no padding). Padding leverages the useful property 1392 that RDMA receives preserve alignment of data, even when they are 1393 placed into anonymous (untagged) buffers. If requested, client 1394 inline writes will insert appropriate pad bytes within the request 1395 header to align the data payload on the specified boundary. The 1396 client is encouraged to be optimistic and simply pad all WRITEs 1397 within the RPC layer to the negotiated size, in the expectation 1398 that the server can use them efficiently. 1400 It is highly recommended that clients offer to pad headers to an 1401 appropriate size. Most servers can make good use of such padding, 1402 which allows them to chain receive buffers in such a way that any 1403 data carried by client requests will be placed into appropriate 1404 buffers at the server, ready for filesystem processing. The 1405 receiver's RPC layer encounters no overhead from skipping over pad 1406 bytes, and the RDMA layer's high performance makes the insertion 1407 and transmission of padding on the sender a significant 1408 optimization. In this way, the need for servers to perform RDMA 1409 Read to satisfy all but the largest client writes is obviated. An 1410 added benefit is the reduction of message roundtrips on the network 1411 - a potentially good trade, where latency is present. 1413 The value to choose for padding is subject to a number of criteria. 1414 A primary source of variable-length data in the RPC header is the 1415 authentication information, the form of which is client-determined, 1416 possibly in response to server specification. The contents of 1417 COMPOUNDs, sizes of strings such as those passed to RENAME, etc. 1418 all go into the determination of a maximal NFSv4 request size and 1419 therefore minimal buffer size. The client must select its offered 1420 value carefully, so as not to overburden the server, and vice- 1421 versa. The payoff of an appropriate padding value is higher 1422 performance. 1424 Sender gather: 1425 |RPC Request|Pad bytes|Length| -> |User data...| 1426 \------+---------------------/ \ 1427 \ \ 1428 \ Receiver scatter: \-----------+- ... 1429 /-----+----------------\ \ \ 1430 |RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->... 1432 In the above case, the server may recycle unused buffers to the 1433 next posted receive if unused by the actual received request, or 1434 may pass the now-complete buffers by reference for normal write 1435 processing. For a server which can make use of it, this removes 1436 any need for data copies of incoming data, without resorting to 1437 complicated end-to-end buffer advertisement and management. This 1438 includes most kernel-based and integrated server designs, among 1439 many others. The client may perform similar optimizations, if 1440 desired. 1442 Padding is negotiated by the session creation operation, and 1443 subsequently used by the RPC RDMA layer, as described in [RPCRDMA]. 1445 3. NFSv4 Integration 1447 The following section discusses the integration of the proposed 1448 RDMA extensions with NFSv4.0. 1450 3.1. Minor Versioning 1452 Minor versioning is the existing facility to extend the NFSv4 1453 protocol, and this proposal takes that approach. 1455 Minor versioning of NFSv4 is relatively restrictive, and allows for 1456 tightly limited changes only. In particular, it does not permit 1457 adding new "procedures" (it permits adding only new "operations"). 1458 Interoperability concerns make it impossible to consider additional 1459 layering to be a minor revision. This somewhat limits the changes 1460 that can be proposed when considering extensions. 1462 To support the duplicate request cache integrated with sessions and 1463 request control, it is desirable to tag each request with an 1464 identifier to be called a Slotid. This identifier must be passed 1465 by NFSv4 when running atop any transport, including traditional 1466 TCP. Therefore it is not desirable to add the Slotid to a new RPC 1467 transport, even though such a transport is indicated for support of 1468 RDMA. This draft and [RPCRDMA] do not propose such an approach. 1470 Instead, this proposal conforms to the requirements of NFSv4 minor 1471 versioning, through the use of a new operation within NFSv4 1472 COMPOUND procedures as detailed below. 1474 If sessions are in use for a given clientid, this same clientid 1475 cannot be used for non-session NFSv4 operation, including NFSv4.0. 1476 Because the server will have allocated session-specific state to 1477 the active clientid, it would be an unnecessary burden on the 1478 server implementor to support and account for additional, non- 1479 session traffic, in addition to being of no benefit. Therefore 1480 this proposal prohibits a single clientid from doing this. 1481 Nevertheless, employing a new clientid for such traffic is 1482 supported. 1484 3.2. Slot Identifiers and Server Duplicate Request Cache 1486 The presence of deterministic maximum request limits on a session 1487 enables in-progress requests to be assigned unique values with 1488 useful properties. 1490 The RPC layer provides a transaction ID (xid), which, while 1491 required to be unique, is not especially convenient for tracking 1492 requests. The transaction ID is only meaningful to the issuer 1493 (client), it cannot be interpreted at the server except to test for 1494 equality with previously issued requests. Because RPC operations 1495 may be completed by the server in any order, many transaction IDs 1496 may be outstanding at any time. The client may therefore perform a 1497 computationally expensive lookup operation in the process of 1498 demultiplexing each reply. 1500 In the proposal, there is a limit to the number of active requests. 1501 This immediately enables a convenient, computationally efficient 1502 index for each request which is designated as a Slot Identifier, or 1503 slotid. 1505 When the client issues a new request, it selects a slotid in the 1506 range 0..N-1, where N is the server's current "totalrequests" limit 1507 granted the client on the session over which the request is to be 1508 issued. The slotid must be unused by any of the requests which the 1509 client has already active on the session. "Unused" here means the 1510 client has no outstanding request for that slotid. Because the 1511 slot id is always an integer in the range 0..N-1, client 1512 implementations can use the slotid from a server response to 1513 efficiently match responses with outstanding requests, such as, for 1514 example, by using the slotid to index into a outstanding request 1515 array. This can be used to avoid expensive hashing and lookup 1516 functions in the performace-critical receive path. 1518 The sequenceid, which accompanies the slotid in each request, is 1519 important for a second, important check at the server: it must be 1520 able to be determined efficiently whether a request using a certain 1521 slotid is a retransmit or a new, never-before-seen request. It is 1522 not feasible for the client to assert that it is retransmitting to 1523 implement this, because for any given request the client cannot 1524 know the server has seen it unless the server actually replies. Of 1525 course, if the client has seen the server's reply, the client would 1526 not retransmit! 1528 The sequenceid must increase monotonically for each new transmit of 1529 a given slotid, and must remain unchanged for any retransmission. 1530 The server must in turn compare each newly received request's 1531 sequenceid with the last one previously received for that slotid, 1532 to see if the new request is: 1534 o A new request, in which the sequenceid is greater than that 1535 previously seen in the slot (accounting for sequence 1536 wraparound). The server proceeds to execute the new request. 1538 o A retransmitted request, in which the sequenceid is equal to 1539 that last seen in the slot. Note that this request may be 1540 either complete, or in progress. The server performs replay 1541 processing in these cases. 1543 o A misordered duplicate, in which the sequenceid is less than 1544 that previously seen in the slot. The server must drop the 1545 incoming request, which may imply dropping the connection if 1546 the transport is reliable, as dictated by section 3.1.1 of 1547 [RFC3530]. 1549 This last condition is possible on any connection, not just 1550 unreliable, unordered transports. Delayed behavior on abandoned 1551 TCP connections which are not yet closed at the server, or 1552 pathological client implementations can cause it, among other 1553 causes. Therefore, the server may wish to harden itself against 1554 certain repeated occurrences of this, as it would for 1555 retransmissions in [RFC3530]. 1557 It is recommended, though not necessary for protocol correctness, 1558 that the client simply increment the sequenceid by one for each new 1559 request on each slotid. This reduces the wraparound window to a 1560 minimum, and is useful for tracing and avoidance of possible 1561 implementation errors. 1563 The client may however, for implementation-specific reasons, choose 1564 a different algorithm. For example it might maintain a single 1565 sequence space for all slots in the session - e.g. employing the 1566 RPC XID itself. The sequenceid, in any case, is never interpreted 1567 by the server for anything but to test by comparison with 1568 previously seen values. 1570 The server may thereby use the slotid, in conjunction with the 1571 sessionid and sequenceid, within the SEQUENCE portion of the 1572 request to maintain its duplicate request cache (DRC) for the 1573 session, as opposed to the traditional approach of ONC RPC 1574 applications that use the XID along with certain transport 1575 information [RW96]. 1577 Unlike the XID, the slotid is always within a specific range; this 1578 has two implications. The first implication is that for a given 1579 session, the server need only cache the results of a limited number 1580 of COMPOUND requests. The second implication derives from the 1581 first, which is unlike XID-indexed DRCs, the slotid DRC by its 1582 nature cannot be overflowed. Through use of the sequenceid to 1583 identify retransmitted requests, it is notable that the server does 1584 not need to actually cache the request itself, reducing the storage 1585 requirements of the DRC further. These new facilities makes it 1586 practical to maintain all the required entries for an effective 1587 DRC. 1589 The slotid and sequenceid therefore take over the traditional role 1590 of the port number in the server DRC implementation, and the 1591 session replaces the IP address. This approach is considerably 1592 more portable and completely robust - it is not subject to the 1593 frequent reassignment of ports as clients reconnect over IP 1594 networks. In addition, the RPC XID is not used in the reply cache, 1595 enhancing robustness of the cache in the face of any rapid reuse of 1596 XIDs by the client. 1598 It is required to encode the slotid information into each request 1599 in a way that does not violate the minor versioning rules of the 1600 NFSv4.0 specification. This is accomplished here by encoding it in 1601 a control operation within each NFSv4.1 COMPOUND and CB_COMPOUND 1602 procedure. The operation easily piggybacks within existing 1603 messages. The implementation section of this document describes 1604 the specific proposal. 1606 In general, the receipt of a new sequenced request arriving on any 1607 valid slot is an indication that the previous DRC contents of that 1608 slot may be discarded. In order to further assist the server in 1609 slot management, the client is required to use the lowest available 1610 slot when issuing a new request. In this way, the server may be 1611 able to retire additional entries. 1613 However, in the case where the server is actively adjusting its 1614 granted maximum request count to the client, it may not be able to 1615 use receipt of the slotid to retire cache entries. The slotid used 1616 in an incoming request may not reflect the server's current idea of 1617 the client's session limit, because the request may have been sent 1618 from the client before the update was received. Therefore, in the 1619 downward adjustment case, the server may have to retain a number of 1620 duplicate request cache entries at least as large as the old value, 1621 until operation sequencing rules allow it to infer that the client 1622 has seen its reply. 1624 The SEQUENCE (and CB_SEQUENCE) operation also carries a "maxslot" 1625 value which carries additional client slot usage information. The 1626 client must always provide its highest-numbered outstanding slot 1627 value in the maxslot argument, and the server may reply with a new 1628 recognized value. The client should in all cases provide the most 1629 conservative value possible, although it can be increased somewhat 1630 above the actual instantaneous usage to maintain some minimum or 1631 optimal level. This provides a way for the client to yield unused 1632 request slots back to the server, which in turn can use the 1633 information to reallocate resources. Obviously, maxslot can never 1634 be zero, or the session would deadlock. 1636 The server also provides a target maxslot value to the client, 1637 which is an indication to the client of the maxslot the server 1638 wishes the client to be using. This permits the server to withdraw 1639 (or add) resources from a client that has been found to not be 1640 using them, in order to more fairly share resources among a varying 1641 level of demand from other clients. The client must always comply 1642 with the server's value updates, since they indicate newly 1643 established hard limits on the client's access to session 1644 resources. However, because of request pipelining, the client may 1645 have active requests in flight reflecting prior values, therefore 1646 the server must not immediately require the client to comply. 1648 It is worthwhile to note that Sprite RPC [BW87] defined a "channel" 1649 which in some ways is similar to the slotid proposed here. Sprite 1650 RPC used channels to implement parallel request processing and 1651 request/response cache retirement. 1653 3.3. COMPOUND and CB_COMPOUND 1655 Support for per-operation control can be piggybacked onto NFSv4 1656 COMPOUNDs with full transparency, by placing such facilities into 1657 their own, new operation, and placing this operation first in each 1658 COMPOUND under the new NFSv4 minor protocol revision. The contents 1659 of the operation would then apply to the entire COMPOUND. 1661 Recall that the NFSv4 minor revision is contained within the 1662 COMPOUND header, encoded prior to the COMPOUNDed operations. By 1663 simply requiring that the new operation always be contained in 1664 NFSv4 minor COMPOUNDs, the control protocol can piggyback perfectly 1665 with each request and response. 1667 In this way, the NFSv4 RDMA Extensions may stay in compliance with 1668 the minor versioning requirements specified in section 10 of 1669 [RFC3530]. 1671 Referring to section 13.1 of the same document, the proposed 1672 session-enabled COMPOUND and CB_COMPOUND have the form: 1674 +-----+--------------+-----------+------------+-----------+---- 1675 | tag | minorversion | numops | control op | op + args | ... 1676 | | (== 1) | (limited) | + args | | 1677 +-----+--------------+-----------+------------+-----------+---- 1679 and the reply's structure is: 1681 +------------+-----+--------+-------------------------------+--// 1682 |last status | tag | numres | status + control op + results | // 1683 +------------+-----+--------+-------------------------------+--// 1684 //-----------------------+---- 1685 // status + op + results | ... 1686 //-----------------------+---- 1688 The single control operation within each NFSv4.1 COMPOUND defines 1689 the context and operational session parameters which govern that 1690 COMPOUND request and reply. Placing it first in the COMPOUND 1691 encoding is required in order to allow its processing before other 1692 operations in the COMPOUND. 1694 3.4. eXternal Data Representation Efficiency 1696 RDMA is a copy avoidance technology, and it is important to 1697 maintain this efficiency when decoding received messages. 1698 Traditional XDR implementations frequently use generated 1699 unmarshaling code to convert objects to local form, incurring a 1700 data copy in the process (in addition to subjecting the caller to 1701 recursive calls, etc). Often, such conversions are carried out 1702 even when no size or byte order conversion is necessary. 1704 It is recommended that implementations pay close attention to the 1705 details of memory referencing in such code. It is far more 1706 efficient to inspect data in place, using native facilities to deal 1707 with word size and byte order conversion into registers or local 1708 variables, rather than formally (and blindly) performing the 1709 operation via fetch, reallocate and store. 1711 Of particular concern is the result of the READDIR operation, in 1712 which such encoding abounds. 1714 3.5. Effect of Sessions on Existing Operations 1716 The use of a session replaces the use of the SETCLIENTID and 1717 SETCLIENTID_CONFIRM operations, and allows certain simplification 1718 of the RENEW and callback addressing mechanisms in the base 1719 protocol. 1721 The cb_program and cb_location which are obtained by the server in 1722 SETCLIENTID_CONFIRM must not be used by the server, because the 1723 NFSv4.1 client performs callback channel designation with 1724 BIND_BACKCHANNEL. Therefore the SETCLIENTID and 1725 SETCLIENTID_CONFIRM operations becomes obsolete when sessions are 1726 in use, and a server should return an error to NFSv4.1 clients 1727 which might issue either operation. 1729 Another favorable result of the session is that the server is able 1730 to avoid requiring the client to perform OPEN_CONFIRM operations. 1731 The existence of a reliable and effective DRC means that the server 1732 will be able to determine whether an OPEN request carrying a 1733 previously known open_owner from a client is or is not a 1734 retransmission. Because of this, the server no longer requires 1735 OPEN_CONFIRM to verify whether the client is retransmitting an open 1736 request. This in turn eliminates the server's reason for 1737 requesting OPEN_CONFIRM - the server can simply replace any 1738 previous information on this open_owner. Client OPEN operations 1739 are therefore streamlined, reducing overhead and latency through 1740 avoiding the additional OPEN_CONFIRM exchange. 1742 Since the session carries the client liveness indication with it 1743 implicitly, any request on a session associated with a given client 1744 will renew that client's leases. Therefore the RENEW operation is 1745 made unnecessary when a session is present, as any request 1746 (including a SEQUENCE operation with or without additional NFSv4 1747 operations) performs its function. It is possible (though this 1748 proposal does not make any recommendation) that the RENEW operation 1749 could be made obsolete. 1751 An interesting issue arises however if an error occurs on such a 1752 SEQUENCE operation. If the SEQUENCE operation fails, perhaps due 1753 to an invalid slotid or other non-renewal-based issue, the server 1754 may or may not have performed the RENEW. In this case, the state 1755 of any renewal is undefined, and the client should make no 1756 assumption that it has been performed. In practice, this should 1757 not occur but even if it did, it is expected the client would 1758 perform some sort of recovery which would result in a new, 1759 successful, SEQUENCE operation being run and the client assured 1760 that the renewal took place. 1762 3.6. Authentication Efficiencies 1764 NFSv4 requires the use of the RPCSEC_GSS ONC RPC security flavor 1765 [RFC2203] to provide authentication, integrity, and privacy via 1766 cryptography. The server dictates to the client the use of 1767 RPCSEC_GSS, the service (authentication, integrity, or privacy), 1768 and the specific GSS-API security mechanism that each remote 1769 procedure call and result will use. 1771 If the connection's integrity is protected by an additional means 1772 than RPCSEC_GSS, such as via IPsec, then the use of RPCSEC_GSS's 1773 integrity service is nearly redundant (See the Security 1774 Considerations section for more explanation of why it is "nearly" 1775 and not completely redundant). Likewise, if the connection's 1776 privacy is protected by additional means, then the use of both 1777 RPCSEC_GSS's integrity and privacy services is nearly redundant. 1779 Connection protection schemes, such as IPsec, are more likely to be 1780 implemented in hardware than upper layer protocols like RPCSEC_GSS. 1781 Hardware-based cryptography at the IPsec layer will be more 1782 efficient than software-based cryptography at the RPCSEC_GSS layer. 1784 When transport integrity can be obtained, it is possible for server 1785 and client to downgrade their per-operation authentication, after 1786 an appropriate exchange. This downgrade can in fact be as complete 1787 as to establish security mechanisms that have zero cryptographic 1788 overhead, effectively using the underlying integrity and privacy 1789 services provided by transport. 1791 Based on the above observations, a new GSS-API mechanism, called 1792 the Channel Conjunction Mechanism [CCM], is being defined. The CCM 1793 works by creating a GSS-API security context using as input a 1794 cookie that the initiator and target have previously agreed to be a 1795 handle for GSS-API context created previously over another GSS-API 1796 mechanism. 1798 NFSv4.1 clients and servers should support CCM and they must use as 1799 the cookie the handle from a successful RPCSEC_GSS context creation 1800 over a non-CCM mechanism (such as Kerberos V5). The value of the 1801 cookie will be equal to the handle field of the rpc_gss_init_res 1802 structure from the RPCSEC_GSS specification. 1804 The [CCM] Draft provides further discussion and examples. 1806 4. Security Considerations 1808 The NFSv4 minor version 1 retains all of existing NFSv4 security; 1809 all security considerations present in NFSv4.0 apply to it equally. 1811 Security considerations of any underlying RDMA transport are 1812 additionally important, all the more so due to the emerging nature 1813 of such transports. Examining these issues is outside the scope of 1814 this draft. 1816 When protecting a connection with RPCSEC_GSS, all data in each 1817 request and response (whether transferred inline or via RDMA) 1818 continues to receive this protection over RDMA fabrics [RPCRDMA]. 1819 However when performing data transfers via RDMA, RPCSEC_GSS 1820 protection of the data transfer portion works against the 1821 efficiency which RDMA is typically employed to achieve. This is 1822 because such data is normally managed solely by the RDMA fabric, 1823 and intentionally is not touched by software. Therefore when 1824 employing RPCSEC_GSS under CCM, and where integrity protection has 1825 been "downgraded", the cooperation of the RDMA transport provider 1826 is critical to maintain any integrity and privacy otherwise in 1827 place for the session. The means by which the local RPCSEC_GSS 1828 implementation is integrated with the RDMA data protection 1829 facilities are outside the scope of this draft. 1831 It is logical to use the same GSS context on a session's callback 1832 channel as that used on its operations channel(s), particularly 1833 when the connection is shared by both. The client must indicate to 1834 the server: 1836 - what security flavor(s) to use in the call back. A special 1837 callback flavor might be defined for this. 1839 - if the flavor is RPCSEC_GSS, then the client must have previously 1840 created an RPCSEC_GSS session with the server. The client offers to 1841 the server the the opaque handle<> value from the rpc_gss_init_res 1842 structure, the window size of RPCSEC_GSS sequence numbers, and an 1843 opaque gss_cb_handle. 1845 This exchange can be performed as part of session and clientid 1846 creation, and the issue warrants careful analysis before being 1847 specified. 1849 If the NFS client wishes to maintain full control over RPCSEC_GSS 1850 protection, it may still perform its transfer operations using 1851 either the inline or RDMA transfer model, or of course employ 1852 traditional TCP stream operation. In the RDMA inline case, header 1853 padding is recommended to optimize behavior at the server. At the 1854 client, close attention should be paid to the implementation of 1855 RPCSEC_GSS processing to minimize memory referencing and especially 1856 copying. These are well-advised in any case! 1858 The proposed session callback channel binding improves security 1859 over that provided by NFSv4 for the callback channel. The 1860 connection is client-initiated, and subject to the same firewall 1861 and routing checks as the operations channel. The connection 1862 cannot be hijacked by an attacker who connects to the client port 1863 prior to the intended server. The connection is set up by the 1864 client with its desired attributes, such as optionally securing 1865 with IPsec or similar. The binding is fully authenticated before 1866 being activated. 1868 4.1. Authentication 1870 Proper authentication of the principal which issues any session and 1871 clientid in the proposed NFSv4.1 operations exactly follows the 1872 similar requirement on client identifiers in NFSv4.0. It must not 1873 be possible for a client to impersonate another by guessing its 1874 session identifiers for NFSv4.1 operations, nor to bind a callback 1875 channel to an existing session. To protect against this, NFSv4.0 1876 requires appropriate authentication and matching of the principal 1877 used. This is discussed in Section 16, Security Considerations of 1878 [RFC3530]. The same requirement when using a session identifier 1879 applies to NFSv4.1 here. 1881 Going beyond NFSv4.0, the presence of a session associated with any 1882 clientid may also be used to enhance NFSv4.1 security with respect 1883 to client impersonation. In NFSv4.0, there are many operations 1884 which carry no clientid, including in particular those which employ 1885 a stateid argument. A rogue client which wished to carry out a 1886 denial of service attack on another client could perform CLOSE, 1887 DELEGRETURN, etc operations with that client's current filehandle, 1888 sequenceid and stateid, after having obtained them from 1889 eavesdropping or other approach. Locking and open downgrade 1890 operations could be similarly attacked. 1892 When an NFSv4.1 session is in place for any clientid, 1893 countermeasures are easily applied through use of authentication by 1894 the server. Because the clientid and sessionid must be present in 1895 each request within a session, the server may verify that the 1896 clientid is in fact originating from a principal with the 1897 appropriate authenticated credentials, that the sessionid belongs 1898 to the clientid, and that the stateid is valid in these contexts. 1899 This is in general not possible with the affected operations in 1900 NFSv4.0 due to the fact that the clientid is not present in the 1901 requests. 1903 In the event that authentication information is not available in 1904 the incoming request, for example after a reconnection when the 1905 security was previously downgraded using CCM, the server must 1906 require the client re-establish the authentication in order that 1907 the server may validate the other client-provided context, prior to 1908 executing any operation. The sessionid, present in the newly 1909 retransmitted request, combined with the retransmission detection 1910 enabled by the NFSv4.1 duplicate request cache, are a convenient 1911 and reliable context for the server to use for this contingency. 1913 The server should take care to protect itself against denial of 1914 service attacks in the creation of sessions and clientids. Clients 1915 who connect and create sessions, only to disconnect and never use 1916 them may leave significant state behind. (The same issue applies 1917 to NFSv4.0 with clients who may perform SETCLIENTID, then never 1918 perform SETCLIENTID_CONFIRM.) Careful authentication coupled with 1919 resource checks is highly recommended. 1921 5. IANA Considerations 1923 As a proposal based on minor protocol revision, any new minor 1924 number might be registered and reserved with the agreed-upon 1925 specification. Assigned operation numbers and any RPC constants 1926 might undergo the same process. 1928 There are no issues stemming from RDMA use itself regarding port 1929 number assignments not already specified by [RFC3530]. Initial 1930 connection is via ordinary TCP stream services, operating on the 1931 same ports and under the same set of naming services. 1933 In the Automatic RDMA connection model described above, it is 1934 possible that a new well-known port, or a new transport type 1935 assignment (netid) as described in [RFC3530], may be desirable. 1937 6. NFSv4 Protocol Extensions 1939 This section specifies details of the extensions to NFSv4 proposed 1940 by this document. Existing NFSv4 operations (under minor version 1941 0) continue to be fully supported, unmodified. 1943 6.1. Operation: CREATECLIENTID - Instantiate Clientid 1945 SYNOPSIS 1947 client -> clientid 1949 ARGUMENT 1951 struct CREATECLIENTID4args { 1952 nfs_client_id4 clientdesc; 1953 }; 1955 RESULT 1957 struct CREATECLIENTID4resok { 1958 clientid4 clientid; 1959 verifier4 clientid_confirm; 1960 }; 1962 union SETCLIENTID4res switch (nfsstat4 status) { 1963 case NFS4_OK: 1964 CREATECLIENTID4resok resok4; 1965 case NFS4ERR_CLID_INUSE: 1966 void; 1967 default: 1968 void; 1969 }; 1971 DESCRIPTION 1972 The client uses the CREATECLIENTID operation to register a 1973 particular client identifier with the server. The clientid 1974 returned from this operation will be necessary for requests that 1975 create state on the server and will serve as a parent object to 1976 sessions created by the client. In order to verify the clientid it 1977 must first be used as an argument to CREATESESSION. 1979 IMPLEMENTATION 1981 A server's client record is a 5-tuple: 1983 1. clientdesc.id: 1984 The long form client identifier, sent via the client.id 1985 subfield of the CREATECLIENTID4args structure 1987 2. clientdesc.verifier: 1988 A client-specific value used to indicate reboots, sent via the 1989 clientdesc.verifier subfield of the CREATECLIENTID4args 1990 structure 1992 3. principal: 1993 The RPCSEC_GSS principal sent via the RPC headers 1995 4. clientid: 1996 The shorthand client identifier, generated by the server and 1997 returned via the clientid field in the CREATECLIENTID4resok 1998 structure 2000 5. confirmed: 2001 A private field on the server indicating whether or not a 2002 client record has been confirmed. A client record is 2003 confirmed if there has been a successful CREATESESSION 2004 operation to confirm it. Otherwise it is unconfirmed. An 2005 unconfirmed record is established by a CREATECLIENTID call. 2006 Any unconfirmed record that is not confirmed within a lease 2007 period may be removed. 2009 The following identifiers represent special values for the fields 2010 in the records. 2012 id_arg: 2013 The value of the clientdesc.id subfield of the 2014 CREATECLIENTID4args structure of the current request. 2016 verifier_arg: 2017 The value of the clientdesc.verifier subfield of the 2018 CREATECLIENTID4args structure of the current request. 2020 old_verifier_arg: 2021 A value of the clientdesc.verifier field of a client record 2022 received in a previous request; this is distinct from 2023 verifier_arg. 2025 principal_arg: 2026 The value of the RPCSEC_GSS principal for the current request. 2028 old_principal_arg: 2029 A value of the RPCSEC_GSS principal received for a previous 2030 request. This is distinct from principal_arg. 2032 clientid_ret: 2033 The value of the clientid field the server will return in the 2034 CREATECLIENTID4resok structure for the current request. 2036 old_clientid_ret: 2037 The value of the clientid field the server returned in the 2038 CREATECLIENTID4resok structure for a previous request. This 2039 is distinct from clientid_ret. 2041 Since CREATECLIENTID is a non-idempotent operation, we must 2042 consider the possibility that replays may occur as a result of a 2043 client reboot, network partition, malfunctioning router, etc. 2044 Replays are identified by the value of the client field of 2045 CREATECLIENTID4args and the method for dealing with them is 2046 outlined in the scenarios below. 2048 The scenarios are described in terms of what client records whose 2049 clientdesc.id subfield have value equal to id_arg exist in the 2050 server's set of client records. Any cases in which there is more 2051 than one record with identical values for id_arg represent a server 2052 implementation error. Operation in the potential valid cases is 2053 summarized as follows. 2055 1) Common case 2056 If no client records with clientdesc.id matching id_arg exist, 2057 a new shorthand client identifier clientid_ret is generated, 2058 and the following unconfirmed record is added to the server's 2059 state. 2061 { id_arg, verifier_arg, principal_arg, clientid_ret, FALSE } 2063 Subsequently, the server returns clientid_ret. 2065 2) Router Replay 2066 If the server has the following confirmed record, then this 2067 request is likely the result of a replayed request due to a 2068 faulty router or lost connection. 2070 { id_arg, verifier_arg, principal_arg, clientid_ret, TRUE } 2072 Since the record has been confirmed, the client must have 2073 received the server's reply from the initial CREATECLIENTID 2074 request. Since this is simply a spurious request, there is no 2075 modification to the server's state, and the server makes no 2076 reply to the client. 2078 3) Client Collision 2079 If the server has the following confirmed record, then this 2080 request is likely the result of a chance collision between the 2081 values of the clientdesc.id subfield of CREATECLIENTID4args 2082 for two different clients. 2084 { id_arg, *, old_principal_arg, clientid_ret, TRUE } 2086 Since the value of the clientdesc.id subfield of each client 2087 record must be unique, there is no modification of the 2088 server's state, and NFS4ERR_CLID_INUSE is returned to indicate 2089 the client should retry with a different value for the 2090 clientdesc.id subfield of CREATECLIENTID4args. 2092 This scenario may also represent a malicious attempt to 2093 destroy a client's state on the server. For security reasons, 2094 the server MUST NOT remove the client's state when there is a 2095 principal mismatch. 2097 4) Replay 2098 If the server has the following unconfirmed record then this 2099 request is likely the result of a client replay due to a 2100 network partition or some other connection failure. 2102 { id_arg, verifier_arg, principal_arg, clientid_ret, FALSE } 2104 Since the response to the CREATECLIENTID request that created 2105 this record may have been lost, it is not acceptable to drop 2106 this duplicate request. However, rather than processing it 2107 normally, the existing record is left unchanged and 2108 clientid_ret, which was generated for the previous request, is 2109 returned. 2111 5) Change of Principal 2112 If the server has the following unconfirmed record then this 2113 request is likely the result of a client which has for 2114 whatever reasons changed principals (possibly to change 2115 security flavor) after calling CREATECLIENTID, but before 2116 calling CREATESESSION. 2118 { id_arg, verifier_arg, old_principal_arg, clientid_ret, FALSE} 2120 Since the client has not changed, the principal field of the 2121 unconfirmed record is updated to principal_arg and 2122 clientid_ret is again returned. There is a small possibility 2123 that this is merely a collision on the client field of 2124 CREATECLIENTID4args between unrelated clients, but since that 2125 is unlikely, and an unconfirmed record does not generally have 2126 any filesystem pertinent state, we can assume it is the same 2127 client without risking loss of any important state. 2129 After processing, the following record will exist on the 2130 server. 2132 { id_arg, verifier_arg, principal_arg, clientid_ret, FALSE} 2134 6) Client Reboot 2135 If the server has the following confirmed client record, then 2136 this request is likely from a previously confirmed client 2137 which has rebooted. 2139 { id_arg, old_verifier_arg, principal_arg, clientid_ret, TRUE } 2141 Since the previous incarnation of the same client will no 2142 longer be making requests, lock and share reservations should 2143 be released immediately rather than forcing the new 2144 incarnation to wait for the lease time on the previous 2145 incarnation to expire. Furthermore, session state should be 2146 removed since if the client had maintained that information 2147 across reboot, this request would not have been issued. If 2148 the server does not support the CLAIM_DELEGATE_PREV claim 2149 type, associated delegations should be purged as well; 2150 otherwise, delegations are retained and recovery proceeds 2151 according to RFC3530. The client record is updated with the 2152 new verifier and its status is changed to unconfirmed. 2154 After processing, clientid_ret is returned to the client and 2155 the following record will exist on the server. 2157 { id_arg, verifier_arg, principal_arg, clientid_ret, FALSE } 2159 7) Reboot before confirmation 2160 If the server has the following unconfirmed record, then this 2161 request is likely from a client which rebooted before sending 2162 a CREATESESSION request. 2164 { id_arg, old_verifier_arg, *, clientid_ret, FALSE } 2166 Since this is believed to be a request from a new incarnation 2167 of the original client, the server updates the value of 2168 clientdesc.verifier and returns the original clientid_ret. 2169 After processing, the following state exists on the server. 2171 { id_arg, verifier_arg, *, clientid_ret, FALSE } 2173 ERRORS 2175 NFS4ERR_BADXDR 2176 NFS4ERR_CLID_INUSE 2177 NFS4ERR_INVAL 2178 NFS4ERR_RESOURCE 2179 NFS4ERR_SERVERFAULT 2181 6.2. Operation: CREATESESSION - Create New Session and Confirm Clientid 2183 SYNOPSIS 2185 clientid, session_args -> sessionid, session_args 2187 ARGUMENT 2188 struct CREATESESSION4args { 2189 clientid4 clientid; 2190 bool persist; 2191 count4 maxrequestsize; 2192 count4 maxresponsesize; 2193 count4 maxrequests; 2194 count4 headerpadsize; 2195 switch (bool clientid_confirm) { 2196 case TRUE: 2197 verifier4 setclientid_confirm; 2198 case FALSE: 2199 void; 2200 } 2201 switch (channelmode4 mode) { 2202 case DEFAULT: 2203 void; 2204 case STREAM: 2205 streamchannelattrs4 streamchanattrs; 2206 case RDMA: 2207 rdmachannelattrs4 rdmachanattrs; 2208 }; 2209 }; 2211 RESULT 2212 typedef opaque sessionid4[16]; 2214 struct CREATESESSION4resok { 2215 sessionid4 sessionid; 2216 bool persist; 2217 count4 maxrequestsize; 2218 count4 maxresponsesize; 2219 count4 maxrequests; 2220 count4 headerpadsize; 2221 switch (channelmode4 mode) { 2222 case DEFAULT: 2223 void; 2224 case STREAM: 2225 streamchannelattrs4 streamchanattrs; 2226 case RDMA: 2227 rdmachannelattrs4 rdmachanattrs; 2228 }; 2229 }; 2231 union CREATESESSION4res switch (nfsstat4 status) { 2232 case NFS4_OK: 2233 CREATESESSION4resok resok4; 2234 default: 2235 void; 2236 }; 2238 DESCRIPTION 2240 This operation is used by the client to create new session objects 2241 on the server. Additionally the first session created with a new 2242 shorthand client identifier serves to confirm the creation of that 2243 client's state on the server. The server returns the parameter 2244 values for the new session. 2246 IMPLEMENTATION 2248 To describe the implementation, the same notation for client 2249 records introduced in the description of CREATECLIENTID is used 2250 with the following addition. 2252 clientid_arg: 2253 The value of the clientid field of the CREATESESSION4args 2254 structure of the current request. 2256 Since CREATESESSION is a non-idempotent operation, we must consider 2257 the possibility that replays may occur as a result of a client 2258 reboot, network partition, malfunctioning router, etc. Replays are 2259 identified by the value of the clientid and sessionid fields of 2260 CREATESESSION4args and the method for dealing with them is outlined 2261 in the scenarios below. 2263 The processing of this operation is divided into two phases: 2264 clientid confirmation and session creation. In case the state for 2265 the provided clientid has not been verified, it is confirmed before 2266 the session is created. Otherwise the clientid confirmation phase 2267 is skipped and only the session creation phase occurs. Note that 2268 since only confirmed clients may create sessions, the clientid 2269 confirmation stage does not depend upon sessionid_arg. 2271 CLIENTID CONFIRMATION 2273 The operational cases are described in terms of what client records 2274 whose clientid field have value equal to clientid_arg exist in the 2275 server's set of client records. Any cases in which there is more 2276 than one record with identical values for clientid represent a 2277 server implementation error. Operation in the potential valid 2278 cases is summarized as follows. 2280 1) Common Case 2281 If the server has the following unconfirmed record, then this 2282 is the expected confirmation of an unconfirmed record. 2284 { *, *, principal_arg, clientid_arg, FALSE } 2286 The confirmed field of the record is set to TRUE and 2287 processing of the operation continues normally. 2289 2) Stale Clientid 2290 If the server contains no records with clientid equal to 2291 clientid_arg, then most likely the client's state has been 2292 purged during a period of inactivity, possibly due to a loss 2293 of connectivity. NFS4ERR_STALE_CLIENTID is returned, and no 2294 changes are made to any client records on the server. 2296 3) Principal Change or Collision 2297 If the server has the following record, then the client has 2298 changed principals after the previous CREATECLIENTID request, 2299 or there has been a chance collision between shortand client 2300 identifiers. 2302 { *, *, old_principal_arg, clientid_arg, * } 2304 Neither of these cases are permissible. Processing stops and 2305 NFS4ERR_CLID_INUSE is returned to the client. No changes are 2306 made to any client records on the server. 2308 SESSION CREATION 2310 To determine whether this request is a replay, the server examines 2311 the sessionid argument provided by the client. If the sessionid 2312 matches the identifier of a previously created session, then this 2313 request must be interpreted as a replay. No new state is created 2314 and a reply with the parameters of the existing session is returned 2315 to the client. If a session corresponding to the sessionid does 2316 not already exist, then the request is not a replay and is 2317 processed as follows. 2319 NOTE: It is the responsibility of the client to generate 2320 appropriate values for sessionid. Since the ordering of messages 2321 sent on different transport connections is not guaranteed, 2322 immediately reusing the sessionid of a previously destroyed session 2323 may yield unpredictable results. Client implementations should 2324 avoid recently used sessionids to ensure correct behavior. 2326 The server examines the persist, maxrequestsize, maxresponsesize, 2327 maxrequests and headerpadsize arguments. For each argument, if the 2328 value is acceptable to the server, it is recommended that the 2329 server use the provided value to create the new session. If it is 2330 not acceptable, the server may use a different value, but must 2331 return the value used to the client. These parameters have the 2332 following interpretation. 2334 persist: 2335 True if the client desires server support for "reliable" 2336 semantics. For sessions in which only idempotent operations 2337 will be used (e.g. a read-only session), clients should set 2338 this value to false. If the server does not or cannot provide 2339 "reliable" semantics this value must be set to false on 2340 return. 2342 maxrequestsize: 2343 The maximum size of a COMPOUND request that will be sent by 2344 the client including RPC headers. 2346 maxresponsesize: 2347 The maximum size of a COMPOUND reply that the client will 2348 accept from the server including RPC headers. The server must 2349 not increase the value of this parameter. If a client sends a 2350 COMPOUND request for which the size of the reply would exceed 2351 this value, the server will return NFS4ERR_RESOURCE. 2353 maxrequests: 2354 The maximum number of concurrent COMPOUND requests that the 2355 client will issue on the session. Subsequent COMPOUND 2356 requests will each be assigned a slot identifier by the client 2357 on the range 0 to maxrequests - 1 inclusive. A slot id cannot 2358 be reused until the previous request on that slot has 2359 completed. 2361 headerpadsize: 2362 The maximum amount of padding the client is willing to apply 2363 to ensure that write payloads are aligned on some boundary at 2364 the server. The server should reply with its preferred value, 2365 or zero if padding is not in use. The server may decrease 2366 this value but must not increase it. 2368 The server creates the session by recording the parameter values 2369 used and if the persist parameter is true and has been accepted by 2370 the server, allocating space for the duplicate request cache (DRC). 2372 If the session state is created successfully, the server associates 2373 it with the session identifier provided by the client. This 2374 identifier must be unique among the client's active sessions but 2375 there is no need for it to be globally unique. Finally, the server 2376 returns the negotiated values used to create the session to the 2377 client. 2379 ERRORS 2381 NFS4ERR_BADXDR 2382 NFS4ERR_CLID_INUSE 2383 NFS4ERR_RESOURCE 2384 NFS4ERR_SERVERFAULT 2385 NFS4ERR_STALE_CLIENTID 2387 6.3. Operation: BIND_BACKCHANNEL - Create a callback channel binding 2389 SYNOPSIS 2391 Establish a callback channel on the connection. 2393 ARGUMENTS 2395 struct BIND_BACKCHANNEL4args { 2396 clientid4 clientid; 2397 uint32_t callback_program; 2398 uint32_t callback_ident; 2399 count4 maxrequestsize; 2400 count4 maxresponsesize; 2401 count4 maxrequests; 2402 switch (channelmode4 mode) { 2403 case DEFAULT: 2404 void; 2405 case STREAM: 2406 streamchannelattrs4 streamchanattrs; 2407 case RDMA: 2408 rdmachannelattrs4 rdmachanattrs; 2409 }; 2410 }; 2412 RESULTS 2414 struct BIND_BACKCHANNEL4resok { 2415 count4 maxrequestsize; 2416 count4 maxresponsesize; 2417 count4 maxrequests; 2418 switch (channelmode4 mode) { 2419 case DEFAULT: 2420 void; 2421 case STREAM: 2422 streamchannelattrs4 streamchanattrs; 2423 case RDMA: 2424 rdmachannelattrs4 rdmachanattrs; 2425 }; 2426 }; 2428 union BIND_BACKCHANNEL4res switch (nfsstat4 status) { 2429 case NFS4_OK: 2430 BIND_BACKCHANNEL4resok resok4; 2431 default: 2432 void; 2433 }; 2435 DESCRIPTION 2437 The BIND_BACKCHANNEL operation serves to establish the current 2438 connection as a designated callback channel for the specified 2439 session. Normally, only one callback channel is bound, however if 2440 more than one are established, they are used at the server's 2441 prerogative, no affinity or preference is specified by the client. 2443 The arguments and results of the BIND_BACKCHANNEL call are a subset 2444 of the session parameters, and used identically to those values on 2445 the callback channel only. However, not all session operation 2446 channel parameters are relevant to the callback channel, for 2447 example header padding (since writes of bulk data are not performed 2448 in callbacks). 2450 ERRORS 2452 ... 2454 6.4. Operation: DESTROYSESSION - Destroy existing session 2456 SYNOPSIS 2458 void -> status 2460 ARGUMENT 2462 struct DESTROYSESSION4args { 2463 sessionid4 sessionid; }; 2465 RESULT 2467 struct SESSION_DESTROYres { 2468 nfsstat status; 2469 }; 2471 DESCRIPTION 2473 The SESSION_DESTROY operation closes the session and discards any 2474 active state such as locks, leases, and server duplicate request 2475 cache entries. Any remaining connections bound to the session are 2476 immediately unbound and may additionally be closed by the server. 2478 This operation must be the final, or only operation in any request. 2479 Because the operation results in destruction of the session, any 2480 duplicate request caching for this request, as well as previously 2481 completed requests, will be lost. For this reason, it is advisable 2482 to not place this operation in a request with other state-modifying 2483 operations. In addition, a SEQUENCE operation is not required in 2484 the request. 2486 Note that because the operation will never be replayed by the 2487 server, a client that retransmits the request may receive an error 2488 in response, even though the session may have been successfully 2489 destroyed. 2491 ... 2493 ERRORS 2495 2497 6.5. Operation: SEQUENCE - Supply per-procedure sequencing and control 2499 SYNOPSIS 2501 control -> control 2503 ARGUMENT 2505 typedef uint32_t sequenceid4; 2506 typedef uint32_t slotid4; 2508 struct SEQUENCE4args { 2509 clientid4 clientid; 2510 sessionid4 sessionid; 2511 sequenceid4 sequenceid; 2512 slotid4 slotid; 2513 slotid4 maxslot; 2514 }; 2516 RESULT 2518 struct SEQUENCE4resok { 2519 clientid4 clientid; 2520 sessionid4 sessionid; 2521 sequenceid4 sequenceid; 2522 slotid4 slotid; 2523 slotid4 maxslot; 2524 slotid4 target_maxslot; 2525 }; 2527 union SEQUENCE4res switch (nfsstat4 status) { 2528 case NFS4_OK: 2529 SEQUENCE4resok resok4; 2530 default: 2531 void; 2532 }; 2534 DESCRIPTION 2536 The SEQUENCE operation is used to manage operational accounting for 2537 the session on which the operation is sent. The contents include 2538 the client and session to which this request belongs, slotid and 2539 sequenceid, used by the server to implement session request control 2540 and the duplicate reply cache semantics, and exchanged slot counts 2541 which are used to adjust these values. This operation must appear 2542 once as the first operation in each COMPOUND sent after the channel 2543 is successfully bound, or a protocol error must result. 2545 ... 2547 ERRORS 2549 NFS4ERR_BADSESSION 2550 NFS4ERR_BADSLOT 2552 6.6. Callback operation: CB_RECALLCREDIT - change flow control limits 2554 SYNOPSIS 2556 targetcount -> status 2558 ARGUMENTS 2560 struct CB_RECALLCREDIT4args { 2561 sessionid4 sessionid; 2562 uint32_t target; 2563 }; 2565 RESULT 2567 struct CB_RECALLCREDIT4res { 2568 nfsstat4 status; 2569 }; 2571 DESCRIPTION 2573 The CB_RECALLCREDIT operation requests the client to return session 2574 and transport credits to the server, by zero-length RDMA Sends or 2575 NULL NFSv4 operations. 2577 ... 2579 ERRORS 2581 2583 6.7. Callback operation: CB_SEQUENCE - Supply callback channel 2584 sequencing and control 2586 SYNOPSIS 2588 control -> control 2590 ARGUMENT 2592 typedef uint32_t sequenceid4; 2593 typedef uint32_t slotid4; 2595 struct CB_SEQUENCE4args { 2596 clientid4 clientid; 2597 sessionid4 sessionid; 2598 sequenceid4 sequenceid; 2599 slotid4 slotid; 2600 slotid4 maxslot; 2601 }; 2603 RESULT 2605 struct CB_SEQUENCE4resok { 2606 clientid4 clientid; 2607 sessionid4 sessionid; 2608 sequenceid4 sequenceid; 2609 slotid4 slotid; 2610 slotid4 maxslot; 2611 slotid4 target_maxslot; 2612 }; 2614 union CB_SEQUENCE4res switch (nfsstat4 status) { 2615 case NFS4_OK: 2616 CB_SEQUENCE4resok resok4; 2617 default: 2618 void; 2619 }; 2621 DESCRIPTION 2623 The CB_SEQUENCE operation is used to manage operational accounting 2624 for the callback channel of the session on which the operation is 2625 sent. The contents include the client and session to which this 2626 request belongs, slotid and sequenceid, used by the server to 2627 implement session request control and the duplicate reply cache 2628 semantics, and exchanged slot counts which are used to adjust these 2629 values. This operation must appear once as the first operation in 2630 each CB_COMPOUND sent after the callback channel is successfully 2631 bound, or a protocol error must result. 2633 ... 2635 ERRORS 2637 NFS4ERR_BADSESSION 2638 NFS4ERR_BADSLOT 2640 7. NFSv4 Session Protocol Description 2642 This section contains the proposed protocol changes in RPC 2643 description language. The constants named in this section are 2644 illustrative. When the working group decides on the full content 2645 of the NFSv4.1 minor revision, they may change in order to avoid 2646 conflict. 2648 NFS4ERR_BADSESSION = 10049,/* invalid session */ 2649 NFS4ERR_BADSLOT = 10050 /* invalid slotid */ 2651 /* 2652 * CREATECLIENTID: v4.1 setclientid for session use 2653 */ 2655 struct CREATECLIENTID4args { 2656 nfs_client_id4 clientdesc; 2657 }; 2659 struct CREATECLIENTID4resok { 2660 clientid4 clientid; 2661 verifier4 clientid_confirm; 2662 }; 2664 union CREATECLIENTID4res switch (nfsstat4 status) { 2665 case NFS4_OK: 2666 CREATECLIENTID4resok resok4; 2667 default: 2668 void; 2669 }; 2671 /* 2672 * Channel attributes - TBD. 2673 */ 2675 enum channelmode4 { 2676 DEFAULT = 0, /* don't change */ 2677 STREAM = 1, /* TCP stream */ 2678 RDMA = 2 /* upshift to RDMA */ 2679 }; 2681 struct streamchannelattrs4 { 2682 opaque nothing[0]; /* TBD */ 2683 }; 2685 struct rdmachannelattrs4 { 2686 count4 maxrdmareads; 2687 /* plus TBD */ 2688 }; 2689 /* 2690 * CREATESESSION: v4.1 session creation and optional 2691 * clientid confirm 2692 */ 2694 typedef opaque sessionid4[16]; 2696 union optverifier4 switch (bool clientid_confirm) { 2697 case TRUE: 2698 verifier4 setclientid_confirm; 2699 case FALSE: 2700 void; 2701 }; 2703 union transportattrs4 switch (channelmode4 mode) { 2704 case DEFAULT: 2705 void; 2706 case STREAM: 2707 streamchannelattrs4 streamchanattrs; 2708 case RDMA: 2709 rdmachannelattrs4 rdmachanattrs; 2710 }; 2712 struct CREATESESSION4args { 2713 clientid4 clientid; 2714 bool persist; 2715 count4 maxrequestsize; 2716 count4 maxresponsesize; 2717 count4 maxrequests; 2718 count4 headerpadsize; 2719 optverifier4 verifier; 2720 transportattrs4 transportattrs; 2721 }; 2723 struct CREATESESSION4resok { 2724 sessionid4 sessionid; 2725 bool persist; 2726 count4 maxrequestsize; 2727 count4 maxresponsesize; 2728 count4 maxrequests; 2729 count4 headerpadsize; 2730 transportattrs4 transportattrs; 2731 }; 2733 union CREATESESSION4res switch (nfsstat4 status) { 2734 case NFS4_OK: 2735 CREATESESSION4resok resok4; 2736 default: 2738 void; 2739 }; 2741 /* 2742 * BIND_BACKCHANNEL: v4.1 callback binding 2743 */ 2745 struct BIND_BACKCHANNEL4args { 2746 clientid4 clientid; 2747 uint32_t callback_program; 2748 uint32_t callback_ident; 2749 count4 maxrequestsize; 2750 count4 maxresponsesize; 2751 count4 maxrequests; 2752 transportattrs4 transportattrs; 2753 }; 2755 struct BIND_BACKCHANNEL4resok { 2756 count4 maxrequestsize; 2757 count4 maxresponsesize; 2758 count4 maxrequests; 2759 transportattrs4 transportattrs; 2760 }; 2762 union BIND_BACKCHANNEL4res switch (nfsstat4 status) { 2763 case NFS4_OK: 2764 BIND_BACKCHANNEL4resok resok4; 2765 default: 2766 void; 2767 }; 2769 /* 2770 * DESTROYSESSION: v4.1 session destruction 2771 */ 2773 struct DESTROYSESSION4args { 2774 sessionid4 sessionid; 2775 }; 2777 struct DESTROYSESSION4res { 2778 nfsstat4 status; 2780 }; 2782 /* 2783 * SEQUENCE: v4.1 operation sequence control 2784 */ 2786 typedef uint32_t sequenceid4; 2787 typedef uint32_t slotid4; 2789 struct SEQUENCE4args { 2790 clientid4 clientid; 2791 sessionid4 sessionid; 2792 sequenceid4 sequenceid; 2793 slotid4 slotid; 2794 slotid4 maxslot; 2795 }; 2797 struct SEQUENCE4resok { 2798 clientid4 clientid; 2799 sessionid4 sessionid; 2800 sequenceid4 sequenceid; 2801 slotid4 slotid; 2802 slotid4 maxslot; 2803 slotid4 target_maxslot; 2804 }; 2806 union SEQUENCE4res switch (nfsstat4 status) { 2807 case NFS4_OK: 2808 struct SEQUENCE4resok resok4; 2809 default: 2810 void; 2811 }; 2813 /* Operation values */ 2814 OP_CREATECLIENTID = 40, 2815 OP_CREATESESSION = 41, 2816 OP_BIND_BACKCHANNEL= 42, 2817 OP_DESTROYSESSION = 43, 2818 OP_SEQUENCE = 44, 2820 /* Operation arguments */ 2821 case OP_CREATECLIENTID: 2823 CREATECLIENTID4args opcreateclientid; 2824 case OP_CREATESESSION: 2825 CREATESESSION4args opcreatesession; 2826 case OP_BIND_BACKCHANNEL: 2827 BIND_BACKCHANNEL4args opbind_backchannel; 2828 case OP_DESTROYSESSION: 2829 DESTROYSESSION4args opdestroysession; 2830 case OP_SEQUENCE: 2831 SEQUENCE4args opsequence; 2833 /* Operation results */ 2834 case OP_CREATECLIENTID: 2835 CREATECLIENTID4res opcreateclientid; 2836 case OP_CREATESESSION: 2837 CREATESESSION4res opcreatesession; 2838 case OP_BIND_BACKCHANNEL: 2839 BIND_BACKCHANNEL4res opbind_backchannel; 2840 case OP_DESTROYSESSION: 2841 DESTROYSESSION4res opdestroysession; 2842 case OP_SEQUENCE: 2843 SEQUENCE4res opsequence; 2845 /* 2846 * CB_RECALLCREDIT: Recall session credits from 2847 * operations channel(s) 2848 */ 2850 struct CB_RECALLCREDIT4args { 2851 sessionid4 sessionid; 2852 uint32_t target; 2853 }; 2855 struct CB_RECALLCREDIT4res { 2856 nfsstat4 status; 2857 }; 2859 /* 2860 * CB_SEQUENCE: v4.1 operation sequence control 2861 */ 2863 struct CB_SEQUENCE4args { 2864 clientid4 clientid; 2865 sessionid4 sessionid; 2866 sequenceid4 sequenceid; 2867 slotid4 slotid; 2868 slotid4 maxslot; 2869 }; 2871 struct CB_SEQUENCE4resok { 2872 clientid4 clientid; 2873 sessionid4 sessionid; 2874 sequenceid4 sequenceid; 2875 slotid4 slotid; 2876 slotid4 maxslot; 2877 slotid4 target_maxslot; 2878 }; 2880 union CB_SEQUENCE4res switch (nfsstat4 status) { 2881 case NFS4_OK: 2882 struct CB_SEQUENCE4resok resok4; 2883 default: 2884 void; 2885 }; 2887 /* Operation values */ 2888 OP_CB_RECALL_CREDIT = 5, 2889 OP_CB_SEQUENCE = 6 2891 /* Operation arguments */ 2892 case OP_CB_RECALLCREDIT: 2893 CB_RECALLCREDIT4args opcbrecallcredit; 2894 case OP_CB_SEQUENCE: 2895 CB_SEQUENCE4args opcbsequence; 2897 /* Operation results */ 2898 case OP_CB_RECALLCREDIT: 2899 CB_RECALLCREDIT4res opcbrecallcredit; 2900 case OP_CB_SEQUENCE: 2901 CB_SEQUENCE4res opcbsequence; 2903 8. Acknowledgements 2905 The authors wish to acknowledge the valuable contributions and 2906 review of Charles Antonelli, Brent Callaghan, Mike Eisler, John 2907 Howard, Chet Juszczak, Trond Myklebust, Dave Noveck, John Scott, 2908 Mike Stolarchuk and Mark Wittle. 2910 9. References 2912 9.1. Normative References 2914 [RFC3530] 2915 S. Shepler, et al., "NFS Version 4 Protocol", Standards Track 2916 RFC, http://www.ietf.org/rfc/rfc3530 2918 9.2. Informative References 2920 [BW87] 2921 B. Welch, "The Sprite Remote Procedure Call System", 2922 University of California Berkeley Technical Report CSD-87-302, 2923 ftp://sunsite.berkeley.edu/pub/techreps/CSD-87-302.html 2925 [CCM] 2926 M. Eisler, N. Williams, "The Channel Conjunction Mechanism 2927 (CCM) for GSS", Internet-Draft Work in Progress, 2928 http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-ccm 2930 [CJ89] 2931 C. Juszczak, "Improving the Performance and Correctness of an 2932 NFS Server," Winter 1989 USENIX Conference Proceedings, USENIX 2933 Association, Berkeley, CA, Februry 1989, pages 53-63. 2935 [DAFS] 2936 Direct Access File System, available from 2937 http://www.dafscollaborative.org 2939 [DCK+03] 2940 M. DeBergalis, P. Corbett, S. Kleiman, A. Lent, D. Noveck, T. 2941 Talpey, M. Wittle, "The Direct Access File System", in 2942 Proceedings of 2nd USENIX Conference on File and Storage 2943 Technologies (FAST '03), San Francisco, CA, March 31 - April 2944 2, 2003 2946 [DDP] 2947 H. Shah, J. Pinkerton, R. Recio, P. Culley, "Direct Data 2948 Placement over Reliable Transports", Internet-Draft Work in 2949 Progress, http://www.ietf.org/internet-drafts/draft-ietf-rddp- 2950 ddp 2952 [FJDAFS] 2953 Fujitsu Prime Software Technologies, "Meet the DAFS 2954 Performance with DAFS/VI Kernel Implementation using cLAN", 2955 http://www.pst.fujitsu.com/english/dafsdemo/index.html 2957 [FJNFS] 2958 Fujitsu Prime Software Technologies, "An Adaptation of VIA to 2959 NFS on Linux", 2960 http://www.pst.fujitsu.com/english/nfs/index.html 2962 [IB] InfiniBand Architecture Specification, Volume 1, Release 1.1. 2963 available from http://www.infinibandta.org 2965 [KM02] 2966 K. Magoutis, "Design and Implementation of a Direct Access 2967 File System (DAFS) Kernel Server for FreeBSD", in Proceedings 2968 of USENIX BSDCon 2002 Conference, San Francisco, CA, February 2969 11-14, 2002. 2971 [MAF+02] 2972 K. Magoutis, S. Addetia, A. Fedorova, M. Seltzer, J. Chase, D. 2973 Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber, "Structure 2974 and Performance of the Direct Access File System (DAFS)", in 2975 Proceedings of 2002 USENIX Annual Technical Conference, 2976 Monterey, CA, June 9-14, 2002. 2978 [MIDTAX] 2979 B. Carpenter, S. Brim, "Middleboxes: Taxonomy and Issues", 2980 Informational RFC, http://www.ietf.org/rfc/rfc3234 2982 [NFSDDP] 2983 B. Callaghan, T. Talpey, "NFS Direct Data Placement", 2984 Internet-Draft Work in Progress, http://www.ietf.org/internet- 2985 drafts/draft-ietf-nfsv4-nfsdirect 2987 [NFSPS] 2988 T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", 2989 Internet-Draft Work in Progress, http://www.ietf.org/internet- 2990 drafts/draft-ietf-nfsv4-nfs-rdma-problem-statement 2992 [RDDP] 2993 Remote Direct Data Placement Working Group charter, 2994 http://www.ietf.org/html.charters/rddp-charter.html 2996 [RDDPPS] 2997 A. Romanow, J. Mogul, T. Talpey, S. Bailey, Remote Direct Data 2998 Placement Working Group Problem Statement, Standards Track 2999 Informational RFC, http://www.ietf.org/internet-drafts/draft- 3000 ietf-rddp-problem-statement 3002 [RDMAP] 3003 R. Recio, P. Culley, D. Garcia, J. Hilland, "An RDMA Protocol 3004 Specification", Internet-Draft Work in Progress, 3005 http://www.ietf.org/internet-drafts/draft-ietf-rddp-rdmap 3007 [RPCRDMA] 3008 B. Callaghan, T. Talpey, "RDMA Transport for ONC RPC" 3009 Internet-Draft Work in Progress, http://www.ietf.org/internet- 3010 drafts/draft-ietf-nfsv4-rpcrdma 3012 [RFC2203] 3013 M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol 3014 Specification", Standards Track RFC, 3015 http://www.ietf.org/rfc/rfc2203 3017 [RW96] 3018 R. Werme, "RPC XID Issues", Connectathon 1996, San Jose, CA, 3019 http://www.cthon.org/talks96/werme1.pdf 3021 10. Authors' Addresses 3023 Comments on this draft may be sent to the NFSv4 Working Group 3024 (nfsv4@ietf.org) and/or the authors. 3026 Tom Talpey 3027 Network Appliance, Inc. 3028 375 Totten Pond Road 3029 Waltham, MA 02451 USA 3031 Phone: +1 781 768 5329 3032 EMail: thomas.talpey@netapp.com 3034 Spencer Shepler 3035 Sun Microsystems, Inc. 3036 7808 Moonflower Drive 3037 Austin, TX 78750 USA 3039 Phone: +1 512 349 9376 3040 EMail: spencer.shepler@sun.com 3042 Jon Bauman 3043 University of Michigan 3044 Center for Information Technology Integration 3045 535 W. William St. Suite 3100 3046 Ann Arbor, MI 48103 USA 3048 Phone: +1 734 615-4782 3049 Email: baumanj@umich.edu 3051 11. Full Copyright Statement 3053 Copyright (C) The Internet Society (2005). This document is 3054 subject to the rights, licenses and restrictions contained in BCP 3055 78, and except as set forth therein, the authors retain all their 3056 rights. 3058 This document and the information contained herein are provided on 3059 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 3060 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND 3061 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, 3062 EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT 3063 THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR 3064 ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A 3065 PARTICULAR PURPOSE. 3067 Intellectual Property 3069 The IETF takes no position regarding the validity or scope of any 3070 Intellectual Property Rights or other rights that might be claimed 3071 to pertain to the implementation or use of the technology described 3072 in this document or the extent to which any license under such 3073 rights might or might not be available; nor does it represent that 3074 it has made any independent effort to identify any such rights. 3075 Information on the procedures with respect to rights in RFC 3076 documents can be found in BCP 78 and BCP 79. 3078 Copies of IPR disclosures made to the IETF Secretariat and any 3079 assurances of licenses to be made available, or the result of an 3080 attempt made to obtain a general license or permission for the use 3081 of such proprietary rights by implementers or users of this 3082 specification can be obtained from the IETF on-line IPR repository 3083 at http://www.ietf.org/ipr. 3085 The IETF invites any interested party to bring to its attention any 3086 copyrights, patents or patent applications, or other proprietary 3087 rights that may cover technology that may be required to implement 3088 this standard. Please address the information to the IETF at ietf- 3089 ipr@ietf.org. 3091 Acknowledgement 3093 Funding for the RFC Editor function is currently provided by the 3094 Internet Society.