idnits 2.17.1 draft-ietf-nfsv4-minorversion1-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 14. -- Found old boilerplate from RFC 3978, Section 5.5 on line 6348. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 6325. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 6332. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 6338. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 5074 has weird spacing: '...L4resok reso...' == Line 5276 has weird spacing: '...E4resok reso...' == Line 5508 has weird spacing: '...r_entry cha...' == Line 5854 has weird spacing: '...T4resok resok...' == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHOULD not' in this paragraph: If the "layoutchanged" field is TRUE, then the client SHOULD not flush its dirty data to the devices specified by the layout being recalled. Instead, it is preferable for the client to flush the dirty data through the metadata server. Alternatively, the client may attempt to obtain a new layout. Note: in order to obtain a new layout the client must first return the old layout. Since obtaining a new layout is not guaranteed to succeed, the client must be ready to flush its dirty data through the metadata server. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (December 12, 2005) is 6711 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: 'RFC3530' on line 2047 -- Looks like a reference, but probably isn't: 'RDMAP' on line 307 -- Looks like a reference, but probably isn't: 'DDP' on line 307 -- Looks like a reference, but probably isn't: 'IB' on line 316 -- Looks like a reference, but probably isn't: 'RPCRDMA' on line 1989 -- Looks like a reference, but probably isn't: 'RDDPPS' on line 455 -- Looks like a reference, but probably isn't: 'KM02' on line 411 -- Looks like a reference, but probably isn't: 'NFSPS' on line 430 -- Looks like a reference, but probably isn't: 'CJ89' on line 756 -- Looks like a reference, but probably isn't: 'CCM' on line 1975 -- Looks like a reference, but probably isn't: 'RW96' on line 1749 -- Looks like a reference, but probably isn't: 'Connection' on line 678 -- Looks like a reference, but probably isn't: 'MIDTAX' on line 1002 -- Looks like a reference, but probably isn't: 'NFSDDP' on line 1142 -- Looks like a reference, but probably isn't: 'Segment' on line 1172 -- Looks like a reference, but probably isn't: 'BW87' on line 1821 -- Looks like a reference, but probably isn't: 'RFC2203' on line 4410 -- Looks like a reference, but probably isn't: 'Section 9' on line 2491 -- Looks like a reference, but probably isn't: 'RFC1831' on line 4400 -- Looks like a reference, but probably isn't: 'RFC2743' on line 4409 == Missing Reference: '16' is mentioned on line 4852, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. '1' ** Obsolete normative reference: RFC 3530 (ref. '2') (Obsoleted by RFC 7530) -- Obsolete informational reference (is this intentional?): RFC 3720 (ref. '3') (Obsoleted by RFC 7143) == Outdated reference: A later version (-02) exists of draft-zelenka-pnfs-obj-01 Summary: 5 errors (**), 0 flaws (~~), 9 warnings (==), 30 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 S. Shepler 3 Internet-Draft Editor 4 Expires: June 15, 2006 December 12, 2005 6 NFSv4 Minor Version 1 7 draft-ietf-nfsv4-minorversion1-01.txt 9 Status of this Memo 11 By submitting this Internet-Draft, each author represents that any 12 applicable patent or other IPR claims of which he or she is aware 13 have been or will be disclosed, and any of which he or she becomes 14 aware will be disclosed, in accordance with Section 6 of BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt. 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 This Internet-Draft will expire on June 15, 2006. 34 Copyright Notice 36 Copyright (C) The Internet Society (2005). 38 Abstract 40 This Internet-Draft describes the NFSv4 minor version 1 protocol 41 extensions. These most significant of these extensions are commonly 42 called: Sessions, Directory Delegations, and parallel NFS or pNFS 44 Requirements Language 46 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 47 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 48 document are to be interpreted as described in RFC 2119 [1]. 50 Table of Contents 52 1. Security Negotiation . . . . . . . . . . . . . . . . . . . . 6 53 2. Clarification of Security Negotiation in NFSv4.1 . . . . . . 6 54 2.1 PUTFH + LOOKUP . . . . . . . . . . . . . . . . . . . . . . 6 55 2.2 PUTFH + LOOKUPP . . . . . . . . . . . . . . . . . . . . . 7 56 2.3 PUTFH + SECINFO . . . . . . . . . . . . . . . . . . . . . 7 57 2.4 PUTFH + Anything Else . . . . . . . . . . . . . . . . . . 7 58 3. NFSv4.1 Sessions . . . . . . . . . . . . . . . . . . . . . . 8 59 3.1 Sessions Background . . . . . . . . . . . . . . . . . . . 8 60 3.1.1 Introduction to Sessions . . . . . . . . . . . . . . . 8 61 3.1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . 9 62 3.1.3 Problem Statement . . . . . . . . . . . . . . . . . . 10 63 3.1.4 NFSv4 Session Extension Characteristics . . . . . . . 11 64 3.2 Transport Issues . . . . . . . . . . . . . . . . . . . . . 12 65 3.2.1 Session Model . . . . . . . . . . . . . . . . . . . . 12 66 3.2.2 Connection State . . . . . . . . . . . . . . . . . . . 13 67 3.2.3 NFSv4 Channels, Sessions and Connections . . . . . . . 14 68 3.2.4 Reconnection, Trunking and Failover . . . . . . . . . 16 69 3.2.5 Server Duplicate Request Cache . . . . . . . . . . . . 17 70 3.3 Session Initialization and Transfer Models . . . . . . . . 18 71 3.3.1 Session Negotiation . . . . . . . . . . . . . . . . . 18 72 3.3.2 RDMA Requirements . . . . . . . . . . . . . . . . . . 19 73 3.3.3 RDMA Connection Resources . . . . . . . . . . . . . . 20 74 3.3.4 TCP and RDMA Inline Transfer Model . . . . . . . . . . 21 75 3.3.5 RDMA Direct Transfer Model . . . . . . . . . . . . . . 23 76 3.4 Connection Models . . . . . . . . . . . . . . . . . . . . 26 77 3.4.1 TCP Connection Model . . . . . . . . . . . . . . . . . 27 78 3.4.2 Negotiated RDMA Connection Model . . . . . . . . . . . 28 79 3.4.3 Automatic RDMA Connection Model . . . . . . . . . . . 29 80 3.5 Buffer Management, Transfer, Flow Control . . . . . . . . 29 81 3.6 Retry and Replay . . . . . . . . . . . . . . . . . . . . . 32 82 3.7 The Back Channel . . . . . . . . . . . . . . . . . . . . . 33 83 3.8 COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . . 34 84 3.9 Data Alignment . . . . . . . . . . . . . . . . . . . . . . 34 85 3.10 NFSv4 Integration . . . . . . . . . . . . . . . . . . . 36 86 3.10.1 Minor Versioning . . . . . . . . . . . . . . . . . . 36 87 3.10.2 Slot Identifiers and Server Duplicate Request 88 Cache . . . . . . . . . . . . . . . . . . . . . . . 36 89 3.10.3 COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . 40 90 3.10.4 eXternal Data Representation Efficiency . . . . . . 41 91 3.10.5 Effect of Sessions on Existing Operations . . . . . 41 92 3.10.6 Authentication Efficiencies . . . . . . . . . . . . 42 93 3.11 Sessions Security Considerations . . . . . . . . . . . . 43 94 3.11.1 Authentication . . . . . . . . . . . . . . . . . . . 44 95 4. Directory Delegations . . . . . . . . . . . . . . . . . . . 45 96 4.1 Introduction to Directory Delegations . . . . . . . . . . 45 97 4.2 Directory Delegation Design (in brief) . . . . . . . . . . 47 98 4.3 Recommended Attributes in support of Directory 99 Delegations . . . . . . . . . . . . . . . . . . . . . . . 48 100 4.4 Delegation Recall . . . . . . . . . . . . . . . . . . . . 48 101 4.5 Delegation Recovery . . . . . . . . . . . . . . . . . . . 49 102 5. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 49 103 6. General Definitions . . . . . . . . . . . . . . . . . . . . 51 104 6.1 Metadata Server . . . . . . . . . . . . . . . . . . . . . 52 105 6.2 Client . . . . . . . . . . . . . . . . . . . . . . . . . . 52 106 6.3 Storage Device . . . . . . . . . . . . . . . . . . . . . . 52 107 6.4 Storage Protocol . . . . . . . . . . . . . . . . . . . . . 52 108 6.5 Control Protocol . . . . . . . . . . . . . . . . . . . . . 53 109 6.6 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . 53 110 6.7 Layout . . . . . . . . . . . . . . . . . . . . . . . . . . 53 111 7. pNFS protocol semantics . . . . . . . . . . . . . . . . . . 53 112 7.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . 54 113 7.1.1 Layout Types . . . . . . . . . . . . . . . . . . . . . 54 114 7.1.2 Layout Iomode . . . . . . . . . . . . . . . . . . . . 54 115 7.1.3 Layout Segments . . . . . . . . . . . . . . . . . . . 55 116 7.1.4 Device IDs . . . . . . . . . . . . . . . . . . . . . . 56 117 7.1.5 Aggregation Schemes . . . . . . . . . . . . . . . . . 56 118 7.2 Guarantees Provided by Layouts . . . . . . . . . . . . . . 56 119 7.3 Getting a Layout . . . . . . . . . . . . . . . . . . . . . 58 120 7.4 Committing a Layout . . . . . . . . . . . . . . . . . . . 58 121 7.4.1 LAYOUTCOMMIT and mtime/atime/change . . . . . . . . . 59 122 7.4.2 LAYOUTCOMMIT and size . . . . . . . . . . . . . . . . 60 123 7.4.3 LAYOUTCOMMIT and layoutupdate . . . . . . . . . . . . 61 124 7.5 Recalling a Layout . . . . . . . . . . . . . . . . . . . . 61 125 7.5.1 Basic Operation . . . . . . . . . . . . . . . . . . . 61 126 7.5.2 Recall Callback Robustness . . . . . . . . . . . . . . 62 127 7.5.3 Recall/Return Sequencing . . . . . . . . . . . . . . . 63 128 7.6 Metadata Server Write Propagation . . . . . . . . . . . . 65 129 7.7 Crash Recovery . . . . . . . . . . . . . . . . . . . . . . 66 130 7.7.1 Leases . . . . . . . . . . . . . . . . . . . . . . . . 66 131 7.7.2 Client Recovery . . . . . . . . . . . . . . . . . . . 67 132 7.7.3 Metadata Server Recovery . . . . . . . . . . . . . . . 68 133 7.7.4 Storage Device Recovery . . . . . . . . . . . . . . . 70 134 8. Security Considerations . . . . . . . . . . . . . . . . . . 71 135 8.1 File Layout Security . . . . . . . . . . . . . . . . . . . 72 136 8.2 Object Layout Security . . . . . . . . . . . . . . . . . . 72 137 8.3 Block/Volume Layout Security . . . . . . . . . . . . . . . 73 138 9. The NFSv4 File Layout Type . . . . . . . . . . . . . . . . . 74 139 9.1 File Striping and Data Access . . . . . . . . . . . . . . 74 140 9.1.1 Sparse and Dense Storage Device Data Layouts . . . . . 75 141 9.1.2 Metadata and Storage Device Roles . . . . . . . . . . 77 142 9.1.3 Device Multipathing . . . . . . . . . . . . . . . . . 78 143 9.1.4 Operations Issued to Storage Devices . . . . . . . . . 79 144 9.2 Global Stateid Requirements . . . . . . . . . . . . . . . 79 145 9.3 The Layout Iomode . . . . . . . . . . . . . . . . . . . . 80 146 9.4 Storage Device State Propagation . . . . . . . . . . . . . 80 147 9.4.1 Lock State Propagation . . . . . . . . . . . . . . . . 80 148 9.4.2 Open-mode Validation . . . . . . . . . . . . . . . . . 81 149 9.4.3 File Attributes . . . . . . . . . . . . . . . . . . . 81 150 9.5 Storage Device Component File Size . . . . . . . . . . . . 82 151 9.6 Crash Recovery Considerations . . . . . . . . . . . . . . 83 152 9.7 Security Considerations . . . . . . . . . . . . . . . . . 83 153 9.8 Alternate Approaches . . . . . . . . . . . . . . . . . . . 84 154 10. pNFS Typed Data Structures . . . . . . . . . . . . . . . . . 85 155 10.1 pnfs_layouttype4 . . . . . . . . . . . . . . . . . . . . 85 156 10.2 pnfs_deviceid4 . . . . . . . . . . . . . . . . . . . . . 85 157 10.3 pnfs_deviceaddr4 . . . . . . . . . . . . . . . . . . . . 86 158 10.4 pnfs_devlist_item4 . . . . . . . . . . . . . . . . . . . 86 159 10.5 pnfs_layout4 . . . . . . . . . . . . . . . . . . . . . . 87 160 10.6 pnfs_layoutupdate4 . . . . . . . . . . . . . . . . . . . 87 161 10.7 pnfs_layouthint4 . . . . . . . . . . . . . . . . . . . . 88 162 10.8 pnfs_layoutiomode4 . . . . . . . . . . . . . . . . . . . 88 163 11. pNFS File Attributes . . . . . . . . . . . . . . . . . . . . 88 164 11.1 pnfs_layouttype4<> FS_LAYOUT_TYPES . . . . . . . . . . . 88 165 11.2 pnfs_layouttype4<> FILE_LAYOUT_TYPES . . . . . . . . . . 88 166 11.3 pnfs_layouthint4 FILE_LAYOUT_HINT . . . . . . . . . . . 89 167 11.4 uint32_t FS_LAYOUT_PREFERRED_BLOCKSIZE . . . . . . . . . 89 168 11.5 uint32_t FS_LAYOUT_PREFERRED_ALIGNMENT . . . . . . . . . 89 169 12. pNFS Error Definitions . . . . . . . . . . . . . . . . . . . 89 170 13. Layouts and Aggregation . . . . . . . . . . . . . . . . . . 90 171 13.1 Simple Map . . . . . . . . . . . . . . . . . . . . . . . 90 172 13.2 Block Extent Map . . . . . . . . . . . . . . . . . . . . 90 173 13.3 Striped Map (RAID 0) . . . . . . . . . . . . . . . . . . 90 174 13.4 Replicated Map . . . . . . . . . . . . . . . . . . . . . 91 175 13.5 Concatenated Map . . . . . . . . . . . . . . . . . . . . 91 176 13.6 Nested Map . . . . . . . . . . . . . . . . . . . . . . . 91 177 14. NFSv4.1 Operations . . . . . . . . . . . . . . . . . . . . . 91 178 14.1 LOOKUPP - Lookup Parent Directory . . . . . . . . . . . 91 179 14.2 SECINFO -- Obtain Available Security . . . . . . . . . . 93 180 14.3 SECINFO_NO_NAME - Get Security on Unnamed Object . . . . 96 181 14.4 CREATECLIENTID - Instantiate Clientid . . . . . . . . . 98 182 14.5 CREATESESSION - Create New Session and Confirm 183 Clientid . . . . . . . . . . . . . . . . . . . . . . . . 104 184 14.6 BIND_BACKCHANNEL - Create a callback channel binding . . 109 185 14.7 DESTROYSESSION - Destroy existing session . . . . . . . 112 186 14.8 SEQUENCE - Supply per-procedure sequencing and control . 113 187 14.9 CB_RECALLCREDIT - change flow control limits . . . . . . 114 188 14.10 CB_SEQUENCE - Supply callback channel sequencing and 189 control . . . . . . . . . . . . . . . . . . . . . . . . 115 190 14.11 GET_DIR_DELEGATION - Get a directory delegation . . . . 117 191 14.12 CB_NOTIFY - Notify directory changes . . . . . . . . . . 120 192 14.13 CB_RECALL_ANY - Keep any N delegations . . . . . . . . . 124 193 14.14 LAYOUTGET - Get Layout Information . . . . . . . . . . . 126 194 14.15 LAYOUTCOMMIT - Commit writes made using a layout . . . . 128 195 14.16 LAYOUTRETURN - Release Layout Information . . . . . . . 131 196 14.17 GETDEVICEINFO - Get Device Information . . . . . . . . . 133 197 14.18 GETDEVICELIST - Get List of Devices . . . . . . . . . . 134 198 14.19 CB_LAYOUTRECALL . . . . . . . . . . . . . . . . . . . . 136 199 14.20 CB_SIZECHANGED . . . . . . . . . . . . . . . . . . . . . 138 200 15. References . . . . . . . . . . . . . . . . . . . . . . . . . 139 201 15.1 Normative References . . . . . . . . . . . . . . . . . . 139 202 15.2 Informative References . . . . . . . . . . . . . . . . . 139 203 Author's Address . . . . . . . . . . . . . . . . . . . . . . 139 204 A. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . 139 205 Intellectual Property and Copyright Statements . . . . . . . 141 207 1. Security Negotiation 209 The NFSv4.0 specification contains three oversights and ambiguities 210 with respect to the SECINFO operation. 212 First, it is impossible for the client to use the SECINFO operation 213 to determine the correct security triple for accessing a parent 214 directory. This is because SECINFO takes as arguments the current 215 file handle and a component name. However, NFSv4.0 uses the LOOKUPP 216 operation to get the parent directory of the current file handle. If 217 the client uses the wrong security when issuing the LOOKUPP, and gets 218 back an NFS4ERR_WRONGSEC error, SECINFO is useless to the client. 219 The client is left with guessing which security the server will 220 accept. This defeats the purpose of SECINFO, which was to provide an 221 efficient method of negotiating security. 223 Second, there is ambiguity as to what the server should do when it is 224 passed a LOOKUP operation such that the server restricts access to 225 the current file handle with one security triple, and access to the 226 component with a different triple, and remote procedure call uses one 227 of the two security triples. Should the server allow the LOOKUP? 229 Third, there is a problem as to what the client must do (or can do), 230 whenever the server returns NFS4ERR_WRONGSEC in response to a PUTFH 231 operation. The NFSv4.0 specification says that client should issue a 232 SECINFO using the parent filehandle and the component name of the 233 filehandle that PUTFH was issued with. This may not be convenient 234 for the client. 236 This document resolves the above three issues in the context of 237 NFSv4.1. 239 2. Clarification of Security Negotiation in NFSv4.1 241 This section attempts to clarify NFSv4.1 security negotiation issues. 242 Unless noted otherwise, for any mention of PUTFH in this section, the 243 reader should interpret it as applying to PUTROOTFH and PUTPUBFH in 244 addition to PUTFH. 246 2.1 PUTFH + LOOKUP 248 The server implementation may decide whether to impose any 249 restrictions on export security administration. There are at least 250 three approaches (Sc is the flavor set of the child export, Sp that 251 of the parent), 252 a) Sc <= Sp (<= for subset) 254 b) Sc ^ Sp != {} (^ for intersection, {} for the empty set) 256 c) free form 258 To support b (when client chooses a flavor that is not a member of 259 Sp) and c, PUTFH must NOT return NFS4ERR_WRONGSEC in case of security 260 mismatch. Instead, it should be returned from the LOOKUP that 261 follows. 263 Since the above guideline does not contradict a, it should be 264 followed in general. 266 2.2 PUTFH + LOOKUPP 268 Since SECINFO only works its way down, there is no way LOOKUPP can 269 return NFS4ERR_WRONGSEC without the server implementing 270 SECINFO_NO_NAME. SECINFO_NO_NAME solves this issue because via style 271 "parent", it works in the opposite direction as SECINFO (component 272 name is implicit in this case). 274 2.3 PUTFH + SECINFO 276 This case should be treated specially. 278 A security sensitive client should be allowed to choose a strong 279 flavor when querying a server to determine a file object's permitted 280 security flavors. The security flavor chosen by the client does not 281 have to be included in the flavor list of the export. Of course the 282 server has to be configured for whatever flavor the client selects, 283 otherwise the request will fail at RPC authentication. 285 In theory, there is no connection between the security flavor used by 286 SECINFO and those supported by the export. But in practice, the 287 client may start looking for strong flavors from those supported by 288 the export, followed by those in the mandatory set. 290 2.4 PUTFH + Anything Else 292 PUTFH must return NFS4ERR_WRONGSEC in case of security mismatch. 293 This is the most straightforward approach without having to add 294 NFS4ERR_WRONGSEC to every other operations. 296 PUTFH + SECINFO_NO_NAME (style "current_fh") is needed for the client 297 to recover from NFS4ERR_WRONGSEC. 299 3. NFSv4.1 Sessions 301 3.1 Sessions Background 303 3.1.1 Introduction to Sessions 305 This draft proposes extensions to NFS version 4 [RFC3530] enabling it 306 to support sessions and endpoint management, and to support operation 307 atop RDMA-capable RPC over transports such as iWARP. [RDMAP, DDP] 308 These extensions enable support for exactly-once semantics by NFSv4 309 servers, multipathing and trunking of transport connections, and 310 enhanced security. The ability to operate over RDMA enables greatly 311 enhanced performance. Operation over existing TCP is enhanced as 312 well. 314 While discussed here with respect to IETF-chartered transports, the 315 proposed protocol is intended to function over other standards, such 316 as Infiniband. [IB] 318 The following are the major aspects of this proposal: 320 Changes are proposed within the framework of NFSv4 minor 321 versioning. RPC, XDR, and the NFSv4 procedures and operations are 322 preserved. The proposed extension functions equally well over 323 existing transports and RDMA, and interoperates transparently with 324 existing implementations, both at the local programmatic interface 325 and over the wire. 327 An explicit session is introduced to NFSv4, and new operations are 328 added to support it. The session allows for enhanced trunking, 329 failover and recovery, and authentication efficiency, along with 330 necessary support for RDMA. The session is implemented as 331 operations within NFSv4 COMPOUND and does not impact layering or 332 interoperability with existing NFSv4 implementations. The NFSv4 333 callback channel is dynamically associated and is connected by the 334 client and not the server, enhancing security and operation 335 through firewalls. In fact, the callback channel will be enabled 336 to share the same connection as the operations channel. 338 An enhanced RPC layer enables NFSv4 operation atop RDMA. The 339 session assists RDMA-mode connection, and additional facilities 340 are provided for managing RDMA resources at both NFSv4 server and 341 client. Existing NFSv4 operations continue to function as before, 342 though certain size limits are negotiated. A companion draft to 343 this document, "RDMA Transport for ONC RPC" [RPCRDMA] is to be 344 referenced for details of RPC RDMA support. 346 Support for exactly-once semantics ("EOS") is enabled by the new 347 session facilities, by providing to the server a way to bound the 348 size of the duplicate request cache for a single client, and to 349 manage its persistent storage. 351 Block Diagram 353 +-----------------+-------------------------------------+ 354 | NFSv4 | NFSv4 + session extensions | 355 +-----------------+------+----------------+-------------+ 356 | Operations | Session | | 357 +------------------------+----------------+ | 358 | RPC/XDR | | 359 +-------------------------------+---------+ | 360 | Stream Transport | RDMA Transport | 361 +-------------------------------+-----------------------+ 363 3.1.2 Motivation 365 NFS version 4 [RFC3530] has been granted "Proposed Standard" status. 366 The NFSv4 protocol was developed along several design points, 367 important among them: effective operation over wide-area networks, 368 including the Internet itself; strong security integrated into the 369 protocol; extensive cross-platform interoperability including 370 integrated locking semantics compatible with multiple operating 371 systems; and protocol extensibility. 373 The NFS version 4 protocol, however, does not provide support for 374 certain important transport aspects. For example, the protocol does 375 not address response caching, which is required to provide 376 correctness for retried client requests across a network partition, 377 nor does it provide an interoperable way to support trunking and 378 multipathing of connections. This leads to inefficiencies, 379 especially where trunking and multipathing are concerned, and 380 presents additional difficulties in supporting RDMA fabrics, in which 381 endpoints may require dedicated or specialized resources. Sessions 382 can be employed to unify NFS-level constructs such as the clientid, 383 with transport-level constructs such as transport endpoints. Each 384 transport endpoint draws on resources via its membership in a 385 session. Resource management can be more strictly maintained, 386 leading to greater server efficiency in implementing the protocol. 387 The enhanced operation over a session affords an opportunity to the 388 server to implement a highly reliable duplicate request cache, and 389 thereby export exactly-once semantics. 391 NFSv4 advances the state of high-performance local sharing, by virtue 392 of its integrated security, locking, and delegation, and its 393 excellent coverage of the sharing semantics of multiple operating 394 systems. It is precisely this environment where exactly-once 395 semantics become a fundamental requirement. 397 Additionally, efforts to standardize a set of protocols for Remote 398 Direct Memory Access, RDMA, over the Internet Protocol Suite have 399 made significant progress. RDMA is a general solution to the problem 400 of CPU overhead incurred due to data copies, primarily at the 401 receiver. Substantial research has addressed this and has borne out 402 the efficacy of the approach. An overview of this is the RDDP 403 Problem Statement document, [RDDPPS]. 405 Numerous upper layer protocols achieve extremely high bandwidth and 406 low overhead through the use of RDMA. Products from a wide variety 407 of vendors employ RDMA to advantage, and prototypes have demonstrated 408 the effectiveness of many more. Here, we are concerned specifically 409 with NFS and NFS-style upper layer protocols; examples from Network 410 Appliance [DAFS, DCK+03], Fujitsu Prime Software Technologies [FJNFS, 411 FJDAFS] and Harvard University [KM02] are all relevant. 413 By layering a session binding for NFS version 4 directly atop a 414 standard RDMA transport, a greatly enhanced level of performance and 415 transparency can be supported on a wide variety of operating system 416 platforms. These combined capabilities alter the landscape between 417 local filesystems and network attached storage, enable a new level of 418 performance, and lead new classes of application to take advantage of 419 NFS. 421 3.1.3 Problem Statement 423 Two issues drive the current proposal: correctness, and performance. 424 Both are instances of "raising the bar" for NFS, whereby the desire 425 to use NFS in new classes applications can be accommodated by 426 providing the basic features to make such use feasible. Such 427 applications include tightly coupled sharing environments such as 428 cluster computing, high performance computing (HPC) and information 429 processing such as databases. These trends are explored in depth in 430 [NFSPS]. 432 The first issue, correctness, exemplified among the attributes of 433 local filesystems, is support for exactly-once semantics. Such 434 semantics have not been reliably available with NFS. Server-based 435 duplicate request caches [CJ89] help, but do not reliably provide 436 strict correctness. For the type of application which is expected to 437 make extensive use of the high-performance RDMA-enabled environment, 438 the reliable provision of such semantics is a fundamental 439 requirement. 441 Introduction of a session to NFSv4 will address these issues. With 442 higher performance and enhanced semantics comes the problem of 443 enabling advanced endpoint management, for example high-speed 444 trunking, multipathing and failover. These characteristics enable 445 availability and performance. RFC3530 presents some issues in 446 permitting a single clientid to access a server over multiple 447 connections. 449 A second issue encountered in common by NFS implementations is the 450 CPU overhead required to implement the protocol. Primary among the 451 sources of this overhead is the movement of data from NFS protocol 452 messages to its eventual destination in user buffers or aligned 453 kernel buffers. The data copies consume system bus bandwidth and CPU 454 time, reducing the available system capacity for applications. 455 [RDDPPS] Achieving zero-copy with NFS has to date required 456 sophisticated, "header cracking" hardware and/or extensive platform- 457 specific virtual memory mapping tricks. 459 Combined in this way, NFSv4, RDMA and the emerging high-speed network 460 fabrics will enable delivery of performance which matches that of the 461 fastest local filesystems, preserving the key existing local 462 filesystem semantics, while enhancing them by providing network 463 filesystem sharing semantics. 465 RDMA implementations generally have other interesting properties, 466 such as hardware assisted protocol access, and support for user space 467 access to I/O. RDMA is compelling here for another reason; hardware 468 offloaded networking support in itself does not avoid data copies, 469 without resorting to implementing part of the NFS protocol in the 470 NIC. Support of RDMA by NFS enables the highest performance at the 471 architecture level rather than by implementation; this enables 472 ubiquitous and interoperable solutions. 474 By providing file access performance equivalent to that of local file 475 systems, NFSv4 over RDMA will enable applications running on a set of 476 client machines to interact through an NFSv4 file system, just as 477 applications running on a single machine might interact through a 478 local file system. 480 This raises the issue of whether additional protocol enhancements to 481 enable such interaction would be desirable and what such enhancements 482 would be. This is a complicated issue which the working group needs 483 to address and will not be further discussed in this document. 485 3.1.4 NFSv4 Session Extension Characteristics 487 This draft will present a solution based upon minor versioning of 488 NFSv4. It will introduce a session to collect transport endpoints 489 and resources such as reply caching, which in turn enables 490 enhancements such as trunking, failover and recovery. It will 491 describe use of RDMA by employing support within an underlying RPC 492 layer [RPCRDMA]. Most importantly, it will focus on making the best 493 possible use of an RDMA transport. 495 These extensions are proposed as elements of a new minor revision of 496 NFS version 4. In this draft, NFS version 4 will be referred to 497 generically as "NFSv4", when describing properties common to all 498 minor versions. When referring specifically to properties of the 499 original, minor version 0 protocol, "NFSv4.0" will be used, and 500 changes proposed here for minor version 1 will be referred to as 501 "NFSv4.1". 503 This draft proposes only changes which are strictly upward- 504 compatible with existing RPC and NFS Application Programming 505 Interfaces (APIs). 507 3.2 Transport Issues 509 The Transport Issues section of the document explores the details of 510 utilizing the various supported transports. 512 3.2.1 Session Model 514 The first and most evident issue in supporting diverse transports is 515 how to provide for their differences. This draft proposes 516 introducing an explicit session. 518 A session introduces minimal protocol requirements, and provides for 519 a highly useful and convenient way to manage numerous endpoint- 520 related issues. The session is a local construct; it represents a 521 named, higher-layer object to which connections can refer, and 522 encapsulates properties important to each associated client. 524 A session is a dynamically created, long-lived server object created 525 by a client, used over time from one or more transport connections. 526 Its function is to maintain the server's state relative to the 527 connection(s) belonging to a client instance. This state is entirely 528 independent of the connection itself. The session in effect becomes 529 the object representing an active client on a connection or set of 530 connections. 532 Clients may create multiple sessions for a single clientid, and may 533 wish to do so for optimization of transport resources, buffers, or 534 server behavior. A session could be created by the client to 535 represent a single mount point, for separate read and write 536 "channels", or for any number of other client-selected parameters. 538 The session enables several things immediately. Clients may 539 disconnect and reconnect (voluntarily or not) without loss of context 540 at the server. (Of course, locks, delegations and related 541 associations require special handling, and generally expire in the 542 extended absence of an open connection.) Clients may connect 543 multiple transport endpoints to this common state. The endpoints may 544 have all the same attributes, for instance when trunked on multiple 545 physical network links for bandwidth aggregation or path failover. 546 Or, the endpoints can have specific, special purpose attributes such 547 as callback channels. 549 The NFSv4 specification does not provide for any form of flow 550 control; instead it relies on the windowing provided by TCP to 551 throttle requests. This unfortunately does not work with RDMA, which 552 in general provides no operation flow control and will terminate a 553 connection in error when limits are exceeded. Limits are therefore 554 exchanged when a session is created; These limits then provide maxima 555 within which each session's connections must operate, they are 556 managed within these limits as described in [RPCRDMA]. The limits 557 may also be modified dynamically at the server's choosing by 558 manipulating certain parameters present in each NFSv4.1 request. 560 The presence of a maximum request limit on the session bounds the 561 requirements of the duplicate request cache. This can be used to 562 advantage by a server, which can accurately determine any storage 563 needs and enable it to maintain duplicate request cache persistence 564 and to provide reliable exactly-once semantics. 566 Finally, given adequate connection-oriented transport security 567 semantics, authentication and authorization may be cached on a per- 568 session basis, enabling greater efficiency in the issuing and 569 processing of requests on both client and server. A proposal for 570 transparent, server-driven implementation of this in NFSv4 has been 571 made. [CCM] The existence of the session greatly facilitates the 572 implementation of this approach. This is discussed in detail in the 573 Authentication Efficiencies section later in this draft. 575 3.2.2 Connection State 577 In RFC3530, the combination of a connected transport endpoint and a 578 clientid forms the basis of connection state. While has been made to 579 be workable with certain limitations, there are difficulties in 580 correct and robust implementation. The NFSv4.0 protocol must provide 581 a server-initiated connection for the callback channel, and must 582 carefully specify the persistence of client state at the server in 583 the face of transport interruptions. The server has only the 584 client's transport address binding (the IP 4-tuple) to identify the 585 client RPC transaction stream and to use as a lookup tag on the 586 duplicate request cache. (A useful overview of this is in [RW96].) 587 If the server listens on multiple adddresses, and the client connects 588 to more than one, it must employ different clientid's on each, 589 negating its ability to aggregate bandwidth and redundancy. In 590 effect, each transport connection is used as the server's 591 representation of client state. But, transport connections are 592 potentially fragile and transitory. 594 In this proposal, a session identifier is assigned by the server upon 595 initial session negotiation on each connection. This identifier is 596 used to associate additional connections, to renegotiate after a 597 reconnect, to provide an abstraction for the various session 598 properties, and to address the duplicate request cache. No 599 transport-specific information is used in the duplicate request cache 600 implementation of an NFSv4.1 server, nor in fact the RPC XID itself. 601 The session identifier is unique within the server's scope and may be 602 subject to certain server policies such as being bounded in time. 604 It is envisioned that the primary transport model will be connection 605 oriented. Connection orientation brings with it certain potential 606 optimizations, such as caching of per-connection properties, which 607 are easily leveraged through the generality of the session. However, 608 it is possible that in future, other transport models could be 609 accommodated below the session abstraction. 611 3.2.3 NFSv4 Channels, Sessions and Connections 613 There are at least two types of NFSv4 channels: the "operations" 614 channel used for ordinary requests from client to server, and the 615 "back" channel, used for callback requests from server to client. 617 As mentioned above, different NFSv4 operations on these channels can 618 lead to different resource needs. For example, server callback 619 operations (CB_RECALL) are specific, small messages which flow from 620 server to client at arbitrary times, while data transfers such as 621 read and write have very different sizes and asymmetric behaviors. 622 It is sometimes impractical for the RDMA peers (NFSv4 client and 623 NFSv4 server) to post buffers for these various operations on a 624 single connection. Commingling of requests with responses at the 625 client receive queue is particularly troublesome, due both to the 626 need to manage both solicited and unsolicited completions, and to 627 provision buffers for both purposes. Due to the lack of any ordering 628 of callback requests versus response arrivals, without any other 629 mechanisms, the client would be forced to allocate all buffers sized 630 to the worst case. 632 The callback requests are likely to be handled by a different task 633 context from that handling the responses. Significant demultiplexing 634 and thread management may be required if both are received on the 635 same queue. However, if callbacks are relatively rare (perhaps due 636 to client access patterns), many of these difficulties can be 637 minimized. 639 Also, the client may wish to perform trunking of operations channel 640 requests for performance reasons, or multipathing for availability. 641 This proposal permits both, as well as many other session and 642 connection possibilities, by permitting each operation to carry 643 session membership information and to share session (and clientid) 644 state in order to draw upon the appropriate resources. For example, 645 reads and writes may be assigned to specific, optimized connections, 646 or sorted and separated by any or all of size, idempotency, etc. 648 To address the problems described above, this proposal allows 649 multiple sessions to share a clientid, as well as for multiple 650 connections to share a session. 652 Single Connection model: 654 NFSv4.1 Session 655 / \ 656 Operations_Channel [Back_Channel] 657 \ / 658 Connection 659 | 661 Multi-connection trunked model (2 operations channels shown): 663 NFSv4.1 Session 664 / \ 665 Operations_Channels [Back_Channel] 666 | | | 667 Connection Connection [Connection] 668 | | | 670 Multi-connection split-use model (2 mounts shown): 672 NFSv4.1 Session 673 / \ 674 (/home) (/usr/local - readonly) 675 / \ | 676 Operations_Channel [Back_Channel] | 677 | | Operations_Channel 678 Connection [Connection] | 679 | | Connection 680 | 682 In this way, implementation as well as resource management may be 683 optimized. Each session will have its own response caching and 684 buffering, and each connection or channel will have its own transport 685 resources, as appropriate. Clients which do not require certain 686 behaviors may optimize such resources away completely, by using 687 specific sessions and not even creating the additional channels and 688 connections. 690 3.2.4 Reconnection, Trunking and Failover 692 Reconnection after failure references stored state on the server 693 associated with lease recovery during the grace period. The session 694 provides a convenient handle for storing and managing information 695 regarding the client's previous state on a per- connection basis, 696 e.g. to be used upon reconnection. Reconnection to a previously 697 existing session, and its stored resources, are covered in the 698 "Connection Models" section below. 700 One important aspect of reconnection is that of RPC library support. 701 Traditionally, an Upper Layer RPC-based Protocol such as NFS leaves 702 all transport knowledge to the RPC layer implementation below it. 703 This allows NFS to operate over a wide variety of transports and has 704 proven to be a highly successful approach. The session, however, 705 introduces an abstraction which is, in a way, "between" RPC and 706 NFSv4.1. It is important that the session abstraction not have 707 ramifications within the RPC layer. 709 One such issue arises within the reconnection logic of RPC. 710 Previously, an explicit session binding operation, which established 711 session context for each new connection, was explored. This however 712 required that the session binding also be performed during reconnect, 713 which in turn required an RPC request. This additional request 714 requires new RPC semantics, both in implementation and the fact that 715 a new request is inserted into the RPC stream. Also, the binding of 716 a connection to a session required the upper layer to become "aware" 717 of connections, something the RPC layer abstraction architecturally 718 abstracts away. Therefore the session binding is not handled in 719 connection scope but instead explicitly carried in each request. 721 For Reliability Availability and Serviceability (RAS) issues such as 722 bandwidth aggregation and multipathing, clients frequently seek to 723 make multiple connections through multiple logical or physical 724 channels. The session is a convenient point to aggregate and manage 725 these resources. 727 3.2.5 Server Duplicate Request Cache 729 Server duplicate request caches, while not a part of an NFS protocol, 730 have become a standard, even required, part of any NFS 731 implementation. First described in [CJ89], the duplicate request 732 cache was initially found to reduce work at the server by avoiding 733 duplicate processing for retransmitted requests. A second, and in 734 the long run more important benefit, was improved correctness, as the 735 cache avoided certain destructive non-idempotent requests from being 736 reinvoked. 738 However, such caches do not provide correctness guarantees; they 739 cannot be managed in a reliable, persistent fashion. The reason is 740 understandable - their storage requirement is unbounded due to the 741 lack of any such bound in the NFS protocol, and they are dependent on 742 transport addresses for request matching. 744 As proposed in this draft, the presence of maximum request count 745 limits and negotiated maximum sizes allows the size and duration of 746 the cache to be bounded, and coupled with a long-lived session 747 identifier, enables its persistent storage on a per-session basis. 749 This provides a single unified mechanism which provides the following 750 guarantees required in the NFSv4 specification, while extending them 751 to all requests, rather than limiting them only to a subset of state- 752 related requests: 754 "It is critical the server maintain the last response sent to the 755 client to provide a more reliable cache of duplicate non- idempotent 756 requests than that of the traditional cache described in [CJ89]..." 757 [RFC3530] 759 The maximum request count limit is the count of active operations, 760 which bounds the number of entries in the cache. Constraining the 761 size of operations additionally serves to limit the required storage 762 to the product of the current maximum request count and the maximum 763 response size. This storage requirement enables server- side 764 efficiencies. 766 Session negotiation allows the server to maintain other state. An 767 NFSv4.1 client invoking the session destroy operation will cause the 768 server to denegotiate (close) the session, allowing the server to 769 deallocate cache entries. Clients can potentially specify that such 770 caches not be kept for appropriate types of sessions (for example, 771 read-only sessions). This can enable more efficient server operation 772 resulting in improved response times, and more efficient sizing of 773 buffers and response caches. 775 Similarly, it is important for the client to explicitly learn whether 776 the server is able to implement reliable semantics. Knowledge of 777 whether these semantics are in force is critical for a highly 778 reliable client, one which must provide transactional integrity 779 guarantees. When clients request that the semantics be enabled for a 780 given session, the session reply must inform the client if the mode 781 is in fact enabled. In this way the client can confidently proceed 782 with operations without having to implement consistency facilities of 783 its own. 785 3.3 Session Initialization and Transfer Models 787 Session initialization issues, and data transfer models relevant to 788 both TCP and RDMA are discussed in this section. 790 3.3.1 Session Negotiation 792 The following parameters are exchanged between client and server at 793 session creation time. Their values allow the server to properly 794 size resources allocated in order to service the client's requests, 795 and to provide the server with a way to communicate limits to the 796 client for proper and optimal operation. They are exchanged prior to 797 all session-related activity, over any transport type. Discussion of 798 their use is found in their descriptions as well as throughout this 799 section. 801 Maximum Requests 803 The client's desired maximum number of concurrent requests is 804 passed, in order to allow the server to size its reply cache 805 storage. The server may modify the client's requested limit 806 downward (or upward) to match its local policy and/or resources. 807 Over RDMA-capable RPC transports, the per-request management of 808 low-level transport message credits is handled within the RPC 809 layer. [RPCRDMA] 811 Maximum Request/Response Sizes 813 The maximum request and response sizes are exchanged in order to 814 permit allocation of appropriately sized buffers and request cache 815 entries. The size must allow for certain protocol minima, 816 allowing the receipt of maximally sized operations (e.g. RENAME 817 requests which contains two name strings). Note the maximum 818 request/response sizes cover the entire request/response message 819 and not simply the data payload as traditional NFS maximum read or 820 write size. Also note the server implementation may not, in fact 821 probably does not, require the reply cache entries to be sized as 822 large as the maximum response. The server may reduce the client's 823 requested sizes. 825 Inline Padding/Alignment 827 The server can inform the client of any padding which can be used 828 to deliver NFSv4 inline WRITE payloads into aligned buffers. Such 829 alignment can be used to avoid data copy operations at the server 830 for both TCP and inline RDMA transfers. For RDMA, the client 831 informs the server in each operation when padding has been 832 applied. [RPCRDMA] 834 Transport Attributes 836 A placeholder for transport-specific attributes is provided, with 837 a format to be determined. Possible examples of information to be 838 passed in this parameter include transport security attributes to 839 be used on the connection, RDMA- specific attributes, legacy 840 "private data" as used on existing RDMA fabrics, transport Quality 841 of Service attributes, etc. This information is to be passed to 842 the peer's transport layer by local means which is currently 843 outside the scope of this draft, however one attribute is provided 844 in the RDMA case: 846 RDMA Read Resources 848 RDMA implementations must explicitly provision resources to 849 support RDMA Read requests from connected peers. These values 850 must be explicitly specified, to provide adequate resources for 851 matching the peer's expected needs and the connection's delay- 852 bandwidth parameters. The client provides its chosen value to the 853 server in the initial session creation, the value must be provided 854 in each client RDMA endpoint. The values are asymmetric and 855 should be set to zero at the server in order to conserve RDMA 856 resources, since clients do not issue RDMA Read operations in this 857 proposal. The result is communicated in the session response, to 858 permit matching of values across the connection. The value may 859 not be changed in the duration of the session, although a new 860 value may be requested as part of a new session. 862 3.3.2 RDMA Requirements 864 A complete discussion of the operation of RPC-based protocols atop 865 RDMA transports is in [RPCRDMA]. Where RDMA is considered, this 866 proposal assumes the use of such a layering; it addresses only the 867 upper layer issues relevant to making best use of RPC/RDMA. 869 A connection oriented (reliable sequenced) RDMA transport will be 870 required. There are several reasons for this. First, this model 871 most closely reflects the general NFSv4 requirement of long-lived and 872 congestion-controlled transports. Second, to operate correctly over 873 either an unreliable or unsequenced RDMA transport, or both, would 874 require significant complexity in the implementation and protocol not 875 appropriate for a strict minor version. For example, retransmission 876 on connected endpoints is explicitly disallowed in the current NFSv4 877 draft; it would again be required with these alternate transport 878 characteristics. Third, the proposal assumes a specific RDMA 879 ordering semantic, which presents the same set of ordering and 880 reliability issues to the RDMA layer over such transports. 882 The RDMA implementation provides for making connections to other 883 RDMA-capable peers. In the case of the current proposals before the 884 RDDP working group, these RDMA connections are preceded by a 885 "streaming" phase, where ordinary TCP (or NFS) traffic might flow. 886 However, this is not assumed here and sizes and other parameters are 887 explicitly exchanged upon a session entering RDMA mode. 889 3.3.3 RDMA Connection Resources 891 On transport endpoints which support automatic RDMA mode, that is, 892 endpoints which are created in the RDMA-enabled state, a single, 893 preposted buffer must initially be provided by both peers, and the 894 client session negotiation must be the first exchange. 896 On transport endpoints supporting dynamic negotiation, a more 897 sophisticated negotiation is possible, but is not discussed in the 898 current draft. 900 RDMA imposes several requirements on upper layer consumers. 901 Registration of memory and the need to post buffers of a specific 902 size and number for receive operations are a primary consideration. 904 Registration of memory can be a relatively high-overhead operation, 905 since it requires pinning of buffers, assignment of attributes (e.g. 906 readable/writable), and initialization of hardware translation. 907 Preregistration is desirable to reduce overhead. These registrations 908 are specific to hardware interfaces and even to RDMA connection 909 endpoints, therefore negotiation of their limits is desirable to 910 manage resources effectively. 912 Following the basic registration, these buffers must be posted by the 913 RPC layer to handle receives. These buffers remain in use by the 914 RPC/NFSv4 implementation; the size and number of them must be known 915 to the remote peer in order to avoid RDMA errors which would cause a 916 fatal error on the RDMA connection. 918 The session provides a natural way for the server to manage resource 919 allocation to each client rather than to each transport connection 920 itself. This enables considerable flexibility in the administration 921 of transport endpoints. 923 3.3.4 TCP and RDMA Inline Transfer Model 925 The basic transfer model for both TCP and RDMA is referred to as 926 "inline". For TCP, this is the only transfer model supported, since 927 TCP carries both the RPC header and data together in the data stream. 929 For RDMA, the RDMA Send transfer model is used for all NFS requests 930 and replies, but data is optionally carried by RDMA Writes or RDMA 931 Reads. Use of Sends is required to ensure consistency of data and to 932 deliver completion notifications. The pure-Send method is typically 933 used where the data payload is small, or where for whatever reason 934 target memory for RDMA is not available. 936 Inline message exchange 938 Client Server 939 : Request : 940 Send : ------------------------------> : untagged 941 : : buffer 942 : Response : 943 untagged : <------------------------------ : Send 944 buffer : : 946 Client Server 947 : Read request : 948 Send : ------------------------------> : untagged 949 : : buffer 950 : Read response with data : 951 untagged : <------------------------------ : Send 952 buffer : : 954 Client Server 955 : Write request with data : 956 Send : ------------------------------> : untagged 957 : : buffer 958 : Write response : 959 untagged : <------------------------------ : Send 960 buffer : : 962 Responses must be sent to the client on the same connection that the 963 request was sent. It is important that the server does not assume 964 any specific client implementation, in particular whether connections 965 within a session share any state at the client. This is also 966 important to preserve ordering of RDMA operations, and especially 967 RMDA consistency. Additionally, it ensures that the RPC RDMA layer 968 makes no requirement of the RDMA provider to open its memory 969 registration handles (Steering Tags) beyond the scope of a single 970 RDMA connection. This is an important security consideration. 972 Two values must be known to each peer prior to issuing Sends: the 973 maximum number of sends which may be posted, and their maximum size. 974 These values are referred to, respectively, as the message credits 975 and the maximum message size. While the message credits might vary 976 dynamically over the duration of the session, the maximum message 977 size does not. The server must commit to preserving this number of 978 duplicate request cache entires, and preparing a number of receive 979 buffers equal to or greater than its currently advertised credit 980 value, each of the advertised size. These ensure that transport 981 resources are allocated sufficient to receive the full advertised 982 limits. 984 Note that the server must post the maximum number of session requests 985 to each client operations channel. The client is not required to 986 spread its requests in any particular fashion across connections 987 within a session. If the client wishes, it may create multiple 988 sessions, each with a single or small number of operations channels 989 to provide the server with this resource advantage. Or, over RDMA 990 the server may employ a "shared receive queue". The server can in 991 any case protect its resources by restricting the client's request 992 credits. 994 While tempting to consider, it is not possible to use the TCP window 995 as an RDMA operation flow control mechanism. First, to do so would 996 violate layering, requiring both senders to be aware of the existing 997 TCP outbound window at all times. Second, since requests are of 998 variable size, the TCP window can hold a widely variable number of 999 them, and since it cannot be reduced without actually receiving data, 1000 the receiver cannot limit the sender. Third, any middlebox 1001 interposing on the connection would wreck any possible scheme. 1002 [MIDTAX] In this proposal, maximum request count limits are exchanged 1003 at the session level to allow correct provisioning of receive buffers 1004 by transports. 1006 When operating over TCP or other similar transport, request limits 1007 and sizes are still employed in NFSv4.1, but instead of being 1008 required for correctness, they provide the basis for efficient server 1009 implementation of the duplicate request cache. The limits are chosen 1010 based upon the expected needs and capabilities of the client and 1011 server, and are in fact arbitrary. Sizes may be specified by the 1012 client as zero (requesting the server's preferred or optimal value), 1013 and request limits may be chosen in proportion to the client's 1014 capabilities. For example, a limit of 1000 allows 1000 requests to 1015 be in progress, which may generally be far more than adequate to keep 1016 local networks and servers fully utilized. 1018 Both client and server have independent sizes and buffering, but over 1019 RDMA fabrics client credits are easily managed by posting a receive 1020 buffer prior to sending each request. Each such buffer may not be 1021 completed with the corresponding reply, since responses from NFSv4 1022 servers arrive in arbitrary order. When an operations channel is 1023 also used for callbacks, the client must account for callback 1024 requests by posting additional buffers. Note that implementation- 1025 specific facilities such as a shared receive queue may also allow 1026 optimization of these allocations. 1028 When a session is created, the client requests a preferred buffer 1029 size, and the server provides its answer. The server posts all 1030 buffers of at least this size. The client must comply by not sending 1031 requests greater than this size. It is recommended that server 1032 implementations do all they can to accommodate a useful range of 1033 possible client requests. There is a provision in [RPCRDMA] to allow 1034 the sending of client requests which exceed the server's receive 1035 buffer size, but it requires the server to "pull" the client's 1036 request as a "read chunk" via RDMA Read. This introduces at least 1037 one additional network roundtrip, plus other overhead such as 1038 registering memory for RDMA Read at the client and additional RDMA 1039 operations at the server, and is to be avoided. 1041 An issue therefore arises when considering the NFSv4 COMPOUND 1042 procedures. Since an arbitrary number (total size) of operations can 1043 be specified in a single COMPOUND procedure, its size is effectively 1044 unbounded. This cannot be supported by RDMA Sends, and therefore 1045 this size negotiation places a restriction on the construction and 1046 maximum size of both COMPOUND requests and responses. If a COMPOUND 1047 results in a reply at the server that is larger than can be sent in 1048 an RDMA Send to the client, then the COMPOUND must terminate and the 1049 operation which causes the overflow will provide a TOOSMALL error 1050 status result. 1052 3.3.5 RDMA Direct Transfer Model 1054 Placement of data by explicitly tagged RDMA operations is referred to 1055 as "direct" transfer. This method is typically used where the data 1056 payload is relatively large, that is, when RDMA setup has been 1057 performed prior to the operation, or when any overhead for setting up 1058 and performing the transfer is regained by avoiding the overhead of 1059 processing an ordinary receive. 1061 The client advertises RDMA buffers in this proposed model, and not 1062 the server. This means the "XDR Decoding with Read Chunks" described 1063 in [RPCRDMA] is not employed by NFSv4.1 replies, and instead all 1064 results transferred via RDMA to the client employ "XDR Decoding with 1065 Write Chunks". There are several reasons for this. 1067 First, it allows for a correct and secure mode of transfer. The 1068 client may advertise specific memory buffers only during specific 1069 times, and may revoke access when it pleases. The server is not 1070 required to expose copies of local file buffers for individual 1071 clients, or to lock or copy them for each client access. 1073 Second, client credits based on fixed-size request buffers are easily 1074 managed on the server, but for the server additional management of 1075 buffers for client RDMA Reads is not well-bounded. For example, the 1076 client may not perform these RDMA Read operations in a timely 1077 fashion, therefore the server would have to protect itself against 1078 denial-of-service on these resources. 1080 Third, it reduces network traffic, since buffer exposure outside the 1081 scope and duration of a single request/response exchange necessitates 1082 additional memory management exchanges. 1084 There are costs associated with this decision. Primary among them is 1085 the need for the server to employ RDMA Read for operations such as 1086 large WRITE. The RDMA Read operation is a two-way exchange at the 1087 RDMA layer, which incurs additional overhead relative to RDMA Write. 1088 Additionally, RDMA Read requires resources at the data source (the 1089 client in this proposal) to maintain state and to generate replies. 1090 These costs are overcome through use of pipelining with credits, with 1091 sufficient RDMA Read resources negotiated at session initiation, and 1092 appropriate use of RDMA for writes by the client - for example only 1093 for transfers above a certain size. 1095 A description of which NFSv4 operation results are eligible for data 1096 transfer via RDMA Write is in [NFSDDP]. There are only two such 1097 operations: READ and READLINK. When XDR encoding these requests on 1098 an RDMA transport, the NFSv4.1 client must insert the appropriate 1099 xdr_write_list entries to indicate to the server whether the results 1100 should be transferred via RDMA or inline with a Send. As described 1101 in [NFSDDP], a zero-length write chunk is used to indicate an inline 1102 result. In this way, it is unnecessary to create new operations for 1103 RDMA-mode versions of READ and READLINK. 1105 Another tool to avoid creation of new, RDMA-mode operations is the 1106 Reply Chunk [RPCRDMA], which is used by RPC in RDMA mode to return 1107 large replies via RDMA as if they were inline. Reply chunks are used 1108 for operations such as READDIR, which returns large amounts of 1109 information, but in many small XDR segments. Reply chunks are 1110 offered by the client and the server can use them in preference to 1111 inline. Reply chunks are transparent to upper layers such as NFSv4. 1113 In any very rare cases where another NFSv4.1 operation requires 1114 larger buffers than were negotiated when the session was created (for 1115 example extraordinarily large RENAMEs), the underlying RPC layer may 1116 support the use of "Message as an RDMA Read Chunk" and "RDMA Write of 1117 Long Replies" as described in [RPCRDMA]. No additional support is 1118 required in the NFSv4.1 client for this. The client should be 1119 certain that its requested buffer sizes are not so small as to make 1120 this a frequent occurrence, however. 1122 All operations are initiated by a Send, and are completed with a 1123 Send. This is exactly as in conventional NFSv4, but under RDMA has a 1124 significant purpose: RDMA operations are not complete, that is, 1125 guaranteed consistent, at the data sink until followed by a 1126 successful Send completion (i.e. a receive). These events provide a 1127 natural opportunity for the initiator (client) to enable and later 1128 disable RDMA access to the memory which is the target of each 1129 operation, in order to provide for consistent and secure operation. 1130 The RDMAP Send with Invalidate operation may be worth employing in 1131 this respect, as it relieves the client of certain overhead in this 1132 case. 1134 A "onetime" boolean advisory to each RDMA region might become a hint 1135 to the server that the client will use the three-tuple for only one 1136 NFSv4 operation. For a transport such as iWARP, the server can 1137 assist the client in invalidating the three-tuple by performing a 1138 Send with Solicited Event and Invalidate. The server may ignore this 1139 hint, in which case the client must perform a local invalidate after 1140 receiving the indication from the server that the NFSv4 operation is 1141 complete. This may be considered in a future version of this draft 1142 and [NFSDDP]. 1144 In a trusted environment, it may be desirable for the client to 1145 persistently enable RDMA access by the server. Such a model is 1146 desirable for the highest level of efficiency and lowest overhead. 1148 RDMA message exchanges 1150 Client Server 1151 : Direct Read Request : 1152 Send : ------------------------------> : untagged 1153 : : buffer 1154 : Segment : 1155 tagged : <------------------------------ : RDMA Write 1156 buffer : : : 1157 : [Segment] : 1158 tagged : <------------------------------ : [RDMA Write] 1159 buffer : : 1160 : Direct Read Response : 1161 untagged : <------------------------------ : Send (w/Inv.) 1162 buffer : : 1164 Client Server 1165 : Direct Write Request : 1166 Send : ------------------------------> : untagged 1167 : : buffer 1168 : Segment : 1169 tagged : v------------------------------ : RDMA Read 1170 buffer : +-----------------------------> : 1171 : : : 1172 : [Segment] : 1173 tagged : v------------------------------ : [RDMA Read] 1174 buffer : +-----------------------------> : 1175 : : 1176 : Direct Write Response : 1177 untagged : <------------------------------ : Send (w/Inv.) 1178 buffer : : 1180 3.4 Connection Models 1182 There are three scenarios in which to discuss the connection model. 1183 Each will be discussed individually, after describing the common case 1184 encountered at initial connection establishment. 1186 After a successful connection, the first request proceeds, in the 1187 case of a new client association, to initial session creation, and 1188 then optionally to session callback channel binding, prior to regular 1189 operation. 1191 Commonly, each new client "mount" will be the action which drives 1192 creation of a new session. However there are any number of other 1193 approaches. Clients may choose to share a single connection and 1194 session among all their mount points. Or, clients may support 1195 trunking, where additional connections are created but all within a 1196 single session. Alternatively, the client may choose to create 1197 multiple sessions, each tuned to the buffering and reliability needs 1198 of the mount point. For example, a readonly mount can sharply reduce 1199 its write buffering and also makes no requirement for the server to 1200 support reliable duplicate request caching. 1202 Similarly, the client can choose among several strategies for 1203 clientid usage. Sessions can share a single clientid, or create new 1204 clientids as the client deems appropriate. For kernel-based clients 1205 which service multiple authenticated users, a single clientid shared 1206 across all mount points is generally the most appropriate and 1207 flexible approach. For example, all the client's file operations may 1208 wish to share locking state and the local client kernel takes the 1209 responsibility for arbitrating access locally. For clients choosing 1210 to support other authentication models, perhaps example userspace 1211 implementations, a new clientid is indicated. Through use of session 1212 create options, both models are supported at the client's choice. 1214 Since the session is explicitly created and destroyed by the client, 1215 and each client is uniquely identified, the server may be 1216 specifically instructed to discard unneeded presistent state. For 1217 this reason, it is possible that a server will retain any previous 1218 state indefinitely, and place its destruction under administrative 1219 control. Or, a server may choose to retain state for some 1220 configurable period, provided that the period meets other NFSv4 1221 requirements such as lease reclamation time, etc. However, since 1222 discarding this state at the server may affect the correctness of the 1223 server as seen by the client across network partitioning, such 1224 discarding of state should be done only in a conservative manner. 1226 Each client request to the server carries a new SEQUENCE operation 1227 within each COMPOUND, which provides the session context. This 1228 session context then governs the request control, duplicate request 1229 caching, and other persistent parameters managed by the server for a 1230 session. 1232 3.4.1 TCP Connection Model 1234 The following is a schematic diagram of the NFSv4.1 protocol 1235 exchanges leading up to normal operation on a TCP stream. 1237 Client Server 1238 TCPmode : Create Clientid(nfs_client_id4) : TCPmode 1239 : ------------------------------> : 1240 : : 1241 : Clientid reply(clientid, ...) : 1242 : <------------------------------ : 1243 : : 1244 : Create Session(clientid, size S, : 1245 : maxreq N, STREAM, ...) : 1246 : ------------------------------> : 1247 : : 1248 : Session reply(sessionid, size S', : 1249 : maxreq N') : 1250 : <------------------------------ : 1251 : : 1252 : : 1253 : ------------------------------> : 1254 : <------------------------------ : 1255 : : : 1257 No net additional exchange is added to the initial negotiation by 1258 this proposal. In the NFSv4.1 exchange, the CREATECLIENTID replaces 1259 SETCLIENTID (eliding the callback "clientaddr4" addressing) and 1260 CREATESESSION subsumes the function of SETCLIENTID_CONFIRM, as 1261 described elsewhere in this document. Callback channel binding is 1262 optional, as in NFSv4.0. Note that the STREAM transport type is 1263 shown above, but since the transport mode remains unchanged and 1264 transport attributes are not necessarily exchanged, DEFAULT could 1265 also be passed. 1267 3.4.2 Negotiated RDMA Connection Model 1269 One possible design which has been considered is to have a 1270 "negotiated" RDMA connection model, supported via use of a session 1271 bind operation as a required first step. However due to issues 1272 mentioned earlier, this proved problematic. This section remains as 1273 a reminder of that fact, and it is possible such a mode can be 1274 supported. 1276 It is not considered critical that this be supported for two reasons. 1277 One, the session persistence provides a way for the server to 1278 remember important session parameters, such as sizes and maximum 1279 request counts. These values can be used to restore the endpoint 1280 prior to making the first reply. Two, there are currently no 1281 critical RDMA parameters to set in the endpoint at the server side of 1282 the connection. RDMA Read resources, which are in general not 1283 settable after entering RDMA mode, are set only at the client - the 1284 originator of the connection. Therefore as long as the RDMA provider 1285 supports an automatic RDMA connection mode, no further support is 1286 required from the NFSv4.1 protocol for reconnection. 1288 Note, the client must provide at least as many RDMA Read resources to 1289 its local queue for the benefit of the server when reconnecting, as 1290 it used when negotiating the session. If this value is no longer 1291 appropriate, the client should resynchronize its session state, 1292 destroy the existing session, and start over with the more 1293 appropriate values. 1295 3.4.3 Automatic RDMA Connection Model 1297 The following is a schematic diagram of the NFSv4.1 protocol 1298 exchanges performed on an RDMA connection. 1300 Client Server 1301 RDMAmode : : : RDMAmode 1302 : : : 1303 Prepost : : : Prepost 1304 receive : : : receive 1305 : : 1306 : Create Clientid(nfs_client_id4) : 1307 : ------------------------------> : 1308 : : Prepost 1309 : Clientid reply(clientid, ...) : receive 1310 : <------------------------------ : 1311 Prepost : : 1312 receive : Create Session(clientid, size S, : 1313 : maxreq N, RDMA ...) : 1314 : ------------------------------> : 1315 : : Prepost <=N' 1316 : Session reply(sessionid, size S', : receives of 1317 : maxreq N') : size S' 1318 : <------------------------------ : 1319 : : 1320 : : 1321 : ------------------------------> : 1322 : <------------------------------ : 1323 : : : 1325 3.5 Buffer Management, Transfer, Flow Control 1327 Inline operations in NFSv4.1 behave effectively the same as TCP 1328 sends. Procedure results are passed in a single message, and its 1329 completion at the client signal the receiving process to inspect the 1330 message. 1332 RDMA operations are performed solely by the server in this proposal, 1333 as described in the previous "RDMA Direct Model" section. Since 1334 server RDMA operations do not result in a completion at the client, 1335 and due to ordering rules in RDMA transports, after all required RDMA 1336 operations are complete, a Send (Send with Solicited Event for iWARP) 1337 containing the procedure results is performed from server to client. 1338 This Send operation will result in a completion which will signal the 1339 client to inspect the message. 1341 In the case of client read-type NFSv4 operations, the server will 1342 have issued RDMA Writes to transfer the resulting data into client- 1343 advertised buffers. The subsequent Send operation performs two 1344 necessary functions: finalizing any active or pending DMA at the 1345 client, and signaling the client to inspect the message. 1347 In the case of client write-type NFSv4 operations, the server will 1348 have issued RDMA Reads to fetch the data from the client-advertised 1349 buffers. No data consistency issues arise at the client, but the 1350 completion of the transfer must be acknowledged, again by a Send from 1351 server to client. 1353 In either case, the client advertises buffers for direct (RDMA style) 1354 operations. The client may desire certain advertisement limits, and 1355 may wish the server to perform remote invalidation on its behalf when 1356 the server has completed its RDMA. This may be considered in a 1357 future version of this draft. 1359 In the absence of remote invalidation, the client may perform its 1360 own, local invalidation after the operation completes. This 1361 invalidation should occur prior to any RPCSEC GSS integrity checking, 1362 since a validly remotely accessible buffer can possibly be modified 1363 by the peer. However, after invalidation and the contents integrity 1364 checked, the contents are locally secure. 1366 Credit updates over RDMA transports are supported at the RPC layer as 1367 described in [RPCRDMA]. In each request, the client requests a 1368 desired number of credits to be made available to the connection on 1369 which it sends the request. The client must not send more requests 1370 than the number which the server has previously advertised, or in the 1371 case of the first request, only one. If the client exceeds its 1372 credit limit, the connection may close with a fatal RDMA error. 1374 The server then executes the request, and replies with an updated 1375 credit count accompanying its results. Since replies are sequenced 1376 by their RDMA Send order, the most recent results always reflect the 1377 server's limit. In this way the client will always know the maximum 1378 number of requests it may safely post. 1380 Because the client requests an arbitrary credit count in each 1381 request, it is relatively easy for the client to request more, or 1382 fewer, credits to match its expected need. A client that discovered 1383 itself frequently queuing outgoing requests due to lack of server 1384 credits might increase its requested credits proportionately in 1385 response. Or, a client might have a simple, configurable number. 1386 The protocol also provides a per-operation "maxslot" exchange to 1387 assist in dynamic adjustment at the session level, described in a 1388 later section. 1390 Occasionally, a server may wish to reduce the total number of credits 1391 it offers a certain client on a connection. This could be 1392 encountered if a client were found to be consuming its credits 1393 slowly, or not at all. A client might notice this itself, and reduce 1394 its requested credits in advance, for instance requesting only the 1395 count of operations it currently has queued, plus a few as a base for 1396 starting up again. Such mechanisms can, however, be potentially 1397 complicated and are implementation-defined. The protocol does not 1398 require them. 1400 Because of the way in which RDMA fabrics function, it is not possible 1401 for the server (or client back channel) to cancel outstanding receive 1402 operations. Therefore, effectively only one credit can be withdrawn 1403 per receive completion. The server (or client back channel) would 1404 simply not replenish a receive operation when replying. The server 1405 can still reduce the available credit advertisement in its replies to 1406 the target value it desires, as a hint to the client that its credit 1407 target is lower and it should expect it to be reduced accordingly. 1408 Of course, even if the server could cancel outstanding receives, it 1409 cannot do so, since the client may have already sent requests in 1410 expectation of the previous limit. 1412 This brings out an interesting scenario similar to the client 1413 reconnect discussed earlier in "Connection Models". How does the 1414 server reduce the credits of an inactive client? 1416 One approach is for the server to simply close such a connection and 1417 require the client to reconnect at a new credit limit. This is 1418 acceptable, if inefficient, when the connection setup time is short 1419 and where the server supports persistent session semantics. 1421 A better approach is to provide a back channel request to return the 1422 operations channel credits. The server may request the client to 1423 return some number of credits, the client must comply by performing 1424 operations on the operations channel, provided of course that the 1425 request does not drop the client's credit count to zero (in which 1426 case the connection would deadlock). If the client finds that it has 1427 no requests with which to consume the credits it was previously 1428 granted, it must send zero-length Send RDMA operations, or NULL NFSv4 1429 operations in order to return the resources to the server. If the 1430 client fails to comply in a timely fashion, the server can recover 1431 the resources by breaking the connection. 1433 While in principle, the back channel credits could be subject to a 1434 similar resource adjustment, in practice this is not an issue, since 1435 the back channel is used purely for control and is expected to be 1436 statically provisioned. 1438 It is important to note that in addition to maximum request counts, 1439 the sizes of buffers are negotiated per-session. This permits the 1440 most efficient allocation of resources on both peers. There is an 1441 important requirement on reconnection: the sizes posted by the server 1442 at reconnect must be at least as large as previously used, to allow 1443 recovery. Any replies that are replayed from the server's duplicate 1444 request cache must be able to be received into client buffers. In 1445 the case where a client has received replies to all its retried 1446 requests (and therefore received all its expected responses), then 1447 the client may disconnect and reconnect with different buffers at 1448 will, since no cache replay will be required. 1450 3.6 Retry and Replay 1452 NFSv4.0 forbids retransmission on active connections over reliable 1453 transports; this includes connected-mode RDMA. This restriction 1454 must be maintained in NFSv4.1. 1456 If one peer were to retransmit a request (or reply), it would consume 1457 an additional credit on the other. If the server retransmitted a 1458 reply, it would certainly result in an RDMA connection loss, since 1459 the client would typically only post a single receive buffer for each 1460 request. If the client retransmitted a request, the additional 1461 credit consumed on the server might lead to RDMA connection failure 1462 unless the client accounted for it and decreased its available 1463 credit, leading to wasted resources. 1465 RDMA credits present a new issue to the duplicate request cache in 1466 NFSv4.1. The request cache may be used when a connection within a 1467 session is lost, such as after the client reconnects. Credit 1468 information is a dynamic property of the connection, and stale values 1469 must not be replayed from the cache. This implies that the request 1470 cache contents must not be blindly used when replies are issued from 1471 it, and credit information appropriate to the channel must be 1472 refreshed by the RPC layer. 1474 Finally, RDMA fabrics do not guarantee that the memory handles 1475 (Steering Tags) within each rdma three-tuple are valid on a scope 1476 outside that of a single connection. Therefore, handles used by the 1477 direct operations become invalid after connection loss. The server 1478 must ensure that any RDMA operations which must be replayed from the 1479 request cache use the newly provided handle(s) from the most recent 1480 request. 1482 3.7 The Back Channel 1484 The NFSv4 callback operations present a significant resource problem 1485 for the RDMA enabled client. Clearly, callbacks must be negotiated 1486 in the way credits are for the ordinary operations channel for 1487 requests flowing from client to server. But, for callbacks to arrive 1488 on the same RDMA endpoint as operation replies would require 1489 dedicating additional resources, and specialized demultiplexing and 1490 event handling. Or, callbacks may not require RDMA sevice at all 1491 (they do not normally carry substantial data payloads). It is highly 1492 desirable to streamline this critical path via a second 1493 communications channel. 1495 The session callback channel binding facility is designed for exactly 1496 such a situation, by dynamically associating a new connected endpoint 1497 with the session, and separately negotiating sizes and counts for 1498 active callback channel operations. The binding operation is 1499 firewall-friendly since it does not require the server to initiate 1500 the connection. 1502 This same method serves as well for ordinary TCP connection mode. It 1503 is expected that all NFSv4.1 clients may make use of the session 1504 facility to streamline their design. 1506 The back channel functions exactly the same as the operations channel 1507 except that no RDMA operations are required to perform transfers, 1508 instead the sizes are required to be sufficiently large to carry all 1509 data inline, and of course the client and server reverse their roles 1510 with respect to which is in control of credit management. The same 1511 rules apply for all transfers, with the server being required to flow 1512 control its callback requests. 1514 The back channel is optional. If not bound on a given session, the 1515 server must not issue callback operations to the client. This in 1516 turn implies that such a client must never put itself in the 1517 situation where the server will need to do so, lest the client lose 1518 its connection by force, or its operation be incorrect. For the same 1519 reason, if a back channel is bound, the client is subject to 1520 revocation of its delegations if the back channel is lost. Any 1521 connection loss should be corrected by the client as soon as 1522 possible. 1524 This can be convenient for the NFSv4.1 client; if the client expects 1525 to make no use of back channel facilities such as delegations, then 1526 there is no need to create it. This may save significant resources 1527 and complexity at the client. 1529 For these reasons, if the client wishes to use the back channel, that 1530 channel must be bound first, before using the operations channel. In 1531 this way, the server will not find itself in a position where it will 1532 send callbacks on the operations channel when the client is not 1533 prepared for them. 1535 There is one special case, that where the back channel is bound in 1536 fact to the operations channel's connection. This configuration 1537 would be used normally over a TCP stream connection to exactly 1538 implement the NFSv4.0 behavior, but over RDMA would require complex 1539 resource and event management at both sides of the connection. The 1540 server is not required to accept such a bind request on an RDMA 1541 connection for this reason, though it is recommended. 1543 3.8 COMPOUND Sizing Issues 1545 Very large responses may pose duplicate request cache issues. Since 1546 servers will want to bound the storage required for such a cache, the 1547 unlimited size of response data in COMPOUND may be troublesome. If 1548 COMPOUND is used in all its generality, then the inclusion of certain 1549 non-idempotent operations within a single COMPOUND request may render 1550 the entire request non-idempotent. (For example, a single COMPOUND 1551 request which read a file or symbolic link, then removed it, would be 1552 obliged to cache the data in order to allow identical replay). 1553 Therefore, many requests might include operations that return any 1554 amount of data. 1556 It is not satisfactory for the server to reject COMPOUNDs at will 1557 with NFS4ERR_RESOURCE when they pose such difficulties for the 1558 server, as this results in serious interoperability problems. 1559 Instead, any such limits must be explicitly exposed as attributes of 1560 the session, ensuring that the server can explicitly support any 1561 duplicate request cache needs at all times. 1563 3.9 Data Alignment 1565 A negotiated data alignment enables certain scatter/gather 1566 optimizations. A facility for this is supported by [RPCRDMA]. Where 1567 NFS file data is the payload, specific optimizations become highly 1568 attractive. 1570 Header padding is requested by each peer at session initiation, and 1571 may be zero (no padding). Padding leverages the useful property that 1572 RDMA receives preserve alignment of data, even when they are placed 1573 into anonymous (untagged) buffers. If requested, client inline 1574 writes will insert appropriate pad bytes within the request header to 1575 align the data payload on the specified boundary. The client is 1576 encouraged to be optimistic and simply pad all WRITEs within the RPC 1577 layer to the negotiated size, in the expectation that the server can 1578 use them efficiently. 1580 It is highly recommended that clients offer to pad headers to an 1581 appropriate size. Most servers can make good use of such padding, 1582 which allows them to chain receive buffers in such a way that any 1583 data carried by client requests will be placed into appropriate 1584 buffers at the server, ready for filesystem processing. The 1585 receiver's RPC layer encounters no overhead from skipping over pad 1586 bytes, and the RDMA layer's high performance makes the insertion and 1587 transmission of padding on the sender a significant optimization. In 1588 this way, the need for servers to perform RDMA Read to satisfy all 1589 but the largest client writes is obviated. An added benefit is the 1590 reduction of message roundtrips on the network - a potentially good 1591 trade, where latency is present. 1593 The value to choose for padding is subject to a number of criteria. 1594 A primary source of variable-length data in the RPC header is the 1595 authentication information, the form of which is client-determined, 1596 possibly in response to server specification. The contents of 1597 COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all 1598 go into the determination of a maximal NFSv4 request size and 1599 therefore minimal buffer size. The client must select its offered 1600 value carefully, so as not to overburden the server, and vice- versa. 1601 The payoff of an appropriate padding value is higher performance. 1603 Sender gather: 1604 |RPC Request|Pad bytes|Length| -> |User data...| 1605 \------+---------------------/ \ 1606 \ \ 1607 \ Receiver scatter: \-----------+- ... 1608 /-----+----------------\ \ \ 1609 |RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->... 1611 In the above case, the server may recycle unused buffers to the next 1612 posted receive if unused by the actual received request, or may pass 1613 the now-complete buffers by reference for normal write processing. 1614 For a server which can make use of it, this removes any need for data 1615 copies of incoming data, without resorting to complicated end-to-end 1616 buffer advertisement and management. This includes most kernel-based 1617 and integrated server designs, among many others. The client may 1618 perform similar optimizations, if desired. 1620 Padding is negotiated by the session creation operation, and 1621 subsequently used by the RPC RDMA layer, as described in [RPCRDMA]. 1623 3.10 NFSv4 Integration 1625 The following section discusses the integration of the proposed RDMA 1626 extensions with NFSv4.0. 1628 3.10.1 Minor Versioning 1630 Minor versioning is the existing facility to extend the NFSv4 1631 protocol, and this proposal takes that approach. 1633 Minor versioning of NFSv4 is relatively restrictive, and allows for 1634 tightly limited changes only. In particular, it does not permit 1635 adding new "procedures" (it permits adding only new "operations"). 1636 Interoperability concerns make it impossible to consider additional 1637 layering to be a minor revision. This somewhat limits the changes 1638 that can be proposed when considering extensions. 1640 To support the duplicate request cache integrated with sessions and 1641 request control, it is desirable to tag each request with an 1642 identifier to be called a Slotid. This identifier must be passed by 1643 NFSv4 when running atop any transport, including traditional TCP. 1644 Therefore it is not desirable to add the Slotid to a new RPC 1645 transport, even though such a transport is indicated for support of 1646 RDMA. This draft and [RPCRDMA] do not propose such an approach. 1648 Instead, this proposal conforms to the requirements of NFSv4 minor 1649 versioning, through the use of a new operation within NFSv4 COMPOUND 1650 procedures as detailed below. 1652 If sessions are in use for a given clientid, this same clientid 1653 cannot be used for non-session NFSv4 operation, including NFSv4.0. 1654 Because the server will have allocated session-specific state to the 1655 active clientid, it would be an unnecessary burden on the server 1656 implementor to support and account for additional, non- session 1657 traffic, in addition to being of no benefit. Therefore this proposal 1658 prohibits a single clientid from doing this. Nevertheless, employing 1659 a new clientid for such traffic is supported. 1661 3.10.2 Slot Identifiers and Server Duplicate Request Cache 1663 The presence of deterministic maximum request limits on a session 1664 enables in-progress requests to be assigned unique values with useful 1665 properties. 1667 The RPC layer provides a transaction ID (xid), which, while required 1668 to be unique, is not especially convenient for tracking requests. 1669 The transaction ID is only meaningful to the issuer (client), it 1670 cannot be interpreted at the server except to test for equality with 1671 previously issued requests. Because RPC operations may be completed 1672 by the server in any order, many transaction IDs may be outstanding 1673 at any time. The client may therefore perform a computationally 1674 expensive lookup operation in the process of demultiplexing each 1675 reply. 1677 In the proposal, there is a limit to the number of active requests. 1678 This immediately enables a convenient, computationally efficient 1679 index for each request which is designated as a Slot Identifier, or 1680 slotid. 1682 When the client issues a new request, it selects a slotid in the 1683 range 0..N-1, where N is the server's current "totalrequests" limit 1684 granted the client on the session over which the request is to be 1685 issued. The slotid must be unused by any of the requests which the 1686 client has already active on the session. "Unused" here means the 1687 client has no outstanding request for that slotid. Because the slot 1688 id is always an integer in the range 0..N-1, client implementations 1689 can use the slotid from a server response to efficiently match 1690 responses with outstanding requests, such as, for example, by using 1691 the slotid to index into a outstanding request array. This can be 1692 used to avoid expensive hashing and lookup functions in the 1693 performace-critical receive path. 1695 The sequenceid, which accompanies the slotid in each request, is 1696 important for a second, important check at the server: it must be 1697 able to be determined efficiently whether a request using a certain 1698 slotid is a retransmit or a new, never-before-seen request. It is 1699 not feasible for the client to assert that it is retransmitting to 1700 implement this, because for any given request the client cannot know 1701 the server has seen it unless the server actually replies. Of 1702 course, if the client has seen the server's reply, the client would 1703 not retransmit! 1705 The sequenceid must increase monotonically for each new transmit of a 1706 given slotid, and must remain unchanged for any retransmission. The 1707 server must in turn compare each newly received request's sequenceid 1708 with the last one previously received for that slotid, to see if the 1709 new request is: 1711 A new request, in which the sequenceid is greater than that 1712 previously seen in the slot (accounting for sequence wraparound). 1713 The server proceeds to execute the new request. 1715 A retransmitted request, in which the sequenceid is equal to that 1716 last seen in the slot. Note that this request may be either 1717 complete, or in progress. The server performs replay processing 1718 in these cases. 1720 A misordered duplicate, in which the sequenceid is less than that 1721 previously seen in the slot. The server must drop the incoming 1722 request, which may imply dropping the connection if the transport 1723 is reliable, as dictated by section 3.1.1 of [RFC3530]. 1725 This last condition is possible on any connection, not just 1726 unreliable, unordered transports. Delayed behavior on abandoned TCP 1727 connections which are not yet closed at the server, or pathological 1728 client implementations can cause it, among other causes. Therefore, 1729 the server may wish to harden itself against certain repeated 1730 occurrences of this, as it would for retransmissions in [RFC3530]. 1732 It is recommended, though not necessary for protocol correctness, 1733 that the client simply increment the sequenceid by one for each new 1734 request on each slotid. This reduces the wraparound window to a 1735 minimum, and is useful for tracing and avoidance of possible 1736 implementation errors. 1738 The client may however, for implementation-specific reasons, choose a 1739 different algorithm. For example it might maintain a single sequence 1740 space for all slots in the session - e.g. employing the RPC XID 1741 itself. The sequenceid, in any case, is never interpreted by the 1742 server for anything but to test by comparison with previously seen 1743 values. 1745 The server may thereby use the slotid, in conjunction with the 1746 sessionid and sequenceid, within the SEQUENCE portion of the request 1747 to maintain its duplicate request cache (DRC) for the session, as 1748 opposed to the traditional approach of ONC RPC applications that use 1749 the XID along with certain transport information [RW96]. 1751 Unlike the XID, the slotid is always within a specific range; this 1752 has two implications. The first implication is that for a given 1753 session, the server need only cache the results of a limited number 1754 of COMPOUND requests. The second implication derives from the first, 1755 which is unlike XID-indexed DRCs, the slotid DRC by its nature cannot 1756 be overflowed. Through use of the sequenceid to identify 1757 retransmitted requests, it is notable that the server does not need 1758 to actually cache the request itself, reducing the storage 1759 requirements of the DRC further. These new facilities makes it 1760 practical to maintain all the required entries for an effective DRC. 1762 The slotid and sequenceid therefore take over the traditional role of 1763 the port number in the server DRC implementation, and the session 1764 replaces the IP address. This approach is considerably more portable 1765 and completely robust - it is not subject to the frequent 1766 reassignment of ports as clients reconnect over IP networks. In 1767 addition, the RPC XID is not used in the reply cache, enhancing 1768 robustness of the cache in the face of any rapid reuse of XIDs by the 1769 client. 1771 It is required to encode the slotid information into each request in 1772 a way that does not violate the minor versioning rules of the NFSv4.0 1773 specification. This is accomplished here by encoding it in a control 1774 operation within each NFSv4.1 COMPOUND and CB_COMPOUND procedure. 1775 The operation easily piggybacks within existing messages. The 1776 implementation section of this document describes the specific 1777 proposal. 1779 In general, the receipt of a new sequenced request arriving on any 1780 valid slot is an indication that the previous DRC contents of that 1781 slot may be discarded. In order to further assist the server in slot 1782 management, the client is required to use the lowest available slot 1783 when issuing a new request. In this way, the server may be able to 1784 retire additional entries. 1786 However, in the case where the server is actively adjusting its 1787 granted maximum request count to the client, it may not be able to 1788 use receipt of the slotid to retire cache entries. The slotid used 1789 in an incoming request may not reflect the server's current idea of 1790 the client's session limit, because the request may have been sent 1791 from the client before the update was received. Therefore, in the 1792 downward adjustment case, the server may have to retain a number of 1793 duplicate request cache entries at least as large as the old value, 1794 until operation sequencing rules allow it to infer that the client 1795 has seen its reply. 1797 The SEQUENCE (and CB_SEQUENCE) operation also carries a "maxslot" 1798 value which carries additional client slot usage information. The 1799 client must always provide its highest-numbered outstanding slot 1800 value in the maxslot argument, and the server may reply with a new 1801 recognized value. The client should in all cases provide the most 1802 conservative value possible, although it can be increased somewhat 1803 above the actual instantaneous usage to maintain some minimum or 1804 optimal level. This provides a way for the client to yield unused 1805 request slots back to the server, which in turn can use the 1806 information to reallocate resources. Obviously, maxslot can never be 1807 zero, or the session would deadlock. 1809 The server also provides a target maxslot value to the client, which 1810 is an indication to the client of the maxslot the server wishes the 1811 client to be using. This permits the server to withdraw (or add) 1812 resources from a client that has been found to not be using them, in 1813 order to more fairly share resources among a varying level of demand 1814 from other clients. The client must always comply with the server's 1815 value updates, since they indicate newly established hard limits on 1816 the client's access to session resources. However, because of 1817 request pipelining, the client may have active requests in flight 1818 reflecting prior values, therefore the server must not immediately 1819 require the client to comply. 1821 It is worthwhile to note that Sprite RPC [BW87] defined a "channel" 1822 which in some ways is similar to the slotid proposed here. Sprite 1823 RPC used channels to implement parallel request processing and 1824 request/response cache retirement. 1826 3.10.3 COMPOUND and CB_COMPOUND 1828 Support for per-operation control can be piggybacked onto NFSv4 1829 COMPOUNDs with full transparency, by placing such facilities into 1830 their own, new operation, and placing this operation first in each 1831 COMPOUND under the new NFSv4 minor protocol revision. The contents 1832 of the operation would then apply to the entire COMPOUND. 1834 Recall that the NFSv4 minor revision is contained within the COMPOUND 1835 header, encoded prior to the COMPOUNDed operations. By simply 1836 requiring that the new operation always be contained in NFSv4 minor 1837 COMPOUNDs, the control protocol can piggyback perfectly with each 1838 request and response. 1840 In this way, the NFSv4 RDMA Extensions may stay in compliance with 1841 the minor versioning requirements specified in section 10 of 1842 [RFC3530]. 1844 Referring to section 13.1 of the same document, the proposed session- 1845 enabled COMPOUND and CB_COMPOUND have the form: 1847 +-----+--------------+-----------+------------+-----------+---- 1848 | tag | minorversion | numops | control op | op + args | ... 1849 | | (== 1) | (limited) | + args | | 1850 +-----+--------------+-----------+------------+-----------+---- 1852 and the reply's structure is: 1854 +------------+-----+--------+-------------------------------+--// 1855 |last status | tag | numres | status + control op + results | // 1856 +------------+-----+--------+-------------------------------+--// 1857 //-----------------------+---- 1858 // status + op + results | ... 1860 //-----------------------+---- 1862 The single control operation within each NFSv4.1 COMPOUND defines the 1863 context and operational session parameters which govern that COMPOUND 1864 request and reply. Placing it first in the COMPOUND encoding is 1865 required in order to allow its processing before other operations in 1866 the COMPOUND. 1868 3.10.4 eXternal Data Representation Efficiency 1870 RDMA is a copy avoidance technology, and it is important to maintain 1871 this efficiency when decoding received messages. Traditional XDR 1872 implementations frequently use generated unmarshaling code to convert 1873 objects to local form, incurring a data copy in the process (in 1874 addition to subjecting the caller to recursive calls, etc). Often, 1875 such conversions are carried out even when no size or byte order 1876 conversion is necessary. 1878 It is recommended that implementations pay close attention to the 1879 details of memory referencing in such code. It is far more efficient 1880 to inspect data in place, using native facilities to deal with word 1881 size and byte order conversion into registers or local variables, 1882 rather than formally (and blindly) performing the operation via 1883 fetch, reallocate and store. 1885 Of particular concern is the result of the READDIR operation, in 1886 which such encoding abounds. 1888 3.10.5 Effect of Sessions on Existing Operations 1890 The use of a session replaces the use of the SETCLIENTID and 1891 SETCLIENTID_CONFIRM operations, and allows certain simplification of 1892 the RENEW and callback addressing mechanisms in the base protocol. 1894 The cb_program and cb_location which are obtained by the server in 1895 SETCLIENTID_CONFIRM must not be used by the server, because the 1896 NFSv4.1 client performs callback channel designation with 1897 BIND_BACKCHANNEL. Therefore the SETCLIENTID and SETCLIENTID_CONFIRM 1898 operations becomes obsolete when sessions are in use, and a server 1899 should return an error to NFSv4.1 clients which might issue either 1900 operation. 1902 Another favorable result of the session is that the server is able to 1903 avoid requiring the client to perform OPEN_CONFIRM operations. The 1904 existence of a reliable and effective DRC means that the server will 1905 be able to determine whether an OPEN request carrying a previously 1906 known open_owner from a client is or is not a retransmission. 1907 Because of this, the server no longer requires OPEN_CONFIRM to verify 1908 whether the client is retransmitting an open request. This in turn 1909 eliminates the server's reason for requesting OPEN_CONFIRM - the 1910 server can simply replace any previous information on this 1911 open_owner. Client OPEN operations are therefore streamlined, 1912 reducing overhead and latency through avoiding the additional 1913 OPEN_CONFIRM exchange. 1915 Since the session carries the client liveness indication with it 1916 implicitly, any request on a session associated with a given client 1917 will renew that client's leases. Therefore the RENEW operation is 1918 made unnecessary when a session is present, as any request (including 1919 a SEQUENCE operation with or without additional NFSv4 operations) 1920 performs its function. It is possible (though this proposal does not 1921 make any recommendation) that the RENEW operation could be made 1922 obsolete. 1924 An interesting issue arises however if an error occurs on such a 1925 SEQUENCE operation. If the SEQUENCE operation fails, perhaps due to 1926 an invalid slotid or other non-renewal-based issue, the server may or 1927 may not have performed the RENEW. In this case, the state of any 1928 renewal is undefined, and the client should make no assumption that 1929 it has been performed. In practice, this should not occur but even 1930 if it did, it is expected the client would perform some sort of 1931 recovery which would result in a new, successful, SEQUENCE operation 1932 being run and the client assured that the renewal took place. 1934 3.10.6 Authentication Efficiencies 1936 NFSv4 requires the use of the RPCSEC_GSS ONC RPC security flavor 1937 [RFC2203] to provide authentication, integrity, and privacy via 1938 cryptography. The server dictates to the client the use of 1939 RPCSEC_GSS, the service (authentication, integrity, or privacy), and 1940 the specific GSS-API security mechanism that each remote procedure 1941 call and result will use. 1943 If the connection's integrity is protected by an additional means 1944 than RPCSEC_GSS, such as via IPsec, then the use of RPCSEC_GSS's 1945 integrity service is nearly redundant (See the Security 1946 Considerations section for more explanation of why it is "nearly" and 1947 not completely redundant). Likewise, if the connection's privacy is 1948 protected by additional means, then the use of both RPCSEC_GSS's 1949 integrity and privacy services is nearly redundant. 1951 Connection protection schemes, such as IPsec, are more likely to be 1952 implemented in hardware than upper layer protocols like RPCSEC_GSS. 1953 Hardware-based cryptography at the IPsec layer will be more efficient 1954 than software-based cryptography at the RPCSEC_GSS layer. 1956 When transport integrity can be obtained, it is possible for server 1957 and client to downgrade their per-operation authentication, after an 1958 appropriate exchange. This downgrade can in fact be as complete as 1959 to establish security mechanisms that have zero cryptographic 1960 overhead, effectively using the underlying integrity and privacy 1961 services provided by transport. 1963 Based on the above observations, a new GSS-API mechanism, called the 1964 Channel Conjunction Mechanism [CCM], is being defined. The CCM works 1965 by creating a GSS-API security context using as input a cookie that 1966 the initiator and target have previously agreed to be a handle for 1967 GSS-API context created previously over another GSS-API mechanism. 1969 NFSv4.1 clients and servers should support CCM and they must use as 1970 the cookie the handle from a successful RPCSEC_GSS context creation 1971 over a non-CCM mechanism (such as Kerberos V5). The value of the 1972 cookie will be equal to the handle field of the rpc_gss_init_res 1973 structure from the RPCSEC_GSS specification. 1975 The [CCM] Draft provides further discussion and examples. 1977 3.11 Sessions Security Considerations 1979 The NFSv4 minor version 1 retains all of existing NFSv4 security; all 1980 security considerations present in NFSv4.0 apply to it equally. 1982 Security considerations of any underlying RDMA transport are 1983 additionally important, all the more so due to the emerging nature of 1984 such transports. Examining these issues is outside the scope of this 1985 draft. 1987 When protecting a connection with RPCSEC_GSS, all data in each 1988 request and response (whether transferred inline or via RDMA) 1989 continues to receive this protection over RDMA fabrics [RPCRDMA]. 1990 However when performing data transfers via RDMA, RPCSEC_GSS 1991 protection of the data transfer portion works against the efficiency 1992 which RDMA is typically employed to achieve. This is because such 1993 data is normally managed solely by the RDMA fabric, and intentionally 1994 is not touched by software. Therefore when employing RPCSEC_GSS 1995 under CCM, and where integrity protection has been "downgraded", the 1996 cooperation of the RDMA transport provider is critical to maintain 1997 any integrity and privacy otherwise in place for the session. The 1998 means by which the local RPCSEC_GSS implementation is integrated with 1999 the RDMA data protection facilities are outside the scope of this 2000 draft. 2002 It is logical to use the same GSS context on a session's callback 2003 channel as that used on its operations channel(s), particularly when 2004 the connection is shared by both. The client must indicate to the 2005 server: 2007 - what security flavor(s) to use in the call back. A special 2008 callback flavor might be defined for this. 2010 - if the flavor is RPCSEC_GSS, then the client must have previously 2011 created an RPCSEC_GSS session with the server. The client offers to 2012 the server the the opaque handle<> value from the rpc_gss_init_res 2013 structure, the window size of RPCSEC_GSS sequence numbers, and an 2014 opaque gss_cb_handle. 2016 This exchange can be performed as part of session and clientid 2017 creation, and the issue warrants careful analysis before being 2018 specified. 2020 If the NFS client wishes to maintain full control over RPCSEC_GSS 2021 protection, it may still perform its transfer operations using either 2022 the inline or RDMA transfer model, or of course employ traditional 2023 TCP stream operation. In the RDMA inline case, header padding is 2024 recommended to optimize behavior at the server. At the client, close 2025 attention should be paid to the implementation of RPCSEC_GSS 2026 processing to minimize memory referencing and especially copying. 2027 These are well-advised in any case! 2029 The proposed session callback channel binding improves security over 2030 that provided by NFSv4 for the callback channel. The connection is 2031 client-initiated, and subject to the same firewall and routing checks 2032 as the operations channel. The connection cannot be hijacked by an 2033 attacker who connects to the client port prior to the intended 2034 server. The connection is set up by the client with its desired 2035 attributes, such as optionally securing with IPsec or similar. The 2036 binding is fully authenticated before being activated. 2038 3.11.1 Authentication 2040 Proper authentication of the principal which issues any session and 2041 clientid in the proposed NFSv4.1 operations exactly follows the 2042 similar requirement on client identifiers in NFSv4.0. It must not be 2043 possible for a client to impersonate another by guessing its session 2044 identifiers for NFSv4.1 operations, nor to bind a callback channel to 2045 an existing session. To protect against this, NFSv4.0 requires 2046 appropriate authentication and matching of the principal used. This 2047 is discussed in Section 16, Security Considerations of [RFC3530]. 2048 The same requirement when using a session identifier applies to 2049 NFSv4.1 here. 2051 Going beyond NFSv4.0, the presence of a session associated with any 2052 clientid may also be used to enhance NFSv4.1 security with respect to 2053 client impersonation. In NFSv4.0, there are many operations which 2054 carry no clientid, including in particular those which employ a 2055 stateid argument. A rogue client which wished to carry out a denial 2056 of service attack on another client could perform CLOSE, DELEGRETURN, 2057 etc operations with that client's current filehandle, sequenceid and 2058 stateid, after having obtained them from eavesdropping or other 2059 approach. Locking and open downgrade operations could be similarly 2060 attacked. 2062 When an NFSv4.1 session is in place for any clientid, countermeasures 2063 are easily applied through use of authentication by the server. 2064 Because the clientid and sessionid must be present in each request 2065 within a session, the server may verify that the clientid is in fact 2066 originating from a principal with the appropriate authenticated 2067 credentials, that the sessionid belongs to the clientid, and that the 2068 stateid is valid in these contexts. This is in general not possible 2069 with the affected operations in NFSv4.0 due to the fact that the 2070 clientid is not present in the requests. 2072 In the event that authentication information is not available in the 2073 incoming request, for example after a reconnection when the security 2074 was previously downgraded using CCM, the server must require the 2075 client re-establish the authentication in order that the server may 2076 validate the other client-provided context, prior to executing any 2077 operation. The sessionid, present in the newly retransmitted 2078 request, combined with the retransmission detection enabled by the 2079 NFSv4.1 duplicate request cache, are a convenient and reliable 2080 context for the server to use for this contingency. 2082 The server should take care to protect itself against denial of 2083 service attacks in the creation of sessions and clientids. Clients 2084 who connect and create sessions, only to disconnect and never use 2085 them may leave significant state behind. (The same issue applies to 2086 NFSv4.0 with clients who may perform SETCLIENTID, then never perform 2087 SETCLIENTID_CONFIRM.) Careful authentication coupled with resource 2088 checks is highly recommended. 2090 4. Directory Delegations 2092 4.1 Introduction to Directory Delegations 2094 The major addition to NFS version 4 in the area of caching is the 2095 ability of the server to delegate certain responsibilities to the 2096 client. When the server grants a delegation for a file to a client, 2097 the client receives certain semantics with respect to the sharing of 2098 that file with other clients. At OPEN, the server may provide the 2099 client either a read or write delegation for the file. If the client 2100 is granted a read delegation, it is assured that no other client has 2101 the ability to write to the file for the duration of the delegation. 2102 If the client is granted a write delegation, the client is assured 2103 that no other client has read or write access to the file. This 2104 reduces network traffic and server load by allowing the client to 2105 perform certain operations on local file data and can also provide 2106 stronger consistency for the local data. 2108 Directory caching for the NFS version 4 protocol is similar to 2109 previous versions. Clients typically cache directory information for 2110 a duration determined by the client. At the end of a predefined 2111 timeout, the client will query the server to see if the directory has 2112 been updated. By caching attributes, clients reduce the number of 2113 GETATTR calls made to the server to validate attributes. 2114 Furthermore, frequently accessed files and directories, such as the 2115 current working directory, have their attributes cached on the client 2116 so that some NFS operations can be performed without having to make 2117 an RPC call. By caching name and inode information about most 2118 recently looked up entries in DNLC (Directory Name Lookup Cache), 2119 clients do not need to send LOOKUP calls to the server every time 2120 these files are accessed. 2122 This caching approach works reasonably well at reducing network 2123 traffic in many environments. However, it does not address 2124 environments where there are numerous queries for files that do not 2125 exist. In these cases of "misses", the client must make RPC calls to 2126 the server in order to provide reasonable application semantics and 2127 promptly detect the creation of new directory entries. Examples of 2128 high miss activity are compilation in software development 2129 environments. The current behavior of NFS limits its potential 2130 scalability and wide-area sharing effectiveness in these types of 2131 environments. Other distributed stateful filesystem architectures 2132 such as AFS and DFS have proven that adding state around directory 2133 contents can greatly reduce network traffic in high miss 2134 environments. 2136 Delegation of directory contents is proposed as an extension for 2137 NFSv4. Such an extension would provide similar traffic reduction 2138 benefits as with file delegations. By allowing clients to cache 2139 directory contents (in a read-only fashion) while being notified of 2140 changes, the client can avoid making frequent requests to interrogate 2141 the contents of slowly-changing directories, reducing network traffic 2142 and improving client performance. 2144 These extensions allow improved namespace cache consistency to be 2145 achieved through delegations and synchronous recalls alone without 2146 asking for notifications. In addition, if time-based consistency is 2147 sufficient, asynchronous notifications can provide performance 2148 benefits for the client, and possibly the server, under some common 2149 operating conditions such as slowly-changing and/or very large 2150 directories. 2152 4.2 Directory Delegation Design (in brief) 2154 A new operation GET_DIR_DELEGATION is used by the client to ask for a 2155 directory delegation. The delegation covers directory attributes and 2156 all entries in the directory. If either of these change the 2157 delegation will be recalled synchronously. The operation causing the 2158 recall will have to wait before the recall is complete. Any changes 2159 to directory entry attributes will not cause the delegation to be 2160 recalled. 2162 In addition to asking for delegations, a client can also ask for 2163 notifications for certain events. These events include changes to 2164 directory attributes and/or its contents. If a client asks for 2165 notification for a certain event, the server will notify the client 2166 when that event occurs. This will not result in the delegation being 2167 recalled for that client. The notifications are asynchronous and 2168 provide a way of avoiding recalls in situations where a directory is 2169 changing enough that the pure recall model may not be effective while 2170 trying to allow the client to get substantial benefit. In the 2171 absence of notifications, once the delegation is recalled the client 2172 has to refresh its directory cache which might not be very efficient 2173 for very large directories. 2175 The delegation is read only and the client may not make changes to 2176 the directory other than by performing NFSv4 operations that modify 2177 the directory or the associated file attributes so that the server 2178 has knowledge of these changes. In order to keep the client 2179 namespace in sync with the server, the server will notify the client 2180 holding the delegation of the changes made as a result. This is to 2181 avoid any subsequent GETATTR or READDIR calls to the server. If a 2182 client holding the delegation makes any changes to the directory, the 2183 delegation will not be recalled. 2185 Delegations can be recalled by the server at any time. Normally, the 2186 server will recall the delegation when the directory changes in a way 2187 that is not covered by the notification, or when the directory 2188 changes and notifications have not been requested. 2190 Also if the server notices that handing out a delegation for a 2191 directory is causing too many notifications to be sent out, it may 2192 decide not to hand out a delegation for that directory or recall 2193 existing delegations. If another client removes the directory for 2194 which a delegation has been granted, the server will recall the 2195 delegation. 2197 Both the notification and recall operations need a callback path to 2198 exist between the client and server. If the callback path does not 2199 exist, then delegation can not be granted. Note that with the 2200 session extensions [talpey] that should not be an issue. In the 2201 absense of sessions, the server will have to establish a callback 2202 path to the client to send callbacks. 2204 4.3 Recommended Attributes in support of Directory Delegations 2206 supp_dir_attr_notice - notification delays on directory attributes 2208 supp_child_attr_notice - notification delays on child attributes 2210 These attributes allow the client and server to negotiate the 2211 frequency of notifications sent due to changes in attributes. These 2212 attributes are returned as part of a GETATTR call on the directory. 2213 The supp_dir_attr_notice value covers all attribute changes to the 2214 directory and the supp_child_attr_notice covers all attribute changes 2215 to any child in the directory. 2217 These attributes are per directory. The client needs to get these 2218 values by doing a GETATTR on the directory for which it wants 2219 notifications. However these attributes are only required when the 2220 client is interested in getting attribute notifications. For all 2221 other types of notifications and delegation requests without 2222 notifications, these attributes are not required. 2224 When the client calls the GET_DIR_DELEGATION operation and asks for 2225 attribute change notifications, it will request a notification delay 2226 that is within the server's supported range. If the client violates 2227 what supp_attr_file_notice or supp_attr_dir_notice values are, the 2228 server should not commit to sending notifications for that change 2229 event. 2231 A value of zero for these attributes means the server will send the 2232 notification as soon as the change occurs. It is not recommended to 2233 set this value to zero since that can put a lot of burden on the 2234 server. A value of N means that the server will make a best effort 2235 guarentee that attribute notification are not delayed by more than 2236 that. nfstime4 values that compute to negative values are illegal. 2238 4.4 Delegation Recall 2240 The server will recall the directory delegation by sending a callback 2241 to the client. It will use the same callback procedure as used for 2242 recalling file delegations. The server will recall the delegation 2243 when the directory changes in a way that is not covered by the 2244 notification. However the server will not recall the delegation if 2245 attributes of an entry within the directory change. Also if the 2246 server notices that handing out a delegation for a directory is 2247 causing too many notifications to be sent out, it may decide not to 2248 hand out a delegation for that directory. If another client tries to 2249 remove the directory for which a delegation has been granted, the 2250 server will recall the delegation. 2252 The server will recall the delegation by sending a CB_RECALL callback 2253 to the client. If the recall is done because of a directory changing 2254 event, the request making that change will need to wait while the 2255 client returns the delegation. 2257 4.5 Delegation Recovery 2259 Crash recovery has two main goals, avoiding the necessity of breaking 2260 application guarantees with respect to locked files and delivery of 2261 updates cached at the client. Neither of these applies to 2262 directories protected by read delegations and notifications. Thus, 2263 the client is required to establish a new delegation on a server or 2264 client reboot. 2266 5. Introduction 2268 The NFSv4 protocol [2] specifies the interaction between a client 2269 that accesses files and a server that provides access to files and is 2270 responsible for coordinating access by multiple clients. As 2271 described in the pNFS problem statement, this requires that all 2272 access to a set of files exported by a single NFSv4 server be 2273 performed by that server; at high data rates the server may become a 2274 bottleneck. 2276 The parallel NFS (pNFS) extensions to NFSv4 allow data accesses to 2277 bypass this bottleneck by permitting direct client access to the 2278 storage devices containing the file data. When file data for a 2279 single NFSv4 server is stored on multiple and/or higher throughput 2280 storage devices (by comparison to the server's throughput 2281 capability), the result can be significantly better file access 2282 performance. The relationship among multiple clients, a single 2283 server, and multiple storage devices for pNFS (server and clients 2284 have access to all storage devices) is shown in this diagram: 2286 +-----------+ 2287 |+-----------+ +-----------+ 2288 ||+-----------+ | | 2289 ||| | NFSv4 + pNFS | | 2290 +|| Clients |<------------------------------>| Server | 2291 +| | | | 2292 +-----------+ | | 2293 ||| +-----------+ 2294 ||| | 2295 ||| | 2296 ||| Storage +-----------+ | 2297 ||| Protocol |+-----------+ | 2298 ||+----------------||+-----------+ Control| 2299 |+-----------------||| | Protocol| 2300 +------------------+|| Storage |------------+ 2301 +| Devices | 2302 +-----------+ 2304 Figure 9 2306 In this structure, the responsibility for coordination of file access 2307 by multiple clients is shared among the server, clients, and storage 2308 devices. This is in contrast to NFSv4 without pNFS extensions, in 2309 which this is primarily the server's responsibility, some of which 2310 can be delegated to clients under strictly specified conditions. 2312 The pNFS extension to NFSv4 takes the form of new operations that 2313 manage data location information called a "layout". The layout is 2314 managed in a similar fashion as NFSv4 data delegations (e.g., they 2315 are recallable and revocable). However, they are distinct 2316 abstractions and are manipulated with new operations. When a client 2317 holds a layout, it has rights to access the data directly using the 2318 location information in the layout. 2320 There are new attributes that describe general layout 2321 characteristics. However, much of the required information cannot be 2322 managed solely within the attribute framework, because it will need 2323 to have a strictly limited term of validity, subject to invalidation 2324 by the server. This requires the use of new operations to obtain, 2325 return, recall, and modify layouts, in addition to new attributes. 2327 This document specifies both the NFSv4 extensions required to 2328 distribute file access coordination between the server and its 2329 clients and a NFSv4 file storage protocol that may be used to access 2330 data stored on NFSv4 storage devices. 2332 Storage protocols used to access a variety of other storage devices 2333 are deliberately not specified here. These might include: 2335 o Block/volume protocols such as iSCSI ([3]), and FCP ([4]). The 2336 block/volume protocol support can be independent of the addressing 2337 structure of the block/volume protocol used, allowing more than 2338 one protocol to access the same file data and enabling 2339 extensibility to other block/volume protocols. 2341 o Object protocols such as OSD over iSCSI or Fibre Channel [5]. 2343 o Other storage protocols, including PVFS and other file systems 2344 that are in use in HPC environments. 2346 pNFS is designed to accommodate these protocols and be extensible to 2347 new classes of storage protocols that may be of interest. 2349 The distribution of file access coordination between the server and 2350 its clients increases the level of responsibility placed on clients. 2351 Clients are already responsible for ensuring that suitable access 2352 checks are made to cached data and that attributes are suitably 2353 propagated to the server. Generally, a misbehaving client that hosts 2354 only a single-user can only impact files accessible to that single 2355 user. Misbehavior by a client hosting multiple users may impact 2356 files accessible to all of its users. NFSv4 delegations increase the 2357 level of client responsibility as a client that carries out actions 2358 requiring a delegation without obtaining that delegation will cause 2359 its user(s) to see unexpected and/or incorrect behavior. 2361 Some uses of pNFS extend the responsibility of clients beyond 2362 delegations. In some configurations, the storage devices cannot 2363 perform fine-grained access checks to ensure that clients are only 2364 performing accesses within the bounds permitted to them by the pNFS 2365 operations with the server (e.g., the checks may only be possible at 2366 file system granularity rather than file granularity). In situations 2367 where this added responsibility placed on clients creates 2368 unacceptable security risks, pNFS configurations in which storage 2369 devices cannot perform fine-grained access checks SHOULD NOT be used. 2370 All pNFS server implementations MUST support NFSv4 access to any file 2371 accessible via pNFS in order to provide an interoperable means of 2372 file access in such situations. See Section 8 on Security for 2373 further discussion. 2375 Finally, there are issues about how layouts interact with the 2376 existing NFSv4 abstractions of data delegations and byte range 2377 locking. These issues, and others, are also discussed here. 2379 6. General Definitions 2381 This protocol extension partitions the NFSv4 file system protocol 2382 into two parts, the control path and the data path. The control path 2383 is implemented by the extended (p)NFSv4 server. When the file system 2384 being exported by (p)NFSv4 uses storage devices that are visible to 2385 clients over the network, the data path may be implemented by direct 2386 communication between the extended (p)NFSv4 file system client and 2387 the storage devices. This leads to a few new terms used to describe 2388 the protocol extension and some clarifications of existing terms. 2390 6.1 Metadata Server 2392 A pNFS "server" or "metadata server" is a server as defined by 2393 RFC3530 [2], which additionally provides support of the pNFS minor 2394 extension. When using the pNFS NFSv4 minor extension, the metadata 2395 server may hold only the metadata associated with a file, while the 2396 data can be stored on the storage devices. However, similar to 2397 NFSv4, data may also be written through the metadata server. Note: 2398 directory data is always accessed through the metadata server. 2400 6.2 Client 2402 A pNFS "client" is a client as defined by RFC3530 [2], with the 2403 addition of supporting the pNFS minor extension server protocol and 2404 with the addition of supporting at least one storage protocol for 2405 performing I/O directly to storage devices. 2407 6.3 Storage Device 2409 This is a device, or server, that controls the file's data, but 2410 leaves other metadata management up to the metadata server. A 2411 storage device could be another NFS server, or an Object Storage 2412 Device (OSD) or a block device accessed over a SAN (e.g., either 2413 FiberChannel or iSCSI SAN). The goal of this extension is to allow 2414 direct communication between clients and storage devices. 2416 6.4 Storage Protocol 2418 This is the protocol between the pNFS client and the storage device 2419 used to access the file data. Three following types have been 2420 described: file protocols (e.g., NFSv4), object protocols (e.g., 2421 OSD), and block/volume protocols (e.g., based on SCSI-block 2422 commands). These protocols are in turn realizable over a variety of 2423 transport stacks. We anticipate there will be variations on these 2424 storage protocols, including new protocols that are unknown at this 2425 time or experimental in nature. The details of the storage protocols 2426 will be described in other documents so that pNFS clients can be 2427 written to use these storage protocols. Use of NFSv4 itself as a 2428 file-based storage protocol is described in Section 9. 2430 6.5 Control Protocol 2432 This is a protocol used by the exported file system between the 2433 server and storage devices. Specification of such protocols is 2434 outside the scope of this draft. Such control protocols would be 2435 used to control such activities as the allocation and deallocation of 2436 storage and the management of state required by the storage devices 2437 to perform client access control. The control protocol should not be 2438 confused with protocols used to manage LUNs in a SAN and other 2439 sysadmin kinds of tasks. 2441 While the pNFS protocol allows for any control protocol, in practice 2442 the control protocol is closely related to the storage protocol. For 2443 example, if the storage devices are NFS servers, then the protocol 2444 between the pNFS metadata server and the storage devices is likely to 2445 involve NFS operations. Similarly, when object storage devices are 2446 used, the pNFS metadata server will likely use iSCSI/OSD commands to 2447 manipulate storage. 2449 However, this document does not mandate any particular control 2450 protocol. Instead, it just describes the requirements on the control 2451 protocol for maintaining attributes like modify time, the change 2452 attribute, and the end-of-file position. 2454 6.6 Metadata 2456 This is information about a file, like its name, owner, where it 2457 stored, and so forth. The information is managed by the exported 2458 file system server (metadata server). Metadata also includes lower- 2459 level information like block addresses and indirect block pointers. 2460 Depending the storage protocol, block-level metadata may or may not 2461 be managed by the metadata server, but is instead managed by Object 2462 Storage Devices or other servers acting as a storage device. 2464 6.7 Layout 2466 A layout defines how a file's data is organized on one or more 2467 storage devices. There are many possible layout types. They vary in 2468 the storage protocol used to access the data, and in the aggregation 2469 scheme that lays out the file data on the underlying storage devices. 2470 Layouts are described in more detail below. 2472 7. pNFS protocol semantics 2474 This section describes the semantics of the pNFS protocol extension 2475 to NFSv4; this is the protocol between the client and the metadata 2476 server. 2478 7.1 Definitions 2480 This sub-section defines a number of terms necessary for describing 2481 layouts and their semantics. In addition, it more precisely defines 2482 how layouts are identified and how they can be composed of smaller 2483 granularity layout segments. 2485 7.1.1 Layout Types 2487 A layout describes the mapping of a file's data to the storage 2488 devices that hold the data. A layout is said to belong to a specific 2489 "layout type" (see Section 10.1 for its RPC definition). The layout 2490 type allows for variants to handle different storage protocols (e.g., 2491 block/volume [6], object [7], and file [Section 9] layout types). A 2492 metadata server, along with its control protocol, must support at 2493 least one layout type. A private sub-range of the layout type name 2494 space is also defined. Values from the private layout type range can 2495 be used for internal testing or experimentation. 2497 As an example, a file layout type could be an array of tuples (e.g., 2498 deviceID, file_handle), along with a definition of how the data is 2499 stored across the devices (e.g., striping). A block/volume layout 2500 might be an array of tuples that store along with information about block size and the file offset of 2502 the first block. An object layout might be an array of tuples 2503 and an additional structure (i.e., the 2504 aggregation map) that defines how the logical byte sequence of the 2505 file data is serialized into the different objects. Note, the actual 2506 layouts are more complex than these simple expository examples. 2508 This document defines a NFSv4 file layout type using a stripe-based 2509 aggregation scheme (see Section 9). Adjunct specifications are being 2510 drafted that precisely define other layout formats (e.g., block/ 2511 volume [6], and object [7] layouts) to allow interoperability among 2512 clients and metadata servers. 2514 7.1.2 Layout Iomode 2516 The iomode indicates to the metadata server the client's intent to 2517 perform either READs (only) or a mixture of I/O possibly containing 2518 WRITEs as well as READs (i.e., READ/WRITE). For certain layout 2519 types, it is useful for a client to specify this intent at LAYOUTGET 2520 time. E.g., for block/volume based protocols, block allocation could 2521 occur when a READ/WRITE iomode is specified. A special 2522 LAYOUTIOMODE_ANY iomode is defined and can only be used for 2523 LAYOUTRETURN and LAYOUTRECALL, not for LAYOUTGET. It specifies that 2524 layouts pertaining to both READ and RW iomodes are being returned or 2525 recalled, respectively. 2527 A storage device may validate I/O with regards to the iomode; this is 2528 dependent upon storage device implementation. Thus, if the client's 2529 layout iomode differs from the I/O being performed the storage device 2530 may reject the client's I/O with an error indicating a new layout 2531 with the correct I/O mode should be fetched. E.g., if a client gets 2532 a layout with a READ iomode and performs a WRITE to a storage device, 2533 the storage device is allowed to reject that WRITE. 2535 The iomode does not conflict with OPEN share modes or lock requests; 2536 open mode checks and lock enforcement are always enforced, and are 2537 logically separate from the pNFS layout level. As well, open modes 2538 and locks are the preferred method for restricting user access to 2539 data files. E.g., an OPEN of read, deny-write does not conflict with 2540 a LAYOUTGET containing an iomode of READ/WRITE performed by another 2541 client. Applications that depend on writing into the same file 2542 concurrently may use byte range locking to serialize their accesses. 2544 7.1.3 Layout Segments 2546 Until this point, layouts have been defined in a fairly vague manner. 2547 A layout is more precisely identified by the following tuple: 2548 ; the FH refers to the FH of the file on 2549 the metadata server. Note, layouts describe a file, not a byte-range 2550 of a file. 2552 Since a layout that describes an entire file may be very large, there 2553 is a desire to manage layouts in smaller chunks that correspond to 2554 byte-ranges of the file. For example, the entire layout need not be 2555 returned, recalled, or committed. These chunks are called "layout 2556 segments" and are further identified by the byte-range they 2557 represent. Layout operations require the identification of the 2558 layout segment (i.e., clientID, FH, layout type, and byte-range), as 2559 well as the iomode. This structure allows clients and metadata 2560 servers to aggregate the results of layout operations into a singly 2561 maintained layout. 2563 It is important to define when layout segments overlap and/or 2564 conflict with each other. For a layout segment to overlap another 2565 layout segment both segments must be of the same layout type, 2566 correspond to the same filehandle, and have the same iomode; in 2567 addition, the byte-ranges of the segments must overlap. Layout 2568 segments conflict, when they overlap and differ in the content of the 2569 layout (i.e., the storage device/file mapping parameters differ). 2570 Note, differing iomodes do not lead to conflicting layouts. It is 2571 permissible for layout segments with different iomodes, pertaining to 2572 the same byte range, to be held by the same client. 2574 7.1.4 Device IDs 2576 The "deviceID" is a short name for a storage device. In practice, a 2577 significant amount of information may be required to fully identify a 2578 storage device. Instead of embedding all that information in a 2579 layout, a level of indirection is used. Layouts embed device IDs, 2580 and a new operation (GETDEVICEINFO) is used to retrieve the complete 2581 identity information about the storage device according to its layout 2582 type. For example, the identity of a file server or object server 2583 could be an IP address and port. The identity of a block device 2584 could be a volume label. Due to multipath connectivity in a SAN 2585 environment, agreement on a volume label is considered the reliable 2586 way to locate a particular storage device. 2588 The device ID is qualified by the layout type and unique per file 2589 system (FSID). This allows different layout drivers to generate 2590 device IDs without the need for co-ordination. In addition to 2591 GETDEVICEINFO, another operation, GETDEVICELIST, has been added to 2592 allow clients to fetch the mappings of multiple storage devices 2593 attached to a metadata server. 2595 Clients cannot expect the mapping between device ID and storage 2596 device address to persist across server reboots, hence a client MUST 2597 fetch new mappings on startup or upon detection of a metadata server 2598 reboot unless it can revalidate its existing mappings. Not all 2599 layout types support such revalidation, and the means of doing so is 2600 layout specific. If data are reorganized from a storage device with 2601 a given device ID to a different storage device (i.e., if the mapping 2602 between storage device and data changes), the layout describing the 2603 data MUST be recalled rather than assigning the new storage device to 2604 the old device ID. 2606 7.1.5 Aggregation Schemes 2608 Aggregation schemes can describe layouts like simple one-to-one 2609 mapping, concatenation, and striping. A general aggregation scheme 2610 allows nested maps so that more complex layouts can be compactly 2611 described. The canonical aggregation type for this extension is 2612 striping, which allows a client to access storage devices in 2613 parallel. Even a one-to-one mapping is useful for a file server that 2614 wishes to distribute its load among a set of other file servers. 2616 7.2 Guarantees Provided by Layouts 2618 Layouts delegate to the client the ability to access data out of 2619 band. The layout guarantees the holder that the layout will be 2620 recalled when the state encapsulated by the layout becomes invalid 2621 (e.g., through some operation that directly or indirectly modifies 2622 the layout) or, possibly, when a conflicting layout is requested, as 2623 determined by the layout's iomode. When a layout is recalled, and 2624 then returned by the client, the client retains the ability to access 2625 file data with normal NFSv4 I/O operations through the metadata 2626 server. Only the right to do I/O out-of-band is affected. 2628 Holding a layout does not guarantee that a user of the layout has the 2629 rights to access the data represented by the layout. All user access 2630 rights MUST be obtained through the appropriate open, lock, and 2631 access operations (i.e., those that would be used in the absence of 2632 pNFS). However, if a valid layout for a file is not held by the 2633 client, the storage device should reject all I/Os to that file's byte 2634 range that originate from that client. In summary, layouts and 2635 ordinary file access controls are independent. The act of modifying 2636 a file for which a layout is held, does not necessarily conflict with 2637 the holding of the layout that describes the file being modified. 2638 However, with certain layout types (e.g., block/volume layouts), the 2639 layout's iomode must agree with the type of I/O being performed. 2641 Depending upon the layout type and storage protocol in use, storage 2642 device access permissions may be granted by LAYOUTGET and may be 2643 encoded within the type specific layout. If access permissions are 2644 encoded within the layout, the metadata server must recall the layout 2645 when those permissions become invalid for any reason; for example 2646 when a file becomes unwritable or inaccessible to a client. Note, 2647 clients are still required to perform the appropriate access 2648 operations as described above (e.g., open and lock ops). The degree 2649 to which it is possible for the client to circumvent these access 2650 operations must be clearly addressed by the individual layout type 2651 documents, as well as the consequences of doing so. In addition, 2652 these documents must be clear about the requirements and non- 2653 requirements for the checking performed by the server. 2655 If the pNFS metadata server supports mandatory byte range locks then 2656 byte range locks must behave as specified by the NFSv4 protocol, as 2657 observed by users of files. If a storage device is unable to 2658 restrict access by a pNFS client who does not hold a required 2659 mandatory byte range lock then the metadata server must not grant 2660 layouts to a client, for that storage device, that permits any access 2661 that conflicts with a mandatory byte range lock held by another 2662 client. In this scenario, it is also necessary for the metadata 2663 server to ensure that byte range locks are not granted to a client if 2664 any other client holds a conflicting layout; in this case all 2665 conflicting layouts must be recalled and returned before the lock 2666 request can be granted. This requires the pNFS server to understand 2667 the capabilities of its storage devices. 2669 7.3 Getting a Layout 2671 A client obtains a layout through a new operation, LAYOUTGET. The 2672 metadata server will give out layouts of a particular type (e.g., 2673 block/volume, object, or file) and aggregation as requested by the 2674 client. The client selects an appropriate layout type which the 2675 server supports and the client is prepared to use. The layout 2676 returned to the client may not line up exactly with the requested 2677 byte range. A field within the LAYOUTGET request, "minlength", 2678 specifies the minimum overlap that MUST exist between the requested 2679 layout and the layout returned by the metadata server. The 2680 "minlength" field should specify a size of at least one. A metadata 2681 server may give-out multiple overlapping, non-conflicting layout 2682 segments to the same client in response to a LAYOUTGET. 2684 There is no implied ordering between getting a layout and performing 2685 a file OPEN. For example, a layout may first be retrieved by placing 2686 a LAYOUTGET operation in the same compound as the initial file OPEN. 2687 Once the layout has been retrieved, it can be held across multiple 2688 OPEN and CLOSE sequences. 2690 The storage protocol used by the client to access the data on the 2691 storage device is determined by the layout's type. The client needs 2692 to select a "layout driver" that understands how to interpret and use 2693 that layout. The API used by the client to talk to its drivers is 2694 outside the scope of the pNFS extension. The storage protocol 2695 between the client's layout driver and the actual storage is covered 2696 by other protocols specifications such as iSCSI (block storage), OSD 2697 (object storage) or NFS (file storage). 2699 Although, the metadata server is in control of the layout for a file, 2700 the pNFS client can provide hints to the server when a file is opened 2701 or created about preferred layout type and aggregation scheme. The 2702 pNFS extension introduces a LAYOUT_HINT attribute that the client can 2703 set at creation time to provide a hint to the server for new files. 2704 It is suggested that this attribute be set as one of the initial 2705 attributes to OPEN when creating a new file. Setting this attribute 2706 separately, after the file has been created could make it difficult, 2707 or impossible, for the server implementation to comply. 2709 7.4 Committing a Layout 2711 Due to the nature of the protocol, the file attributes, and data 2712 location mapping (e.g., which offsets store data vs. store holes) 2713 that exist on the metadata storage device may become inconsistent in 2714 relation to the data stored on the storage devices; e.g., when WRITEs 2715 occur before a layout has been committed (e.g., between a LAYOUTGET 2716 and a LAYOUTCOMMIT). Thus, it is necessary to occasionally re-sync 2717 this state and make it visible to other clients through the metadata 2718 server. 2720 The LAYOUTCOMMIT operation is responsible for committing a modified 2721 layout segment to the metadata server. Note: the data should be 2722 written and committed to the appropriate storage devices before the 2723 LAYOUTCOMMIT occurs. Note, if the data is being written 2724 asynchronously through the metadata server a COMMIT to the metadata 2725 server is required to sync the data and make it visible on the 2726 storage devices (see Section 7.6 for more details). The scope of 2727 this operation depends on the storage protocol in use. For block/ 2728 volume-based layouts, it may require updating the block list that 2729 comprises the file and committing this layout to stable storage. 2730 While, for file-layouts it requires some synchronization of 2731 attributes between the metadata and storage devices (i.e., mainly the 2732 size attribute; EOF). It is important to note that the level of 2733 synchronization is from the point of view of the client who issued 2734 the LAYOUTCOMMIT. The updated state on the metadata server need only 2735 reflect the state as of the client's last operation previous to the 2736 LAYOUTCOMMIT, it need not reflect a globally synchronized state 2737 (e.g., other clients may be performing, or may have performed I/O 2738 since the client's last operation and the LAYOUTCOMMIT). 2740 The control protocol is free to synchronize the attributes before it 2741 receives a LAYOUTCOMMIT, however upon successful completion of a 2742 LAYOUTCOMMIT, state that exists on the metadata server that describes 2743 the file MUST be in sync with the state existing on the storage 2744 devices that comprise that file as of the issuing client's last 2745 operation. Thus, a client that queries the size of a file between a 2746 WRITE to a storage device and the LAYOUTCOMMIT may observe a size 2747 that does not reflects the actual data written. 2749 7.4.1 LAYOUTCOMMIT and mtime/atime/change 2751 The change attribute and the modify/access times may be updated, by 2752 the server, at LAYOUTCOMMIT time; since for some layout types, the 2753 change attribute and atime/mtime can not be updated by the 2754 appropriate I/O operation performed at a storage device. The 2755 arguments to LAYOUTCOMMIT allow the client to provide suggested 2756 access and modify time values to the server. Again, depending upon 2757 the layout type, these client provided values may or may not be used. 2758 The server should sanity check the client provided values before they 2759 are used. For example, the server should ensure that time does not 2760 flow backwards. According to the NFSv4 specification, The client 2761 always has the option to set these attributes through an explicit 2762 SETATTR operation. 2764 As mentioned, for some layout protocols the change attribute and 2765 mtime/atime may be updated at or after the time the I/O occurred 2766 (e.g., if the storage device is able to communicate these attributes 2767 to the metadata server). If, upon receiving a LAYOUTCOMMIT, the 2768 server implementation is able to determine that the file did not 2769 change since the last time the change attribute was updated (e.g., no 2770 WRITEs or over-writes occurred), the implementation need not update 2771 the change attribute; file-based protocols may have enough state to 2772 make this determination or may update the change attribute upon each 2773 file modification. This also applies for mtime and atime; if the 2774 server implementation is able to determine that the file has not been 2775 modified since the last mtime update, the server need not update 2776 mtime at LAYOUTCOMMIT time. Once LAYOUTCOMMIT completes, the new 2777 change attribute and mtime/atime should be visible if that file was 2778 modified since the latest previous LAYOUTCOMMIT or LAYOUTGET. 2780 7.4.2 LAYOUTCOMMIT and size 2782 The file's size may be updated at LAYOUTCOMMIT time as well. The 2783 LAYOUTCOMMIT operation contains an argument that indicates the last 2784 byte offset to which the client wrote ("last_write_offset"). Note: 2785 for this offset to be viewed as a file size it must be incremented by 2786 one byte (e.g., a write to offset 0 would map into a file size of 1, 2787 but the last write offset is 0). The metadata server may do one of 2788 the following: 2790 1. It may update the file's size based on the last write offset. 2791 However, to the extent possible, the metadata server should 2792 sanity check any value to which the file's size is going to be 2793 set. E.g., it must not truncate the file based on the client 2794 presenting a smaller last write offset than the file's current 2795 size. 2797 2. If it has sufficient other knowledge of file size (e.g., by 2798 querying the storage devices through the control protocol), it 2799 may ignore the client provided argument and use the query-derived 2800 value. 2802 3. It may use the last write offset as a hint, subject to correction 2803 when other information is available as above. 2805 The method chosen to update the file's size will depend on the 2806 storage device's and/or the control protocol's implementation. For 2807 example, if the storage devices are block devices with no knowledge 2808 of file size, the metadata server must rely on the client to set the 2809 size appropriately. A new size flag and length are also returned in 2810 the results of a LAYOUTCOMMIT. This union indicates whether a new 2811 size was set, and to what length it was set. If a new size is set as 2812 a result of LAYOUTCOMMIT, then the metadata server must reply with 2813 the new size. As well, if the size is updated, the metadata server 2814 in conjunction with the control protocol SHOULD ensure that the new 2815 size is reflected by the storage devices immediately upon return of 2816 the LAYOUTCOMMIT operation; e.g., a READ up to the new file size 2817 should succeed on the storage devices (assuming no intervening 2818 truncations). Again, if the client wants to explicitly zero-extend 2819 or truncate a file, SETATTR must be used; it need not be used when 2820 simply writing past EOF. 2822 Since client layout holders may be unaware of changes made to the 2823 file's size, through LAYOUTCOMMIT or SETATTR, by other clients, an 2824 additional callback/notification has been added for pNFS. 2825 CB_SIZECHANGED is a notification that the metadata server sends to 2826 layout holders to notify them of a change in file size. This is 2827 preferred over issuing CB_LAYOUTRECALL to each of the layout holders. 2829 7.4.3 LAYOUTCOMMIT and layoutupdate 2831 The LAYOUTCOMMIT operation contains a "layoutupdate" argument. This 2832 argument is a layout type specific structure. The structure can be 2833 used to pass arbitrary layout type specific information from the 2834 client to the metadata server at LAYOUTCOMMIT time. For example, if 2835 using a block/volume layout, the client can indicate to the metadata 2836 server which reserved or allocated blocks it used and which it did 2837 not. The "layoutupdate" structure need not be the same structure as 2838 the layout returned by LAYOUTGET. The structure is defined by the 2839 layout type and is opaque to LAYOUTCOMMIT. 2841 7.5 Recalling a Layout 2843 7.5.1 Basic Operation 2845 Since a layout protects a client's access to a file via a direct 2846 client-storage-device path, a layout need only be recalled when it is 2847 semantically unable to serve this function. Typically, this occurs 2848 when the layout no longer encapsulates the true location of the file 2849 over the byte range it represents. Any operation or action (e.g., 2850 server driven restriping or load balancing) that changes the layout 2851 will result in a recall of the layout. A layout is recalled by the 2852 CB_LAYOUTRECALL callback operation (see Section 14.19). This 2853 callback can either recall a layout segment identified by a byte 2854 range, or all the layouts associated with a file system (FSID). 2855 However, there is no single operation to return all layouts 2856 associated with an FSID; multiple layout segments may be returned in 2857 a single compound operation. Section 7.5.3 discusses sequencing 2858 issues surrounding the getting, returning, and recalling of layouts. 2860 The iomode is also specified when recalling a layout or layout 2861 segment. Generally, the iomode in the recall request must match the 2862 layout, or segment, being returned; e.g., a recall with an iomode of 2863 RW should cause the client to only return RW layout segments (not R 2864 segments). However, a special LAYOUTIOMODE_ANY enumeration is 2865 defined to enable recalling a layout of any type (i.e., the client 2866 must return both read-only and read/write layouts). 2868 A REMOVE operation may cause the metadata server to recall the layout 2869 to prevent the client from accessing a non-existent file and to 2870 reclaim state stored on the client. Since a REMOVE may be delayed 2871 until the last close of the file has occurred, the recall may also be 2872 delayed until this time. As well, once the file has been removed, 2873 after the last reference, the client SHOULD no longer be able to 2874 perform I/O using the layout (e.g., with file-based layouts an error 2875 such as ESTALE could be returned). 2877 Although, the pNFS extension does not alter the caching capabilities 2878 of clients, or their semantics, it recognizes that some clients may 2879 perform more aggressive write-behind caching to optimize the benefits 2880 provided by pNFS. However, write-behind caching may impact the 2881 latency in returning a layout in response to a CB_LAYOUTRECALL; just 2882 as caching impacts DELEGRETURN with regards to data delegations. 2883 Client implementations should limit the amount of dirty data they 2884 have outstanding at any one time. Server implementations may fence 2885 clients from performing direct I/O to the storage devices if they 2886 perceive that the client is taking too long to return a layout once 2887 recalled. A server may be able to monitor client progress by 2888 watching client I/Os or by observing LAYOUTRETURNs of sub-portions of 2889 the recalled layout. The server can also limit the amount of dirty 2890 data to be flushed to storage devices by limiting the byte ranges 2891 covered in the layouts it gives out. 2893 Once a layout has been returned, the client MUST NOT issue I/Os to 2894 the storage devices for the file, byte range, and iomode represented 2895 by the returned layout. If a client does issue an I/O to a storage 2896 device for which it does not hold a layout, the storage device SHOULD 2897 reject the I/O. 2899 7.5.2 Recall Callback Robustness 2901 For simplicity, the discussion thus far has assumed that pNFS client 2902 state for a file exactly matches the pNFS server state for that file 2903 and client regarding layout ranges and permissions. This assumption 2904 leads to the implicit assumption that any callback results in a 2905 LAYOUTRETURN or set of LAYOUTRETURNs that exactly match the range in 2906 the callback, since both client and server agree about the state 2907 being maintained. However, it can be useful if this assumption does 2908 not always hold. For example: 2910 o It may be useful for clients to be able to discard layout 2911 information without calling LAYOUTRETURN. If conflicts that 2912 require callbacks are very rare, and a server can use a multi-file 2913 callback to recover per-client resources (e.g., via a FSID recall, 2914 or a multi-file recall within a single compound), the result may 2915 be significantly less client-server pNFS traffic. 2917 o It may be similarly useful for servers to enhance information 2918 about what layout ranges are held by a client beyond what a client 2919 actually holds. In the extreme, a server could manage conflicts 2920 on a per-file basis, only issuing whole-file callbacks even though 2921 clients may request and be granted sub-file ranges. 2923 o As well, the synchronized state assumption is not robust to minor 2924 errors. A more robust design would allow for divergence between 2925 client and server and the ability to recover. It is vital that a 2926 client not assign itself layout permissions beyond what the server 2927 has granted and that the server not forget layout permissions that 2928 have been granted in order to avoid errors. On the other hand, if 2929 a server believes that a client holds a layout segment that the 2930 client does not know about, it's useful for the client to be able 2931 to issue the LAYOUTRETURN that the server is expecting in response 2932 to a recall. 2934 Thus, in light of the above, it is useful for a server to be able to 2935 issue callbacks for layout ranges it has not granted to a client, and 2936 for a client to return ranges it does not hold. A pNFS client must 2937 always return layout segments that comprise the full range specified 2938 by the recall. Note, the full recalled layout range need not be 2939 returned as part of a single operation, but may be returned in 2940 segments. This allows the client to stage the flushing of dirty 2941 data, layout commits, and returns. Also, it indicates to the 2942 metadata server that the client is making progress. 2944 In order to ensure client/server convergence on the layout state, the 2945 final LAYOUTRETURN operation in a sequence of returns for a 2946 particular recall, SHOULD specify the entire range being recalled, 2947 even if layout segments pertaining to partial ranges were previously 2948 returned. In addition, if the client holds no layout segment that 2949 overlaps the range being recalled, the client should return the 2950 NFS4ERR_NOMATCHING_LAYOUT error code. This allows the server to 2951 update its view of the client's layout state. 2953 7.5.3 Recall/Return Sequencing 2955 As with other stateful operations, pNFS requires the correct 2956 sequencing of layout operations. This proposal assumes that sessions 2957 will precede or accompany pNFS into NFSv4.x and thus, pNFS will 2958 require the use of sessions. If the sessions proposal does not 2959 precede pNFS, then this proposal needs to be modified to provide for 2960 the correct sequencing of pNFS layout operations. Also, this 2961 specification is reliant on the sessions protocol to provide the 2962 correct sequencing between regular operations and callbacks. It is 2963 the server's responsibility to avoid inconsistencies regarding the 2964 layouts it hands out and the client's responsibility to properly 2965 serialize its layout requests. 2967 One critical issue with operation sequencing concerns callbacks. The 2968 protocol must defend against races between the reply to a LAYOUTGET 2969 operation and a subsequent CB_LAYOUTRECALL. It MUST NOT be possible 2970 for a client to process the CB_LAYOUTRECALL for a layout that it has 2971 not received in a reply message to a LAYOUTGET. 2973 7.5.3.1 Client Side Considerations 2975 Consider a pNFS client that has issued a LAYOUTGET and then receives 2976 an overlapping recall callback for the same file. There are two 2977 possibilities, which the client cannot distinguish when the callback 2978 arrives: 2980 1. The server processed the LAYOUTGET before issuing the recall, so 2981 the LAYOUTGET response is in flight, and must be waited for 2982 because it may be carrying layout info that will need to be 2983 returned to deal with the recall callback. 2985 2. The server issued the callback before receiving the LAYOUTGET. 2986 The server will not respond to the LAYOUTGET until the recall 2987 callback is processed. 2989 This can cause deadlock, as the client must wait for the LAYOUTGET 2990 response before processing the recall in the first case, but that 2991 response will not arrive until after the recall is processed in the 2992 second case. This deadlock can be avoided by adhering to the 2993 following requirements: 2995 o A LAYOUTGET MUST be rejected with an error (i.e., 2996 NFS4ERR_RECALLCONFLICT) if there's an overlapping outstanding 2997 recall callback to the same client 2999 o When processing a recall, the client MUST wait for a response to 3000 all conflicting outstanding LAYOUTGETs before performing any 3001 RETURN that could be affected by any such response. 3003 o The client SHOULD wait for responses to all operations required to 3004 complete a recall before sending any LAYOUTGETs that would 3005 conflict with the recall because the server is likely to return 3006 errors for them. 3008 Now the client can wait for the LAYOUTGET response, as it will be 3009 received in both cases. 3011 7.5.3.2 Server Side Considerations 3013 Consider a related situation from the pNFS server's point of view. 3014 The server has issued a recall callback and receives an overlapping 3015 LAYOUTGET for the same file before the LAYOUTRETURN(s) that respond 3016 to the recall callback. Again, there are two cases: 3018 1. The client issued the LAYOUTGET before processing the recall 3019 callback. 3021 2. The client issued the LAYOUTGET after processing the recall 3022 callback, but it arrived before the LAYOUTRETURN that completed 3023 that processing. 3025 The simplest approach is to always reject the overlapping LAYOUTGET. 3026 The client has two ways to avoid this result - it can issue the 3027 LAYOUTGET as a subsequent element of a COMPOUND containing the 3028 LAYOUTRETURN that completes the recall callback, or it can wait for 3029 the response to that LAYOUTRETURN. 3031 This leads to a more general problem; in the absence of a callback if 3032 a client issues concurrent overlapping LAYOUTGET and LAYOUTRETURN 3033 operations, it is possible for the server to process them in either 3034 order. Again, a client must take the appropriate precautions in 3035 serializing its actions. 3037 [ASIDE: HighRoad forbids a client from doing this, as the per-file 3038 layout stateid will cause one of the two operations to be rejected 3039 with a stale layout stateid. This approach is simpler and produces 3040 better results by comparison to allowing concurrent operations, at 3041 least for this sort of conflict case, because server execution of 3042 operations in an order not anticipated by the client may produce 3043 results that are not useful to the client (e.g., if a LAYOUTRETURN is 3044 followed by a concurrent overlapping LAYOUTGET, but executed in the 3045 other order, the client will not retain layout extents for the 3046 overlapping range).] 3048 7.6 Metadata Server Write Propagation 3050 Asynchronous writes written through the metadata server may be 3051 propagated lazily to the storage devices. For data written 3052 asynchronously through the metadata server, a client performing a 3053 read at the appropriate storage device is not guaranteed to see the 3054 newly written data until a COMMIT occurs at the metadata server. 3055 While the write is pending, reads to the storage device can give out 3056 either the old data, the new data, or a mixture thereof. After 3057 either a synchronous write completes, or a COMMIT is received (for 3058 asynchronously written data), the metadata server must ensure that 3059 storage devices give out the new data and that the data has been 3060 written to stable storage. If the server implements its storage in 3061 any way such that it cannot obey these constraints, then it must 3062 recall the layouts to prevent reads being done that cannot be handled 3063 correctly. 3065 7.7 Crash Recovery 3067 Crash recovery is complicated due to the distributed nature of the 3068 pNFS protocol. In general, crash recovery for layouts is similar to 3069 crash recovery for delegations in the base NFSv4 protocol. However, 3070 the client's ability to perform I/O without contacting the metadata 3071 server introduces subtleties that must be handled correctly if file 3072 system corruption is to be avoided. 3074 7.7.1 Leases 3076 The layout lease period plays a critical role in crash recovery. 3077 Depending on the capabilities of the storage protocol, it is crucial 3078 that the client is able to maintain an accurate layout lease timer to 3079 ensure that I/Os are not issued to storage devices after expiration 3080 of the layout lease period. In order for the client to do so, it 3081 must know which operations renew a lease. 3083 7.7.1.1 Lease Renewal 3085 The current NFSv4 specification allows for implicit lease renewals to 3086 occur upon receiving an I/O. However, due to the distributed pNFS 3087 architecture, implicit lease renewals are limited to operations 3088 performed at the metadata server; this includes I/O performed through 3089 the metadata server. So, a client must not assume that READ and 3090 WRITE I/O to storage devices implicitly renew lease state. 3092 If sessions are required for pNFS, as has been suggested, then the 3093 SEQUENCE operation is to be used to explicitly renew leases. It is 3094 proposed that the SEQUENCE operation be extended to return all the 3095 specific information that RENEW does, but not as an error as RENEW 3096 returns it. Since, when using session, beginning each compound with 3097 the SEQUENCE op allows renews to be performed without an additional 3098 operation and without an additional request. Again, the client must 3099 not rely on any operation to the storage devices to renew a lease. 3100 Using the SEQUENCE operation for renewals, simplifies the client's 3101 perception of lease renewal. 3103 7.7.1.2 Client Lease Timer 3105 Depending on the storage protocol and layout type in use, it may be 3106 crucial that the client not issue I/Os to storage devices if the 3107 corresponding layout's lease has expired. Doing so may lead to file 3108 system corruption if the layout has been given out and used by 3109 another client. In order to prevent this, the client must maintain 3110 an accurate lease timer for all layouts held. RFC3530 has the 3111 following to say regarding the maintenance of a client lease timer: 3113 ...the client must track operations which will renew the lease 3114 period. Using the time that each such request was sent and the 3115 time that the corresponding reply was received, the client should 3116 bound the time that the corresponding renewal could have occurred 3117 on the server and thus determine if it is possible that a lease 3118 period expiration could have occurred. 3120 To be conservative, the client should start its lease timer based on 3121 the time that the it issued the operation to the metadata server, 3122 rather than based on the time of the response. 3124 It is also necessary to take propagation delay into account when 3125 requesting a renewal of the lease: 3127 ...the client should subtract it from lease times (e.g., if the 3128 client estimates the one-way propagation delay as 200 msec, then 3129 it can assume that the lease is already 200 msec old when it gets 3130 it). In addition, it will take another 200 msec to get a response 3131 back to the server. So the client must send a lock renewal or 3132 write data back to the server 400 msec before the lease would 3133 expire. 3135 Thus, the client must be aware of the one-way propagation delay and 3136 should issue renewals well in advance of lease expiration. Clients, 3137 to the extent possible, should try not to issue I/Os that may extend 3138 past the lease expiration time period. However, since this is not 3139 always possible, the storage protocol must be able to protect against 3140 the effects of inflight I/Os, as is discussed later. 3142 7.7.2 Client Recovery 3144 Client recovery for layouts works in much the same way as NFSv4 3145 client recovery works for other lock/delegation state. When an NFSv4 3146 client reboots, it will lose all information about the layouts that 3147 it previously owned. There are two methods by which the server can 3148 reclaim these resources and allow otherwise conflicting layouts to be 3149 provided to other clients. 3151 The first is through the expiry of the client's lease. If the client 3152 recovery time is longer than the lease period, the client's lease 3153 will expire and the server will know that state may be released. for 3154 layouts the server may release the state immediately upon lease 3155 expiry or it may allow the layout to persist awaiting possible lease 3156 revival, as long as there are no conflicting requests. 3158 On the other hand, the client may recover in less time than it takes 3159 for the lease period to expire. In such a case, the client will 3160 contact the server through the standard SETCLIENTID protocol. The 3161 server will find that the client's id matches the id of the previous 3162 client invocation, but that the verifier is different. The server 3163 uses this as a signal to release all the state associated with the 3164 client's previous invocation. 3166 7.7.3 Metadata Server Recovery 3168 The server recovery case is slightly more complex. In general, the 3169 recovery process again follows the standard NFSv4 recovery model: the 3170 client will discover that the metadata server has rebooted when it 3171 receives an unexpected STALE_STATEID or STALE_CLIENTID reply from the 3172 server; it will then proceed to try to reclaim its previous 3173 delegations during the server's recovery grace period. However, 3174 layouts are not reclaimable in the same sense as data delegations; 3175 there is no reclaim bit, thus no guarantee of continuity between the 3176 previous and new layout. This is not necessarily required since a 3177 layout is not required to perform I/O; I/O can always be performed 3178 through the metadata server. 3180 [NOTE: there is no reclaim bit for getting a layout. Thus, in the 3181 case of reclaiming an old layout obtained through LAYOUTGET, there is 3182 no guarantee of continuity. If a reclaim bit existed a block/volume 3183 layout type might be happier knowing it got the layout back with the 3184 assurance of continuity. However, this would require the metadata 3185 server trusting the client in telling it the exact layout it had 3186 (i.e., the full block-list); however, divergence is avoided by having 3187 the server tell the client what is contained within the layout.] 3189 If the client has dirty data that it needs to write out, or an 3190 outstanding LAYOUTCOMMIT, the client should try to obtain a new 3191 layout segment covering the byte range covered by the previous layout 3192 segment. However, the client might not not get the same layout 3193 segment it had. The range might be different or it might get the 3194 same range but the content of the layout might be different. For 3195 example, if using a block/volume-based layout, the blocks 3196 provisionally assigned by the layout might be different, in which 3197 case the client will have to write the corresponding blocks again; in 3198 the interest of simplicity, the client might decide to always write 3199 them again. Alternatively, the client might be unable to obtain a 3200 new layout and thus, must write the data using normal NFSv4 through 3201 the metadata server. 3203 There is an important safety concern associated with layouts that 3204 does not come into play in the standard NFSv4 case. If a standard 3205 NFSv4 client makes use of a stale delegation, while reading, the 3206 consequence could be to deliver stale data to an application. If 3207 writing, using a stale delegation or a stale state stateid for an 3208 open or lock would result in the rejection of the client's write with 3209 the appropriate stale stateid error. 3211 However, the pNFS layout enables the client to directly access the 3212 file system storage---if this access is not properly managed by the 3213 NFSv4 server the client can potentially corrupt the file system data 3214 or metadata. Thus, it is vitally important that the client discover 3215 that the metadata server has rebooted, and that the client stops 3216 using stale layouts before the metadata server gives them away to 3217 other clients. To ensure this, the client must be implemented so 3218 that layouts are never used to access the storage after the client's 3219 lease timer has expired. It is crucial that clients have precise 3220 knowledge of the lease periods of their layouts. For specific 3221 details on lease renewal and client lease timers, see Section 7.7.1. 3223 The prohibition on using stale layouts applies to all layout related 3224 accesses, especially the flushing of dirty data to the storage 3225 devices. If the client's lease timer expires because the client 3226 could not contact the server for any reason, the client MUST 3227 immediately stop using the layout until the server can be contacted 3228 and the layout can be officially recovered or reclaimed. However, 3229 this is only part of the solution. It is also necessary to deal with 3230 the consequences of I/Os already in flight. 3232 The issue of the effects of I/Os started before lease expiration and 3233 possibly continuing through lease expiration is the responsibility of 3234 the data storage protocol and as such is layout type specific. There 3235 are two approaches the data storage protocol can take. The protocol 3236 may adopt a global solution which prevents all I/Os from being 3237 executed after the lease expiration and thus is safe against a client 3238 who issues I/Os after lease expiration. This is the preferred 3239 solution and the solution used by NFSv4 file based layouts (see 3240 Section 9.6); as well, the object storage device protocol allows 3241 storage to fence clients after lease expiration. Alternatively, the 3242 storage protocol may rely on proper client operation and only deal 3243 with the effects of lingering I/Os. These solutions may impact the 3244 client layout-driver, the metadata server layout-driver, and the 3245 control protocol. 3247 7.7.4 Storage Device Recovery 3249 Storage device crash recovery is mostly dependent upon the layout 3250 type in use. However, there are a few general techniques a client 3251 can use if it discovers a storage device has crashed while holding 3252 asynchronously written, non-committed, data. First and foremost, it 3253 is important to realize that the client is the only one who has the 3254 information necessary to recover asynchronously written data; since, 3255 it holds the dirty data and most probably nobody else does. Second, 3256 the best solution is for the client to err on the side or caution and 3257 attempt to re-write the dirty data through another path. 3259 The client, rather than hold the asynchronously written data 3260 indefinitely, is encouraged to, and can make sure that the data is 3261 written by using other paths to that data. The client may write the 3262 data to the metadata server, either synchronously or asynchronously 3263 with a subsequent COMMIT. Once it does this, there is no need to 3264 wait for the original storage device. In the event that the data 3265 range to be committed is transferred to a different storage device, 3266 as indicated in a new layout, the client may write to that storage 3267 device. Once the data has been committed at that storage device, 3268 either through a synchronous write or through a commit to that 3269 storage device (e.g., through the NFSv4 COMMIT operation for the 3270 NFSv4 file layout), the client should consider the transfer of 3271 responsibility for the data to the new server as strong evidence that 3272 this is the intended and most effective method for the client to get 3273 the data written. In either case, once the write is on stable 3274 storage (through either the storage device or metadata server), there 3275 is no need to continue either attempting to commit or attempting to 3276 synchronously write the data to the original storage device or wait 3277 for that storage device to become available. That storage device may 3278 never be visible to the client again. 3280 This approach does have a "lingering write" problem, similar to 3281 regular NFSv4. Suppose a WRITE is issued to a storage device for 3282 which no response is received. The client breaks the connection, 3283 trying to re-establish a new one, and gets a recall of the layout. 3284 The client issues the I/O for the dirty data through an alternative 3285 path, for example, through the metadata server and it succeeds. The 3286 client then goes on to perform additional writes that all succeed. 3287 If at some time later, the original write to the storage device 3288 succeeds, data inconsistency could result. The same problem can 3289 occur in regular NFSv4. For example, a WRITE is held in a switch for 3290 some period of time while other writes are issued and replied to, if 3291 the original WRITE finally succeeds, the same issues can occur. 3292 However, this is solved by sessions in NFSv4.x. 3294 8. Security Considerations 3296 The pNFS extension partitions the NFSv4 file system protocol into two 3297 parts, the control path and the data path (i.e., storage protocol). 3298 The control path contains all the new operations described by this 3299 extension; all existing NFSv4 security mechanisms and features apply 3300 to the control path. The combination of components in a pNFS system 3301 (see Figure 9) is required to preserve the security properties of 3302 NFSv4 with respect to an entity accessing data via a client, 3303 including security countermeasures to defend against threats that 3304 NFSv4 provides defenses for in environments where these threats are 3305 considered significant. 3307 In some cases, the security countermeasures for connections to 3308 storage devices may take the form of physical isolation or a 3309 recommendation not to use pNFS in an environment. For example, it is 3310 currently infeasible to provide confidentiality protection for some 3311 storage device access protocols to protect against eavesdropping; in 3312 environments where eavesdropping on such protocols is of sufficient 3313 concern to require countermeasures, physical isolation of the 3314 communication channel (e.g., via direct connection from client(s) to 3315 storage device(s)) and/or a decision to forego use of pNFS (e.g., and 3316 fall back to NFSv4) may be appropriate courses of action. 3318 In full generality where communication with storage devices is 3319 subject to the same threats as client-server communication, the 3320 protocols used for that communication need to provide security 3321 mechanisms comparable to those available via RPSEC_GSS for NFSv4. 3322 Many situations in which pNFS is likely to be used will not be 3323 subject to the overall threat profile for which NFSv4 is required to 3324 provide countermeasures. 3326 pNFS implementations MUST NOT remove NFSv4's access controls. The 3327 combination of clients, storage devices, and the server are 3328 responsible for ensuring that all client to storage device file data 3329 access respects NFSv4 ACLs and file open modes. This entails 3330 performing both of these checks on every access in the client, the 3331 storage device, or both. If a pNFS configuration performs these 3332 checks only in the client, the risk of a misbehaving client obtaining 3333 unauthorized access is an important consideration in determining when 3334 it is appropriate to use such a pNFS configuration. Such 3335 configurations SHOULD NOT be used when client- only access checks do 3336 not provide sufficient assurance that NFSv4 access control is being 3337 applied correctly. 3339 The following subsections describe security considerations 3340 specifically applicable to each of the three major storage device 3341 protocol types supported for pNFS. 3343 [Requiring strict equivalence to NFSv4 security mechanisms is the 3344 wrong approach. Will need to lay down a set of statements that each 3345 protocol has to make starting with access check location/properties.] 3347 8.1 File Layout Security 3349 A NFSv4 file layout type is defined in Section 9; see Section 9.7 for 3350 additional security considerations and details. In summary, the 3351 NFSv4 file layout type requires that all I/O access checks MUST be 3352 performed by the storage devices, as defined by the NFSv4 3353 specification. If another file layout type is being used, additional 3354 access checks may be required. But in all cases, the access control 3355 performed by the storage devices must be at least as strict as that 3356 specified by the NFSv4 protocol. 3358 8.2 Object Layout Security 3360 The object storage protocol MUST implement the security aspects 3361 described in version 1 of the T10 OSD protocol definition [5]. The 3362 remainder of this section gives an overview of the security mechanism 3363 described in that standard. The goal is to give the reader a basic 3364 understanding of the object security model. Any discrepancies 3365 between this text and the actual standard are obviously to be 3366 resolved in favor of the OSD standard. 3368 The object storage protocol relies on a cryptographically secure 3369 capability to control accesses at the object storage devices. 3370 Capabilities are generated by the metadata server, returned to the 3371 client, and used by the client as described below to authenticate 3372 their requests to the Object Storage Device (OSD). Capabilities 3373 therefore achieve the required access and open mode checking. They 3374 allow the file server to define and check a policy (e.g., open mode) 3375 and the OSD to check and enforce that policy without knowing the 3376 details (e.g., user IDs and ACLs). Since capabilities are tied to 3377 layouts, and since they are used to enforce access control, the 3378 server should recall layouts and revoke capabilities when the file 3379 ACL or mode changes in order to signal the clients. 3381 Each capability is specific to a particular object, an operation on 3382 that object, a byte range w/in the object, and has an explicit 3383 expiration time. The capabilities are signed with a secret key that 3384 is shared by the object storage devices (OSD) and the metadata 3385 managers. clients do not have device keys so they are unable to forge 3386 capabilities. The the following sketch of the algorithm should help 3387 the reader understand the basic model. 3389 LAYOUTGET returns 3390 {CapKey = MAC(CapArgs), CapArgs} 3392 The client uses CapKey to sign all the requests it issues for that 3393 object using the respective CapArgs. In other words, the CapArgs 3394 appears in the request to the storage device, and that request is 3395 signed with the CapKey as follows: 3397 ReqMAC = MAC(Req, Nonceln) 3399 The following is sent to the OSD: {CapArgs, Req, Nonceln, ReqMAC}. 3400 The OSD uses the SecretKey it shares with the metadata server to 3401 compare the ReqMAC the client sent with a locally computed 3403 MAC(CapArgs)>(Req, Nonceln) 3405 and if they match the OSD assumes that the capabilities came from an 3406 authentic metadata server and allows access to the object, as allowed 3407 by the CapArgs. Therefore, if the server LAYOUTGET reply, holding 3408 CapKey and CapArgs, is snooped by another client, it can be used to 3409 generate valid OSD requests (within the CapArgs access restriction). 3411 To provide the required privacy requirements for the capabilities 3412 returned by LAYOUTGET, the GSS-API can be used, e.g. by using a 3413 session key known to the file server and to the client to encrypt the 3414 whole layout or parts of it. Two general ways to provide privacy in 3415 the absence of GSS-API that are independent of NFSv4 are either an 3416 isolated network such as a VLAN or a secure channel provided by 3417 IPsec. 3419 8.3 Block/Volume Layout Security 3421 As typically used, block/volume protocols rely on clients to enforce 3422 file access checks since the storage devices are generally unaware of 3423 the files they are storing and in particular are unaware of which 3424 blocks belongs to which file. In such environments, the physical 3425 addresses of blocks are exported to pNFS clients via layouts. An 3426 alternative method of block/volume protocol use is for the storage 3427 devices to export virtualized block addresses, which do reflect the 3428 files to which blocks belong. These virtual block addresses are 3429 exported to pNFS clients via layouts. This allows the storage device 3430 to make appropriate access checks, while mapping virtual block 3431 addresses to physical block addresses. 3433 In environments where access control is important and client-only 3434 access checks provide insufficient assurance of access control 3435 enforcement (e.g., there is concern about a malicious of 3436 malfunctioning client skipping the access checks) and where physical 3437 block addresses are exported to clients, the storage devices will 3438 generally be unable to compensate for these client deficiencies. 3440 In such threat environments, block/volume protocols SHOULD NOT be 3441 used with pNFS, unless the storage device is able to implement the 3442 appropriate access checks, via use of virtualized block addresses, or 3443 other means. NFSv4 without pNFS or pNFS with a different type of 3444 storage protocol would be a more suitable means to access files in 3445 such environments. Storage-device/protocol-specific methods (e.g. 3446 LUN masking/mapping) may be available to prevent malicious or high- 3447 risk clients from directly accessing storage devices. 3449 9. The NFSv4 File Layout Type 3451 This section describes the semantics and format of NFSv4 file-based 3452 layouts. 3454 9.1 File Striping and Data Access 3456 The file layout type describes a method for striping data across 3457 multiple devices. The data for each stripe unit is stored within an 3458 NFSv4 file located on a particular storage device. The structures 3459 used to describe the stripe layout are as follows: 3461 enum stripetype4 { 3462 STRIPE_SPARSE = 1, 3463 STRIPE_DENSE = 2 3464 }; 3466 struct nfsv4_file_layouthint { 3467 stripetype4 stripe_type; 3468 length4 stripe_unit; 3469 uint32_t stripe_width; 3470 }; 3472 struct nfsv4_file_layout { /* Per data stripe */ 3473 pnfs_deviceid4 dev_id<>; 3474 nfs_fh4 fh; 3475 }; 3477 struct nfsv4_file_layouttype4 { /* Per file */ 3478 stripetype4 stripe_type; 3479 length4 stripe_unit; 3480 length4 file_size; 3481 nfsv4_file_layout dev_list<>; 3482 }; 3484 The file layout specifies an ordered array of 3485 tuples, as well as the stripe size, type of stripe layout (discussed 3486 a little later), and the file's current size as of LAYOUTGET time. 3487 The filehandle, "fh", identifies the file on a storage device 3488 identified by "dev_id", that holds a particular stripe of the file. 3489 The "dev_id" array can be used for multipathing and is discussed 3490 further in Section 9.1.3. The stripe width is determined by the 3491 stripe unit size multiplied by the number of devices in the dev_list. 3492 The stripe held by is determined by that tuples position 3493 within the device list, "dev_list". For example, consider a dev_list 3494 consisting of the following pairs: 3496 <(1,0x12), (2,0x13), (1,0x15)> and stripe_unit = 32KB 3498 The stripe width is 32KB * 3 devices = 96KB. The first entry 3499 specifies that on device 1 in the data file with filehandle 0x12 3500 holds the first 32KB of data (and every 32KB stripe beginning where 3501 the file's offset % 96KB == 0). 3503 Devices may be repeated multiple times within the device list array; 3504 this is shown where storage device 1 holds both the first and third 3505 stripe of data. Filehandles can only be repeated if a sparse stripe 3506 type is used. Data is striped across the devices in the order listed 3507 in the device list array in increments of the stripe size. A data 3508 file stored on a storage device MUST map to a single file as defined 3509 by the metadata server; i.e., data from two files as viewed by the 3510 metadata server MUST NOT be stored within the same data file on any 3511 storage device. 3513 The "stripe_type" field specifies how the data is laid out within the 3514 data file on a storage device. It allows for two different data 3515 layouts: sparse and dense or packed. The stripe type determines the 3516 calculation that must be made to map the client visible file offset 3517 to the offset within the data file located on the storage device. 3519 The layout hint structure is described in more detail in 3520 Section 10.7. It is used, by the client, as by the FILE_LAYOUT_HINT 3521 attribute to specify the type of layout to be used for a newly 3522 created file. 3524 9.1.1 Sparse and Dense Storage Device Data Layouts 3526 The stripe_type field allows for two storage device data file 3527 representations. Example sparse and dense storage device data 3528 layouts are illustrated below: 3530 Sparse file-layout (stripe_unit = 4KB) 3531 ------------------ 3533 Is represented by the following file layout on the storage devices: 3535 Offset ID:0 ID:1 ID:2 3536 0 +--+ +--+ +--+ +--+ indicates a 3537 |//| | | | | |//| stripe that 3538 4KB +--+ +--+ +--+ +--+ contains data 3539 | | |//| | | 3540 8KB +--+ +--+ +--+ 3541 | | | | |//| 3542 12KB +--+ +--+ +--+ 3543 |//| | | | | 3544 16KB +--+ +--+ +--+ 3545 | | |//| | | 3546 +--+ +--+ +--+ 3548 The sparse file-layout has holes for the byte ranges not exported by 3549 that storage device. This allows clients to access data using the 3550 real offset into the file, regardless of the storage device's 3551 position within the stripe. However, if a client writes to one of 3552 the holes (e.g., offset 4-12KB on device 1), then an error MUST be 3553 returned by the storage device. This requires that the storage 3554 device have knowledge of the layout for each file. 3556 When using a sparse layout, the offset into the storage device data 3557 file is the same as the offset into the main file. 3559 Dense/packed file-layout (stripe_unit = 4KB) 3560 ------------------------ 3562 Is represented by the following file layout on the storage devices: 3564 Offset ID:0 ID:1 ID:2 3565 0 +--+ +--+ +--+ 3566 |//| |//| |//| 3567 4KB +--+ +--+ +--+ 3568 |//| |//| |//| 3569 8KB +--+ +--+ +--+ 3570 |//| |//| |//| 3571 12KB +--+ +--+ +--+ 3572 |//| |//| |//| 3573 16KB +--+ +--+ +--+ 3574 |//| |//| |//| 3575 +--+ +--+ +--+ 3577 The dense or packed file-layout does not leave holes on the storage 3578 devices. Each stripe unit is spread across the storage devices. As 3579 such, the storage devices need not know the file's layout since the 3580 client is allowed to write to any offset. 3582 The calculation to determine the byte offset within the data file for 3583 dense storage device layouts is: 3585 stripe_width = stripe_unit * N; where N = |dev_list| 3586 dev_offset = floor(file_offset / stripe_width) * stripe_unit + 3587 file_offset % stripe_unit 3589 Regardless of the storage device data file layout, the calculation to 3590 determine the index into the device array is the same: 3592 dev_idx = floor(file_offset / stripe_unit) mod N 3594 Section 9.5 describe the semantics for dealing with reads to holes 3595 within the striped file. This is of particular concern, since each 3596 individual component stripe file (i.e., the component of the striped 3597 file that lives on a particular storage device) may be of different 3598 length. Thus, clients may experience 'short' reads when reading off 3599 the end of one of these component files. 3601 9.1.2 Metadata and Storage Device Roles 3603 In many cases, the metadata server and the storage device will be 3604 separate pieces of physical hardware. The specification text is 3605 written as if that were always case. However, it can be the case 3606 that the same physical hardware is used to implement both a metadata 3607 and storage device and in this case, the specification text's 3608 references to these two entities are to be understood as referring to 3609 the same physical hardware implementing two distinct roles and it is 3610 important that it be clearly understood on behalf of which role the 3611 hardware is executing at any given time. 3613 Two sub-cases can be distinguished. In the first sub-case, the same 3614 physical hardware is used to implement both a metadata and data 3615 server in which each role is addressed through a distinct network 3616 interface (e.g., IP addresses for the metadata server and storage 3617 device are distinct). As long as the storage device address is 3618 obtained from the layout and is distinct from the metadata server's 3619 address, using the device ID therein to obtain the appropriate 3620 storage device address, it is always clear, for any given request, to 3621 what role it is directed, based on the destination IP address. 3623 However, it may also be the case that even though the metadata server 3624 and storage device are distinct from one client's point of view, the 3625 roles may be reversed according to another client's point of view. 3627 For example, in the cluster file system model a metadata server to 3628 one client, may be a storage device to another client. Thus, it is 3629 safer to always mark the filehandle so that operations addressed to 3630 storage devices can be distinguished. 3632 The second sub-case is where both the metadata and storage device 3633 have the same network address. This requires us to make the 3634 distinction as to which role each request is directed, on a another 3635 basis. Since the network address is the same, the request is 3636 understood as being directed at one or the other, based on the 3637 filehandle of the first current filehandle value for the request. If 3638 the first current file handle is one derived from a layout (i.e., it 3639 is specified within the layout) (and it is recommended that these be 3640 distinguishable), then the request is to be considered as executed by 3641 a storage device. Otherwise, the operation is to be understood as 3642 executed by the metadata server. 3644 If a current filehandle is set that is inconsistent with the role to 3645 which it is directed, then the error NFS4ERR_BADHANDLE should result. 3646 For example, if a request is directed at the storage device, because 3647 the first current handle is from a layout, any attempt to set the 3648 current filehandle to be a value not from a layout should be 3649 rejected. Similarly, if the first current file handle was for a 3650 value not from a layout, a subsequent attempt to set the current file 3651 handle to a value obtained from a layout should be rejected. 3653 9.1.3 Device Multipathing 3655 The NFSv4 file layout supports multipathing to 'equivalent' devices. 3656 Device-level multipathing is primarily of use in the case of a data 3657 server failure --- it allows the client to switch to another storage 3658 device that is exporting the same data stripe, without having to 3659 contact the metadata server for a new layout. 3661 To support device multipathing, an array of device IDs is encoded 3662 within the data stripe portion of the file's layout. This array 3663 represents an ordered list of devices where the first element has the 3664 highest priority. Each device in the list MUST be 'equivalent' to 3665 every other device in the list and each device must be attempted in 3666 the order specified. 3668 Equivalent devices MUST export the same system image (e.g., the 3669 stateids and filehandles that they use are the same) and must provide 3670 the same consistency guarantees. Two equivalent storage devices must 3671 also have sufficient connections to the storage, such that writing to 3672 one storage device is equivalent to writing to another, this also 3673 applies to reading. Also, if multiple copies of the same data exist, 3674 reading from one must provide access to all existing copies. As 3675 such, it is unlikely that multipathing will provide additional 3676 benefit in the case of an I/O error. 3678 [NOTE: the error cases in which a client is expected to attempt an 3679 equivalent storage device should be specified.] 3681 9.1.4 Operations Issued to Storage Devices 3683 Clients MUST use the filehandle described within the layout when 3684 accessing data on the storage devices. When using the layout's 3685 filehandle, the client MUST only issue READ, WRITE, PUTFH, COMMIT, 3686 and NULL operations to the storage device associated with that 3687 filehandle. If a client issues an operation other than those 3688 specified above, using the filehandle and storage device listed in 3689 the client's layout, that storage device SHOULD return an error to 3690 the client. The client MUST follow the instruction implied by the 3691 layout (i.e., which filehandles to use on which devices). As 3692 described in Section 7.2, a client MUST NOT issue I/Os to storage 3693 devices for which it does not hold a valid layout. The storage 3694 devices may reject such requests. 3696 GETATTR and SETATTR MUST be directed to the metadata server. In the 3697 case of a SETATTR of the size attribute, the control protocol is 3698 responsible for propagating size updates/truncations to the storage 3699 devices. In the case of extending WRITEs to the storage devices, the 3700 new size must be visible on the metadata server once a LAYOUTCOMMIT 3701 has completed (see Section 7.4.2). Section 9.5, describes the 3702 mechanism by which the client is to handle storage device file's that 3703 do not reflect the metadata server's size. 3705 9.2 Global Stateid Requirements 3707 Note, there are no stateids returned embedded within the layout. The 3708 client MUST use the stateid representing open or lock state as 3709 returned by an earlier metadata operation (e.g., OPEN, LOCK), or a 3710 special stateid to perform I/O on the storage devices, as in regular 3711 NFSv4. Special stateid usage for I/O is subject to the NFSv4 3712 protocol specification. The stateid used for I/O MUST have the same 3713 effect and be subject to the same validation on storage device as it 3714 would if the I/O was being performed on the metadata server itself in 3715 the absence of pNFS. This has the implication that stateids are 3716 globally valid on both the metadata and storage devices. This 3717 requires the metadata server to propagate changes in lock and open 3718 state to the storage devices, so that the storage devices can 3719 validate I/O accesses. This is discussed further in Section 9.4. 3720 Depending on when stateids are propagated, the existence of a valid 3721 stateid on the storage device may act as proof of a valid layout. 3723 [NOTE: a number of proposals have been made that have the possibility 3724 of limiting the amount of validation performed by the storage device, 3725 if any of these proposals are accepted or obtain consensus, the 3726 global stateid requirement can be revisited.] 3728 9.3 The Layout Iomode 3730 The layout iomode need not used by the metadata server when servicing 3731 NFSv4 file-based layouts, although in some circumstances it may be 3732 useful to use. For example, if the server implementation supports 3733 reading from read-only replicas or mirrors, it would be useful for 3734 the server to return a layout enabling the client to do so. As such, 3735 the client should set the iomode based on its intent to read or write 3736 the data. The client may default to an iomode of READ/WRITE 3737 (LAYOUTIOMODE_RW). The iomode need not be checked by the storage 3738 devices when clients perform I/O. However, the storage devices SHOULD 3739 still validate that the client holds a valid layout and return an 3740 error if the client does not. 3742 9.4 Storage Device State Propagation 3744 Since the metadata server, which handles lock and open-mode state 3745 changes, as well as ACLs, may not be collocated with the storage 3746 devices where I/O access are validated, as such, the server 3747 implementation MUST take care of propagating changes of this state to 3748 the storage devices. Once the propagation to the storage devices is 3749 complete, the full effect of those changes must be in effect at the 3750 storage devices. However, some state changes need not be propagated 3751 immediately, although all changes SHOULD be propagated promptly. 3752 These state propagations have an impact on the design of the control 3753 protocol, even though the control protocol is outside of the scope of 3754 this specification. Immediate propagation refers to the synchronous 3755 propagation of state from the metadata server to the storage 3756 device(s); the propagation must be complete before returning to the 3757 client. 3759 9.4.1 Lock State Propagation 3761 Mandatory locks MUST be made effective at the storage devices before 3762 the request that establishes them returns to the caller. Thus, 3763 mandatory lock state MUST be synchronously propagated to the storage 3764 devices. On the other hand, since advisory lock state is not used 3765 for checking I/O accesses at the storage devices, there is no 3766 semantic reason for propagating advisory lock state to the storage 3767 devices. However, since all lock, unlock, open downgrades and 3768 upgrades affect the sequence ID stored within the stateid, the 3769 stateid changes which may cause difficulty if this state is not 3770 propagated. Thus, when a client uses a stateid on a storage device 3771 for I/O with a newer sequence number than the one the storage device 3772 has, the storage device should query the metadata server and get any 3773 pending updates to that stateid. This allows stateid sequence number 3774 changes to be propagated lazily, on-demand. 3776 [NOTE: With the reliance on the sessions protocol, there is no real 3777 need for sequence ID portion of the stateid to be validated on I/O 3778 accesses. It is proposed that the seq. ID checking is obsoleted.] 3780 Since updates to advisory locks neither confer nor remove privileges, 3781 these changes need not be propagated immediately, and may not need to 3782 be propagated promptly. The updates to advisory locks need only be 3783 propagated when the storage device needs to resolve a question about 3784 a stateid. In fact, if byte-range locking is not mandatory (i.e., is 3785 advisory) the clients are advised not to use the lock-based stateids 3786 for I/O at all. The stateids returned by open are sufficient and 3787 eliminate overhead for this kind of state propagation. 3789 9.4.2 Open-mode Validation 3791 Open-mode validation MUST be performed against the open mode(s) held 3792 by the storage devices. However, the server implementation may not 3793 always require the immediate propagation of changes. Reduction in 3794 access because of CLOSEs or DOWNGRADEs do not have to be propagated 3795 immediately, but SHOULD be propagated promptly; whereas changes due 3796 to revocation MUST be propagated immediately. On the other hand, 3797 changes that expand access (e.g., new OPEN's and upgrades) don't have 3798 to be propagated immediately but the storage device SHOULD NOT reject 3799 a request because of mode issues without making sure that the upgrade 3800 is not in flight. 3802 9.4.3 File Attributes 3804 Since the SETATTR operation has the ability to modify state that is 3805 visible on both the metadata and storage devices (e.g., the size), 3806 care must be taken to ensure that the resultant state across the set 3807 of storage devices is consistent; especially when truncating or 3808 growing the file. 3810 As described earlier, the LAYOUTCOMMIT operation is used to ensure 3811 that the metadata is synced with changes made to the storage devices. 3812 For the file-based protocol, it is necessary to re-sync state such as 3813 the size attribute, and the setting of mtime/atime. See Section 7.4 3814 for a full description of the semantics regarding LAYOUTCOMMIT and 3815 attribute synchronization. It should be noted, that by using a file- 3816 based layout type, it is possible to synchronize this state before 3817 LAYOUTCOMMIT occurs. For example, the control protocol can be used 3818 to query the attributes present on the storage devices. 3820 Any changes to file attributes that control authorization or access 3821 as reflected by ACCESS calls or READs and WRITEs on the metadata 3822 server, MUST be propagated to the storage devices for enforcement on 3823 READ and WRITE I/O calls. If the changes made on the metadata server 3824 result in more restrictive access permissions for any user, those 3825 changes MUST be propagated to the storage devices synchronously. 3827 Recall that the NFSv4 protocol [2] specifies that: 3829 ...since the NFS version 4 protocol does not impose any 3830 requirement that READs and WRITEs issued for an open file have the 3831 same credentials as the OPEN itself, the server still must do 3832 appropriate access checking on the READs and WRITEs themselves. 3834 This also includes changes to ACLs. The propagation of access right 3835 changes due to changes in ACLs may be asynchronous only if the server 3836 implementation is able to determine that the updated ACL is not more 3837 restrictive for any user specified in the old ACL. Due to the 3838 relative infrequency of ACL updates, it is suggested that all changes 3839 be propagated synchronously. 3841 [NOTE: it has been suggested that the NFSv4 specification is in error 3842 with regard to allowing principles other than those used for OPEN to 3843 be used for file I/O. If changes within a minor version alter the 3844 behavior of NFSv4 with regard to OPEN principals and stateids some 3845 access control checking at the storage device can be made less 3846 expensive. pNFS should be altered to take full advantage of these 3847 changes.] 3849 9.5 Storage Device Component File Size 3851 A potential problem exists when a component data file on a particular 3852 storage device is grown past EOF; the problem exists for both dense 3853 and sparse layouts. Imagine the following scenario: a client creates 3854 a new file (size == 0) and writes to byte 128KB; the client then 3855 seeks to the beginning of the file and reads byte 100. The client 3856 should receive 0s back as a result of the read. However, if the read 3857 falls on a different storage device to the client's original write, 3858 the storage device servicing the READ may still believe that the 3859 file's size is at 0 and return no data with the EOF flag set. The 3860 storage device can only return 0s if it knows that the file's size 3861 has been extended. This would require the immediate propagation of 3862 the file's size to all storage devices, which is potentially very 3863 costly, instead, another approach as outlined below. 3865 First, the file's size is returned within the layout by LAYOUTGET. 3866 This size must reflect the latest size at the metadata server as set 3867 by the most recent of either the last LAYOUTCOMMIT or SETATTR; 3868 however, it may be more recent. Second, if a client performs a read 3869 that is returned short (i.e., is fully within the file's size, but 3870 the storage device indicates EOF and returns partial or no data), the 3871 client must assume that it is a hole and substitute 0s for the data 3872 not read up until its known local file size. If a client extends the 3873 file, it must update its local file size. Third, if the metadata 3874 server receives a SETATTR of the size or a LAYOUTCOMMIT that alters 3875 the file's size, the metadata server must send out CB_SIZECHANGED 3876 messages with the new size to clients holding layouts; it need not 3877 send a notification to the client that performed the operation that 3878 resulted in the size changing). Upon reception of the CB_SIZECHANGED 3879 notification, clients must update their local size for that file. As 3880 well, if a new file size is returned as a result to LAYOUTCOMMIT, the 3881 client must update their local file size. 3883 9.6 Crash Recovery Considerations 3885 As described in Section 7.7, the layout type specific storage 3886 protocol is responsible for handling the effects of I/Os started 3887 before lease expiration, extending through lease expiration. The 3888 NFSv4 file layout type prevents all I/Os from being executed after 3889 lease expiration, without relying on a precise client lease timer and 3890 without requiring storage devices to maintain lease timers. 3892 It works as follows. In the presence of sessions, each compound 3893 begins with a SEQUENCE operation that contains the "clientID". On 3894 the storage device, the clientID can be used to validate that the 3895 client has a valid layout for the I/O being performed, if it does 3896 not, the I/O is rejected. Before the metadata server takes any 3897 action to invalidate a layout given out by a previous instance, it 3898 must make sure that all layouts from that previous instance are 3899 invalidated at the storage devices. Note: it is sufficient to 3900 invalidate the stateids associated with the layout only if special 3901 stateids are not being used for I/O at the storage devices, otherwise 3902 the layout itself must be invalidated. 3904 This means that a metadata server may not restripe a file until it 3905 has contacted all of the storage devices to invalidate the layouts 3906 from the previous instance nor may it give out locks that conflict 3907 with locks embodied by the stateids associated with any layout from 3908 the previous instance without either doing a specific invalidation 3909 (as it would have to do anyway) or doing a global storage device 3910 invalidation. 3912 9.7 Security Considerations 3914 The NFSv4 file layout type MUST adhere to the security considerations 3915 outlined in Section 8. More specifically, storage devices must make 3916 all of the required access checks on each READ or WRITE I/O as 3917 determined by the NFSv4 protocol [2]. This impacts the control 3918 protocol and the propagation of state from the metadata server to the 3919 storage devices; see Section 9.4 for more details. 3921 9.8 Alternate Approaches 3923 Two alternate approaches exist for file-based layouts and the method 3924 used by clients to obtain stateids used for I/O. Both approaches 3925 embed stateids within the layout. 3927 However, before examining these approaches it is important to 3928 understand the distinction between clients and owners. Delegations 3929 belong to clients, while locks (e.g., record and share reservations) 3930 are held by owners which in turn belong to a specific client. As 3931 such, delegations can only protect against inter-client conflicts, 3932 not intra-client conflicts. Layouts are held by clients and SHOULD 3933 NOT be associated with state held by owners. Therefore, if stateids 3934 used for data access are embedded within a layout, these stateids can 3935 only act as delegation stateids, protecting against inter-client 3936 conflicts; stateids pertaining to an owner can not be embedded within 3937 the layout. This has the implication that the client MUST arbitrate 3938 among all intra-client conflicts (e.g., arbitrating among lock 3939 requests by different processes) before issuing pNFS operations. 3940 Using the stateids stored within the layout, storage devices can only 3941 arbitrate between clients (not owners). 3943 The first alternate approach is to do away with global stateids, 3944 stateids returned by OPEN/LOCK that are valid on the metadata server 3945 and storage devices, and use only stateids embedded within the 3946 layout. This approach has the drawback that the stateids used for 3947 I/O access can not be validated against per owner state, since they 3948 are only associated with the client holding the layout. It breaks 3949 the semantics of tieing a stateid used for I/O to an open instance. 3950 This has the implication that clients must delegate per owner lock 3951 and open requests internally, rather than push the work onto the 3952 storage devices. The storage devices can still arbitrate and enforce 3953 inter-client lock and open state. 3955 The second approach is a hybrid approach. This approach allows for 3956 stateids to be embedded with the layout, but also allows for the 3957 possibility of global stateids. If the stateid embedded within the 3958 layout is a special stateid of all zeros, then the stateid referring 3959 to the last successful OPEN/LOCK should be used. This approach is 3960 recommended if it is decided that using NFSv4 as a control protocol 3961 is required. 3963 This proposal suggests the global stateid approach due to the cleaner 3964 semantics it provides regarding the relationship between stateids 3965 used for I/O and their corresponding open instance or lock state. 3966 However, it does have a profound impact on the control protocol's 3967 implementation and the state propagation that is required (as 3968 described in Section 9.4). 3970 10. pNFS Typed Data Structures 3972 10.1 pnfs_layouttype4 3974 enum pnfs_layouttype4 { 3975 LAYOUT_NFSV4_FILES = 1, 3976 LAYOUT_OSD2_OBJECTS = 2, 3977 LAYOUT_BLOCK_VOLUME = 3 3978 }; 3980 A layout type specifies the layout being used. The implication is 3981 that clients have "layout drivers" that support one or more layout 3982 types. The file server advertises the layout types it supports 3983 through the LAYOUT_TYPES file system attribute. A client asks for 3984 layouts of a particular type in LAYOUTGET, and passes those layouts 3985 to its layout driver. The set of well known layout types must be 3986 defined. As well, a private range of layout types is to be defined 3987 by this document. This would allow custom installations to introduce 3988 new layout types. 3990 [OPEN ISSUE: Determine private range of layout types] 3992 New layout types must be specified in RFCs approved by the IESG 3993 before becoming part of the pNFS specification. 3995 The LAYOUT_NFSV4_FILES enumeration specifies that the NFSv4 file 3996 layout type is to be used. The LAYOUT_OSD2_OBJECTS enumeration 3997 specifies that the object layout, as defined in [7], is to be used. 3998 Similarly, the LAYOUT_BLOCK_VOLUME enumeration that the block/volume 3999 layout, as defined in [6], is to be used. 4001 10.2 pnfs_deviceid4 4003 typedef uint32_t pnfs_deviceid4; /* 32-bit device ID */ 4005 Layout information includes device IDs that specify a storage device 4006 through a compact handle. Addressing and type information is 4007 obtained with the GETDEVICEINFO operation. A client must not assume 4008 that device IDs are valid across metadata server reboots. The device 4009 ID is qualified by the layout type and are unique per file system 4010 (FSID). This allows different layout drivers to generate device IDs 4011 without the need for co-ordination. See Section 7.1.4 for more 4012 details. 4014 10.3 pnfs_deviceaddr4 4016 struct pnfs_netaddr4 { 4017 string r_netid<>; /* network ID */ 4018 string r_addr<>; /* universal address */ 4019 }; 4021 struct pnfs_deviceaddr4 { 4022 pnfs_layouttype4 type; 4023 opaque device_addr<>; 4024 }; 4026 The device address is used to set up a communication channel with the 4027 storage device. Different layout types will require different types 4028 of structures to define how they communicate with storage devices. 4029 The opaque device_addr field must be interpreted based on the 4030 specified layout type. 4032 Currently, the only defined device address is that for the NFSv4 file 4033 layout (struct pnfs_netaddr4), which identifies a storage device by 4034 network IP address and port number. This is sufficient for the 4035 clients to communicate with the NFSv4 storage devices, and may also 4036 be sufficient for object-based storage drivers to communicate with 4037 OSDs. The other device address we expect to support is a SCSI volume 4038 identifier. The final protocol specification will detail the allowed 4039 values for device_type and the format of their associated location 4040 information. 4042 [NOTE: other device addresses will be added as the respective 4043 specifications mature. It has been suggested that a separate 4044 device_type enumeration is used as a switch to the pnfs_deviceaddr4 4045 structure (e.g., if multiple types of addresses exist for the same 4046 layout type). Until such a time as a real case is made and the 4047 respective layout types have matured, the device address structure 4048 will be left as is.] 4050 10.4 pnfs_devlist_item4 4052 struct pnfs_devlist_item4 { 4053 pnfs_deviceid4 id; 4054 pnfs_deviceaddr4 addr; 4055 }; 4057 An array of these values is returned by the GETDEVICELIST operation. 4058 They define the set of devices associated with a file system. 4060 10.5 pnfs_layout4 4062 struct pnfs_layout4 { 4063 offset4 offset; 4064 length4 length; 4065 pnfs_layoutiomode4 iomode; 4066 pnfs_layouttype4 type; 4067 opaque layout<>; 4068 }; 4070 The pnfs_layout4 structure defines a layout for a file. The layout 4071 type specific data is opaque within this structure and must be 4072 interepreted based on the layout type. Currently, only the NFSv4 4073 file layout type is defined; see Section 9.1 for its definition. 4074 Since layouts are sub-dividable, the offset and length together with 4075 the file's filehandle, the clientid, iomode, and layout type, 4076 identifies the layout. 4078 [OPEN ISSUE: there is a discussion of moving the striping 4079 information, or more generally the "aggregation scheme", up to the 4080 generic layout level. This creates a two-layer system where the top 4081 level is a switch on different data placement layouts, and the next 4082 level down is a switch on different data storage types. This lets 4083 different layouts (e.g., striping or mirroring or redundant servers) 4084 to be layered over different storage devices. This would move 4085 geometry information out of nfsv4_file_layouttype4 and up into a 4086 generic pnfs_striped_layout type that would specify a set of 4087 pnfs_deviceid4 and pnfs_devicetype4 to use for storage. Instead of 4088 nfsv4_file_layouttype4, there would be pnfs_nfsv4_devicetype4.] 4090 10.6 pnfs_layoutupdate4 4092 struct pnfs_layoutupdate4 { 4093 pnfs_layouttype4 type; 4094 opaque layoutupdate_data<>; 4095 }; 4097 The pnfs_layoutupdate4 structure is used by the client to return 4098 'updated' layout information to the metadata server at LAYOUTCOMMIT 4099 time. This structure provides a channel to pass layout type specific 4100 information back to the metadata server. E.g., for block/volume 4101 layout types this could include the list of reserved blocks that were 4102 written. The contents of the opaque layoutupdate_data argument are 4103 determined by the layout type and are defined in their context. The 4104 NFSv4 file-based layout does not use this structure, thus the 4105 update_data field should have a zero length. 4107 10.7 pnfs_layouthint4 4109 struct pnfs_layouthint4 { 4110 pnfs_layouttype4 type; 4111 opaque layouthint_data<>; 4112 }; 4114 The pnfs_layouthint4 structure is used by the client to pass in a 4115 hint about the type of layout it would like created for a particular 4116 file. It is the structure specified by the FILE_LAYOUT_HINT 4117 attribute described below. The metadata server may ignore the hint, 4118 or may selectively ignore fields within the hint. This hint should 4119 be provided at create time as part of the initial attributes within 4120 OPEN. The NFSv4 file-based layout uses the "nfsv4_file_layouthint" 4121 structure as defined in Section 9.1. 4123 10.8 pnfs_layoutiomode4 4125 enum pnfs_layoutiomode4 { 4126 LAYOUTIOMODE_READ = 1, 4127 LAYOUTIOMODE_RW = 2, 4128 LAYOUTIOMODE_ANY = 3 4129 }; 4131 The iomode specifies whether the client intends to read or write 4132 (with the possibility of reading) the data represented by the layout. 4133 The ANY iomode MUST NOT be used for LAYOUTGET, however, it can be 4134 used for LAYOUTRETURN and LAYOUTRECALL. The ANY iomode specifies 4135 that layouts pertaining to both READ and RW iomodes are being 4136 returned or recalled, respectively. The metadata server's use of the 4137 iomode may depend on the layout type being used. The storage devices 4138 may validate I/O accesses against the iomode and reject invalid 4139 accesses. 4141 11. pNFS File Attributes 4143 11.1 pnfs_layouttype4<> FS_LAYOUT_TYPES 4145 This attribute applies to a file system and indicates what layout 4146 types are supported by the file system. We expect this attribute to 4147 be queried when a client encounters a new fsid. This attribute is 4148 used by the client to determine if it has applicable layout drivers. 4150 11.2 pnfs_layouttype4<> FILE_LAYOUT_TYPES 4152 This attribute indicates the particular layout type(s) used for a 4153 file. This is for informational purposes only. The client needs to 4154 use the LAYOUTGET operation in order to get enough information (e.g., 4155 specific device information) in order to perform I/O. 4157 11.3 pnfs_layouthint4 FILE_LAYOUT_HINT 4159 This attribute may be set on newly created files to influence the 4160 metadata server's choice for the file's layout. It is suggested that 4161 this attribute is set as one of the initial attributes within the 4162 OPEN call. The metadata server may ignore this attribute. This 4163 attribute is a sub-set of the layout structure returned by LAYOUTGET. 4164 For example, instead of specifying particular devices, this would be 4165 used to suggest the stripe width of a file. It is up to the server 4166 implementation to determine which fields within the layout it uses. 4168 [OPEN ISSUE: it has been suggested that the HINT is a well defined 4169 type other than pnfs_layoutdata4, similar to pnfs_layoutupdate4.] 4171 11.4 uint32_t FS_LAYOUT_PREFERRED_BLOCKSIZE 4173 This attribute is a file system wide attribute and indicates the 4174 preferred block size for direct storage device access. 4176 11.5 uint32_t FS_LAYOUT_PREFERRED_ALIGNMENT 4178 This attribute is a file system wide attribute and indicates the 4179 preferred alignment for direct storage device access. 4181 12. pNFS Error Definitions 4183 NFS4ERR_BADLAYOUT Layout specified is invalid. 4185 NFS4ERR_BADIOMODE Layout iomode is invalid. 4187 NFS4ERR_LAYOUTUNAVAILABLE Layouts are not available for the file or 4188 its containing file system. 4190 NFS4ERR_LAYOUTTRYLATER Layouts are temporarily unavailable for the 4191 file, client should retry later. 4193 NFS4ERR_NOMATCHING_LAYOUT Client has no matching layout (segment) to 4194 return. 4196 NFS4ERR_RECALLCONFLICT Layout is unavailable due to a conflicting 4197 LAYOUTRECALL that is in progress. 4199 NFS4ERR_UNKNOWN_LAYOUTTYPE Layout type is unknown. 4201 13. Layouts and Aggregation 4203 This section describes several aggregation schemes in a semi-formal 4204 way to provide context for layout formats. These definitions will be 4205 formalized in other protocols. However, the set of understood types 4206 is part of this protocol in order to provide for basic 4207 interoperability. 4209 The layout descriptions include (deviceID, objectID) tuples that 4210 identify some storage object on some storage device. The addressing 4211 formation associated with the deviceID is obtained with 4212 GETDEVICEINFO. The interpretation of the objectID depends on the 4213 storage protocol. The objectID could be a filehandle for an NFSv4 4214 storage device. It could be a OSD object ID for an object server. 4215 The layout for a block device generally includes additional block map 4216 information to enumerate blocks or extents that are part of the 4217 layout. 4219 13.1 Simple Map 4221 The data is located on a single storage device. In this case the 4222 file server can act as the front end for several storage devices and 4223 distribute files among them. Each file is limited in its size and 4224 performance characteristics by a single storage device. The simple 4225 map consists of (deviceID, objectID). 4227 13.2 Block Extent Map 4229 The data is located on a LUN in the SAN. The layout consists of an 4230 array of (deviceID, blockID, offset, length) tuples. Each entry 4231 describes a block extent. 4233 13.3 Striped Map (RAID 0) 4235 The data is striped across storage devices. The parameters of the 4236 stripe include the number of storage devices (N) and the size of each 4237 stripe unit (U). A full stripe of data is N * U bytes. The stripe 4238 map consists of an ordered list of (deviceID, objectID) tuples and 4239 the parameter value for U. The first stripe unit (the first U bytes) 4240 are stored on the first (deviceID, objectID), the second stripe unit 4241 on the second (deviceID, objectID) and so forth until the first 4242 complete stripe. The data layout then wraps around so that byte 4243 (N*U) of the file is stored on the first (deviceID, objectID) in the 4244 list, but starting at offset U within that object. The striped 4245 layout allows a client to read or write to the component objects in 4246 parallel to achieve high bandwidth. 4248 The striped map for a block device would be slightly different. The 4249 map is an ordered list of (deviceID, blockID, blocksize), where the 4250 deviceID is rotated among a set of devices to achieve striping. 4252 13.4 Replicated Map 4254 The file data is replicated on N storage devices. The map consists 4255 of N (deviceID, objectID) tuples. When data is written using this 4256 map, it should be written to N objects in parallel. When data is 4257 read, any component object can be used. 4259 This map type is controversial because it highlights the issues with 4260 error recovery. Those issues get interesting with any scheme that 4261 employs redundancy. The handling of errors (e.g., only a subset of 4262 replicas get updated) is outside the scope of this protocol 4263 extension. Instead, it is a function of the storage protocol and the 4264 metadata control protocol. 4266 13.5 Concatenated Map 4268 The map consists of an ordered set of N (deviceID, objectID, size) 4269 tuples. Each successive tuple describes the next segment of the 4270 file. 4272 13.6 Nested Map 4274 The nested map is used to compose more complex maps out of simpler 4275 ones. The map format is an ordered set of M sub-maps, each submap 4276 applies to a byte range within the file and has its own type such as 4277 the ones introduced above. Any level of nesting is allowed in order 4278 to build up complex aggregation schemes. 4280 14. NFSv4.1 Operations 4282 14.1 LOOKUPP - Lookup Parent Directory 4284 If the NFSv4 minor version is 1, then following replaces section 4285 14.2.14 of the NFSv4.0 specification. The LOOKUPP operation's "over 4286 the wire" format is not altered, but the semantics are slightly 4287 modified to account for the addition of SECINFO_NO_NAME. 4289 SYNOPSIS 4291 (cfh) -> (cfh) 4293 ARGUMENT 4295 /* CURRENT_FH: object */ 4296 void; 4298 RESULT 4300 struct LOOKUPP4res { 4301 /* CURRENT_FH: directory */ 4302 nfsstat4 status; 4303 }; 4305 DESCRIPTION 4307 The current filehandle is assumed to refer to a regular directory 4308 or a named attribute directory. LOOKUPP assigns the filehandle 4309 for its parent directory to be the current filehandle. If there 4310 is no parent directory an NFS4ERR_NOENT error must be returned. 4311 Therefore, NFS4ERR_NOENT will be returned by the server when the 4312 current filehandle is at the root or top of the server's file 4313 tree. 4315 As for LOOKUP, LOOKUPP will also cross mountpoints. 4317 If the current filehandle is not a directory or named attribute 4318 directory, the error NFS4ERR_NOTDIR is returned. 4320 If the requester's security flavor does not match that configured 4321 for the parent directory, then the server SHOULD return 4322 NFS4ERR_WRONGSEC (a future minor revision of NFSv4 may upgrade 4323 this to MUST) in the LOOKUPP response. However, if the server 4324 does so, it MUST support the new SECINFO_NO_NAME operation, so 4325 that the client can gracefully determine the correct security 4326 flavor. See the discussion of the SECINFO_NO_NAME operation for a 4327 description. 4329 ERRORS 4331 NFS4ERR_ACCESS NFS4ERR_BADHANDLE NFS4ERR_FHEXPIRED NFS4ERR_IO 4332 NFS4ERR_MOVED NFS4ERR_NOENT NFS4ERR_NOFILEHANDLE NFS4ERR_NOTDIR 4333 NFS4ERR_RESOURCE NFS4ERR_SERVERFAULT NFS4ERR_STALE 4334 NFS4ERR_WRONGSEC 4336 14.2 SECINFO -- Obtain Available Security 4338 If the NFSv4 minor version is 1, then following replaces section 4339 14.2.31 of the NFSv4.0 specification. The SECINFO operation's "over 4340 the wire" format is not altered, but the semantics are slightly 4341 modified to account for the addition of SECINFO_NO_NAME. 4343 SYNOPSIS 4345 (cfh), name -> { secinfo } 4347 ARGUMENT 4349 struct SECINFO4args { 4350 /* CURRENT_FH: directory */ 4351 component4 name; 4352 }; 4354 RESULT 4355 enum rpc_gss_svc_t {/* From RFC 2203 */ 4356 RPC_GSS_SVC_NONE = 1, 4357 RPC_GSS_SVC_INTEGRITY = 2, 4358 RPC_GSS_SVC_PRIVACY = 3 4359 }; 4361 struct rpcsec_gss_info { 4362 sec_oid4 oid; 4363 qop4 qop; 4364 rpc_gss_svc_t service; 4365 }; 4367 union secinfo4 switch (uint32_t flavor) { 4368 case RPCSEC_GSS: 4369 rpcsec_gss_info flavor_info; 4370 default: 4371 void; 4372 }; 4374 typedef secinfo4 SECINFO4resok<>; 4376 union SECINFO4res switch (nfsstat4 status) { 4377 case NFS4_OK: 4378 SECINFO4resok resok4; 4379 default: 4380 void; 4381 }; 4383 DESCRIPTION 4385 The SECINFO operation is used by the client to obtain a list of 4386 valid RPC authentication flavors for a specific directory 4387 filehandle, file name pair. SECINFO should apply the same access 4388 methodology used for LOOKUP when evaluating the name. Therefore, 4389 if the requester does not have the appropriate access to LOOKUP 4390 the name then SECINFO must behave the same way and return 4391 NFS4ERR_ACCESS. 4393 The result will contain an array which represents the security 4394 mechanisms available, with an order corresponding to the server's 4395 preferences, the most preferred being first in the array. The 4396 client is free to pick whatever security mechanism it both desires 4397 and supports, or to pick in the server's preference order the 4398 first one it supports. The array entries are represented by the 4399 secinfo4 structure. The field 'flavor' will contain a value of 4400 AUTH_NONE, AUTH_SYS (as defined in [RFC1831]), or RPCSEC_GSS (as 4401 defined in [RFC2203]). The field flavor can also any other 4402 security flavor registered with IANA. 4404 For the flavors AUTH_NONE and AUTH_SYS, no additional security 4405 information is returned. The same is true of many (if not most) 4406 other security flavors, including AUTH_DH. For a return value of 4407 RPCSEC_GSS, a security triple is returned that contains the 4408 mechanism object id (as defined in [RFC2743]), the quality of 4409 protection (as defined in [RFC2743]) and the service type (as 4410 defined in [RFC2203]). It is possible for SECINFO to return 4411 multiple entries with flavor equal to RPCSEC_GSS with different 4412 security triple values. 4414 On success, the current filehandle retains its value. 4416 If the name has a length of 0 (zero), or if name does not obey the 4417 UTF-8 definition, the error NFS4ERR_INVAL will be returned. 4419 IMPLEMENTATION 4421 The SECINFO operation is expected to be used by the NFS client 4422 when the error value of NFS4ERR_WRONGSEC is returned from another 4423 NFS operation. This signifies to the client that the server's 4424 security policy is different from what the client is currently 4425 using. At this point, the client is expected to obtain a list of 4426 possible security flavors and choose what best suits its policies. 4428 As mentioned, the server's security policies will determine when a 4429 client request receives NFS4ERR_WRONGSEC. The operations which 4430 may receive this error are: LINK, LOOKUP, LOOKUPP, OPEN, PUTFH, 4431 PUTPUBFH, PUTROOTFH, RESTOREFH, RENAME, and indirectly READDIR. 4432 LINK and RENAME will only receive this error if the security used 4433 for the operation is inappropriate for saved filehandle. With the 4434 exception of READDIR, these operations represent the point at 4435 which the client can instantiate a filehandle into the "current 4436 filehandle" at the server. The filehandle is either provided by 4437 the client (PUTFH, PUTPUBFH, PUTROOTFH) or generated as a result 4438 of a name to filehandle translation (LOOKUP and OPEN). RESTOREFH 4439 is different because the filehandle is a result of a previous 4440 SAVEFH. Even though the filehandle, for RESTOREFH, might have 4441 previously passed the server's inspection for a security match, 4442 the server will check it again on RESTOREFH to ensure that the 4443 security policy has not changed. 4445 If the client wants to resolve an error return of 4446 NFS4ERR_WRONGSEC, the following will occur: 4448 * For LOOKUP and OPEN, the client will use SECINFO with the same 4449 current filehandle and name as provided in the original LOOKUP 4450 or OPEN to enumerate the available security triples. 4452 * For LINK, PUTFH, PUTROOTFH, PUTPUBFH, RENAME, and RESTOREFH, 4453 the client will use SECINFO_NO_NAME { style = current_fh }. 4454 The client will prefix the SECINFO_NO_NAME operation with the 4455 appropriate PUTFH, PUTPUBFH, or PUTROOTFH operation that 4456 provides the file handled originally provided by the PUTFH, 4457 PUTPUBFH, PUTROOTFH, or RESTOREFH, or for the failed LINK or 4458 RENAME, the SAVEFH. 4460 * ========================================================= NOTE: 4461 In NFSv4.0, the client was required to use SECINFO, and had to 4462 reconstruct the parent of the original file handle, and the 4463 component name of the original filehandle. 4464 ======================================================== 4466 * For LOOKUPP, the client will use SECINFO_NO_NAME { style = 4467 parent } and provide the filehandle with equals the filehandle 4468 originally provided to LOOKUPP. 4470 The READDIR operation will not directly return the 4471 NFS4ERR_WRONGSEC error. However, if the READDIR request included 4472 a request for attributes, it is possible that the READDIR 4473 request's security triple did not match that of a directory entry. 4474 If this is the case and the client has requested the rdattr_error 4475 attribute, the server will return the NFS4ERR_WRONGSEC error in 4476 rdattr_error for the entry. 4478 See the section "Security Considerations" for a discussion on the 4479 recommendations for security flavor used by SECINFO and 4480 SECINFO_NO_NAME. 4482 ERRORS 4484 14.3 SECINFO_NO_NAME - Get Security on Unnamed Object 4486 Obtain available security mechanisms with the use of the parent of an 4487 object or the current filehandle. 4489 SYNOPSIS 4491 (cfh), secinfo_style -> { secinfo } 4493 ARGUMENT 4495 enum secinfo_style_4 { 4496 current_fh = 0, 4497 parent = 1 4498 }; 4500 typedef secinfo_style_4 SECINFO_NO_NAME4args; 4502 RESULT 4504 typedef SECINFO4res SECINFO_NO_NAME4res; 4506 DESCRIPTION 4508 Like the SECINFO operation, SECINFO_NO_NAME is used by the client 4509 to obtain a list of valid RPC authentication flavors for a 4510 specific file object. Unlike SECINFO, SECINFO_NO_NAME only works 4511 with objects are accessed by file handle. 4513 There are two styles of SECINFO_NO_NAME, as determined by the 4514 value of the secinfo_style_4 enumeration. If "current_fh" is 4515 passed, then SECINFO_NO_NAME is querying for the required security 4516 for the current filehandle. If "parent" is passed, then 4517 SECINFO_NO_NAME is querying for the required security of the 4518 current filehandles's parent. If the style selected is "parent", 4519 then SECINFO should apply the same access methodology used for 4520 LOOKUPP when evaluating the traversal to the parent directory. 4521 Therefore, if the requester does not have the appropriate access 4522 to LOOKUPP the parent then SECINFO_NO_NAME must behave the same 4523 way and return NFS4ERR_ACCESS. 4525 Note that if PUTFH, PUTPUBFH, or PUTROOTFH return 4526 NFS4ERR_WRONGSEC, this is tantamount to the server asserting that 4527 the client will have to guess what the required security is, 4528 because there is no way to query. Therefore, the client must 4529 iterate through the security triples available at the client and 4530 reattempt the PUTFH, PUTROOTFH or PUTPUBFH operation. In the 4531 unfortunate event none of the MANDATORY security triples are 4532 supported by the client and server, the client SHOULD try using 4533 others that support integrity. Failing that, the client can try 4534 using other forms (e.g. AUTH_SYS and AUTH_NONE), but because such 4535 forms lack integrity checks, this puts the client at risk. 4537 The server implementor should pay particular attention to the 4538 section "Clarification of Security Negotiation in NFSv4.1" for 4539 implementation suggestions for avoiding NFS4ERR_WRONGSEC error 4540 returns from PUTFH, PUTROOTFH or PUTPUBFH. 4542 Everything else about SECINFO_NO_NAME is the same as SECINFO. See 4543 the previous discussion on SECINFO. 4545 IMPLEMENTATION 4547 See the previous dicussion on SECINFO. 4549 ERRORS 4551 NFS4ERR_ACCESS NFS4ERR_BADCHAR NFS4ERR_BADHANDLE NFS4ERR_BADNAME 4552 NFS4ERR_BADXDR NFS4ERR_FHEXPIRED NFS4ERR_INVAL NFS4ERR_MOVED 4553 NFS4ERR_NAMETOOLONG NFS4ERR_NOENT NFS4ERR_NOFILEHANDLE 4554 NFS4ERR_NOTDIR NFS4ERR_RESOURCE NFS4ERR_SERVERFAULT NFS4ERR_STALE 4556 14.4 CREATECLIENTID - Instantiate Clientid 4558 Create a clientid 4560 SYNOPSIS 4562 client -> clientid 4564 ARGUMENT 4566 struct CREATECLIENTID4args { 4567 nfs_client_id4 clientdesc; 4568 }; 4570 RESULT 4571 struct CREATECLIENTID4resok { 4572 clientid4 clientid; 4573 verifier4 clientid_confirm; 4574 }; 4576 union SETCLIENTID4res switch (nfsstat4 status) { 4577 case NFS4_OK: 4578 CREATECLIENTID4resok resok4; 4579 case NFS4ERR_CLID_INUSE: 4580 void; 4581 default: 4582 void; 4583 }; 4585 DESCRIPTION 4587 The client uses the CREATECLIENTID operation to register a 4588 particular client identifier with the server. The clientid 4589 returned from this operation will be necessary for requests that 4590 create state on the server and will serve as a parent object to 4591 sessions created by the client. In order to verify the clientid 4592 it must first be used as an argument to CREATESESSION. 4594 IMPLEMENTATION 4596 A server's client record is a 5-tuple: 4598 1. clientdesc.id: 4600 The long form client identifier, sent via the client.id 4601 subfield of the CREATECLIENTID4args structure 4603 2. clientdesc.verifier: 4605 A client-specific value used to indicate reboots, sent via 4606 the clientdesc.verifier subfield of the CREATECLIENTID4args 4607 structure 4609 3. principal: 4611 The RPCSEC_GSS principal sent via the RPC headers 4613 4. clientid: 4615 The shorthand client identifier, generated by the server 4616 and returned via the clientid field in the 4617 CREATECLIENTID4resok structure 4619 5. confirmed: 4621 A private field on the server indicating whether or not a 4622 client record has been confirmed. A client record is 4623 confirmed if there has been a successful CREATESESSION 4624 operation to confirm it. Otherwise it is unconfirmed. An 4625 unconfirmed record is established by a CREATECLIENTID call. 4626 Any unconfirmed record that is not confirmed within a lease 4627 period may be removed. 4629 The following identifiers represent special values for the fields 4630 in the records. 4632 id_arg: 4634 The value of the clientdesc.id subfield of the 4635 CREATECLIENTID4args structure of the current request. 4637 verifier_arg: 4639 The value of the clientdesc.verifier subfield of the 4640 CREATECLIENTID4args structure of the current request. 4642 old_verifier_arg: 4644 A value of the clientdesc.verifier field of a client record 4645 received in a previous request; this is distinct from 4646 verifier_arg. 4648 principal_arg: 4650 The value of the RPCSEC_GSS principal for the current request. 4652 old_principal_arg: 4654 A value of the RPCSEC_GSS principal received for a previous 4655 request. This is distinct from principal_arg. 4657 clientid_ret: 4659 The value of the clientid field the server will return in the 4660 CREATECLIENTID4resok structure for the current request. 4662 old_clientid_ret: 4664 The value of the clientid field the server returned in the 4665 CREATECLIENTID4resok structure for a previous request. This is 4666 distinct from clientid_ret. 4668 Since CREATECLIENTID is a non-idempotent operation, we must 4669 consider the possibility that replays may occur as a result of a 4670 client reboot, network partition, malfunctioning router, etc. 4671 Replays are identified by the value of the client field of 4672 CREATECLIENTID4args and the method for dealing with them is 4673 outlined in the scenarios below. 4675 The scenarios are described in terms of what client records whose 4676 clientdesc.id subfield have value equal to id_arg exist in the 4677 server's set of client records. Any cases in which there is more 4678 than one record with identical values for id_arg represent a 4679 server implementation error. Operation in the potential valid 4680 cases is summarized as follows. 4682 1. Common case 4684 If no client records with clientdesc.id matching id_arg 4685 exist, a new shorthand client identifier clientid_ret is 4686 generated, and the following unconfirmed record is added to 4687 the server's state. 4689 { id_arg, verifier_arg, principal_arg, clientid_ret, FALSE 4690 } 4692 Subsequently, the server returns clientid_ret. 4694 2. Router Replay 4696 If the server has the following confirmed record, then this 4697 request is likely the result of a replayed request due to a 4698 faulty router or lost connection. 4700 { id_arg, verifier_arg, principal_arg, clientid_ret, TRUE } 4702 Since the record has been confirmed, the client must have 4703 received the server's reply from the initial CREATECLIENTID 4704 request. Since this is simply a spurious request, there is 4705 no modification to the server's state, and the server makes 4706 no reply to the client. 4708 3. Client Collision 4710 If the server has the following confirmed record, then this 4711 request is likely the result of a chance collision between 4712 the values of the clientdesc.id subfield of 4713 CREATECLIENTID4args for two different clients. 4715 { id_arg, *, old_principal_arg, clientid_ret, TRUE } 4717 Since the value of the clientdesc.id subfield of each 4718 client record must be unique, there is no modification of 4719 the server's state, and NFS4ERR_CLID_INUSE is returned to 4720 indicate the client should retry with a different value for 4721 the clientdesc.id subfield of CREATECLIENTID4args. 4723 This scenario may also represent a malicious attempt to 4724 destroy a client's state on the server. For security 4725 reasons, the server MUST NOT remove the client's state when 4726 there is a principal mismatch. 4728 4. Replay 4730 If the server has the following unconfirmed record then 4731 this request is likely the result of a client replay due to 4732 a network partition or some other connection failure. 4734 { id_arg, verifier_arg, principal_arg, clientid_ret, FALSE 4735 } 4737 Since the response to the CREATECLIENTID request that 4738 created this record may have been lost, it is not 4739 acceptable to drop this duplicate request. However, rather 4740 than processing it normally, the existing record is left 4741 unchanged and clientid_ret, which was generated for the 4742 previous request, is returned. 4744 5. Change of Principal 4746 If the server has the following unconfirmed record then 4747 this request is likely the result of a client which has for 4748 whatever reasons changed principals (possibly to change 4749 security flavor) after calling CREATECLIENTID, but before 4750 calling CREATESESSION. 4752 { id_arg, verifier_arg, old_principal_arg, clientid_ret, 4753 FALSE} 4754 Since the client has not changed, the principal field of 4755 the unconfirmed record is updated to principal_arg and 4756 clientid_ret is again returned. There is a small 4757 possibility that this is merely a collision on the client 4758 field of CREATECLIENTID4args between unrelated clients, but 4759 since that is unlikely, and an unconfirmed record does not 4760 generally have any filesystem pertinent state, we can 4761 assume it is the same client without risking loss of any 4762 important state. 4764 After processing, the following record will exist on the 4765 server. 4767 { id_arg, verifier_arg, principal_arg, clientid_ret, FALSE} 4769 6. Client Reboot 4771 If the server has the following confirmed client record, 4772 then this request is likely from a previously confirmed 4773 client which has rebooted. 4775 { id_arg, old_verifier_arg, principal_arg, clientid_ret, 4776 TRUE } 4778 Since the previous incarnation of the same client will no 4779 longer be making requests, lock and share reservations 4780 should be released immediately rather than forcing the new 4781 incarnation to wait for the lease time on the previous 4782 incarnation to expire. Furthermore, session state should 4783 be removed since if the client had maintained that 4784 information across reboot, this request would not have been 4785 issued. If the server does not support the 4786 CLAIM_DELEGATE_PREV claim type, associated delegations 4787 should be purged as well; otherwise, delegations are 4788 retained and recovery proceeds according to RFC3530. The 4789 client record is updated with the new verifier and its 4790 status is changed to unconfirmed. 4792 After processing, clientid_ret is returned to the client 4793 and the following record will exist on the server. 4795 { id_arg, verifier_arg, principal_arg, clientid_ret, FALSE 4796 } 4798 7. Reboot before confirmation 4800 If the server has the following unconfirmed record, then 4801 this request is likely from a client which rebooted before 4802 sending a CREATESESSION request. 4804 { id_arg, old_verifier_arg, *, clientid_ret, FALSE } 4806 Since this is believed to be a request from a new 4807 incarnation of the original client, the server updates the 4808 value of clientdesc.verifier and returns the original 4809 clientid_ret. After processing, the following state exists 4810 on the server. 4812 { id_arg, verifier_arg, *, clientid_ret, FALSE } 4814 ERRORS 4816 NFS4ERR_BADXDR NFS4ERR_CLID_INUSE NFS4ERR_INVAL NFS4ERR_RESOURCE 4817 NFS4ERR_SERVERFAULT 4819 14.5 CREATESESSION - Create New Session and Confirm Clientid 4821 Start up session and confirm clientid. 4823 SYNOPSIS 4825 clientid, session_args -> sessionid, session_args 4827 ARGUMENT 4828 struct CREATESESSION4args { 4829 clientid4 clientid; 4830 bool persist; 4831 count4 maxrequestsize; 4832 count4 maxresponsesize; 4833 count4 maxrequests; 4834 count4 headerpadsize; 4835 switch (bool clientid_confirm) { 4836 case TRUE: 4837 verifier4 setclientid_confirm; 4838 case FALSE: 4839 void; 4840 } 4841 switch (channelmode4 mode) { 4842 case DEFAULT: 4843 void; 4844 case STREAM: 4845 streamchannelattrs4 streamchanattrs; 4846 case RDMA: 4847 rdmachannelattrs4 rdmachanattrs; 4848 }; 4849 }; 4851 RESULT 4852 typedef opaque sessionid4[16]; 4854 struct CREATESESSION4resok { 4855 sessionid4 sessionid; 4856 bool persist; 4857 count4 maxrequestsize; 4858 count4 maxresponsesize; 4859 count4 maxrequests; 4860 count4 headerpadsize; 4861 switch (channelmode4 mode) { 4862 case DEFAULT: 4863 void; 4864 case STREAM: 4865 streamchannelattrs4 streamchanattrs; 4866 case RDMA: 4867 rdmachannelattrs4 rdmachanattrs; 4868 }; 4869 }; 4871 union CREATESESSION4res switch (nfsstat4 status) { 4872 case NFS4_OK: 4873 CREATESESSION4resok resok4; 4874 default: 4875 void; 4876 }; 4878 DESCRIPTION 4880 This operation is used by the client to create new session objects 4881 on the server. Additionally the first session created with a new 4882 shorthand client identifier serves to confirm the creation of that 4883 client's state on the server. The server returns the parameter 4884 values for the new session. 4886 IMPLEMENTATION 4888 To describe the implementation, the same notation for client 4889 records introduced in the description of CREATECLIENTID is used 4890 with the following addition. 4892 clientid_arg: The value of the clientid field of the 4893 CREATESESSION4args structure of the current request. 4895 Since CREATESESSION is a non-idempotent operation, we must 4896 consider the possibility that replays may occur as a result of a 4897 client reboot, network partition, malfunctioning router, etc. 4898 Replays are identified by the value of the clientid and sessionid 4899 fields of CREATESESSION4args and the method for dealing with them 4900 is outlined in the scenarios below. 4902 The processing of this operation is divided into two phases: 4903 clientid confirmation and session creation. In case the state for 4904 the provided clientid has not been verified, it is confirmed 4905 before the session is created. Otherwise the clientid 4906 confirmation phase is skipped and only the session creation phase 4907 occurs. Note that since only confirmed clients may create 4908 sessions, the clientid confirmation stage does not depend upon 4909 sessionid_arg. 4911 CLIENTID CONFIRMATION 4913 The operational cases are described in terms of what client 4914 records whose clientid field have value equal to clientid_arg 4915 exist in the server's set of client records. Any cases in which 4916 there is more than one record with identical values for clientid 4917 represent a server implementation error. Operation in the 4918 potential valid cases is summarized as follows. 4920 1. Common Case 4922 If the server has the following unconfirmed record, then 4923 this is the expected confirmation of an unconfirmed record. 4925 { *, *, principal_arg, clientid_arg, FALSE } 4927 The confirmed field of the record is set to TRUE and 4928 processing of the operation continues normally. 4930 2. Stale Clientid 4932 If the server contains no records with clientid equal to 4933 clientid_arg, then most likely the client's state has been 4934 purged during a period of inactivity, possibly due to a 4935 loss of connectivity. NFS4ERR_STALE_CLIENTID is returned, 4936 and no changes are made to any client records on the 4937 server. 4939 3. Principal Change or Collision 4941 If the server has the following record, then the client has 4942 changed principals after the previous CREATECLIENTID 4943 request, or there has been a chance collision between 4944 shortand client identifiers. 4946 { *, *, old_principal_arg, clientid_arg, * } 4948 Neither of these cases are permissible. Processing stops 4949 and NFS4ERR_CLID_INUSE is returned to the client. No 4950 changes are made to any client records on the server. 4952 SESSION CREATION 4954 To determine whether this request is a replay, the server examines 4955 the sessionid argument provided by the client. If the sessionid 4956 matches the identifier of a previously created session, then this 4957 request must be interpreted as a replay. No new state is created 4958 and a reply with the parameters of the existing session is 4959 returned to the client. If a session corresponding to the 4960 sessionid does not already exist, then the request is not a replay 4961 and is processed as follows. 4963 NOTE: It is the responsibility of the client to generate 4964 appropriate values for sessionid. Since the ordering of messages 4965 sent on different transport connections is not guaranteed, 4966 immediately reusing the sessionid of a previously destroyed 4967 session may yield unpredictable results. Client implementations 4968 should avoid recently used sessionids to ensure correct behavior. 4970 The server examines the persist, maxrequestsize, maxresponsesize, 4971 maxrequests and headerpadsize arguments. For each argument, if 4972 the value is acceptable to the server, it is recommended that the 4973 server use the provided value to create the new session. If it is 4974 not acceptable, the server may use a different value, but must 4975 return the value used to the client. These parameters have the 4976 following interpretation. 4978 persist: 4980 True if the client desires server support for "reliable" 4981 semantics. For sessions in which only idempotent operations 4982 will be used (e.g. a read-only session), clients should set 4983 this value to false. If the server does not or cannot provide 4984 "reliable" semantics this value must be set to false on return. 4986 maxrequestsize: 4988 The maximum size of a COMPOUND request that will be sent by the 4989 client including RPC headers. 4991 maxresponsesize: 4993 The maximum size of a COMPOUND reply that the client will 4994 accept from the server including RPC headers. The server must 4995 not increase the value of this parameter. If a client sends a 4996 COMPOUND request for which the size of the reply would exceed 4997 this value, the server will return NFS4ERR_RESOURCE. 4999 maxrequests: 5001 The maximum number of concurrent COMPOUND requests that the 5002 client will issue on the session. Subsequent COMPOUND requests 5003 will each be assigned a slot identifier by the client on the 5004 range 0 to maxrequests - 1 inclusive. A slot id cannot be 5005 reused until the previous request on that slot has completed. 5007 headerpadsize: 5009 The maximum amount of padding the client is willing to apply to 5010 ensure that write payloads are aligned on some boundary at the 5011 server. The server should reply with its preferred value, or 5012 zero if padding is not in use. The server may decrease this 5013 value but must not increase it. 5015 The server creates the session by recording the parameter values 5016 used and if the persist parameter is true and has been accepted by 5017 the server, allocating space for the duplicate request cache 5018 (DRC). 5020 If the session state is created successfully, the server 5021 associates it with the session identifier provided by the client. 5022 This identifier must be unique among the client's active sessions 5023 but there is no need for it to be globally unique. Finally, the 5024 server returns the negotiated values used to create the session to 5025 the client. 5027 ERRORS 5029 NFS4ERR_BADXDR NFS4ERR_CLID_INUSE NFS4ERR_RESOURCE 5030 NFS4ERR_SERVERFAULT NFS4ERR_STALE_CLIENTID 5032 14.6 BIND_BACKCHANNEL - Create a callback channel binding 5034 Establish a callback channel on the connection. 5036 SYNOPSIS 5038 ARGUMENT 5040 struct BIND_BACKCHANNEL4args { 5041 clientid4 clientid; 5042 uint32_t callback_program; 5043 uint32_t callback_ident; 5044 count4 maxrequestsize; 5045 count4 maxresponsesize; 5046 count4 maxrequests; 5047 switch (channelmode4 mode) { 5048 case DEFAULT: 5049 void; 5050 case STREAM: 5051 streamchannelattrs4 streamchanattrs; 5052 case RDMA: 5053 rdmachannelattrs4 rdmachanattrs; 5054 }; 5055 }; 5057 RESULT 5058 struct BIND_BACKCHANNEL4resok { 5059 count4 maxrequestsize; 5060 count4 maxresponsesize; 5061 count4 maxrequests; 5062 switch (channelmode4 mode) { 5063 case DEFAULT: 5064 void; 5065 case STREAM: 5066 streamchannelattrs4 streamchanattrs; 5067 case RDMA: 5068 rdmachannelattrs4 rdmachanattrs; 5069 }; 5070 }; 5072 union BIND_BACKCHANNEL4res switch (nfsstat4 status) { 5073 case NFS4_OK: 5074 BIND_BACKCHANNEL4resok resok4; 5075 default: 5076 void; 5077 }; 5079 DESCRIPTION 5081 The BIND_BACKCHANNEL operation serves to establish the current 5082 connection as a designated callback channel for the specified 5083 session. Normally, only one callback channel is bound, however if 5084 more than one are established, they are used at the server's 5085 prerogative, no affinity or preference is specified by the client. 5087 The arguments and results of the BIND_BACKCHANNEL call are a 5088 subset of the session parameters, and used identically to those 5089 values on the callback channel only. However, not all session 5090 operation channel parameters are relevant to the callback channel, 5091 for example header padding (since writes of bulk data are not 5092 performed in callbacks). 5094 IMPLEMENTATION 5096 No discussion at this time. 5098 ERRORS 5100 TBD 5102 14.7 DESTROYSESSION - Destroy existing session 5104 Destroy existing session. 5106 SYNOPSIS 5108 void -> status 5110 ARGUMENT 5112 struct DESTROYSESSION4args { 5113 sessionid4 sessionid; 5114 }; 5116 RESULT 5118 struct SESSION_DESTROYres { 5119 nfsstat status; 5120 }; 5122 DESCRIPTION 5124 The SESSION_DESTROY operation closes the session and discards any 5125 active state such as locks, leases, and server duplicate request 5126 cache entries. Any remaining connections bound to the session are 5127 immediately unbound and may additionally be closed by the server. 5129 This operation must be the final, or only operation in any 5130 request. Because the operation results in destruction of the 5131 session, any duplicate request caching for this request, as well 5132 as previously completed requests, will be lost. For this reason, 5133 it is advisable to not place this operation in a request with 5134 other state-modifying operations. In addition, a SEQUENCE 5135 operation is not required in the request. 5137 Note that because the operation will never be replayed by the 5138 server, a client that retransmits the request may receive an error 5139 in response, even though the session may have been successfully 5140 destroyed. 5142 IMPLEMENTATION 5144 No discussion at this time. 5146 ERRORS 5148 TBD 5150 14.8 SEQUENCE - Supply per-procedure sequencing and control 5152 Supply per-procedure sequencing and control 5154 SYNOPSIS 5156 control -> control 5158 ARGUMENT 5160 typedef uint32_t sequenceid4; 5161 typedef uint32_t slotid4; 5163 struct SEQUENCE4args { 5164 clientid4 clientid; 5165 sessionid4 sessionid; 5166 sequenceid4 sequenceid; 5167 slotid4 slotid; 5168 slotid4 maxslot; 5169 }; 5171 RESULT 5172 struct SEQUENCE4resok { 5173 clientid4 clientid; 5174 sessionid4 sessionid; 5175 sequenceid4 sequenceid; 5176 slotid4 slotid; 5177 slotid4 maxslot; 5178 slotid4 target_maxslot; 5179 }; 5181 union SEQUENCE4res switch (nfsstat4 status) { 5182 case NFS4_OK: 5183 SEQUENCE4resok resok4; 5184 default: 5185 void; 5186 }; 5188 DESCRIPTION 5190 The SEQUENCE operation is used to manage operational accounting 5191 for the session on which the operation is sent. The contents 5192 include the client and session to which this request belongs, 5193 slotid and sequenceid, used by the server to implement session 5194 request control and the duplicate reply cache semantics, and 5195 exchanged slot counts which are used to adjust these values. This 5196 operation must appear once as the first operation in each COMPOUND 5197 sent after the channel is successfully bound, or a protocol error 5198 must result. 5200 IMPLEMENTATION 5202 No discussion at this time. 5204 ERRORS 5206 NFS4ERR_BADSESSION NFS4ERR_BADSLOT 5208 14.9 CB_RECALLCREDIT - change flow control limits 5210 Change flow control limits 5212 SYNOPSIS 5214 targetcount -> status 5216 ARGUMENT 5218 struct CB_RECALLCREDIT4args { 5219 sessionid4 sessionid; 5220 uint32_t target; 5221 }; 5223 RESULT 5225 struct CB_RECALLCREDIT4res { 5226 nfsstat4 status; 5227 }; 5229 DESCRIPTION 5231 The CB_RECALLCREDIT operation requests the client to return 5232 session and transport credits to the server, by zero-length RDMA 5233 Sends or NULL NFSv4 operations. 5235 IMPLEMENTATION 5237 No discussion at this time. 5239 ERRORS 5241 NONE 5243 14.10 CB_SEQUENCE - Supply callback channel sequencing and control 5245 Sequence and control 5247 SYNOPSIS 5249 control -> control 5251 ARGUMENT 5252 typedef uint32_t sequenceid4; 5253 typedef uint32_t slotid4; 5255 struct CB_SEQUENCE4args { 5256 clientid4 clientid; 5257 sessionid4 sessionid; 5258 sequenceid4 sequenceid; 5259 slotid4 slotid; 5260 slotid4 maxslot; 5261 }; 5263 RESULT 5265 struct CB_SEQUENCE4resok { 5266 clientid4 clientid; 5267 sessionid4 sessionid; 5268 sequenceid4 sequenceid; 5269 slotid4 slotid; 5270 slotid4 maxslot; 5271 slotid4 target_maxslot; 5272 }; 5274 union CB_SEQUENCE4res switch (nfsstat4 status) { 5275 case NFS4_OK: 5276 CB_SEQUENCE4resok resok4; 5277 default: 5278 void; 5279 }; 5281 DESCRIPTION 5283 The CB_SEQUENCE operation is used to manage operational accounting 5284 for the callback channel of the session on which the operation is 5285 sent. The contents include the client and session to which this 5286 request belongs, slotid and sequenceid, used by the server to 5287 implement session request control and the duplicate reply cache 5288 semantics, and exchanged slot counts which are used to adjust 5289 these values. This operation must appear once as the first 5290 operation in each CB_COMPOUND sent after the callback channel is 5291 successfully bound, or a protocol error must result. 5293 IMPLEMENTATION 5294 No discussion at this time. 5296 ERRORS 5298 NFS4ERR_BADSESSION NFS4ERR_BADSLOT 5300 14.11 GET_DIR_DELEGATION - Get a directory delegation 5302 Obtain a directory delegation. 5304 SYNOPSIS 5306 (cfh), requested notification -> (cfh), cookieverf, stateid, 5307 supported notification 5309 ARGUMENT 5311 struct GET_DIR_DELEGATION4args { 5312 dir_notification_type4 notification_type; 5313 attr_notice4 child_attr_delay; 5314 attr_notice4 dir_attr_delay; 5315 }; 5317 /* 5318 * Notification types. 5319 */ 5320 const DIR_NOTIFICATION_NONE = 0x00000000; 5321 const DIR_NOTIFICATION_CHANGE_CHILD_ATTRIBUTES = 0x00000001; 5322 const DIR_NOTIFICATION_CHANGE_DIR_ATTRIBUTES = 0x00000002; 5323 const DIR_NOTIFICATION_REMOVE_ENTRY = 0x00000004; 5324 const DIR_NOTIFICATION_ADD_ENTRY = 0x00000008; 5325 const DIR_NOTIFICATION_RENAME_ENTRY = 0x00000010; 5326 const DIR_NOTIFICATION_CHANGE_COOKIE_VERIFIER = 0x00000020; 5328 typedef uint32_t dir_notification_type4; 5330 typedef nfstime4 attr_notice4; 5332 RESULT 5333 struct GET_DIR_DELEGATION4resok { 5334 verifier4 cookieverf; 5335 /* Stateid for get_dir_delegation */ 5336 stateid4 stateid; 5337 /* Which notifications can the server support */ 5338 dir_notification_type4 supp_notification; 5339 bitmap4 child_attributes; 5340 bitmap4 dir_attributes; 5341 }; 5343 union GET_DIR_DELEGATION4res switch (nfsstat4 status) { 5344 case NFS4_OK: 5345 /* CURRENT_FH: delegated dir */ 5346 GET_DIR_DELEGATION4resok resok4; 5347 default: 5348 void; 5349 }; 5351 DESCRIPTION 5353 The GET_DIR_DELEGATION operation is used by a client to request a 5354 directory delegation. The directory is represented by the current 5355 filehandle. The client also specifies whether it wants the server 5356 to notify it when the directory changes in certain ways by setting 5357 one or more bits in a bitmap. The server may also choose not to 5358 grant the delegation. In that case the server will return 5359 NFS4ERR_DIRDELEG_UNAVAIL. If the server decides to hand out the 5360 delegation, it will return a cookie verifier for that directory. 5361 If the cookie verifier changes when the client is holding the 5362 delegation, the delegation will be recalled unless the client has 5363 asked for notification for this event. In that case a 5364 notification will be sent to the client. 5366 The server will also return a directory delegation stateid in 5367 addition to the cookie verifier as a result of the 5368 GET_DIR_DELEGATION operation. This stateid will appear in 5369 callback messages related to the delegation, such as notifications 5370 and delegation recalls. The client will use this stateid to 5371 return the delegation voluntarily or upon recall. A delegation is 5372 returned by calling the DELEGRETURN operation. 5374 The server may not be able to support notifications of certain 5375 events. If the client asks for such notifications, the server 5376 must inform the client of its inability to do so as part of the 5377 GET_DIR_DELEGATION reply by not setting the appropriate bits in 5378 the supported notifications bitmask contained in the reply. 5380 The GET_DIR_DELEGATION operation can be used for both normal and 5381 named attribute directories. It covers all the entries in the 5382 directory except the ".." entry. That means if a directory and 5383 its parent both hold directory delegations, any changes to the 5384 parent will not cause a notification to be sent for the child even 5385 though the child's ".." entry points to the parent. 5387 IMPLEMENTATION 5389 Directory delegation provides the benefit of improving cache 5390 consistency of namespace information. This is done through 5391 synchronous callbacks. A server must support synchronous 5392 callbacks in order to support directory delegations. In addition 5393 to that, asynchronous notifications provide a way to reduce 5394 network traffic as well as improve client performance in certain 5395 conditions. Notifications would not be requested when the goal is 5396 just cache consitency. 5398 Notifications are specified in terms of potential changes to the 5399 directory. A client can ask to be notified whenever an entry is 5400 added to a directory by setting notification_type to 5401 DIR_NOTIFICATION_ADD_ENTRY. It can also ask for notifications on 5402 entry removal, renames, directory attribute changes and cookie 5403 verifier changes by setting notification_type flag appropriately. 5404 In addition to that, the client can also ask for notifications 5405 upon attribute changes to children in the directory to keep its 5406 attribute cache up to date. However any changes made to child 5407 attributes do not cause the delegation to be recalled. If a 5408 client is interested in directory entry caching, or negative name 5409 caching, it can set the notification_type appropriately and the 5410 server will notify it of all changes that would otherwise 5411 invalidate its name cache. The kind of notification a client asks 5412 for may depend on the directory size, its rate of change and the 5413 applications being used to access that directory. However, the 5414 conditions under which a client might ask for a notification, is 5415 out of the scope of this specification. 5417 The client will set one or more bits in a bitmap 5418 (notification_type) to let the server know what kind of 5419 notification(s) it is interested in. For attribute notifications 5420 it will set bits in another bitmap to indicate which attributes it 5421 wants to be notified of. If the server does not support 5422 notifications for changes to a certain attribute, it should not 5423 set that attribute in the supported attribute bitmap 5424 (supp_notification) specified in the reply. 5426 In addition to that, the client will also let the server know if 5427 it wants to get the notification as soon as the attribute change 5428 occurs or after a certain delay by setting a delay factor, 5429 child_attr_delay for attribute changes to children and 5430 dir_attr_delay for attribute changes to the directory. If this 5431 delay factor is set to zero, that indicates to the server that the 5432 client wants to be notified of any attribute changes as soon as 5433 they occur. If the delay factor is set to N, the server will make 5434 a best effort guarantee that attribute updates are not out of sync 5435 by more than that. One value covers all attribute changes for the 5436 directory and another value covers all attribute changes for all 5437 children in the directory. If the client asks for a delay factor 5438 that the server does not support or that may cause significant 5439 resource consumption on the server by causing the server to send a 5440 lot of notifications, the server should not commit to sending out 5441 notifications for that attribute and therefore must not set the 5442 approprite bit in the child_attributes and dir_attributes bitmaps 5443 in the response. 5445 The server will let the client know about which notifications it 5446 can support by setting appropriate bits in a bitmap. If it agrees 5447 to send attribute notifications, it will also set two attribute 5448 masks indicating which attributes it will send change 5449 notifications for. One of the masks covers changes in directory 5450 attributes and the other covers atttribute changes to any files in 5451 the directory. 5453 The client should use a security flavor that the filesystem is 5454 exported with. If it uses a different flavor, the server should 5455 return NFS4ERR_WRONGSEC. 5457 ERRORS 5459 NFS4ERR_ACCESS NFS4ERR_BADHANDLE NFS4ERR_BADXDR NFS4ERR_FHEXPIRED 5460 NFS4ERR_INVAL NFS4ERR_MOVED NFS4ERR_NOFILEHANDLE NFS4ERR_NOTDIR 5461 NFS4ERR_RESOURCE NFS4ERR_SERVERFAULT NFS4ERR_STALE 5462 NFS4ERR_DIRDELEG_UNAVAIL NFS4ERR_WRONGSEC NFS4ERR_EIO 5463 NFS4ERR_NOTSUPP 5465 14.12 CB_NOTIFY - Notify directory changes 5467 Tell the client of directory changes. 5469 SYNOPSIS 5471 stateid, notification -> {} 5473 ARGUMENT 5475 struct CB_NOTIFY4args { 5476 stateid4 stateid; 5477 dir_notification4 changes<>; 5478 }; 5480 /* 5481 * Notification information sent to the client. 5482 */ 5483 union dir_notification4 5484 switch (dir_notification_type4 notification_type) { 5485 case DIR_NOTIFICATION_CHANGE_CHILD_ATTRIBUTES: 5486 dir_notification_attribute4 change_child_attributes; 5487 case DIR_NOTIFICATION_CHANGE_DIR_ATTRIBUTES: 5488 fattr4 change_dir_attributes; 5489 case DIR_NOTIFICATION_REMOVE_ENTRY: 5490 dir_notification_remove4 remove_notification; 5491 case DIR_NOTIFICATION_ADD_ENTRY: 5492 dir_notification_add4 add_notification; 5493 case DIR_NOTIFICATION_RENAME_ENTRY: 5494 dir_notification_rename4 rename_notification; 5495 case DIR_NOTIFICATION_CHANGE_COOKIE_VERIFIER: 5496 dir_notification_verifier4 verf_notification; 5497 }; 5499 /* 5500 * Changed entry information. 5501 */ 5502 struct dir_entry { 5503 component4 file; 5504 fattr4 attrs; 5505 }; 5507 struct dir_notification_attribute4 { 5508 dir_entry changed_entry; 5509 }; 5511 struct dir_notification_remove4 { 5512 dir_entry old_entry; 5513 nfs_cookie4 old_entry_cookie; 5514 }; 5516 struct dir_notification_rename4 { 5517 dir_entry old_entry; 5518 dir_notification_add4 new_entry; 5519 }; 5521 struct dir_notification_verifier4 { 5522 verifier4 old_cookieverf; 5523 verifier4 new_cookieverf; 5524 }; 5526 struct dir_notification_add4 { 5527 dir_entry new_entry; 5528 /* what READDIR would have returned for this entry */ 5529 nfs_cookie4 new_entry_cookie; 5530 bool last_entry; 5531 prev_entry_info4 prev_info; 5532 }; 5534 union prev_entry_info4 switch (bool isprev) { 5535 case TRUE: /* A previous entry exists */ 5536 prev_entry4 prev_entry_info; 5537 case FALSE: /* we are adding to an empty 5538 directory */ 5539 void; 5540 }; 5542 /* 5543 * Previous entry information 5544 */ 5545 struct prev_entry4 { 5546 dir_entry prev_entry; 5547 /* what READDIR returned for this entry */ 5548 nfs_cookie4 prev_entry_cookie; 5549 }; 5551 RESULT 5553 struct CB_NOTIFY4res { 5554 nfsstat4 status; 5555 }; 5557 DESCRIPTION 5559 The CB_NOTIFY operation is used by the server to send 5560 notifications to clients about changes in a delegated directory. 5561 These notifications are sent over the callback path. The 5562 notification is sent once the original request has been processed 5563 on the server. The server will send an array of notifications for 5564 all changes that might have occurred in the directory. The 5565 dir_notification_type4 can only have one bit set for each 5566 notification in the array. If the client holding the delegation 5567 makes any changes in the directory that cause files or sub 5568 directories to be added or removed, the server will notify that 5569 client of the resulting change(s). If the client holding the 5570 delegation is making attribute or cookie verifier changes only, 5571 the server does not need to send notifications to that client. 5572 The server will send the following information for each operation: 5574 * ADDING A FILE: The server will send information about the new 5575 entry being created along with the cookie for that entry. The 5576 entry information contains the nfs name of the entry and 5577 attributes. If this entry is added to the end of the 5578 directory, the server will set a last_entry flag to true. If 5579 the file is added such that there is atleast one entry before 5580 it, the server will also return the previous entry information 5581 along with its cookie. This is to help clients find the right 5582 location in their DNLC or directory caches where this entry 5583 should be cached. 5585 * REMOVING A FILE: The server will send information about the 5586 directory entry being deleted. The server will also send the 5587 cookie value for the deleted entry so that clients can get to 5588 the cached information for this entry. 5590 * RENAMING A FILE: The server will send information about both 5591 the old entry and the new entry. This includes name and 5592 attributes for each entry. This notification is only sent if 5593 both entries are in the same directory. If the rename is 5594 across directories, the server will send a remove notification 5595 to one directory and an add notification to the other 5596 directory, assuming both have a directory delegation. 5598 * FILE/DIR ATTRIBUTE CHANGE: The client will use the attribute 5599 mask to inform the server of attributes for which it wants to 5600 receive notifications. This change notification can be 5601 requested for both changes to the attributes of the directory 5602 as well as changes to any file attributes in the directory by 5603 using two separate attribute masks. The client can not ask for 5604 change attribute notification per file. One attribute mask 5605 covers all the files in the directory. Upon any attribute 5606 change, the server will send back the values of changed 5607 attributes. Notifications might not make sense for some 5608 filesystem wide attributes and it is up to the server to decide 5609 which subset it wants to support. The client can negotiate the 5610 frequency of attribute notifications by letting the server know 5611 how often it wants to be notified of an attribute change. The 5612 server will return supported notification frequencies or an 5613 indication that no notification is permitted for directory or 5614 child attributes by setting the supp_dir_attr_notice and 5615 supp_child_attr_notice attributes respectively. 5617 * COOKIE VERIFIER CHANGE: If the cookie verifier changes while a 5618 client is holding a delegation, the server will notify the 5619 client so that it can invalidate its cookies and reissue a 5620 READDIR to get the new set of cookies. 5622 IMPLEMENTATION 5624 ERRORS 5626 NFS4ERR_BAD_STATEID NFS4ERR_INVAL NFS4ERR_BADXDR 5627 NFS4ERR_SERVERFAULT 5629 14.13 CB_RECALL_ANY - Keep any N delegations 5631 Notify client to return delegation and keep N of them. 5633 SYNOPSIS 5635 N -> {} 5637 ARGUMENT 5639 struct CB_RECALLANYY4args { 5640 uint4 dlgs_to_keep; 5641 } 5643 RESULT 5645 struct CB_RECALLANY4res { 5646 nfsstat4 status; 5647 }; 5649 DESCRIPTION 5651 The server may decide that it can not hold all the delegation 5652 state without running out of resources. Since the server has no 5653 knowledge of which delegations are being used more than others, it 5654 can not implement an effective reclaim scheme that avoids 5655 reclaiming frequently used delegations. In that case the server 5656 may issue a CB_RECALL_ANY callback to the client asking it to keep 5657 N delegations and return the rest. The reason why CB_RECALL_ANY 5658 specifies a count of delegations the client may keep as opposed to 5659 a count of delegations the client must yield is as follows. Were 5660 it otherwise, there is a potential for a race between a 5661 CB_RECALL_ANY that had a count of delegations to free with a set 5662 of client originated operations to return delegations. As a 5663 result of the race the client and server would have differing 5664 ideas as to how many delegations to return. Hence the client 5665 could mistakenly free too many delegations. This operation 5666 applies to delegations for a regular file (read or write) as well 5667 as for a directory. 5669 The client can choose to return any type of delegation as a result 5670 of this callback i.e. read, write or directory delegation. The 5671 client can also choose to keep more delegations than what the 5672 server asked for and it is up to the server to handle this 5673 situation. The server must give the client enough time to return 5674 the delegations. This time should not be less than the lease 5675 period. 5677 IMPLEMENTATION 5679 ERRORS 5681 NFS4ERR_RESOURCE 5683 14.14 LAYOUTGET - Get Layout Information 5685 SYNOPSIS 5687 (cfh), clientid, layout_type, iomode, offset, length, 5688 minlength, maxcount -> layout 5690 ARGUMENT 5692 struct LAYOUTGET4args { 5693 /* CURRENT_FH: file */ 5694 clientid4 clientid; 5695 pnfs_layouttype4 layout_type; 5696 pnfs_layoutiomode4 iomode; 5697 offset4 offset; 5698 length4 length; 5699 length4 minlength; 5700 count4 maxcount; 5701 }; 5703 RESULT 5705 struct LAYOUTGET4resok { 5706 pnfs_layout4 layout; 5707 }; 5709 union LAYOUTGET4res switch (nfsstat4 status) { 5710 case NFS4_OK: 5711 LAYOUTGET4resok resok4; 5712 default: 5713 void; 5714 }; 5716 DESCRIPTION 5718 Requests a layout for reading or writing (and reading) the file given 5719 by the filehandle at the byte range specified by offset and length. 5720 Layouts are identified by the clientid, filehandle, and layout type. 5721 The use of the iomode depends upon the layout type, but should 5722 reflect the client's data access intent. 5724 The LAYOUTGET operation returns layout information for the specified 5725 byte range, a layout segment. To get a layout segment from a 5726 specific offset through the end-of-file, regardless of the file's 5727 length, a length field with all bits set to 1 (one) should be used. 5728 If the length is zero, or if a length which is not all bits set to 5729 one is specified, and length when added to the offset exceeds the 5730 maximum 64-bit unsigned integer value, the error NFS4ERR_INVAL will 5731 result. 5733 The "minlength" field specifies the minimum size overlap with the 5734 requested offset and length that is to be returned. If this 5735 requirement cannot be met, no layout must be returned; the error 5736 NFS4ERR_LAYOUTTRYLATER can be returned. 5738 The "maxcount" field specifies the maximum layout size (in bytes) 5739 that the client can handle. If the size of the layout structure 5740 exceeds the size specified by maxcount, the metadata server will 5741 return the NFS4ERR_TOOSMALL error. 5743 As well, the metadata server may adjust the range of the returned 5744 layout segment based on striping patterns and usage implied by the 5745 iomode. The client must be prepared to get a layout that does not 5746 line up exactly with their request; there MUST be at least an overlap 5747 of "minlength" between the layout returned by the server and the 5748 client's request, or the server SHOULD reject the request. See 5749 Section 7.3 for more details. 5751 The metadata server may also return a layout segment with an iomode 5752 other than that requested by the client. If it does so, it must 5753 ensure that the iomode is more permissive than the iomode requested. 5754 E.g., this allows an implementation to upgrade read-only requests to 5755 read/write requests at its discretion, within the limits of the 5756 layout type specific protocol. An iomode of either LAYOUTIOMODE_READ 5757 or LAYOUTIOMODE_RW must be returned. 5759 The format of the returned layout is specific to the underlying file 5760 system. Layout types other than the NFSv4 file layout type should be 5761 specified outside of this document. 5763 If layouts are not supported for the requested file or its containing 5764 file system the server SHOULD return NFS4ERR_LAYOUTUNAVAILABLE. If 5765 the layout type is not supported, the metadata server should return 5766 NFS4ERR_UNKNOWN_LAYOUTTYPE. If layouts are supported but no layout 5767 matches the client provided layout identification, the server should 5768 return NFS4ERR_BADLAYOUT. If an invalid iomode is specified, or an 5769 iomode of LAYOUTIOMODE_ANY is specified, the server should return 5770 NFS4ERR_BADIOMODE. 5772 If the layout for the file is unavailable due to transient 5773 conditions, e.g. file sharing prohibits layouts, the server must 5774 return NFS4ERR_LAYOUTTRYLATER. 5776 If the layout request is rejected due to an overlapping layout 5777 recall, the server must return NFS4ERR_RECALLCONFLICT. See 5778 Section 7.5.3 for details. 5780 If the layout conflicts with a mandatory byte range lock held on the 5781 file, and if the storage devices have no method of enforcing 5782 mandatory locks, other than through the restriction of layouts, the 5783 metadata server should return NFS4ERR_LOCKED. 5785 On success, the current filehandle retains its value. 5787 IMPLEMENTATION 5789 Typically, LAYOUTGET will be called as part of a compound RPC after 5790 an OPEN operation and results in the client having location 5791 information for the file; a client may also hold a layout across 5792 multiple OPENs. The client specifies a layout type that limits what 5793 kind of layout the server will return. This prevents servers from 5794 issuing layouts that are unusable by the client. 5796 ERRORS 5798 NFS4ERR_BADLAYOUT 5799 NFS4ERR_BADIOMODE 5800 NFS4ERR_FHEXPIRED 5801 NFS4ERR_INVAL 5802 NFS4ERR_LAYOUTUNAVAILABLE 5803 NFS4ERR_LAYOUTTRYLATER 5804 NFS4ERR_LOCKED 5805 NFS4ERR_NOFILEHANDLE 5806 NFS4ERR_NOTSUPP 5807 NFS4ERR_RECALLCONFLICT 5808 NFS4ERR_STALE 5809 NFS4ERR_STALE_CLIENTID 5810 NFS4ERR_TOOSMALL 5811 NFS4ERR_UNKNOWN_LAYOUTTYPE 5813 14.15 LAYOUTCOMMIT - Commit writes made using a layout 5814 SYNOPSIS 5816 (cfh), clientid, offset, length, last_write_offset, 5817 time_modify, time_access, layoutupdate -> newsize 5819 ARGUMENT 5821 union newtime4 switch (bool timechanged) { 5822 case TRUE: 5823 nfstime4 time; 5824 case FALSE: 5825 void; 5826 }; 5828 union newsize4 switch (bool sizechanged) { 5829 case TRUE: 5830 length4 size; 5831 case FALSE: 5832 void; 5833 }; 5835 struct LAYOUTCOMMIT4args { 5836 /* CURRENT_FH: file */ 5837 clientid4 clientid; 5838 offset4 offset; 5839 length4 length; 5840 length4 last_write_offset; 5841 newtime4 time_modify; 5842 newtime4 time_access; 5843 pnfs_layoutupdate4 layoutupdate; 5844 }; 5846 RESULT 5848 struct LAYOUTCOMMIT4resok { 5849 newsize4 newsize; 5850 }; 5852 union LAYOUTCOMMIT4res switch (nfsstat4 status) { 5853 case NFS4_OK: 5854 LAYOUTCOMMIT4resok resok4; 5855 default: 5856 void; 5857 }; 5859 DESCRIPTION 5860 Commits changes in the layout segment represented by the current 5861 filehandle, clientid, and byte range. Since layouts are sub- 5862 dividable, a smaller portion of a layout, retrieved via LAYOUTGET, 5863 may be committed. The region being committed is specified through 5864 the byte range (length and offset). Note: the "layoutupdate" 5865 structure does not include the length and offset, as they are already 5866 specified in the arguments. 5868 The LAYOUTCOMMIT operation indicates that the client has completed 5869 writes using a layout obtained by a previous LAYOUTGET. The client 5870 may have only written a subset of the data range it previously 5871 requested. LAYOUTCOMMIT allows it to commit or discard provisionally 5872 allocated space and to update the server with a new end of file. The 5873 layout referenced by LAYOUTCOMMIT is still valid after the operation 5874 completes and can be continued to be referenced by the clientid, 5875 filehandle, byte range, and layout type. 5877 The "last_write_offset" field specifies the offset of the last byte 5878 written by the client previous to the LAYOUTCOMMIT. Note: this value 5879 is never equal to the file's size (at most it is one byte less than 5880 the file's size). The metadata server may use this information to 5881 determine whether the file's size needs to be updated. If the 5882 metadata server updates the file's size as the result of the 5883 LAYOUTCOMMIT operation, it must return the new size as part of the 5884 results. 5886 The "time_modify" and "time_access" fields allow the client to 5887 suggest times it would like the metadata server to set. The metadata 5888 server may use these time values or it may use the time of the 5889 LAYOUTCOMMIT operation to set these time values. If the metadata 5890 server uses the client provided times, it should sanity check the 5891 values (e.g., to ensure time does not flow backwards). If the client 5892 wants to force the metadata server to set an exact time, the client 5893 should use a SETATTR operation in a compound right after 5894 LAYOUTCOMMIT. See Section 7.4 for more details. If the new client 5895 desires the resultant mtime or atime, it should issue a GETATTR 5896 following the LAYOUTCOMMIT; e.g., later in the same compound. 5898 The "layoutupdate" argument to LAYOUTCOMMIT provides a mechanism for 5899 a client to provide layout specific updates to the metadata server. 5900 For example, the layout update can describe what regions of the 5901 original layout have been used and what regions can be deallocated. 5902 There is no NFSv4 file layout specific layoutupdate structure. 5904 The layout information is more verbose for block devices than for 5905 objects and files because the latter hide the details of block 5906 allocation behind their storage protocols. At the minimum, the 5907 client needs to communicate changes to the end of file location back 5908 to the server, and, if desired, its view of the file modify and 5909 access time. For block/volume layouts, it needs to specify precisely 5910 which blocks have been used. 5912 If the layout identified in the arguments does not exist, the error 5913 NFS4ERR_BADLAYOUT is returned. The layout being committed may also 5914 be rejected if it does not correspond to an existing layout with an 5915 iomode of RW. 5917 On success, the current filehandle retains its value. 5919 ERRORS 5921 NFS4ERR_BADLAYOUT 5922 NFS4ERR_BADIOMODE 5923 NFS4ERR_FHEXPIRED 5924 NFS4ERR_INVAL 5925 NFS4ERR_NOFILEHANDLE 5926 NFS4ERR_STALE 5927 NFS4ERR_STALE_CLIENTID 5928 NFS4ERR_UNKNOWN_LAYOUTTYPE 5930 14.16 LAYOUTRETURN - Release Layout Information 5932 SYNOPSIS 5934 (cfh), clientid, offset, length, iomode, layout_type -> - 5936 ARGUMENT 5938 struct LAYOUTRETURN4args { 5939 /* CURRENT_FH: file */ 5940 clientid4 clientid; 5941 offset4 offset; 5942 length4 length; 5943 pnfs_layoutiomode4 iomode; 5944 pnfs_layouttype4 layout_type; 5945 }; 5947 RESULT 5949 struct LAYOUTRETURN4res { 5950 nfsstat4 status; 5951 }; 5953 DESCRIPTION 5954 Returns the layout segment represented by the current filehandle, 5955 clientid, byte range, iomode, and layout type. After this call, the 5956 client MUST NOT use the layout and the associated storage protocol to 5957 access the file data. The layout being returned may be a subdivision 5958 of a layout previously fetched through LAYOUTGET. As well, it may be 5959 a subset or superset of a layout specified by CB_LAYOUTRECALL. 5960 However, if it is a subset, the recall is not complete until the full 5961 byte range has been returned. It is also permissible, and no error 5962 should result, for a client to return a byte range covering a layout 5963 it does not hold. If the length is all 1s, the layout covers the 5964 range from offset to EOF. An iomode of ANY specifies that all 5965 layouts that match the other arguments to LAYOUTRETURN (i.e., 5966 clientid, byte range, and type) are being returned. 5968 Layouts may be returned when recalled or voluntarily (i.e., before 5969 the server has recalled them). In either case the client must 5970 properly propagate state changed under the context of the layout to 5971 storage or to the server before returning the layout. 5973 If a client fails to return a layout in a timely manner, then the 5974 metadata server should use its control protocol with the storage 5975 devices to fence the client from accessing the data referenced by the 5976 layout. See Section 7.5 for more details. 5978 If the layout identified in the arguments does not exist, the error 5979 NFS4ERR_BADLAYOUT is returned. If a layout exists, but the iomode 5980 does not match, NFS4ERR_BADIOMODE is returned. 5982 On success, the current filehandle retains its value. 5984 [OPEN ISSUE: Should LAYOUTRETURN be modified to handle FSID 5985 callbacks?] 5987 ERRORS 5989 NFS4ERR_BADLAYOUT 5990 NFS4ERR_BADIOMODE 5991 NFS4ERR_FHEXPIRED 5992 NFS4ERR_INVAL 5993 NFS4ERR_NOFILEHANDLE 5994 NFS4ERR_STALE 5995 NFS4ERR_STALE_CLIENTID 5996 NFS4ERR_UNKNOWN_LAYOUTTYPE 5998 14.17 GETDEVICEINFO - Get Device Information 6000 SYNOPSIS 6002 (cfh), device_id, layout_type, maxcount -> device_addr 6004 ARGUMENT 6006 struct GETDEVICEINFO4args { 6007 /* CURRENT_FH: file */ 6008 pnfs_deviceid4 device_id; 6009 pnfs_layouttype4 layout_type; 6010 count4 maxcount; 6011 }; 6013 RESULT 6015 struct GETDEVICEINFO4resok { 6016 pnfs_deviceaddr4 device_addr; 6017 }; 6019 union GETDEVICEINFO4res switch (nfsstat4 status) { 6020 case NFS4_OK: 6021 GETDEVICEINFO4resok resok4; 6022 default: 6023 void; 6024 }; 6026 DESCRIPTION 6028 Returns device type and device address information for a specified 6029 device. The returned device_addr includes a type that indicates how 6030 to interpret the addressing information for that device. The current 6031 filehandle (cfh) is used to identify the file system; device IDs are 6032 unique per file system (FSID) and are qualified by the layout type. 6034 See Section 7.1.4 for more details on device ID assignment. 6036 If the size of the device address exceeds maxcount bytes, the 6037 metadata server will return the error NFS4ERR_TOOSMALL. If an 6038 invalid device ID is given, the metadata server will respond with 6039 NFS4ERR_INVAL. 6041 ERRORS 6043 NFS4ERR_FHEXPIRED 6044 NFS4ERR_INVAL 6045 NFS4ERR_TOOSMALL 6046 NFS4ERR_UNKNOWN_LAYOUTTYPE 6048 14.18 GETDEVICELIST - Get List of Devices 6050 SYNOPSIS 6052 (cfh), layout_type, maxcount, cookie, cookieverf -> 6053 cookie, cookieverf, device_addrs<> 6055 ARGUMENT 6057 struct GETDEVICELIST4args { 6058 /* CURRENT_FH: file */ 6059 pnfs_layouttype4 layout_type; 6060 count4 maxcount; 6061 nfs_cookie4 cookie; 6062 verifier4 cookieverf; 6063 }; 6065 RESULT 6067 struct GETDEVICELIST4resok { 6068 nfs_cookie4 cookie; 6069 verifier4 cookieverf; 6070 pnfs_devlist_item4 device_addrs<>; 6071 }; 6073 union GETDEVICELIST4res switch (nfsstat4 status) { 6074 case NFS4_OK: 6075 GETDEVICELIST4resok resok4; 6076 default: 6077 void; 6078 }; 6080 DESCRIPTION 6082 In some applications, especially SAN environments, it is convenient 6083 to find out about all the devices associated with a file system. 6084 This lets a client determine if it has access to these devices, e.g., 6085 at mount time. 6087 This operation returns an array of items (pnfs_devlist_item4) that 6088 establish the association between the short pnfs_deviceid4 and the 6089 addressing information for that device, for a particular layout type. 6090 This operation may not be able to fetch all device information at 6091 once, thus it uses a cookie based approach, similar to READDIR, to 6092 fetch additional device information (see [2], section 14.2.24). As 6093 in GETDEVICEINFO, the current filehandle (cfh) is used to identify 6094 the file system. 6096 As in GETDEVICEINFO, maxcount specifies the maximum number of bytes 6097 to return. If the metadata server is unable to return a single 6098 device address, it will return the error NFS4ERR_TOOSMALL. If an 6099 invalid device ID is given, the metadata server will respond with 6100 NFS4ERR_INVAL. 6102 ERRORS 6104 NFS4ERR_BAD_COOKIE 6105 NFS4ERR_FHEXPIRED 6106 NFS4ERR_INVAL 6107 NFS4ERR_TOOSMALL 6108 NFS4ERR_UNKNOWN_LAYOUTTYPE 6110 14.19 CB_LAYOUTRECALL 6112 SYNOPSIS 6114 layout_type, iomode, layoutchanged, layoutrecall -> - 6116 ARGUMENT 6118 enum layoutrecall_type4 { 6119 RECALL_FILE = 1, 6120 RECALL_FSID = 2 6121 }; 6123 struct layoutrecall_file4 { 6124 nfs_fh4 fh; 6125 offset4 offset; 6126 length4 length; 6127 }; 6129 union layoutrecall4 switch(layoutrecall_type4 recalltype) { 6130 case RECALL_FILE: 6131 layoutrecall_file4 layout; 6132 case RECALL_FSID: 6133 fsid4 fsid; 6134 }; 6136 struct CB_LAYOUTRECALLargs { 6137 pnfs_layouttype4 layout_type; 6138 pnfs_layoutiomode4 iomode; 6139 bool layoutchanged; 6140 layoutrecall4 layoutrecall; 6141 }; 6143 RESULT 6145 struct CB_LAYOUTRECALLres { 6146 nfsstat4 status; 6147 }; 6149 DESCRIPTION 6151 The CB_LAYOUTRECALL operation is used to begin the process of 6152 recalling a layout, a portion thereof, or all layouts pertaining to a 6153 particular file system (FSID). If RECALL_FILE is specified, the 6154 offset and length fields specify the portion of the layout to be 6155 returned. The iomode specifies the set of layouts to be returned. 6156 An iomode of ANY specifies that all matching layouts, regardless of 6157 iomode, must be returned; otherwise, only layouts that exactly match 6158 the iomode must be returned. 6160 If the "layoutchanged" field is TRUE, then the client SHOULD not 6161 flush its dirty data to the devices specified by the layout being 6162 recalled. Instead, it is preferable for the client to flush the 6163 dirty data through the metadata server. Alternatively, the client 6164 may attempt to obtain a new layout. Note: in order to obtain a new 6165 layout the client must first return the old layout. Since obtaining 6166 a new layout is not guaranteed to succeed, the client must be ready 6167 to flush its dirty data through the metadata server. 6169 If RECALL_FSID is specified, the fsid specifies the file system for 6170 which any outstanding layouts must be returned. Layouts are returned 6171 through the LAYOUTRETURN operation. 6173 If the client does not hold any layout segment either matching or 6174 overlapping with the requested layout, it returns 6175 NFS4ERR_NOMATCHING_LAYOUT. If a length of all 1s is specified then 6176 the layout corresponding to the byte range from "offset" to the end- 6177 of-file MUST be returned. 6179 IMPLEMENTATION 6181 The client should reply to the callback immediately. Replying does 6182 not complete the recall except when an error is returned. The recall 6183 is not complete until the layout(s) are returned using a 6184 LAYOUTRETURN. 6186 The client should complete any in-flight I/O operations using the 6187 recalled layout(s) before returning it/them via LAYOUTRETURN. If the 6188 client has buffered dirty data there are a number of options for 6189 flushing that data. If "layoutchanged" is false, the client may 6190 choose to write dirty data directly to storage before calling 6191 LAYOUTRETURN. However, if "layoutchanged" is true, the client may 6192 either choose to write it later using normal NFSv4 WRITE operations 6193 to the metadata server or it may attempt to obtain a new layout, 6194 after first returning the recalled layout, using the new layout to 6195 flush the dirty data. Regardless of whether the client is holding a 6196 layout, it may always write data through the metadata server. 6198 If dirty data is flushed while the layout is held, the client must 6199 still issue LAYOUTCOMMIT operations at the appropriate time, 6200 especially before issuing the LAYOUTRETURN. If a large amount of 6201 dirty data is outstanding, the client may issue LAYOUTRETURNs for 6202 portions of the layout being recalled; this allows the server to 6203 monitor the client's progress and adherence to the callback. 6204 However, the last LAYOUTRETURN in a sequence of returns, SHOULD 6205 specify the full range being recalled (see Section 7.5.2 for 6206 details). 6208 ERRORS 6210 NFS4ERR_NOMATCHING_LAYOUT 6212 14.20 CB_SIZECHANGED 6214 SYNOPSIS 6216 fh, size -> - 6218 ARGUMENT 6220 struct CB_SIZECHANGEDargs { 6221 nfs_fh4 fh; 6222 length4 size; 6223 }; 6225 RESULT 6227 struct CB_SIZECHANGEDres { 6228 nfsstat4 status; 6229 }; 6231 DESCRIPTION 6233 The CB_SIZECHANGED operation is used to notify the client that the 6234 size pertaining to the filehandle associated with "fh", has changed. 6235 The new size is specified. Upon reception of this notification 6236 callback, the client should update its internal size for the file. 6237 If the layout being held for the file is of the NFSv4 file layout 6238 type, then the size field within that layout should be updated (see 6239 Section 9.5). For other layout types see Section 7.4.2 for more 6240 details. 6242 If the handle specified is not one for which the client holds a 6243 layout, an NFS4ERR_BADHANDLE error is returned. 6245 ERRORS 6247 NFS4ERR_BADHANDLE 6249 15. References 6250 15.1 Normative References 6252 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 6253 Levels", March 1997. 6255 [2] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, 6256 C., Eisler, M., and D. Noveck, "Network File System (NFS) 6257 version 4 Protocol", RFC 3530, April 2003. 6259 15.2 Informative References 6261 [3] Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M., and E. 6262 Zeidner, "Internet Small Computer Systems Interface (iSCSI)", 6263 RFC 3720, April 2004. 6265 [4] Snively, R., "Fibre Channel Protocol for SCSI, 2nd Version 6266 (FCP-2)", ANSI/INCITS 350-2003, Oct 2003. 6268 [5] Weber, R., "Object-Based Storage Device Commands (OSD)", ANSI/ 6269 INCITS 400-2004, July 2004, 6270 . 6272 [6] Black, D., "pNFS Block/Volume Layout", July 2005, . 6275 [7] Zelenka, J., Welch, B., and B. Halevy, "Object-based pNFS 6276 Operations", July 2005, . 6279 Author's Address 6281 Spencer Shepler 6282 Sun Microsystems, Inc. 6283 7808 Moonflower Drive 6284 Austin, TX 78750 6285 USA 6287 Phone: +1-512-349-9376 6288 Email: spencer.shepler@sun.com 6290 Appendix A. Acknowledgments 6292 The initial drafts for the SECINFO extensions were edited by Mike 6293 Eisler with contributions from Tom Talpey, Saadia Khan, and Jon 6294 Bauman. 6296 The initial drafts for the SESSIONS extensions were edited by Tom 6297 Talpey, Spencer Shepler, Jon Bauman with contributions from Charles 6298 Antonelli, Brent Callaghan, Mike Eisler, John Howard, Chet Juszczak, 6299 Trond Myklebust, Dave Noveck, John Scott, Mike stolarchuk and Mark 6300 Wittle. 6302 The initial drafts for the Directory Delegations support were 6303 contributed by Saadia Khan with input from Dave Noveck, Mike Eisler, 6304 Carl Burnett, Ted Anderson and Tom Talpey. 6306 The initial drafts for the parellel NFS support were edited by Brent 6307 Welch and Garth Goodson. Additional authors for those documents were 6308 Benny Halevy, David Black, and Andy Adamson. Additional input came 6309 from the informal group which contributed to the construction of the 6310 initial pNFS drafts; specific acknowledgement goes to Gary Grider, 6311 Peter Corbett, Dave Noveck, and Peter Honeyman. The pNFS work was 6312 inspired by the NASD and OSD work done by Garth Gibson. Gary Grider 6313 of the national labs (LANL) has also been a champion of high- 6314 performance parallel I/O. 6316 Intellectual Property Statement 6318 The IETF takes no position regarding the validity or scope of any 6319 Intellectual Property Rights or other rights that might be claimed to 6320 pertain to the implementation or use of the technology described in 6321 this document or the extent to which any license under such rights 6322 might or might not be available; nor does it represent that it has 6323 made any independent effort to identify any such rights. Information 6324 on the procedures with respect to rights in RFC documents can be 6325 found in BCP 78 and BCP 79. 6327 Copies of IPR disclosures made to the IETF Secretariat and any 6328 assurances of licenses to be made available, or the result of an 6329 attempt made to obtain a general license or permission for the use of 6330 such proprietary rights by implementers or users of this 6331 specification can be obtained from the IETF on-line IPR repository at 6332 http://www.ietf.org/ipr. 6334 The IETF invites any interested party to bring to its attention any 6335 copyrights, patents or patent applications, or other proprietary 6336 rights that may cover technology that may be required to implement 6337 this standard. Please address the information to the IETF at 6338 ietf-ipr@ietf.org. 6340 Disclaimer of Validity 6342 This document and the information contained herein are provided on an 6343 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 6344 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 6345 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 6346 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 6347 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 6348 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 6350 Copyright Statement 6352 Copyright (C) The Internet Society (2005). This document is subject 6353 to the rights, licenses and restrictions contained in BCP 78, and 6354 except as set forth therein, the authors retain all their rights. 6356 Acknowledgment 6358 Funding for the RFC Editor function is currently provided by the 6359 Internet Society.