idnits 2.17.1 draft-ietf-nfsv4-minorversion2-21.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The document has examples using IPv4 documentation addresses according to RFC6890, but does not use any IPv6 documentation addresses. Maybe there should be IPv6 examples, too? Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (February 03, 2014) is 3734 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 3927 == Missing Reference: '32K' is mentioned on line 3927, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. 'NFSv42xdr' ** Obsolete normative reference: RFC 5661 (Obsoleted by RFC 8881) == Outdated reference: A later version (-05) exists of draft-ietf-nfsv4-labreqs-04 == Outdated reference: A later version (-35) exists of draft-ietf-nfsv4-rfc3530bis-25 -- Obsolete informational reference (is this intentional?): RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) -- Obsolete informational reference (is this intentional?): RFC 5226 (Obsoleted by RFC 8126) Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 T. Haynes, Ed. 3 Internet-Draft NetApp 4 Intended status: Standards Track February 03, 2014 5 Expires: August 7, 2014 7 NFS Version 4 Minor Version 2 8 draft-ietf-nfsv4-minorversion2-21.txt 10 Abstract 12 This Internet-Draft describes NFS version 4 minor version two, 13 focusing mainly on the protocol extensions made from NFS version 4 14 minor version 0 and NFS version 4 minor version 1. Major extensions 15 introduced in NFS version 4 minor version two include: Server-side 16 Copy, Application I/O Advise, Space Reservations, Sparse Files, 17 Application Data Blocks, and Labeled NFS. 19 Requirements Language 21 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 22 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 23 document are to be interpreted as described in RFC 2119 [RFC2119]. 25 Status of this Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at http://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on August 7, 2014. 42 Copyright Notice 44 Copyright (c) 2014 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents 49 (http://trustee.ietf.org/license-info) in effect on the date of 50 publication of this document. Please review these documents 51 carefully, as they describe your rights and restrictions with respect 52 to this document. Code Components extracted from this document must 53 include Simplified BSD License text as described in Section 4.e of 54 the Trust Legal Provisions and are provided without warranty as 55 described in the Simplified BSD License. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 60 1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . 5 61 1.2. Scope of This Document . . . . . . . . . . . . . . . . . 5 62 1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 5 63 1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . 6 64 1.4.1. Server-side Copy . . . . . . . . . . . . . . . . . . . 6 65 1.4.2. Application I/O Advise . . . . . . . . . . . . . . . . 6 66 1.4.3. Sparse Files . . . . . . . . . . . . . . . . . . . . . 6 67 1.4.4. Space Reservation . . . . . . . . . . . . . . . . . . 6 68 1.4.5. Application Data Hole (ADH) Support . . . . . . . . . 6 69 1.4.6. Labeled NFS . . . . . . . . . . . . . . . . . . . . . 6 70 1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . 7 71 2. Minor Versioning . . . . . . . . . . . . . . . . . . . . . . . 7 72 3. Server-side Copy . . . . . . . . . . . . . . . . . . . . . . . 10 73 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 10 74 3.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 11 75 3.2.1. Overview of Copy Operations . . . . . . . . . . . . . 11 76 3.2.2. Locking the Files . . . . . . . . . . . . . . . . . . 12 77 3.2.3. Intra-Server Copy . . . . . . . . . . . . . . . . . . 12 78 3.2.4. Inter-Server Copy . . . . . . . . . . . . . . . . . . 14 79 3.2.5. Server-to-Server Copy Protocol . . . . . . . . . . . . 18 80 3.3. Requirements for Operations . . . . . . . . . . . . . . . 19 81 3.3.1. netloc4 - Network Locations . . . . . . . . . . . . . 20 82 3.3.2. Copy Offload Stateids . . . . . . . . . . . . . . . . 20 83 3.4. Security Considerations . . . . . . . . . . . . . . . . . 21 84 3.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 21 85 4. Support for Application IO Hints . . . . . . . . . . . . . . . 31 86 5. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 31 87 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 31 88 5.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 32 89 5.3. New Operations . . . . . . . . . . . . . . . . . . . . . 33 90 5.3.1. READ_PLUS . . . . . . . . . . . . . . . . . . . . . . 33 91 5.3.2. WRITE_PLUS . . . . . . . . . . . . . . . . . . . . . . 33 92 6. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 33 93 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 34 94 7. Application Data Hole Support . . . . . . . . . . . . . . . . 36 95 7.1. Generic Framework . . . . . . . . . . . . . . . . . . . . 36 96 7.1.1. Data Hole Representation . . . . . . . . . . . . . . . 37 97 7.1.2. Data Content . . . . . . . . . . . . . . . . . . . . . 38 98 7.2. An Example of Detecting Corruption . . . . . . . . . . . 38 99 7.3. Example of READ_PLUS . . . . . . . . . . . . . . . . . . 39 100 8. Labeled NFS . . . . . . . . . . . . . . . . . . . . . . . . . 40 101 8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 40 102 8.2. Definitions . . . . . . . . . . . . . . . . . . . . . . . 41 103 8.3. MAC Security Attribute . . . . . . . . . . . . . . . . . 41 104 8.3.1. Delegations . . . . . . . . . . . . . . . . . . . . . 42 105 8.3.2. Permission Checking . . . . . . . . . . . . . . . . . 42 106 8.3.3. Object Creation . . . . . . . . . . . . . . . . . . . 43 107 8.3.4. Existing Objects . . . . . . . . . . . . . . . . . . . 43 108 8.3.5. Label Changes . . . . . . . . . . . . . . . . . . . . 43 109 8.4. pNFS Considerations . . . . . . . . . . . . . . . . . . . 44 110 8.5. Discovery of Server Labeled NFS Support . . . . . . . . . 44 111 8.6. MAC Security NFS Modes of Operation . . . . . . . . . . . 44 112 8.6.1. Full Mode . . . . . . . . . . . . . . . . . . . . . . 44 113 8.6.2. Guest Mode . . . . . . . . . . . . . . . . . . . . . . 46 114 8.7. Security Considerations . . . . . . . . . . . . . . . . . 46 115 9. Sharing change attribute implementation details with NFSv4 116 clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 117 9.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 47 118 10. Security Considerations . . . . . . . . . . . . . . . . . . . 47 119 11. Error Values . . . . . . . . . . . . . . . . . . . . . . . . . 47 120 11.1. Error Definitions . . . . . . . . . . . . . . . . . . . . 48 121 11.1.1. General Errors . . . . . . . . . . . . . . . . . . . . 48 122 11.1.2. Server to Server Copy Errors . . . . . . . . . . . . . 48 123 11.1.3. Labeled NFS Errors . . . . . . . . . . . . . . . . . . 49 124 11.2. New Operations and Their Valid Errors . . . . . . . . . . 49 125 11.3. New Callback Operations and Their Valid Errors . . . . . 52 126 12. New File Attributes . . . . . . . . . . . . . . . . . . . . . 53 127 12.1. New RECOMMENDED Attributes - List and Definition 128 References . . . . . . . . . . . . . . . . . . . . . . . 53 129 12.2. Attribute Definitions . . . . . . . . . . . . . . . . . . 53 130 13. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 57 131 14. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 60 132 14.1. Operation 59: COPY - Initiate a server-side copy . . . . 60 133 14.2. Operation 60: OFFLOAD_ABORT - Cancel a server-side 134 copy . . . . . . . . . . . . . . . . . . . . . . . . . . 64 135 14.3. Operation 61: COPY_NOTIFY - Notify a source server of 136 a future copy . . . . . . . . . . . . . . . . . . . . . . 65 137 14.4. Operation 62: OFFLOAD_REVOKE - Revoke a destination 138 server's copy privileges . . . . . . . . . . . . . . . . 66 139 14.5. Operation 63: OFFLOAD_STATUS - Poll for status of a 140 server-side copy . . . . . . . . . . . . . . . . . . . . 67 141 14.6. Modification to Operation 42: EXCHANGE_ID - 142 Instantiate Client ID . . . . . . . . . . . . . . . . . . 68 143 14.7. Operation 64: WRITE_PLUS . . . . . . . . . . . . . . . . 69 144 14.8. Operation 67: IO_ADVISE - Application I/O access 145 pattern hints . . . . . . . . . . . . . . . . . . . . . . 75 146 14.9. Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 80 147 14.10. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 83 148 14.11. Operation 66: SEEK . . . . . . . . . . . . . . . . . . . 88 149 15. NFSv4.2 Callback Operations . . . . . . . . . . . . . . . . . 89 150 15.1. Operation 15: CB_OFFLOAD - Report results of an 151 asynchronous operation . . . . . . . . . . . . . . . . . 89 152 16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 90 153 17. References . . . . . . . . . . . . . . . . . . . . . . . . . . 90 154 17.1. Normative References . . . . . . . . . . . . . . . . . . 90 155 17.2. Informative References . . . . . . . . . . . . . . . . . 91 156 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 92 157 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 93 158 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 94 160 1. Introduction 162 1.1. The NFS Version 4 Minor Version 2 Protocol 164 The NFS version 4 minor version 2 (NFSv4.2) protocol is the third 165 minor version of the NFS version 4 (NFSv4) protocol. The first minor 166 version, NFSv4.0, is described in [I-D.ietf-nfsv4-rfc3530bis] and the 167 second minor version, NFSv4.1, is described in [RFC5661]. It follows 168 the guidelines for minor versioning that are listed in Section 11 of 169 [I-D.ietf-nfsv4-rfc3530bis]. 171 As a minor version, NFSv4.2 is consistent with the overall goals for 172 NFSv4, but extends the protocol so as to better meet those goals, 173 based on experiences with NFSv4.1. In addition, NFSv4.2 has adopted 174 some additional goals, which motivate some of the major extensions in 175 NFSv4.2. 177 1.2. Scope of This Document 179 This document describes the NFSv4.2 protocol. With respect to 180 NFSv4.0 and NFSv4.1, this document does not: 182 o describe the NFSv4.0 or NFSv4.1 protocols, except where needed to 183 contrast with NFSv4.2 185 o modify the specification of the NFSv4.0 or NFSv4.1 protocols 187 o clarify the NFSv4.0 or NFSv4.1 protocols. I.e., any 188 clarifications made here apply to NFSv4.2 and neither of the prior 189 protocols 191 The full XDR for NFSv4.2 is presented in [NFSv42xdr]. 193 1.3. NFSv4.2 Goals 195 The goal of the design of NFSv4.2 is to take common local file system 196 features and offer them remotely. These features might 198 o already be available on the servers, e.g., sparse files 200 o be under development as a new standard, e.g., SEEK_HOLE and 201 SEEK_DATA 203 o be used by clients with the servers via some proprietary means, 204 e.g., Labeled NFS 206 but the clients are not able to leverage them on the server within 207 the confines of the NFS protocol. 209 1.4. Overview of NFSv4.2 Features 211 1.4.1. Server-side Copy 213 A traditional file copy from one server to another results in the 214 data being put on the network twice - source to client and then 215 client to destination. New operations are introduced to allow the 216 client to authorize the two servers to interact directly. As this 217 copy can be lengthy, asynchronous support is also provided. 219 1.4.2. Application I/O Advise 221 Applications and clients want to advise the server as to expected I/O 222 behavior. Using IO_ADVISE (see Section 14.8) to communicate future 223 I/O behavior such as whether a file will be accessed sequentially or 224 randomly, and whether a file will or will not be accessed in the near 225 future, allows servers to optimize future I/O requests for a file by, 226 for example, prefetching or evicting data. This operation can be 227 used to support the posix_fadvise function as well as other 228 applications such as databases and video editors. 230 1.4.3. Sparse Files 232 Sparse files are ones which have unallocated data blocks as holes in 233 the file. Such holes are typically transferred as 0s during I/O. 234 READ_PLUS (see Section 14.10) allows a server to send back to the 235 client metadata describing the hole and WRITE_PLUS (see Section 14.7) 236 allows the client to punch holes into a file. In addition, SEEK (see 237 Section 14.11) is provided to scan for the next hole or data from a 238 given location. 240 1.4.4. Space Reservation 242 When a file is sparse, one concern applications have is ensuring that 243 there will always be enough data blocks available for the file during 244 future writes. A new attribute, space_reserved (see Section 12.2.4) 245 provides the client a guarantee that space will be available. 247 1.4.5. Application Data Hole (ADH) Support 249 Some applications treat a file as if it were a disk and as such want 250 to initialize (or format) the file image. We extend both READ_PLUS 251 and WRITE_PLUS to understand this metadata as a new form of a hole. 253 1.4.6. Labeled NFS 255 While both clients and servers can employ Mandatory Access Control 256 (MAC) security models to enforce data access, there has been no 257 protocol support to allow full interoperability. A new file object 258 attribute, sec_label (see Section 12.2.2) allows for the server to 259 store and enforce MAC labels. The format of the sec_label 260 accommodates any MAC security system. 262 1.5. Differences from NFSv4.1 264 In NFSv4.1, the only way to introduce new variants of an operation 265 was to introduce a new operation. I.e., READ becomes either READ2 or 266 READ_PLUS. With the use of discriminated unions as parameters to 267 such functions in NFSv4.2, it is possible to add a new arm in a 268 subsequent minor version. And it is also possible to move such an 269 operation from OPTIONAL/RECOMMENDED to REQUIRED. Forcing an 270 implementation to adopt each arm of a discriminated union at such a 271 time does not meet the spirit of the minor versioning rules. As 272 such, new arms of a discriminated union MUST follow the same 273 guidelines for minor versioning as operations in NFSv4.1 - i.e., they 274 may not be made REQUIRED. To support this, a new error code, 275 NFS4ERR_UNION_NOTSUPP, is introduced which allows the server to 276 communicate to the client that the operation is supported, but the 277 specific arm of the discriminated union is not. 279 2. Minor Versioning 281 To address the requirement of an NFS protocol that can evolve as the 282 need arises, the NFSv4 protocol contains the rules and framework to 283 allow for future minor changes or versioning. 285 The base assumption with respect to minor versioning is that any 286 future accepted minor version will be documented in one or more 287 Standards Track RFCs. Minor version 0 of the NFSv4 protocol is 288 represented by [I-D.ietf-nfsv4-rfc3530bis], minor version 1 by 289 [RFC5661], and minor version 2 by this document. The COMPOUND and 290 CB_COMPOUND procedures support the encoding of the minor version 291 being requested by the client. 293 The following items represent the basic rules for the development of 294 minor versions. Note that a future minor version may modify or add 295 to the following rules as part of the minor version definition. 297 1. Procedures are not added or deleted. 299 To maintain the general RPC model, NFSv4 minor versions will not 300 add to or delete procedures from the NFS program. 302 2. Minor versions may add operations to the COMPOUND and 303 CB_COMPOUND procedures. 305 The addition of operations to the COMPOUND and CB_COMPOUND 306 procedures does not affect the RPC model. 308 * Minor versions may append attributes to the bitmap4 that 309 represents sets of attributes and to the fattr4 that 310 represents sets of attribute values. 312 This allows for the expansion of the attribute model to allow 313 for future growth or adaptation. 315 * Minor version X must append any new attributes after the last 316 documented attribute. 318 Since attribute results are specified as an opaque array of 319 per-attribute, XDR-encoded results, the complexity of adding 320 new attributes in the midst of the current definitions would 321 be too burdensome. 323 3. Minor versions must not modify the structure of an existing 324 operation's arguments or results. 326 Again, the complexity of handling multiple structure definitions 327 for a single operation is too burdensome. New operations should 328 be added instead of modifying existing structures for a minor 329 version. 331 This rule does not preclude the following adaptations in a minor 332 version: 334 * adding bits to flag fields, such as new attributes to 335 GETATTR's bitmap4 data type, and providing corresponding 336 variants of opaque arrays, such as a notify4 used together 337 with such bitmaps 339 * adding bits to existing attributes like ACLs that have flag 340 words 342 * extending enumerated types (including NFS4ERR_*) with new 343 values 345 * adding cases to a switched union 347 4. Note that when adding new cases to a switched union, a minor 348 version must not make new cases be REQUIRED. While the 349 encapsulating operation may be REQUIRED, the new cases (the 350 specific arm of the discriminated union) is not. The error code 351 NFS4ERR_UNION_NOTSUPP is used to notify the client when the 352 server does not support such a case. 354 5. Minor versions must not modify the structure of existing 355 attributes. 357 6. Minor versions must not delete operations. 359 This prevents the potential reuse of a particular operation 360 "slot" in a future minor version. 362 7. Minor versions must not delete attributes. 364 8. Minor versions must not delete flag bits or enumeration values. 366 9. Minor versions may declare an operation MUST NOT be implemented. 368 Specifying that an operation MUST NOT be implemented is 369 equivalent to obsoleting an operation. For the client, it means 370 that the operation MUST NOT be sent to the server. For the 371 server, an NFS error can be returned as opposed to "dropping" 372 the request as an XDR decode error. This approach allows for 373 the obsolescence of an operation while maintaining its structure 374 so that a future minor version can reintroduce the operation. 376 1. Minor versions may declare that an attribute MUST NOT be 377 implemented. 379 2. Minor versions may declare that a flag bit or enumeration 380 value MUST NOT be implemented. 382 10. Minor versions may declare an operation to be OBSOLESCENT, which 383 indicates an intention to remove the operation (i.e., make it 384 MANDATORY TO NOT implement) in a subsequent minor version. Such 385 labeling is separate from the question of whether the operation 386 is REQUIRED or RECOMMENDED or OPTIONAL in the current minor 387 version. An operation may be both REQUIRED for the given minor 388 version and marked OBSOLESCENT, with the expectation that it 389 will be MANDATORY TO NOT implement in the next (or other 390 subsequent) minor version. 392 11. Note that the early notification of operation obsolescence is 393 put in place to mitigate the effects of design and 394 implementation mistakes, and to allow protocol development to 395 adapt to unexpected changes in the pace of implementation. Even 396 if an operation is marked OBSOLESCENT in a given minor version, 397 it may end up not being marked MANDATORY TO NOT implement in the 398 next minor version. In unusual circumstances, it might not be 399 marked OBSOLESCENT in a subsequent minor version, and never 400 become MANDATORY TO NOT implement. 402 12. Minor versions may downgrade features from REQUIRED to 403 RECOMMENDED, from RECOMMENDED to OPTIONAL, or from OPTIONAL to 404 MANDATORY TO NOT implement. Also, if a feature was marked as 405 OBSOLESCENT in the prior minor version, it may be downgraded 406 from REQUIRED to OPTIONAL from RECOMMENDED to MANDATORY TO NOT 407 implement, or from REQUIRED to MANDATORY TO NOT implement. 409 13. Minor versions may upgrade features from OPTIONAL to 410 RECOMMENDED, or RECOMMENDED to REQUIRED. Also, if a feature was 411 marked as OBSOLESCENT in the prior minor version, it may be 412 upgraded to not be OBSOLESCENT. 414 14. A client and server that support minor version X SHOULD support 415 minor versions 0 through X-1 as well. 417 15. Except for infrastructural changes, a minor version must not 418 introduce REQUIRED new features. 420 This rule allows for the introduction of new functionality and 421 forces the use of implementation experience before designating a 422 feature as REQUIRED. On the other hand, some classes of 423 features are infrastructural and have broad effects. Allowing 424 infrastructural features to be RECOMMENDED or OPTIONAL 425 complicates implementation of the minor version. 427 16. A client MUST NOT attempt to use a stateid, filehandle, or 428 similar returned object from the COMPOUND procedure with minor 429 version X for another COMPOUND procedure with minor version Y, 430 where X != Y. 432 3. Server-side Copy 434 3.1. Introduction 436 The server-side copy feature provides a mechanism for the NFS client 437 to perform a file copy on a server or between two servers without the 438 data being transmitted back and forth over the network through the 439 NFS client. Without this feature, an NFS client copies data from one 440 location to another by reading the data from the source server over 441 the network, and then writing the data back over the network to the 442 destiniation server. 444 If the source object and destination object are on different file 445 servers, the file servers will communicate with one another to 446 perform the copy operation. The server-to-server protocol by which 447 this is accomplished is not defined in this document. 449 3.2. Protocol Overview 451 The server-side copy offload operations support both intra-server and 452 inter-server file copies. An intra-server copy is a copy in which 453 the source file and destination file reside on the same server. In 454 an inter-server copy, the source file and destination file are on 455 different servers. In both cases, the copy may be performed 456 synchronously or asynchronously. 458 Throughout the rest of this document, we refer to the NFS server 459 containing the source file as the "source server" and the NFS server 460 to which the file is transferred as the "destination server". In the 461 case of an intra-server copy, the source server and destination 462 server are the same server. Therefore in the context of an intra- 463 server copy, the terms source server and destination server refer to 464 the single server performing the copy. 466 The operations described below are designed to copy files. Other 467 file system objects can be copied by building on these operations or 468 using other techniques. For example if the user wishes to copy a 469 directory, the client can synthesize a directory copy by first 470 creating the destination directory and then copying the source 471 directory's files to the new destination directory. If the user 472 wishes to copy a namespace junction [FEDFS-NSDB] [FEDFS-ADMIN], the 473 client can use the ONC RPC Federated Filesystem protocol 474 [FEDFS-ADMIN] to perform the copy. Specifically the client can 475 determine the source junction's attributes using the FEDFS_LOOKUP_FSN 476 procedure and create a duplicate junction using the 477 FEDFS_CREATE_JUNCTION procedure. 479 For the inter-server copy, the operations are defined to be 480 compatible with the traditional copy authentication approach. The 481 client and user are authorized at the source for reading. Then they 482 are authorized at the destination for writing. 484 3.2.1. Overview of Copy Operations 486 COPY_NOTIFY: For inter-server copies, the client sends this 487 operation to the source server to notify it of a future file copy 488 from a given destination server for the given user. 489 (Section 14.3) 491 OFFLOAD_REVOKE: Also for inter-server copies, the client sends this 492 operation to the source server to revoke permission to copy a file 493 for the given user. (Section 14.4) 495 COPY: Used by the client to request a file copy. (Section 14.1) 497 OFFLOAD_ABORT: Used by the client to abort an asynchronous file 498 copy. (Section 14.2) 500 OFFLOAD_STATUS: Used by the client to poll the status of an 501 asynchronous file copy. (Section 14.5) 503 CB_OFFLOAD: Used by the destination server to report the results of 504 an asynchronous file copy to the client. (Section 15.1) 506 3.2.2. Locking the Files 508 Both the source and destination file may need to be locked to protect 509 the content during the copy operations. A client can achieve this by 510 a combination of OPEN and LOCK operations. I.e., either share or 511 byte range locks might be desired. 513 3.2.3. Intra-Server Copy 515 To copy a file on a single server, the client uses a COPY operation. 516 The server may respond to the copy operation with the final results 517 of the copy or it may perform the copy asynchronously and deliver the 518 results using a CB_OFFLOAD operation callback. If the copy is 519 performed asynchronously, the client may poll the status of the copy 520 using OFFLOAD_STATUS or cancel the copy using OFFLOAD_ABORT. 522 A synchronous intra-server copy is shown in Figure 1. In this 523 example, the NFS server chooses to perform the copy synchronously. 524 The copy operation is completed, either successfully or 525 unsuccessfully, before the server replies to the client's request. 526 The server's reply contains the final result of the operation. 528 Client Server 529 + + 530 | | 531 |--- OPEN ---------------------------->| Client opens 532 |<------------------------------------/| the source file 533 | | 534 |--- OPEN ---------------------------->| Client opens 535 |<------------------------------------/| the destination file 536 | | 537 |--- COPY ---------------------------->| Client requests 538 |<------------------------------------/| a file copy 539 | | 540 |--- CLOSE --------------------------->| Client closes 541 |<------------------------------------/| the destination file 542 | | 543 |--- CLOSE --------------------------->| Client closes 544 |<------------------------------------/| the source file 545 | | 546 | | 548 Figure 1: A synchronous intra-server copy. 550 An asynchronous intra-server copy is shown in Figure 2. In this 551 example, the NFS server performs the copy asynchronously. The 552 server's reply to the copy request indicates that the copy operation 553 was initiated and the final result will be delivered at a later time. 554 The server's reply also contains a copy stateid. The client may use 555 this copy stateid to poll for status information (as shown) or to 556 cancel the copy using a OFFLOAD_ABORT. When the server completes the 557 copy, the server performs a callback to the client and reports the 558 results. 560 Client Server 561 + + 562 | | 563 |--- OPEN ---------------------------->| Client opens 564 |<------------------------------------/| the source file 565 | | 566 |--- OPEN ---------------------------->| Client opens 567 |<------------------------------------/| the destination file 568 | | 569 |--- COPY ---------------------------->| Client requests 570 |<------------------------------------/| a file copy 571 | | 572 | | 573 |--- OFFLOAD_STATUS ------------------>| Client may poll 574 |<------------------------------------/| for status 575 | | 576 | . | Multiple OFFLOAD_STATUS 577 | . | operations may be sent. 578 | . | 579 | | 580 |<-- CB_OFFLOAD -----------------------| Server reports results 581 |\------------------------------------>| 582 | | 583 |--- CLOSE --------------------------->| Client closes 584 |<------------------------------------/| the destination file 585 | | 586 |--- CLOSE --------------------------->| Client closes 587 |<------------------------------------/| the source file 588 | | 589 | | 591 Figure 2: An asynchronous intra-server copy. 593 3.2.4. Inter-Server Copy 595 A copy may also be performed between two servers. The copy protocol 596 is designed to accommodate a variety of network topologies. As shown 597 in Figure 3, the client and servers may be connected by multiple 598 networks. In particular, the servers may be connected by a 599 specialized, high speed network (network 192.0.2.0/24 in the diagram) 600 that does not include the client. The protocol allows the client to 601 setup the copy between the servers (over network 203.0.113.0/24 in 602 the diagram) and for the servers to communicate on the high speed 603 network if they choose to do so. 605 192.0.2.0/24 606 +-------------------------------------+ 607 | | 608 | | 609 | 192.0.2.18 | 192.0.2.56 610 +-------+------+ +------+------+ 611 | Source | | Destination | 612 +-------+------+ +------+------+ 613 | 203.0.113.18 | 203.0.113.56 614 | | 615 | | 616 | 203.0.113.0/24 | 617 +------------------+------------------+ 618 | 619 | 620 | 203.0.113.243 621 +-----+-----+ 622 | Client | 623 +-----------+ 625 Figure 3: An example inter-server network topology. 627 For an inter-server copy, the client notifies the source server that 628 a file will be copied by the destination server using a COPY_NOTIFY 629 operation. The client then initiates the copy by sending the COPY 630 operation to the destination server. The destination server may 631 perform the copy synchronously or asynchronously. 633 A synchronous inter-server copy is shown in Figure 4. In this case, 634 the destination server chooses to perform the copy before responding 635 to the client's COPY request. 637 An asynchronous copy is shown in Figure 5. In this case, the 638 destination server chooses to respond to the client's COPY request 639 immediately and then perform the copy asynchronously. 641 Client Source Destination 642 + + + 643 | | | 644 |--- OPEN --->| | Returns os1 645 |<------------------/| | 646 | | | 647 |--- COPY_NOTIFY --->| | 648 |<------------------/| | 649 | | | 650 |--- OPEN ---------------------------->| Returns os2 651 |<------------------------------------/| 652 | | | 653 |--- COPY ---------------------------->| 654 | | | 655 | | | 656 | |<----- read -----| 657 | |\--------------->| 658 | | | 659 | | . | Multiple reads may 660 | | . | be necessary 661 | | . | 662 | | | 663 | | | 664 |<------------------------------------/| Destination replies 665 | | | to COPY 666 | | | 667 |--- CLOSE --------------------------->| Release open state 668 |<------------------------------------/| 669 | | | 670 |--- CLOSE --->| | Release open state 671 |<------------------/| | 673 Figure 4: A synchronous inter-server copy. 675 Client Source Destination 676 + + + 677 | | | 678 |--- OPEN --->| | Returns os1 679 |<------------------/| | 680 | | | 681 |--- LOCK --->| | Optional, could be done 682 |<------------------/| | with a share lock 683 | | | 684 |--- COPY_NOTIFY --->| | Need to pass in 685 |<------------------/| | os1 or lock state 686 | | | 687 | | | 688 | | | 689 |--- OPEN ---------------------------->| Returns os2 690 |<------------------------------------/| 691 | | | 692 |--- LOCK ---------------------------->| Optional ... 693 |<------------------------------------/| 694 | | | 695 |--- COPY ---------------------------->| Need to pass in 696 |<------------------------------------/| os2 or lock state 697 | | | 698 | | | 699 | |<----- read -----| 700 | |\--------------->| 701 | | | 702 | | . | Multiple reads may 703 | | . | be necessary 704 | | . | 705 | | | 706 | | | 707 |--- OFFLOAD_STATUS ------------------>| Client may poll 708 |<------------------------------------/| for status 709 | | | 710 | | . | Multiple OFFLOAD_STATUS 711 | | . | operations may be sent 712 | | . | 713 | | | 714 | | | 715 | | | 716 |<-- CB_OFFLOAD -----------------------| Destination reports 717 |\------------------------------------>| results 718 | | | 719 |--- LOCKU --------------------------->| Only if LOCK was done 720 |<------------------------------------/| 721 | | | 722 |--- CLOSE --------------------------->| Release open state 723 |<------------------------------------/| 724 | | | 725 |--- LOCKU --->| | Only if LOCK was done 726 |<------------------/| | 727 | | | 728 |--- CLOSE --->| | Release open state 729 |<------------------/| | 730 | | | 732 Figure 5: An asynchronous inter-server copy. 734 3.2.5. Server-to-Server Copy Protocol 736 The source server and destination server are not required to use a 737 specific protocol to transfer the file data. The choice of what 738 protocol to use is ultimately the destination server's decision. 740 3.2.5.1. Using NFSv4.x as a Server-to-Server Copy Protocol 742 The destination server MAY use standard NFSv4.x (where x >= 1) 743 operations to read the data from the source server. If NFSv4.x is 744 used for the server-to-server copy protocol, the destination server 745 can use the source filehandle and ca_src_stateid provided in the COPY 746 request with standard NFSv4.x operations to read data from the source 747 server. 749 3.2.5.2. Using an alternative Server-to-Server Copy Protocol 751 In a homogeneous environment, the source and destination servers 752 might be able to perform the file copy extremely efficiently using 753 specialized protocols. For example the source and destination 754 servers might be two nodes sharing a common file system format for 755 the source and destination file systems. Thus the source and 756 destination are in an ideal position to efficiently render the image 757 of the source file to the destination file by replicating the file 758 system formats at the block level. Another possibility is that the 759 source and destination might be two nodes sharing a common storage 760 area network, and thus there is no need to copy any data at all, and 761 instead ownership of the file and its contents might simply be re- 762 assigned to the destination. To allow for these possibilities, the 763 destination server is allowed to use a server-to-server copy protocol 764 of its choice. 766 In a heterogeneous environment, using a protocol other than NFSv4.x 767 (e.g., HTTP [RFC2616] or FTP [RFC0959]) presents some challenges. In 768 particular, the destination server is presented with the challenge of 769 accessing the source file given only an NFSv4.x filehandle. 771 One option for protocols that identify source files with path names 772 is to use an ASCII hexadecimal representation of the source 773 filehandle as the file name. 775 Another option for the source server is to use URLs to direct the 776 destination server to a specialized service. For example, the 777 response to COPY_NOTIFY could include the URL 778 ftp://s1.example.com:9999/_FH/0x12345, where 0x12345 is the ASCII 779 hexadecimal representation of the source filehandle. When the 780 destination server receives the source server's URL, it would use 781 "_FH/0x12345" as the file name to pass to the FTP server listening on 782 port 9999 of s1.example.com. On port 9999 there would be a special 783 instance of the FTP service that understands how to convert NFS 784 filehandles to an open file descriptor (in many operating systems, 785 this would require a new system call, one which is the inverse of the 786 makefh() function that the pre-NFSv4 MOUNT service needs). 788 Authenticating and identifying the destination server to the source 789 server is also a challenge. Recommendations for how to accomplish 790 this are given in Section 3.4.1.4. 792 3.3. Requirements for Operations 794 The implementation of server-side copy is OPTIONAL by the client and 795 the server. However, in order to successfully copy a file, some 796 operations MUST be supported by the client and/or server. 798 If a client desires an intra-server file copy, then it MUST support 799 the COPY and CB_OFFLOAD operations. If COPY returns a stateid, then 800 the client MAY use the OFFLOAD_ABORT and OFFLOAD_STATUS operations. 802 If a client desires an inter-server file copy, then it MUST support 803 the COPY, COPY_NOTICE, and CB_OFFLOAD operations, and MAY use the 804 OFFLOAD_REVOKE operation. If COPY returns a stateid, then the client 805 MAY use the OFFLOAD_ABORT and OFFLOAD_STATUS operations. 807 If a server supports intra-server copy, then the server MUST support 808 the COPY operation. If a server's COPY operation returns a stateid, 809 then the server MUST also support these operations: CB_OFFLOAD, 810 OFFLOAD_ABORT, and OFFLOAD_STATUS. 812 If a source server supports inter-server copy, then the source server 813 MUST support all these operations: COPY_NOTIFY and OFFLOAD_REVOKE. 814 If a destination server supports inter-server copy, then the 815 destination server MUST support the COPY operation. If a destination 816 server's COPY operation returns a stateid, then the destination 817 server MUST also support these operations: CB_OFFLOAD, OFFLOAD_ABORT, 818 COPY_NOTIFY, OFFLOAD_REVOKE, and OFFLOAD_STATUS. 820 Each operation is performed in the context of the user identified by 821 the ONC RPC credential of its containing COMPOUND or CB_COMPOUND 822 request. For example, a OFFLOAD_ABORT operation issued by a given 823 user indicates that a specified COPY operation initiated by the same 824 user be canceled. Therefore a OFFLOAD_ABORT MUST NOT interfere with 825 a copy of the same file initiated by another user. 827 An NFS server MAY allow an administrative user to monitor or cancel 828 copy operations using an implementation specific interface. 830 3.3.1. netloc4 - Network Locations 832 The server-side copy operations specify network locations using the 833 netloc4 data type shown below: 835 enum netloc_type4 { 836 NL4_NAME = 0, 837 NL4_URL = 1, 838 NL4_NETADDR = 2 839 }; 840 union netloc4 switch (netloc_type4 nl_type) { 841 case NL4_NAME: utf8str_cis nl_name; 842 case NL4_URL: utf8str_cis nl_url; 843 case NL4_NETADDR: netaddr4 nl_addr; 844 }; 846 If the netloc4 is of type NL4_NAME, the nl_name field MUST be 847 specified as a UTF-8 string. The nl_name is expected to be resolved 848 to a network address via DNS, LDAP, NIS, /etc/hosts, or some other 849 means. If the netloc4 is of type NL4_URL, a server URL [RFC3986] 850 appropriate for the server-to-server copy operation is specified as a 851 UTF-8 string. If the netloc4 is of type NL4_NETADDR, the nl_addr 852 field MUST contain a valid netaddr4 as defined in Section 3.3.9 of 853 [RFC5661]. 855 When netloc4 values are used for an inter-server copy as shown in 856 Figure 3, their values may be evaluated on the source server, 857 destination server, and client. The network environment in which 858 these systems operate should be configured so that the netloc4 values 859 are interpreted as intended on each system. 861 3.3.2. Copy Offload Stateids 863 A server may perform a copy offload operation asynchronously. An 864 asynchronous copy is tracked using a copy offload stateid. Copy 865 offload stateids are included in the COPY, OFFLOAD_ABORT, 866 OFFLOAD_STATUS, and CB_OFFLOAD operations. 868 Section 8.2.4 of [RFC5661] specifies that stateids are valid until 869 either (A) the client or server restart or (B) the client returns the 870 resource. 872 A copy offload stateid will be valid until either (A) the client or 873 server restarts or (B) the client returns the resource by issuing a 874 OFFLOAD_ABORT operation or the client replies to a CB_OFFLOAD 875 operation. 877 A copy offload stateid's seqid MUST NOT be 0. In the context of a 878 copy offload operation, it is ambiguous to indicate the most recent 879 copy offload operation using a stateid with seqid of 0. Therefore a 880 copy offload stateid with seqid of 0 MUST be considered invalid. 882 3.4. Security Considerations 884 The security considerations pertaining to NFSv4 885 [I-D.ietf-nfsv4-rfc3530bis] apply to this chapter. 887 The standard security mechanisms provide by NFSv4 888 [I-D.ietf-nfsv4-rfc3530bis] may be used to secure the protocol 889 described in this chapter. 891 NFSv4 clients and servers supporting the inter-server copy operations 892 described in this chapter are REQUIRED to implement the mechanism 893 described in Section 3.4.1.2, and to support rejecting COPY_NOTIFY 894 requests that do not use RPCSEC_GSS with privacy. If the server-to- 895 server copy protocol is ONC RPC based, the servers are also REQUIRED 896 to implement [rpcsec_gssv3] including the RPCSEC_GSSv3 copy_to_auth, 897 copy_from_auth, and copy_confirm_auth structured privileges. This 898 requirement to implement is not a requirement to use; for example, a 899 server may depending on configuration also allow COPY_NOTIFY requests 900 that use only AUTH_SYS. 902 3.4.1. Inter-Server Copy Security 904 3.4.1.1. Requirements for Secure Inter-Server Copy 906 Inter-server copy is driven by several requirements: 908 o The specification must not mandate an inter-server copy protocol. 909 There are many ways to copy data. Some will be more optimal than 910 others depending on the identities of the source server and 911 destination server. For example the source and destination 912 servers might be two nodes sharing a common file system format for 913 the source and destination file systems. Thus the source and 914 destination are in an ideal position to efficiently render the 915 image of the source file to the destination file by replicating 916 the file system formats at the block level. In other cases, the 917 source and destination might be two nodes sharing a common storage 918 area network, and thus there is no need to copy any data at all, 919 and instead ownership of the file and its contents simply gets re- 920 assigned to the destination. 922 o The specification must provide guidance for using NFSv4.x as a 923 copy protocol. For those source and destination servers willing 924 to use NFSv4.x there are specific security considerations that 925 this specification can and does address. 927 o The specification must not mandate pre-configuration between the 928 source and destination server. Requiring that the source and 929 destination first have a "copying relationship" increases the 930 administrative burden. However the specification MUST NOT 931 preclude implementations that require pre-configuration. 933 o The specification must not mandate a trust relationship between 934 the source and destination server. The NFSv4 security model 935 requires mutual authentication between a principal on an NFS 936 client and a principal on an NFS server. This model MUST continue 937 with the introduction of COPY. 939 3.4.1.2. Inter-Server Copy via ONC RPC with RPCSEC_GSSv3 941 When the client sends a COPY_NOTIFY to the source server to expect 942 the destination to attempt to copy data from the source server, it is 943 expected that this copy is being done on behalf of the principal 944 (called the "user principal") that sent the RPC request that encloses 945 the COMPOUND procedure that contains the COPY_NOTIFY operation. The 946 user principal is identified by the RPC credentials. A mechanism 947 that allows the user principal to authorize the destination server to 948 perform the copy, that lets the source server properly authenticate 949 the destination's copy, and does not allow the destination server to 950 exceed this authorization, is necessary. 952 An approach that sends delegated credentials of the client's user 953 principal to the destination server is not used for the following 954 reason. If the client's user delegated its credentials, the 955 destination would authenticate as the user principal. If the 956 destination were using the NFSv4 protocol to perform the copy, then 957 the source server would authenticate the destination server as the 958 user principal, and the file copy would securely proceed. However, 959 this approach would allow the destination server to copy other files. 960 The user principal would have to trust the destination server to not 961 do so. This is counter to the requirements, and therefore is not 962 considered. 964 Instead, we employ a combination of two features of the RPCSEC_GSSv3 965 [rpcsec_gssv3] protocol: compound authentication and RPC application 966 defined structured privilege assertions. The combination of these 967 features allows the destination server to authenticate to the source 968 server as acting on behalf of the user prinicpal, and to authorize 969 the destination server to perform READs of the file to be copied from 970 the source on behalf of the user principal. Once the copy is 971 complete, the client can destroy the RPCSEC_GSSv3 handles to end the 972 source and destination servers authorization to copy. 974 RPCSEC_GSSv3 introduces the notion of RPC application defined 975 structured privileges. We define three structured privileges that 976 work in tandum to authorize the copy: 978 copy_from_auth: A user principal is authorizing a source principal 979 ("nfs@") to allow a destination principal ("nfs@ 980 ") to setup the copy_confirm_auth privilege required 981 to copy a file from the source to the destination on behalf of the 982 user principal. This privilege is established on the source 983 server before the user principal sends a COPY_NOTIFY operation to 984 the source server, and the resultant RPCSEC_GSSv3 context is used 985 to secure the COPY_NOTIFY operation. 987 struct copy_from_auth_priv { 988 secret4 cfap_shared_secret; 989 netloc4 cfap_destination; 990 /* the NFSv4 user name that the user principal maps to */ 991 utf8str_mixed cfap_username; 992 }; 994 cfp_shared_secret is an automatically generated random number 995 secret value. 997 copy_to_auth: A user principal is authorizing a destination 998 principal ("nfs@") to setup a copy_confirm_auth 999 privilege with a source principal ("nfs@") to allow it to 1000 copy a file from the source to the destination on behalf of the 1001 user principal. This privilege is established on the destination 1002 server before the user principal sends a COPY operation to the 1003 destination server, and the resultant RPCSEC_GSSv3 context is used 1004 to secure the COPY operation. 1006 struct copy_to_auth_priv { 1007 /* equal to cfap_shared_secret */ 1008 secret4 ctap_shared_secret; 1009 netloc4 ctap_source; 1010 /* the NFSv4 user name that the user principal maps to */ 1011 utf8str_mixed ctap_username; 1012 /* 1013 * user principal RPCSEC_GSSv1 (or v2) handle shared 1014 * with the source server 1015 */ 1016 opaque ctap_handle; 1017 int ctap_handle_vers; 1018 /* A nounce and a mic of the nounce using ctap_handle */ 1019 opaque ctap_nounce; 1020 opaque ctap_nounce_mic; 1021 }; 1023 ctap_shared_secret is the automatically generated secret value 1024 used to establish the copy_from_auth privilege with the source 1025 principal. ctap_handle, ctap_handle_vers, ctap_nounce and 1026 ctap_nounce_mic are used to construct the compound authentication 1027 portion of the copy_confirm_auth RPCGSS_GSSv3 context between the 1028 destination server and the source server. See Section 3.4.1.2.1 1030 copy_confirm_auth: A destination principal ("nfs@") is 1031 confirming with the source principal ("nfs@") that it is 1032 authorized to copy data from the source. Note that besides the 1033 rpc_gss3_privs payload (struct copy_confirm_auth_priv), the 1034 copy_confirm_auth RPCSEC_GSS3_CREATE message also contains an 1035 rpc_gss3_gss_binding payload so that the copy is done on behalf of 1036 the user principal. This privilege is established on the 1037 destination server before the file is copied from the source to 1038 the destination. The resultant RPCSEC_GSSv3 context is used to 1039 secure the READ operations from the source to the destination 1040 server. 1042 struct copy_confirm_auth_priv { 1043 /* equal to GSS_GetMIC() of cfap_shared_secret */ 1044 opaque ccap_shared_secret_mic<>; 1045 /* the NFSv4 user name that the user principal maps to */ 1046 utf8str_mixed ccap_username; 1047 }; 1049 3.4.1.2.1. Establishing a Security Context 1051 The RPCSEC_GSSv3 compound authentication feature allows a server to 1052 act on behalf of a user if the server identifies the user and trusts 1053 the client. In the inter-server server side copy case, the server is 1054 the source server, and the client is the destination server acting as 1055 a client when performing the copy. 1057 The user principal is not required (nor expected) to have an 1058 RPCSEC_GSS secured connection and context between the destination 1059 server (acting as a client) and the source server. The user 1060 principal does have an RPCSEC_GSS secured connection and context 1061 between the client and the source server established for the OPEN of 1062 the file to be copied. 1064 We use the RPCSEC_GSS context established between the user principal 1065 and the source server to OPEN the file to be copied to provide the 1066 the necessary user principal identification to the source server from 1067 the destination server (acting as a client). This is accomplished by 1068 sending the user principal indentification information: e.g the 1069 rpc_gss3_gss_binding fields, in the copy_to_auth privilege 1070 established between the client and the destination server. This same 1071 information is then placed in the rpc_gss3_gss_binding fields of the 1072 copy_confirm_auth RPCSEC_GSS3_CREATE message sent from the 1073 destination server (acting as a client) to the source server. 1075 When the user principal wants to COPY a file between two servers, if 1076 it has not established copy_from_auth and copy_to_auth privileges on 1077 the servers, it establishes them: 1079 o As noted in [rpcsec_gssv3] the client uses an existing 1080 RPCSEC_GSSv1 (or v2) context termed the "parent" handle to 1081 establish and protect RPCSEC_GSSv3 exchanges. The copy_from_auth 1082 privilege will use the context established between the user 1083 principal and the source server used to OPEN the source file as 1084 the RPCSEC_GSSv3 parent handle. The copy_to_auth privilege will 1085 use the context established between the user principal and the 1086 destination server used to OPEN the destination file as the 1087 RPCSEC_GSSv3 parent handle. 1089 o A random number is generated to use as a secret to be shared 1090 between the two servers. This shared secret will be placed in the 1091 cfap_shared_secret and ctap_shared_secret fields of the 1092 appropriate privilege data types, copy_from_auth_priv and 1093 copy_to_auth_priv. Because of this shared_secret the 1094 RPCSEC_GSS3_CREATE control messages for copy_from_auth and 1095 copy_to_auth MUST use a QOP of rpc_gss_svc_privacy. 1097 o An instance of copy_from_auth_priv is filled in with the shared 1098 secret, the destination server, and the NFSv4 user id of the user 1099 principal and is placed in rpc_gss3_create_args 1100 assertions[0].assertion.privs.privilege. The string 1101 "copy_from_auth" is placed in assertions[0].assertion.privs.name. 1102 The field assertions[0].critical is set to TRUE. The source 1103 server unwraps the rpc_gss_svc_privacy RPCSEC_GSS3_CREATE payload 1104 and verifies that the NFSv4 user id being asserted matches the 1105 source server's mapping of the user principal. If it does, the 1106 privilege is established on the source server as: 1107 <"copy_from_auth", user id, destination>. The field "handle" in a 1108 successful reply is the RPCSEC_GSSv3 "child" handle that the 1109 client will use on COPY_NOTIFY requests to the source server 1110 involving the destination server. 1111 granted_assertions[0].assertion.privs.name will be equal to 1112 "copy_from_auth". 1114 o An instance of copy_to_auth_priv is filled in with the shared 1115 secret, the cnr_source_server list returned by COPY_NOTIFY, and 1116 the NFSv4 user id of the user principal. The next four fields are 1117 passed in the copy_to_auth privilege to be used by the 1118 copy_confirm_auth rpc_gss3_gss_binding fields as explained above. 1119 A nounce is created, and GSS_MIC() is invoked on the nounce using 1120 the RPCSEC_GSSv1 (or v2) context shared between user prinicpal and 1121 the source server. The nounce, nounce MIC, context handle used to 1122 create the nounce MIC, and the context handle version are added to 1123 the copy_to_auth_priv instance which is placed in 1124 rpc_gss3_create_args assertions[0].assertion.privs.privilege. The 1125 string "copy_to_auth" is placed in 1126 assertions[0].assertion.privs.name. The field 1127 assertions[0].critical is set to TRUE. The destination server 1128 unwraps the rpc_gss_svc_privacy RPCSEC_GSS3_CREATE payload and 1129 verifies that the NFSv4 user id being asserted matches the 1130 destination server's mapping of the user principal. If it does, 1131 the privilege is established on the destination server as: 1132 <"copy_to_auth", user id, source list, nounce, nounce MIC, context 1133 handle, handle version>. The field "handle" in a successful reply 1134 is the RPCSEC_GSSv3 "child" handle that the client will use on 1135 COPY requests to the destination server involving the source 1136 server. granted_assertions[0].assertion.privs.name will be equal 1137 to "copy_to_auth". 1139 As noted in [rpcsec_gssv3] section 2.3.1 "Create Request", both the 1140 client and the source server should associate the RPCSEC_GSSv3 1141 "child" handle with the parent RPCSEC_GSSv1 (or v2) handle used to 1142 create the RPCSEC_GSSv3 child handle. 1144 3.4.1.2.2. Starting a Secure Inter-Server Copy 1146 When the client sends a COPY_NOTIFY request to the source server, it 1147 uses the privileged "copy_from_auth" RPCSEC_GSSv3 handle. 1148 cna_destination_server in COPY_NOTIFY MUST be the same as 1149 cfap_destination specified in copy_from_auth_priv. Otherwise, 1150 COPY_NOTIFY will fail with NFS4ERR_ACCESS. The source server 1151 verifies that the privilege <"copy_from_auth", user id, destination> 1152 exists, and annotates it with the source filehandle, if the user 1153 principal has read access to the source file, and if administrative 1154 policies give the user principal and the NFS client read access to 1155 the source file (i.e., if the ACCESS operation would grant read 1156 access). Otherwise, COPY_NOTIFY will fail with NFS4ERR_ACCESS. 1158 When the client sends a COPY request to the destination server, it 1159 uses the privileged "copy_to_auth" RPCSEC_GSSv3 handle. 1160 ca_source_server list in COPY MUST be the same as ctap_source list 1161 specified in copy_to_auth_priv. Otherwise, COPY will fail with 1162 NFS4ERR_ACCESS. The destination server verifies that the privilege 1163 <"copy_to_auth", user id, source list, nounce, nounce MIC, context 1164 handle, handle version> exists, and annotates it with the source and 1165 destination filehandles. If the COPY returns a wr_callback_id, then 1166 this is an asynchronous copy and the wr_callback_id must also must be 1167 annotated to the copy_to_auth privilege. If the client has failed to 1168 establish the "copy_to_auth" privilege it will reject the request 1169 with NFS4ERR_PARTNER_NO_AUTH. 1171 If either the COPY_NOTIFY, or the COPY operations fail, the 1172 associated "copy_from_auth" and "copy_to_auth" RPCSEC_GSSv3 handles 1173 MUST be destroyed. 1175 3.4.1.2.3. Securing ONC RPC Server-to-Server Copy Protocols 1177 After a destination server has a "copy_to_auth" privilege established 1178 on it, and it receives a COPY request, if it knows it will use an ONC 1179 RPC protocol to copy data, it will establish a "copy_confirm_auth" 1180 privilege on the source server prior to responding to the COPY 1181 operation as follows: 1183 o Before establishing an RPCSEC_GSSv3 context, a parent context 1184 needs to exist between nfs@ as the initiator 1185 principal, and nfs@ as the target principal. If NFS is to 1186 be used as the copy protocol, this means that the destiniation 1187 server must mount the source server using RPCSEC_GSS. 1189 o An instance of copy_confirm_auth_priv is filled in with 1190 information from the established "copy_to_auth" privilege. The 1191 value of the field ccap_shared_secret_mic is a GSS_GetMIC() of the 1192 ctap_shared_secret in the copy_to_auth privilege using the parent 1193 handle context. The field ccap_username is the mapping of the 1194 user principal to an NFSv4 user name ("user"@"domain" form), and 1195 MUST be the same as the ctap_username in the copy_to_auth 1196 privilege. The copy_confirm_auth_priv instance is placed in 1197 rpc_gss3_create_args assertions[0].assertion.privs.privilege. The 1198 string "copy_confirm_auth" is placed in 1199 assertions[0].assertion.privs.name. The field 1200 assertions[0].critical is set to TRUE. 1202 o The copy_confirm_auth RPCSEC_GSS3_CREATE call also includes a 1203 compound authentication component. The rpc_gss3_gss_binding 1204 fields are filled in with information from the estalished 1205 "copy_to_auth" privilege (see Section 3.4.1.2.1). The 1206 ctap_handle_vers, ctap_handle, ctap_nounce, and ctap_nounce_mic 1207 are assigned to the vers, handle, nounce, and mic fields of an 1208 rpc_gss3_gss_binding instance respectively. 1210 o The RPCSEC_GSS3_CREATE copy_from_auth message is sent to the 1211 source server with a QOP of rpc_gss_svc_privacy. The source 1212 server unwraps the rpc_gss_svc_privacy RPCSEC_GSS3_CREATE payload 1213 and verifies the cap_shared_secret_mic by calling GSS_VerifyMIC() 1214 using the parent context on the cfap_shared_secret from the 1215 established "copy_from_auth" privilege, and verifies the that the 1216 ccap_username equals the cfap_username. The source server then 1217 locates the ctap_handle in it's GSS context cache and verifies 1218 that the handle belongs to the user principal that maps to the 1219 ccap_username and that the cached handle version equals 1220 ctap_handle_vers. The ctap_nounce_mic is verified by calling 1221 GSS_VerifyMIC() on the ctap_nounce using the cached handle 1222 context. If all verification succeeds, the "copy_confirm_auth" 1223 privilege is established on the source server as < 1224 "copy_confirm_auth", shared_secret_mic, user id, nounce, nounce 1225 MIC, context handle, context handle version>, and the resultant 1226 child handle is noted to be acting on behalf of the user 1227 principal. If the source server fails to verify either the 1228 privilege or the compound_binding, the COPY operation will be 1229 rejected with NFS4ERR_PARTNER_NO_AUTH. 1231 o All subsequent ONC RPC requests sent from the destination to copy 1232 data from the source to the destination will use the RPCSEC_GSSv3 1233 handle returned by the source's RPCSEC_GSS3_CREATE response. Note 1234 that as per the Compound Authenticaion section of [rpcsec_gssv3] 1235 the resultant RPCSEC_GSSv3 context handle is bound to the user 1236 principal RPCSEC_GSS context and so it MUST be treated by servers 1237 as authenticating the user principal. 1239 Note that the use of the "copy_confirm_auth" privilege accomplishes 1240 the following: 1242 o If a protocol like NFS is being used, with export policies, export 1243 policies can be overridden in case the destination server as-an- 1244 NFS-client is not authorized 1246 o Manual configuration to allow a copy relationship between the 1247 source and destination is not needed. 1249 3.4.1.2.4. Finishing or Stoping a Secure Inter-Server Copy 1251 Under normal operation, the client MUST destroy the copy_from_auth 1252 and the copy_to_auth RPCSEC_GSSv3 handle once the COPY operation 1253 returns for a synchronous inter-server copy or a CB_OFFLOAD reports 1254 the result of an asynchronous copy. 1256 The copy_confirm_auth privilege and compound authentication 1257 RPCSEC_GSSv3 handle is constructed from information held by the 1258 copy_to_auth privilege, and MUST be destroyed by the destination 1259 server (via an RPCSEC_GSS3_DESTROY call) when the copy_to_auth 1260 RPCSEC_GSSv3 handle is destroyed. 1262 If the client sends a OFFLOAD_REVOKE to the source server to rescind 1263 the destination server's synchronous copy privilege, it uses the 1264 privileged "copy_from_auth" RPCSEC_GSSv3 handle and the 1265 cra_destination_server in OFFLOAD_REVOKE MUST be the same as the name 1266 of the destination server specified in copy_from_auth_priv. The 1267 source server will then delete the <"copy_from_auth", user id, 1268 destination> privilege and fail any subsequent copy requests sent 1269 under the auspices of this privilege from the destination server. 1270 The client MUST destroy both the "copy_from_auth" and the 1271 "copy_to_auth" RPCSEC_GSSv3 handles. 1273 If the client sends a OFFLOAD_STATUS to the destination server to 1274 check on the status of an asynchronous copy, it uses the privileged 1275 "copy_to_auth" RPCSEC_GSSv3 handle and the osa_stateid in 1276 OFFLOAD_STATUS MUST be the same as the wr_callback_id specified in 1277 the "copy_to_auth" privilege stored on the destiniation server. 1279 If the client sends a OFFLOAD_ABORT to the destination server to 1280 cancel an asynchronous copy, it uses the privileged "copy_to_auth" 1281 RPCSEC_GSSv3 handle and the oaa_stateid in OFFLOAD_ABORT MUST be the 1282 same as the wr_callback_id specified in the "copy_to_auth" privilege 1283 stored on the destiniation server. The destiniation server will then 1284 delete the <"copy_to_auth", user id, source list, nounce, nounce MIC, 1285 context handle, handle version> privilege and the associated 1286 "copy_confirm_auth" RPCSEC_GSSv3 handle. The client MUST destroy 1287 both the copy_to_auth and copy_from_auth RPCSEC_GSSv3 handles. 1289 3.4.1.3. Inter-Server Copy via ONC RPC without RPCSEC_GSS 1291 ONC RPC security flavors other than RPCSEC_GSS MAY be used with the 1292 server-side copy offload operations described in this chapter. In 1293 particular, host-based ONC RPC security flavors such as AUTH_NONE and 1294 AUTH_SYS MAY be used. If a host-based security flavor is used, a 1295 minimal level of protection for the server-to-server copy protocol is 1296 possible. 1298 In the absence of a strong security mechanism designed for the 1299 purpose, the challenge is how the source server and destination 1300 server identify themselves to each other, especially in the presence 1301 of multi-homed source and destination servers. In a multi-homed 1302 environment, the destination server might not contact the source 1303 server from the same network address specified by the client in the 1304 COPY_NOTIFY. This can be overcome using the procedure described 1305 below. 1307 When the client sends the source server the COPY_NOTIFY operation, 1308 the source server may reply to the client with a list of target 1309 addresses, names, and/or URLs and assign them to the unique 1310 quadruple: . If the destination uses one of these target netlocs to contact 1312 the source server, the source server will be able to uniquely 1313 identify the destination server, even if the destination server does 1314 not connect from the address specified by the client in COPY_NOTIFY. 1315 The level of assurance in this identification depends on the 1316 unpredictability, strength and secrecy of the random number. 1318 For example, suppose the network topology is as shown in Figure 3. 1319 If the source filehandle is 0x12345, the source server may respond to 1320 a COPY_NOTIFY for destination 203.0.113.56 with the URLs: 1322 nfs://203.0.113.18//_COPY/FvhH1OKbu8VrxvV1erdjvR7N/203.0.113.56/ 1323 _FH/0x12345 1325 nfs://192.0.2.18//_COPY/FvhH1OKbu8VrxvV1erdjvR7N/203.0.113.56/_FH/ 1326 0x12345 1328 The name component after _COPY is 24 characters of base 64, more than 1329 enough to encode a 128 bit random number. 1331 The client will then send these URLs to the destination server in the 1332 COPY operation. Suppose that the 192.0.2.0/24 network is a high 1333 speed network and the destination server decides to transfer the file 1334 over this network. If the destination contacts the source server 1335 from 192.0.2.56 over this network using NFSv4.1, it does the 1336 following: 1338 COMPOUND { PUTROOTFH, LOOKUP "_COPY" ; LOOKUP 1339 "FvhH1OKbu8VrxvV1erdjvR7N" ; LOOKUP "203.0.113.56"; LOOKUP "_FH" ; 1340 OPEN "0x12345" ; GETFH } 1342 Provided that the random number is unpredictable and has been kept 1343 secret by the parties involved, the source server will therefore know 1344 that these NFSv4.x operations are being issued by the destination 1345 server identified in the COPY_NOTIFY. This random number technique 1346 only provides initial authentication of the destination server, and 1347 cannot defend against man-in-the-middle attacks after authentication 1348 or an eavesdropper that observes the random number on the wire. 1349 Other secure communication techniques (e.g., IPsec) are necessary to 1350 block these attacks. 1352 Servers SHOULD reject COPY_NOTIFY requests that do not use RPCSEC_GSS 1353 with privacy, thus ensuring the URL in the COPY_NOTIFY reply is 1354 encrypted. For the same reason, clients SHOULD send COPY requests to 1355 the destination using RPCSEC_GSS with privacy. 1357 3.4.1.4. Inter-Server Copy without ONC RPC 1359 The same techniques as Section 3.4.1.3, using unique URLs for each 1360 destination server, can be used for other protocols (e.g., HTTP 1361 [RFC2616] and FTP [RFC0959]) as well. 1363 4. Support for Application IO Hints 1365 Applications can issue client I/O hints via posix_fadvise() 1366 [posix_fadvise] to the NFS client. While this can help the NFS 1367 client optimize I/O and caching for a file, it does not allow the NFS 1368 server and its exported file system to do likewise. We add an 1369 IO_ADVISE procedure (Section 14.8) to communicate the client file 1370 access patterns to the NFS server. The NFS server upon receiving a 1371 IO_ADVISE operation MAY choose to alter its I/O and caching behavior, 1372 but is under no obligation to do so. 1374 Application specific NFS clients such as those used by hypervisors 1375 and databases can also leverage application hints to communicate 1376 their specialized requirements. 1378 5. Sparse Files 1380 5.1. Introduction 1382 A sparse file is a common way of representing a large file without 1383 having to utilize all of the disk space for it. Consequently, a 1384 sparse file uses less physical space than its size indicates. This 1385 means the file contains 'holes', byte ranges within the file that 1386 contain no data. Most modern file systems support sparse files, 1387 including most UNIX file systems and NTFS, but notably not Apple's 1388 HFS+. Common examples of sparse files include Virtual Machine (VM) 1389 OS/disk images, database files, log files, and even checkpoint 1390 recovery files most commonly used by the HPC community. 1392 If an application reads a hole in a sparse file, the file system must 1393 return all zeros to the application. For local data access there is 1394 little penalty, but with NFS these zeroes must be transferred back to 1395 the client. If an application uses the NFS client to read data into 1396 memory, this wastes time and bandwidth as the application waits for 1397 the zeroes to be transferred. 1399 A sparse file is typically created by initializing the file to be all 1400 zeros - nothing is written to the data in the file, instead the hole 1401 is recorded in the metadata for the file. So a 8G disk image might 1402 be represented initially by a couple hundred bits in the inode and 1403 nothing on the disk. If the VM then writes 100M to a file in the 1404 middle of the image, there would now be two holes represented in the 1405 metadata and 100M in the data. 1407 Two new operations WRITE_PLUS (Section 14.7) and READ_PLUS 1408 (Section 14.10) are introduced. WRITE_PLUS allows for the creation 1409 of a sparse file and for hole punching. An application might want to 1410 zero out a range of the file. READ_PLUS supports all the features of 1411 READ but includes an extension to support sparse pattern files 1412 (Section 7.1.2). READ_PLUS is guaranteed to perform no worse than 1413 READ, and can dramatically improve performance with sparse files. 1414 READ_PLUS does not depend on pNFS protocol features, but can be used 1415 by pNFS to support sparse files. 1417 5.2. Terminology 1419 Regular file: An object of file type NF4REG or NF4NAMEDATTR. 1421 Sparse file: A Regular file that contains one or more Holes. 1423 Hole: A byte range within a Sparse file that contains regions of all 1424 zeroes. For block-based file systems, this could also be an 1425 unallocated region of the file. 1427 Hole Threshold: The minimum length of a Hole as determined by the 1428 server. If a server chooses to define a Hole Threshold, then it 1429 would not return hole information about holes with a length 1430 shorter than the Hole Threshold. 1432 5.3. New Operations 1434 READ_PLUS and WRITE_PLUS are new variants of the NFSv4.1 READ and 1435 WRITE operations [RFC5661]. Besides being able to support all of the 1436 data semantics of those operations, they can also be used by the 1437 client and server to efficiently transfer both holes and ADHs (see 1438 Section 7.1.1). As both READ and WRITE are inefficient for transfer 1439 of sparse sections of the file, they are marked as OBSOLESCENT in 1440 NFSv4.2. Instead, a client should utilize READ_PLUS and WRITE_PLUS. 1441 Note that as the client has no a priori knowledge of whether either 1442 an ADH or a hole is present or not, if it supports these operations 1443 and so does the server, then it should always use these operations. 1445 5.3.1. READ_PLUS 1447 For holes, READ_PLUS extends the response to avoid returning data for 1448 portions of the file which are initialized and contain no backing 1449 store. Additionally it will do so if the result would appear to be a 1450 hole. I.e., if the result was a data block composed entirely of 1451 zeros, then it is easier to return a hole. Returning data blocks of 1452 uninitialized data wastes computational and network resources, thus 1453 reducing performance. For ADHs, READ_PLUS is used to return the 1454 metadata describing the portions of the file which are initialized 1455 and contain no backing store. 1457 If the client sends a READ operation, it is explicitly stating that 1458 it is neither supporting sparse files nor ADHs. So if a READ occurs 1459 on a sparse ADH or file, then the server must expand such data to be 1460 raw bytes. If a READ occurs in the middle of a hole or ADH, the 1461 server can only send back bytes starting from that offset. In 1462 contrast, if a READ_PLUS occurs in the middle of a hole or ADH, the 1463 server can send back a range which starts before the offset and 1464 extends past the range. 1466 5.3.2. WRITE_PLUS 1468 WRITE_PLUS can be used to either hole punch or initialize ADHs. For 1469 either purpose, the client can avoid the transfer of a repetitive 1470 pattern across the network. If the filesystem on the server does not 1471 support sparse files, the WRITE_PLUS operation may return the result 1472 asynchronously via the CB_OFFLOAD operation. As a hole punch may 1473 entail deallocating data blocks, even if the filesystem supports 1474 sparse files, it may still have to return the result via CB_OFFLOAD. 1476 6. Space Reservation 1477 6.1. Introduction 1479 Applications such as hypervisors want to be able to reserve space for 1480 a file, report the amount of actual disk space a file occupies, and 1481 free-up the backing space of a file when it is not required. In 1482 virtualized environments, virtual disk files are often stored on NFS 1483 mounted volumes. Since virtual disk files represent the hard disks 1484 of virtual machines, hypervisors often have to guarantee certain 1485 properties for the file. 1487 One such example is space reservation. When a hypervisor creates a 1488 virtual disk file, it often tries to preallocate the space for the 1489 file so that there are no future allocation related errors during the 1490 operation of the virtual machine. Such errors prevent a virtual 1491 machine from continuing execution and result in downtime. 1493 Currently, in order to achieve such a guarantee, applications zero 1494 the entire file. The initial zeroing allocates the backing blocks 1495 and all subsequent writes are overwrites of already allocated blocks. 1496 This approach is not only inefficient in terms of the amount of I/O 1497 done, it is also not guaranteed to work on file systems that are log 1498 structured or deduplicated. An efficient way of guaranteeing space 1499 reservation would be beneficial to such applications. 1501 We define a "reservation" as being the combination of the 1502 space_reserved attribute (see Section 12.2.4) and the size attribute 1503 (see Section 5.8.1.5 of [RFC5661]). If space_reserved attribute is 1504 set on a file, it is guaranteed that writes that do not grow the file 1505 past the size will not fail with NFS4ERR_NOSPC. Once the size is 1506 changed, then the reservation is changed to that new size. 1508 Another useful feature is the ability to report the number of blocks 1509 that would be freed when a file is deleted. Currently, NFS reports 1510 two size attributes: 1512 size The logical file size of the file. 1514 space_used The size in bytes that the file occupies on disk 1516 While these attributes are sufficient for space accounting in 1517 traditional file systems, they prove to be inadequate in modern file 1518 systems that support block sharing. In such file systems, multiple 1519 inodes can point to a single block with a block reference count to 1520 guard against premature freeing. Having a way to tell the number of 1521 blocks that would be freed if the file was deleted would be useful to 1522 applications that wish to migrate files when a volume is low on 1523 space. 1525 Since virtual disks represent a hard drive in a virtual machine, a 1526 virtual disk can be viewed as a file system within a file. Since not 1527 all blocks within a file system are in use, there is an opportunity 1528 to reclaim blocks that are no longer in use. A call to deallocate 1529 blocks could result in better space efficiency. Lesser space MAY be 1530 consumed for backups after block deallocation. 1532 The following operations and attributes can be used to resolve this 1533 issues: 1535 space_reserved This attribute specifies that writes to the reserved 1536 area of the file will not fail with NFS4ERR_NOSPACE. 1538 space_freed This attribute specifies the space freed when a file is 1539 deleted, taking block sharing into consideration. 1541 WRITE_PLUS This operation zeroes and/or deallocates the blocks 1542 backing a region of the file. 1544 If space_used of a file is interpreted to mean the size in bytes of 1545 all disk blocks pointed to by the inode of the file, then shared 1546 blocks get double counted, over-reporting the space utilization. 1547 This also has the adverse effect that the deletion of a file with 1548 shared blocks frees up less than space_used bytes. 1550 On the other hand, if space_used is interpreted to mean the size in 1551 bytes of those disk blocks unique to the inode of the file, then 1552 shared blocks are not counted in any file, resulting in under- 1553 reporting of the space utilization. 1555 For example, two files A and B have 10 blocks each. Let 6 of these 1556 blocks be shared between them. Thus, the combined space utilized by 1557 the two files is 14 * BLOCK_SIZE bytes. In the former case, the 1558 combined space utilization of the two files would be reported as 20 * 1559 BLOCK_SIZE. However, deleting either would only result in 4 * 1560 BLOCK_SIZE being freed. Conversely, the latter interpretation would 1561 report that the space utilization is only 8 * BLOCK_SIZE. 1563 Adding another size attribute, space_freed (see Section 12.2.5), is 1564 helpful in solving this problem. space_freed is the number of blocks 1565 that are allocated to the given file that would be freed on its 1566 deletion. In the example, both A and B would report space_freed as 4 1567 * BLOCK_SIZE and space_used as 10 * BLOCK_SIZE. If A is deleted, B 1568 will report space_freed as 10 * BLOCK_SIZE as the deletion of B would 1569 result in the deallocation of all 10 blocks. 1571 The addition of this problem does not solve the problem of space 1572 being over-reported. However, over-reporting is better than under- 1573 reporting. 1575 7. Application Data Hole Support 1577 At the OS level, files are contained on disk blocks. Applications 1578 are also free to impose structure on the data contained in a file and 1579 we can define an Application Data Block (ADB) to be such a structure. 1580 From the application's viewpoint, it only wants to handle ADBs and 1581 not raw bytes (see [Strohm11]). An ADB is typically comprised of two 1582 sections: a header and data. The header describes the 1583 characteristics of the block and can provide a means to detect 1584 corruption in the data payload. The data section is typically 1585 initialized to all zeros. 1587 The format of the header is application specific, but there are two 1588 main components typically encountered: 1590 1. A logical block number which allows the application to determine 1591 which data block is being referenced. This is useful when the 1592 client is not storing the blocks in contiguous memory. 1594 2. Fields to describe the state of the ADB and a means to detect 1595 block corruption. For both pieces of data, a useful property is 1596 that allowed values be unique in that if passed across the 1597 network, corruption due to translation between big and little 1598 endian architectures are detectable. For example, 0xF0DEDEF0 has 1599 the same bit pattern in both architectures. 1601 Applications already impose structures on files [Strohm11] and detect 1602 corruption in data blocks [Ashdown08]. What they are not able to do 1603 is efficiently transfer and store ADBs. To initialize a file with 1604 ADBs, the client must send the full ADB to the server and that must 1605 be stored on the server. 1607 In this section, we are going to define an Application Data Hole 1608 (ADH), which is a generic framework for transferring the ADB, present 1609 one approach to detecting corruption in a given ADH implementation, 1610 and describe the model for how the client and server can support 1611 efficient initialization of ADHs, reading of ADH holes, punching ADH 1612 holes in a file, and space reservation. We define the ADHN to be the 1613 Application Data Hole Number, which is the logical block number 1614 discussed earlier. 1616 7.1. Generic Framework 1618 We want the representation of the ADH to be flexible enough to 1619 support many different applications. The most basic approach is no 1620 imposition of a block at all, which means we are working with the raw 1621 bytes. Such an approach would be useful for storing holes, punching 1622 holes, etc. In more complex deployments, a server might be 1623 supporting multiple applications, each with their own definition of 1624 the ADH. One might store the ADHN at the start of the block and then 1625 have a guard pattern to detect corruption [McDougall07]. The next 1626 might store the ADHN at an offset of 100 bytes within the block and 1627 have no guard pattern at all, i.e., existing applications might 1628 already have well defined formats for their data blocks. 1630 The guard pattern can be used to represent the state of the block, to 1631 protect against corruption, or both. Again, it needs to be able to 1632 be placed anywhere within the ADH. 1634 We need to be able to represent the starting offset of the block and 1635 the size of the block. Note that nothing prevents the application 1636 from defining different sized blocks in a file. 1638 7.1.1. Data Hole Representation 1640 struct app_data_hole4 { 1641 offset4 adh_offset; 1642 length4 adh_block_size; 1643 length4 adh_block_count; 1644 length4 adh_reloff_blocknum; 1645 count4 adh_block_num; 1646 length4 adh_reloff_pattern; 1647 opaque adh_pattern<>; 1648 }; 1650 The app_data_hole4 structure captures the abstraction presented for 1651 the ADH. The additional fields present are to allow the transmission 1652 of adh_block_count ADHs at one time. We also use adh_block_num to 1653 convey the ADHN of the first block in the sequence. Each ADH will 1654 contain the same adh_pattern string. 1656 As both adh_block_num and adh_pattern are optional, if either 1657 adh_reloff_pattern or adh_reloff_blocknum is set to NFS4_UINT64_MAX, 1658 then the corresponding field is not set in any of the ADH. 1660 7.1.2. Data Content 1662 /* 1663 * Use an enum such that we can extend new types. 1664 */ 1665 enum data_content4 { 1666 NFS4_CONTENT_DATA = 0, 1667 NFS4_CONTENT_APP_DATA_HOLE = 1, 1668 NFS4_CONTENT_HOLE = 2 1669 }; 1671 New operations might need to differentiate between wanting to access 1672 data versus an ADH. Also, future minor versions might want to 1673 introduce new data formats. This enumeration allows that to occur. 1675 7.2. An Example of Detecting Corruption 1677 In this section, we define an ADH format in which corruption can be 1678 detected. Note that this is just one possible format and means to 1679 detect corruption. 1681 Consider a very basic implementation of an operating system's disk 1682 blocks. A block is either data or it is an indirect block which 1683 allows for files to be larger than one block. It is desired to be 1684 able to initialize a block. Lastly, to quickly unlink a file, a 1685 block can be marked invalid. The contents remain intact - which 1686 would enable this OS application to undelete a file. 1688 The application defines 4k sized data blocks, with an 8 byte block 1689 counter occurring at offset 0 in the block, and with the guard 1690 pattern occurring at offset 8 inside the block. Furthermore, the 1691 guard pattern can take one of four states: 1693 0xfeedface - This is the FREE state and indicates that the ADH 1694 format has been applied. 1696 0xcafedead - This is the DATA state and indicates that real data 1697 has been written to this block. 1699 0xe4e5c001 - This is the INDIRECT state and indicates that the 1700 block contains block counter numbers that are chained off of this 1701 block. 1703 0xba1ed4a3 - This is the INVALID state and indicates that the block 1704 contains data whose contents are garbage. 1706 Finally, it also defines an 8 byte checksum [Baira08] starting at 1707 byte 16 which applies to the remaining contents of the block. If the 1708 state is FREE, then that checksum is trivially zero. As such, the 1709 application has no need to transfer the checksum implicitly inside 1710 the ADH - it need not make the transfer layer aware of the fact that 1711 there is a checksum (see [Ashdown08] for an example of checksums used 1712 to detect corruption in application data blocks). 1714 Corruption in each ADH can thus be detected: 1716 o If the guard pattern is anything other than one of the allowed 1717 values, including all zeros. 1719 o If the guard pattern is FREE and any other byte in the remainder 1720 of the ADH is anything other than zero. 1722 o If the guard pattern is anything other than FREE, then if the 1723 stored checksum does not match the computed checksum. 1725 o If the guard pattern is INDIRECT and one of the stored indirect 1726 block numbers has a value greater than the number of ADHs in the 1727 file. 1729 o If the guard pattern is INDIRECT and one of the stored indirect 1730 block numbers is a duplicate of another stored indirect block 1731 number. 1733 As can be seen, the application can detect errors based on the 1734 combination of the guard pattern state and the checksum. But also, 1735 the application can detect corruption based on the state and the 1736 contents of the ADH. This last point is important in validating the 1737 minimum amount of data we incorporated into our generic framework. 1738 I.e., the guard pattern is sufficient in allowing applications to 1739 design their own corruption detection. 1741 Finally, it is important to note that none of these corruption checks 1742 occur in the transport layer. The server and client components are 1743 totally unaware of the file format and might report everything as 1744 being transferred correctly even in the case the application detects 1745 corruption. 1747 7.3. Example of READ_PLUS 1749 The hypothetical application presented in Section 7.2 can be used to 1750 illustrate how READ_PLUS would return an array of results. A file is 1751 created and initialized with 100 4k ADHs in the FREE state: 1753 WRITE_PLUS {0, 4k, 100, 0, 0, 8, 0xfeedface} 1755 Further, assume the application writes a single ADH at 16k, changing 1756 the guard pattern to 0xcafedead, we would then have in memory: 1758 0 -> (16k - 1) : 4k, 4, 0, 0, 8, 0xfeedface 1759 16k -> (20k - 1) : 00 00 00 05 ca fe de ad XX XX ... XX XX 1760 20k -> 400k : 4k, 95, 0, 6, 0xfeedface 1762 And when the client did a READ_PLUS of 64k at the start of the file, 1763 it would get back a result of an ADH, some data, and a final ADH: 1765 ADH {0, 4, 0, 0, 8, 0xfeedface} 1766 data 4k 1767 ADH {20k, 4k, 59, 0, 6, 0xfeedface} 1769 8. Labeled NFS 1771 8.1. Introduction 1773 Access control models such as Unix permissions or Access Control 1774 Lists are commonly referred to as Discretionary Access Control (DAC) 1775 models. These systems base their access decisions on user identity 1776 and resource ownership. In contrast Mandatory Access Control (MAC) 1777 models base their access control decisions on the label on the 1778 subject (usually a process) and the object it wishes to access 1779 [Haynes13]. These labels may contain user identity information but 1780 usually contain additional information. In DAC systems users are 1781 free to specify the access rules for resources that they own. MAC 1782 models base their security decisions on a system wide policy 1783 established by an administrator or organization which the users do 1784 not have the ability to override. In this section, we add a MAC 1785 model to NFSv4.2. 1787 The first change necessary is to devise a method for transporting and 1788 storing security label data on NFSv4 file objects. Security labels 1789 have several semantics that are met by NFSv4 recommended attributes 1790 such as the ability to set the label value upon object creation. 1791 Access control on these attributes are done through a combination of 1792 two mechanisms. As with other recommended attributes on file objects 1793 the usual DAC checks (ACLs and permission bits) will be performed to 1794 ensure that proper file ownership is enforced. In addition a MAC 1795 system MAY be employed on the client, server, or both to enforce 1796 additional policy on what subjects may modify security label 1797 information. 1799 The second change is to provide methods for the client to determine 1800 if the security label has changed. A client which needs to know if a 1801 label is going to change SHOULD request a delegation on that file. 1802 In order to change the security label, the server will have to recall 1803 all delegations. This will inform the client of the change. If a 1804 client wants to detect if the label has changed, it MAY use VERIFY 1805 and NVERIFY on FATTR4_CHANGE_SEC_LABEL to detect that the 1806 FATTR4_SEC_LABEL has been modified. 1808 An additional useful change would be modification to the RPC layer 1809 used in NFSv4 to allow RPC calls to carry security labels. Such 1810 modifications are outside the scope of this document. 1812 8.2. Definitions 1814 Label Format Specifier (LFS): is an identifier used by the client to 1815 establish the syntactic format of the security label and the 1816 semantic meaning of its components. These specifiers exist in a 1817 registry associated with documents describing the format and 1818 semantics of the label. 1820 Label Format Registry: is the IANA registry containing all 1821 registered LFS along with references to the documents that 1822 describe the syntactic format and semantics of the security label. 1824 Policy Identifier (PI): is an optional part of the definition of a 1825 Label Format Specifier which allows for clients and server to 1826 identify specific security policies. 1828 Object: is a passive resource within the system that we wish to be 1829 protected. Objects can be entities such as files, directories, 1830 pipes, sockets, and many other system resources relevant to the 1831 protection of the system state. 1833 Subject: is an active entity usually a process which is requesting 1834 access to an object. 1836 MAC-Aware: is a server which can transmit and store object labels. 1838 MAC-Functional: is a client or server which is Labeled NFS enabled. 1839 Such a system can interpret labels and apply policies based on the 1840 security system. 1842 Multi-Level Security (MLS): is a traditional model where objects are 1843 given a sensitivity level (Unclassified, Secret, Top Secret, etc) 1844 and a category set [MLS]. 1846 8.3. MAC Security Attribute 1848 MAC models base access decisions on security attributes bound to 1849 subjects and objects. This information can range from a user 1850 identity for an identity based MAC model, sensitivity levels for 1851 Multi-level security, or a type for Type Enforcement. These models 1852 base their decisions on different criteria but the semantics of the 1853 security attribute remain the same. The semantics required by the 1854 security attributes are listed below: 1856 o MUST provide flexibility with respect to the MAC model. 1858 o MUST provide the ability to atomically set security information 1859 upon object creation. 1861 o MUST provide the ability to enforce access control decisions both 1862 on the client and the server. 1864 o MUST NOT expose an object to either the client or server name 1865 space before its security information has been bound to it. 1867 NFSv4 implements the security attribute as a recommended attribute. 1868 These attributes have a fixed format and semantics, which conflicts 1869 with the flexible nature of the security attribute. To resolve this 1870 the security attribute consists of two components. The first 1871 component is a LFS as defined in [Quigley11] to allow for 1872 interoperability between MAC mechanisms. The second component is an 1873 opaque field which is the actual security attribute data. To allow 1874 for various MAC models, NFSv4 should be used solely as a transport 1875 mechanism for the security attribute. It is the responsibility of 1876 the endpoints to consume the security attribute and make access 1877 decisions based on their respective models. In addition, creation of 1878 objects through OPEN and CREATE allows for the security attribute to 1879 be specified upon creation. By providing an atomic create and set 1880 operation for the security attribute it is possible to enforce the 1881 second and fourth requirements. The recommended attribute 1882 FATTR4_SEC_LABEL (see Section 12.2.2) will be used to satisfy this 1883 requirement. 1885 8.3.1. Delegations 1887 In the event that a security attribute is changed on the server while 1888 a client holds a delegation on the file, both the server and the 1889 client MUST follow the NFSv4.1 protocol (see Chapter 10 of [RFC5661]) 1890 with respect to attribute changes. It SHOULD flush all changes back 1891 to the server and relinquish the delegation. 1893 8.3.2. Permission Checking 1895 It is not feasible to enumerate all possible MAC models and even 1896 levels of protection within a subset of these models. This means 1897 that the NFSv4 client and servers cannot be expected to directly make 1898 access control decisions based on the security attribute. Instead 1899 NFSv4 should defer permission checking on this attribute to the host 1900 system. These checks are performed in addition to existing DAC and 1901 ACL checks outlined in the NFSv4 protocol. Section 8.6 gives a 1902 specific example of how the security attribute is handled under a 1903 particular MAC model. 1905 8.3.3. Object Creation 1907 When creating files in NFSv4 the OPEN and CREATE operations are used. 1908 One of the parameters to these operations is an fattr4 structure 1909 containing the attributes the file is to be created with. This 1910 allows NFSv4 to atomically set the security attribute of files upon 1911 creation. When a client is MAC-Functional it must always provide the 1912 initial security attribute upon file creation. In the event that the 1913 server is MAC-Functional as well, it should determine by policy 1914 whether it will accept the attribute from the client or instead make 1915 the determination itself. If the client is not MAC-Functional, then 1916 the MAC-Functional server must decide on a default label. A more in 1917 depth explanation can be found in Section 8.6. 1919 8.3.4. Existing Objects 1921 Note that under the MAC model, all objects must have labels. 1922 Therefore, if an existing server is upgraded to include Labeled NFS 1923 support, then it is the responsibility of the security system to 1924 define the behavior for existing objects. 1926 8.3.5. Label Changes 1928 If there are open delegations on the file belonging to client other 1929 than the one making the label change, then the process described in 1930 Section 8.3.1 must be followed. In short, the delegation will be 1931 recalled, which effectively notifies the client of the change. 1933 Consider a system in which the clients enforce MAC checks and and the 1934 server has a very simple security system which just stores the 1935 labels. In this system, the MAC label check always allows access, 1936 regardless of the subject label. 1938 The way in which MAC labels are enforced is by the client. The 1939 security policies on the client can be such that the client does not 1940 have access to the file unless it has a delegation. The recall of 1941 the delegation will force the client to flush any cached content of 1942 the file. The clients could also be configured to periodically 1943 VERIFY/NVERIFY the FATTR4_CHANGE_SEC_LABEL attribute to determine 1944 when the label has changed. When a change is detected, then the 1945 client could take the costlier action of retrieving the 1946 FATTR4_SEC_LABEL. 1948 8.4. pNFS Considerations 1950 The new FATTR4_SEC_LABEL attribute is metadata information and as 1951 such the DS is not aware of the value contained on the MDS. 1952 Fortunately, the NFSv4.1 protocol [RFC5661] already has provisions 1953 for doing access level checks from the DS to the MDS. In order for 1954 the DS to validate the subject label presented by the client, it 1955 SHOULD utilize this mechanism. 1957 8.5. Discovery of Server Labeled NFS Support 1959 The server can easily determine that a client supports Labeled NFS 1960 when it queries for the FATTR4_SEC_LABEL label for an object. The 1961 client might need to discover which LFS the server supports. 1963 The following compound MUST NOT be denied by any MAC label check: 1965 PUTROOTFH, GETATTR {FATTR4_SEC_LABEL} 1967 Note that the server might have imposed a security flavor on the root 1968 that precludes such access. I.e., if the server requires kerberized 1969 access and the client presents a compound with AUTH_SYS, then the 1970 server is allowed to return NFS4ERR_WRONGSEC in this case. But if 1971 the client presents a correct security flavor, then the server MUST 1972 return the FATTR4_SEC_LABEL attribute with the supported LFS filled 1973 in. 1975 8.6. MAC Security NFS Modes of Operation 1977 A system using Labeled NFS may operate in two modes. The first mode 1978 provides the most protection and is called "full mode". In this mode 1979 both the client and server implement a MAC model allowing each end to 1980 make an access control decision. The remaining mode is called the 1981 "guest mode" and in this mode one end of the connection is not 1982 implementing a MAC model and thus offers less protection than full 1983 mode. 1985 8.6.1. Full Mode 1987 Full mode environments consist of MAC-Functional NFSv4 servers and 1988 clients and may be composed of mixed MAC models and policies. The 1989 system requires that both the client and server have an opportunity 1990 to perform an access control check based on all relevant information 1991 within the network. The file object security attribute is provided 1992 using the mechanism described in Section 8.3. 1994 Fully MAC-Functional NFSv4 servers are not possible in the absence of 1995 RPC layer modifications to support subject label transport. However, 1996 servers may make decisions based on the RPC credential information 1997 available and future specifications may provide subject label 1998 transport. 2000 8.6.1.1. Initial Labeling and Translation 2002 The ability to create a file is an action that a MAC model may wish 2003 to mediate. The client is given the responsibility to determine the 2004 initial security attribute to be placed on a file. This allows the 2005 client to make a decision as to the acceptable security attributes to 2006 create a file with before sending the request to the server. Once 2007 the server receives the creation request from the client it may 2008 choose to evaluate if the security attribute is acceptable. 2010 Security attributes on the client and server may vary based on MAC 2011 model and policy. To handle this the security attribute field has an 2012 LFS component. This component is a mechanism for the host to 2013 identify the format and meaning of the opaque portion of the security 2014 attribute. A full mode environment may contain hosts operating in 2015 several different LFSs. In this case a mechanism for translating the 2016 opaque portion of the security attribute is needed. The actual 2017 translation function will vary based on MAC model and policy and is 2018 out of the scope of this document. If a translation is unavailable 2019 for a given LFS then the request MUST be denied. Another recourse is 2020 to allow the host to provide a fallback mapping for unknown security 2021 attributes. 2023 8.6.1.2. Policy Enforcement 2025 In full mode access control decisions are made by both the clients 2026 and servers. When a client makes a request it takes the security 2027 attribute from the requesting process and makes an access control 2028 decision based on that attribute and the security attribute of the 2029 object it is trying to access. If the client denies that access an 2030 RPC call to the server is never made. If however the access is 2031 allowed the client will make a call to the NFS server. 2033 When the server receives the request from the client it uses any 2034 credential information conveyed in the RPC request and the attributes 2035 of the object the client is trying to access to make an access 2036 control decision. If the server's policy allows this access it will 2037 fulfill the client's request, otherwise it will return 2038 NFS4ERR_ACCESS. 2040 Future protocol extensions may also allow the server to factor into 2041 the decision a security label extracted from the RPC request. 2043 Implementations MAY validate security attributes supplied over the 2044 network to ensure that they are within a set of attributes permitted 2045 from a specific peer, and if not, reject them. Note that a system 2046 may permit a different set of attributes to be accepted from each 2047 peer. 2049 8.6.1.3. Limited Server 2051 A Limited Server mode (see Section 3.5.2 of [Haynes13]) consists of a 2052 server which is label aware, but does not enforce policies. Such a 2053 server will store and retrieve all object labels presented by 2054 clients, utilize the methods described in Section 8.3.5 to allow the 2055 clients to detect changing labels, but may not factor the label into 2056 access decisions. Instead, it will expect the clients to enforce all 2057 such access locally. 2059 8.6.2. Guest Mode 2061 Guest mode implies that either the client or the server does not 2062 handle labels. If the client is not Labeled NFS aware, then it will 2063 not offer subject labels to the server. The server is the only 2064 entity enforcing policy, and may selectively provide standard NFS 2065 services to clients based on their authentication credentials and/or 2066 associated network attributes (e.g., IP address, network interface). 2067 The level of trust and access extended to a client in this mode is 2068 configuration-specific. If the server is not Labeled NFS aware, then 2069 it will not return object labels to the client. Clients in this 2070 environment are may consist of groups implementing different MAC 2071 model policies. The system requires that all clients in the 2072 environment be responsible for access control checks. 2074 8.7. Security Considerations 2076 This entire chapter deals with security issues. 2078 Depending on the level of protection the MAC system offers there may 2079 be a requirement to tightly bind the security attribute to the data. 2081 When only one of the client or server enforces labels, it is 2082 important to realize that the other side is not enforcing MAC 2083 protections. Alternate methods might be in use to handle the lack of 2084 MAC support and care should be taken to identify and mitigate threats 2085 from possible tampering outside of these methods. 2087 An example of this is that a server that modifies READDIR or LOOKUP 2088 results based on the client's subject label might want to always 2089 construct the same subject label for a client which does not present 2090 one. This will prevent a non-Labeled NFS client from mixing entries 2091 in the directory cache. 2093 9. Sharing change attribute implementation details with NFSv4 clients 2095 9.1. Introduction 2097 Although both the NFSv4 [I-D.ietf-nfsv4-rfc3530bis] and NFSv4.1 2098 protocol [RFC5661], define the change attribute as being mandatory to 2099 implement, there is little in the way of guidance. The only mandated 2100 feature is that the value must change whenever the file data or 2101 metadata change. 2103 While this allows for a wide range of implementations, it also leaves 2104 the client with a conundrum: how does it determine which is the most 2105 recent value for the change attribute in a case where several RPC 2106 calls have been issued in parallel? In other words if two COMPOUNDs, 2107 both containing WRITE and GETATTR requests for the same file, have 2108 been issued in parallel, how does the client determine which of the 2109 two change attribute values returned in the replies to the GETATTR 2110 requests correspond to the most recent state of the file? In some 2111 cases, the only recourse may be to send another COMPOUND containing a 2112 third GETATTR that is fully serialized with the first two. 2114 NFSv4.2 avoids this kind of inefficiency by allowing the server to 2115 share details about how the change attribute is expected to evolve, 2116 so that the client may immediately determine which, out of the 2117 several change attribute values returned by the server, is the most 2118 recent. change_attr_type is defined as a new recommended attribute 2119 (see Section 12.2.1), and is per file system. 2121 10. Security Considerations 2123 NFSv4.2 has all of the security concerns present in NFSv4.1 (see 2124 Section 21 of [RFC5661]) and those present in the Server-side Copy 2125 (see Section 3.4) and in Labeled NFS (see Section 8.7). 2127 11. Error Values 2129 NFS error numbers are assigned to failed operations within a Compound 2130 (COMPOUND or CB_COMPOUND) request. A Compound request contains a 2131 number of NFS operations that have their results encoded in sequence 2132 in a Compound reply. The results of successful operations will 2133 consist of an NFS4_OK status followed by the encoded results of the 2134 operation. If an NFS operation fails, an error status will be 2135 entered in the reply and the Compound request will be terminated. 2137 11.1. Error Definitions 2139 Protocol Error Definitions 2141 +-------------------------+--------+------------------+ 2142 | Error | Number | Description | 2143 +-------------------------+--------+------------------+ 2144 | NFS4ERR_BADLABEL | 10093 | Section 11.1.3.1 | 2145 | NFS4ERR_OFFLOAD_DENIED | 10091 | Section 11.1.2.1 | 2146 | NFS4ERR_PARTNER_NO_AUTH | 10089 | Section 11.1.2.2 | 2147 | NFS4ERR_PARTNER_NOTSUPP | 10088 | Section 11.1.2.3 | 2148 | NFS4ERR_UNION_NOTSUPP | 10090 | Section 11.1.1.1 | 2149 | NFS4ERR_WRONG_LFS | 10092 | Section 11.1.3.2 | 2150 +-------------------------+--------+------------------+ 2152 Table 1 2154 11.1.1. General Errors 2156 This section deals with errors that are applicable to a broad set of 2157 different purposes. 2159 11.1.1.1. NFS4ERR_UNION_NOTSUPP (Error Code 10090) 2161 One of the arguments to the operation is a discriminated union and 2162 while the server supports the given operation, it does not support 2163 the selected arm of the discriminated union. For an example, see 2164 WRITE_PLUS (Section 14.7). 2166 11.1.2. Server to Server Copy Errors 2168 These errors deal with the interaction between server to server 2169 copies. 2171 11.1.2.1. NFS4ERR_OFFLOAD_DENIED (Error Code 10091) 2173 The copy offload operation is supported by both the source and the 2174 destination, but the destination is not allowing it for this file. 2175 If the client sees this error, it should fall back to the normal copy 2176 semantics. 2178 11.1.2.2. NFS4ERR_PARTNER_NO_AUTH (Error Code 10089) 2180 The source server does not authorize a server-to-server copy offload 2181 operation. This may be due to the client's failure to send the 2182 COPY_NOTIFY operation to the source server, the source server 2183 receiving a server-to-server copy offload request after the copy 2184 lease time expired, or for some other permission problem. 2186 11.1.2.3. NFS4ERR_PARTNER_NOTSUPP (Error Code 10088) 2188 The remote server does not support the server-to-server copy offload 2189 protocol. 2191 11.1.3. Labeled NFS Errors 2193 These errors are used in Labeled NFS. 2195 11.1.3.1. NFS4ERR_BADLABEL (Error Code 10093) 2197 The label specified is invalid in some manner. 2199 11.1.3.2. NFS4ERR_WRONG_LFS (Error Code 10092) 2201 The LFS specified in the subject label is not compatible with the LFS 2202 in the object label. 2204 11.2. New Operations and Their Valid Errors 2206 This section contains a table that gives the valid error returns for 2207 each new NFSv4.2 protocol operation. The error code NFS4_OK 2208 (indicating no error) is not listed but should be understood to be 2209 returnable by all new operations. The error values for all other 2210 operations are defined in Section 15.2 of [RFC5661]. 2212 Valid Error Returns for Each New Protocol Operation 2214 +----------------+--------------------------------------------------+ 2215 | Operation | Errors | 2216 +----------------+--------------------------------------------------+ 2217 | COPY | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 2218 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 2219 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 2220 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | 2221 | | NFS4ERR_EXPIRED, NFS4ERR_FBIG, | 2222 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, NFS4ERR_INVAL, | 2223 | | NFS4ERR_IO, NFS4ERR_ISDIR, NFS4ERR_LOCKED, | 2224 | | NFS4ERR_METADATA_NOTSUPP, NFS4ERR_MOVED, | 2225 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 2226 | | NFS4ERR_OFFLOAD_DENIED, NFS4ERR_OLD_STATEID, | 2227 | | NFS4ERR_OPENMODE, NFS4ERR_OP_NOT_IN_SESSION, | 2228 | | NFS4ERR_PARTNER_NO_AUTH, | 2229 | | NFS4ERR_PARTNER_NOTSUPP, NFS4ERR_PNFS_IO_HOLE, | 2230 | | NFS4ERR_PNFS_NO_LAYOUT, NFS4ERR_REP_TOO_BIG, | 2231 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 2232 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, | 2233 | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | 2234 | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | 2235 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_TYPE | 2236 | COPY_NOTIFY | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 2237 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 2238 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 2239 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_EXPIRED, | 2240 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, NFS4ERR_INVAL, | 2241 | | NFS4ERR_ISDIR, NFS4ERR_IO, NFS4ERR_LOCKED, | 2242 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 2243 | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | 2244 | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_PNFS_IO_HOLE, | 2245 | | NFS4ERR_PNFS_NO_LAYOUT, NFS4ERR_REP_TOO_BIG, | 2246 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 2247 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, | 2248 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 2249 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 2250 | | NFS4ERR_WRONG_TYPE | 2251 | OFFLOAD_ABORT | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 2252 | | NFS4ERR_BAD_STATEID, NFS4ERR_COMPLETE_ALREADY, | 2253 | | NFS4ERR_DEADSESSION, NFS4ERR_EXPIRED, | 2254 | | NFS4ERR_DELAY, NFS4ERR_GRACE, NFS4ERR_NOTSUPP, | 2255 | | NFS4ERR_OLD_STATEID, NFS4ERR_OP_NOT_IN_SESSION, | 2256 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS | 2257 | OFFLOAD_REVOKE | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 2258 | | NFS4ERR_COMPLETE_ALREADY, NFS4ERR_DELAY, | 2259 | | NFS4ERR_GRACE, NFS4ERR_INVALID, NFS4ERR_MOVED, | 2260 | | NFS4ERR_NOTSUPP, NFS4ERR_OP_NOT_IN_SESSION, | 2261 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS | 2262 | OFFLOAD_STATUS | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 2263 | | NFS4ERR_BAD_STATEID, NFS4ERR_COMPLETE_ALREADY, | 2264 | | NFS4ERR_DEADSESSION, NFS4ERR_EXPIRED, | 2265 | | NFS4ERR_DELAY, NFS4ERR_GRACE, NFS4ERR_NOTSUPP, | 2266 | | NFS4ERR_OLD_STATEID, NFS4ERR_OP_NOT_IN_SESSION, | 2267 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS | 2268 | READ_PLUS | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 2269 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 2270 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 2271 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_EXPIRED, | 2272 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, NFS4ERR_INVAL, | 2273 | | NFS4ERR_ISDIR, NFS4ERR_IO, NFS4ERR_LOCKED, | 2274 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 2275 | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | 2276 | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_PNFS_IO_HOLE, | 2277 | | NFS4ERR_PNFS_NO_LAYOUT, NFS4ERR_REP_TOO_BIG, | 2278 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 2279 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, | 2280 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 2281 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 2282 | | NFS4ERR_WRONG_TYPE | 2283 | SEEK | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 2284 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 2285 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 2286 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_EXPIRED, | 2287 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, NFS4ERR_INVAL, | 2288 | | NFS4ERR_ISDIR, NFS4ERR_IO, NFS4ERR_LOCKED, | 2289 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 2290 | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | 2291 | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_PNFS_IO_HOLE, | 2292 | | NFS4ERR_PNFS_NO_LAYOUT, NFS4ERR_REP_TOO_BIG, | 2293 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 2294 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, | 2295 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 2296 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 2297 | | NFS4ERR_UNION_NOTSUPP, NFS4ERR_WRONG_TYPE | 2298 | SEQUENCE | NFS4ERR_BADSESSION, NFS4ERR_BADSLOT, | 2299 | | NFS4ERR_BADXDR, NFS4ERR_BAD_HIGH_SLOT, | 2300 | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | 2301 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 2302 | | NFS4ERR_REP_TOO_BIG, | 2303 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 2304 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, | 2305 | | NFS4ERR_SEQUENCE_POS, NFS4ERR_SEQ_FALSE_RETRY, | 2306 | | NFS4ERR_SEQ_MISORDERED, NFS4ERR_TOO_MANY_OPS | 2307 | WRITE_PLUS | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 2308 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 2309 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 2310 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | 2311 | | NFS4ERR_EXPIRED, NFS4ERR_FBIG, | 2312 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, NFS4ERR_INVAL, | 2313 | | NFS4ERR_IO, NFS4ERR_ISDIR, NFS4ERR_LOCKED, | 2314 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 2315 | | NFS4ERR_NOSPC, NFS4ERR_OLD_STATEID, | 2316 | | NFS4ERR_OPENMODE, NFS4ERR_OP_NOT_IN_SESSION, | 2317 | | NFS4ERR_PNFS_IO_HOLE, NFS4ERR_PNFS_NO_LAYOUT, | 2318 | | NFS4ERR_REP_TOO_BIG, | 2319 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 2320 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, | 2321 | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | 2322 | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | 2323 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_UNION_NOTSUPP, | 2324 | | NFS4ERR_WRONG_TYPE | 2325 +----------------+--------------------------------------------------+ 2327 Table 2 2329 11.3. New Callback Operations and Their Valid Errors 2331 This section contains a table that gives the valid error returns for 2332 each new NFSv4.2 callback operation. The error code NFS4_OK 2333 (indicating no error) is not listed but should be understood to be 2334 returnable by all new callback operations. The error values for all 2335 other callback operations are defined in Section 15.3 of [RFC5661]. 2337 Valid Error Returns for Each New Protocol Callback Operation 2339 +------------+------------------------------------------------------+ 2340 | Callback | Errors | 2341 | Operation | | 2342 +------------+------------------------------------------------------+ 2343 | CB_OFFLOAD | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 2344 | | NFS4ERR_BAD_STATEID, NFS4ERR_DELAY, | 2345 | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_REP_TOO_BIG, | 2346 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, NFS4ERR_REQ_TOO_BIG, | 2347 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_SERVERFAULT, | 2348 | | NFS4ERR_TOO_MANY_OPS | 2349 +------------+------------------------------------------------------+ 2351 Table 3 2353 12. New File Attributes 2355 12.1. New RECOMMENDED Attributes - List and Definition References 2357 The list of new RECOMMENDED attributes appears in Table 4. The 2358 meaning of the columns of the table are: 2360 Name: The name of the attribute. 2362 Id: The number assigned to the attribute. In the event of conflicts 2363 between the assigned number and [NFSv42xdr], the latter is likely 2364 authoritative, but should be resolved with Errata to this document 2365 and/or [NFSv42xdr]. See [IESG08] for the Errata process. 2367 Data Type: The XDR data type of the attribute. 2369 Acc: Access allowed to the attribute. 2371 R means read-only (GETATTR may retrieve, SETATTR may not set). 2373 W means write-only (SETATTR may set, GETATTR may not retrieve). 2375 R W means read/write (GETATTR may retrieve, SETATTR may set). 2377 Defined in: The section of this specification that describes the 2378 attribute. 2380 +------------------+----+-------------------+-----+----------------+ 2381 | Name | Id | Data Type | Acc | Defined in | 2382 +------------------+----+-------------------+-----+----------------+ 2383 | change_attr_type | 79 | change_attr_type4 | R | Section 12.2.1 | 2384 | sec_label | 80 | sec_label4 | R W | Section 12.2.2 | 2385 | change_sec_label | 81 | change_sec_label4 | R | Section 12.2.3 | 2386 | space_reserved | 77 | boolean | R W | Section 12.2.4 | 2387 | space_freed | 78 | length4 | R | Section 12.2.5 | 2388 +------------------+----+-------------------+-----+----------------+ 2390 Table 4 2392 12.2. Attribute Definitions 2393 12.2.1. Attribute 79: change_attr_type 2395 enum change_attr_type4 { 2396 NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR = 0, 2397 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER = 1, 2398 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS = 2, 2399 NFS4_CHANGE_TYPE_IS_TIME_METADATA = 3, 2400 NFS4_CHANGE_TYPE_IS_UNDEFINED = 4 2401 }; 2403 change_attr_type is a per file system attribute which enables the 2404 NFSv4.2 server to provide additional information about how it expects 2405 the change attribute value to evolve after the file data, or metadata 2406 has changed. While Section 5.4 of [RFC5661] discusses per file 2407 system attributes, it is expected that the value of change_attr_type 2408 not depend on the value of "homogeneous" and only changes in the 2409 event of a migration. 2411 NFS4_CHANGE_TYPE_IS_UNDEFINED: The change attribute does not take 2412 values that fit into any of these categories. 2414 NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR: The change attribute value MUST 2415 monotonically increase for every atomic change to the file 2416 attributes, data, or directory contents. 2418 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER: The change attribute value MUST 2419 be incremented by one unit for every atomic change to the file 2420 attributes, data, or directory contents. This property is 2421 preserved when writing to pNFS data servers. 2423 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS: The change attribute 2424 value MUST be incremented by one unit for every atomic change to 2425 the file attributes, data, or directory contents. In the case 2426 where the client is writing to pNFS data servers, the number of 2427 increments is not guaranteed to exactly match the number of 2428 writes. 2430 NFS4_CHANGE_TYPE_IS_TIME_METADATA: The change attribute is 2431 implemented as suggested in the NFSv4 spec 2432 [I-D.ietf-nfsv4-rfc3530bis] in terms of the time_metadata 2433 attribute. 2435 If either NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR, 2436 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, or 2437 NFS4_CHANGE_TYPE_IS_TIME_METADATA are set, then the client knows at 2438 the very least that the change attribute is monotonically increasing, 2439 which is sufficient to resolve the question of which value is the 2440 most recent. 2442 If the client sees the value NFS4_CHANGE_TYPE_IS_TIME_METADATA, then 2443 by inspecting the value of the 'time_delta' attribute it additionally 2444 has the option of detecting rogue server implementations that use 2445 time_metadata in violation of the spec. 2447 If the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it has the 2448 ability to predict what the resulting change attribute value should 2449 be after a COMPOUND containing a SETATTR, WRITE, or CREATE. This 2450 again allows it to detect changes made in parallel by another client. 2451 The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits the 2452 same, but only if the client is not doing pNFS WRITEs. 2454 Finally, if the server does not support change_attr_type or if 2455 NFS4_CHANGE_TYPE_IS_UNDEFINED is set, then the server SHOULD make an 2456 effort to implement the change attribute in terms of the 2457 time_metadata attribute. 2459 12.2.2. Attribute 80: sec_label 2461 typedef uint32_t policy4; 2463 struct labelformat_spec4 { 2464 policy4 lfs_lfs; 2465 policy4 lfs_pi; 2466 }; 2468 struct sec_label4 { 2469 labelformat_spec4 slai_lfs; 2470 opaque slai_data<>; 2471 }; 2473 The FATTR4_SEC_LABEL contains an array of two components with the 2474 first component being an LFS. It serves to provide the receiving end 2475 with the information necessary to translate the security attribute 2476 into a form that is usable by the endpoint. Label Formats assigned 2477 an LFS may optionally choose to include a Policy Identifier field to 2478 allow for complex policy deployments. The LFS and Label Format 2479 Registry are described in detail in [Quigley11]. The translation 2480 used to interpret the security attribute is not specified as part of 2481 the protocol as it may depend on various factors. The second 2482 component is an opaque section which contains the data of the 2483 attribute. This component is dependent on the MAC model to interpret 2484 and enforce. 2486 In particular, it is the responsibility of the LFS specification to 2487 define a maximum size for the opaque section, slai_data<>. When 2488 creating or modifying a label for an object, the client needs to be 2489 guaranteed that the server will accept a label that is sized 2490 correctly. By both client and server being part of a specific MAC 2491 model, the client will be aware of the size. 2493 If a server supports sec_label, then it MUST also support 2494 change_sec_label. Any modification to sec_label MUST modify the 2495 value for change_sec_label. 2497 12.2.3. Attribute 81: change_sec_label 2499 The change_sec_label attribute is a read-only attribute per file. If 2500 the value of sec_label for a file is not the same at two disparate 2501 times then the values of change_sec_label at those times MUST be 2502 different as well. The value of change_sec_label MAY change at other 2503 times as well, but this should be rare, as that will require the 2504 client to abort any operation in progress, re-read the label, and 2505 retry the operation. As the sec_label is not bounded by size, this 2506 attribute allows for VERIFY and NVERIFY to quickly determine if the 2507 sec_label has been modified. 2509 12.2.4. Attribute 77: space_reserved 2511 The space_reserve attribute is a read/write attribute of type 2512 boolean. It is a per file attribute and applies during the lifetime 2513 of the file or until it is turned off. When the space_reserved 2514 attribute is set via SETATTR, the server must ensure that there is 2515 disk space to accommodate every byte in the file before it can return 2516 success. If the server cannot guarantee this, it must return 2517 NFS4ERR_NOSPC. 2519 If the client tries to grow a file which has the space_reserved 2520 attribute set, the server must guarantee that there is disk space to 2521 accommodate every byte in the file with the new size before it can 2522 return success. If the server cannot guarantee this, it must return 2523 NFS4ERR_NOSPC. 2525 It is not required that the server allocate the space to the file 2526 before returning success. The allocation can be deferred, however, 2527 it must be guaranteed that it will not fail for lack of space. 2529 The value of space_reserved can be obtained at any time through 2530 GETATTR. If the size is retrieved at the same time, the client can 2531 determine the size of the reservation. 2533 In order to avoid ambiguity, the space_reserve bit cannot be set 2534 along with the size bit in SETATTR. Increasing the size of a file 2535 with space_reserve set will fail if space reservation cannot be 2536 guaranteed for the new size. If the file size is decreased, space 2537 reservation is only guaranteed for the new size. If a hole is 2538 punched into the file, then the reservation is not changed. 2540 12.2.5. Attribute 78: space_freed 2542 space_freed gives the number of bytes freed if the file is deleted. 2543 This attribute is read only and is of type length4. It is a per file 2544 attribute. 2546 13. Operations: REQUIRED, RECOMMENDED, or OPTIONAL 2548 The following tables summarize the operations of the NFSv4.2 protocol 2549 and the corresponding designation of REQUIRED, RECOMMENDED, and 2550 OPTIONAL to implement or either OBSOLESCENT or MUST NOT implement. 2551 The designation of OBSOLESCENT is reserved for those operations which 2552 are defined in either NFSv4.0 or NFSv4.1 and are intended to be 2553 classified as MUST NOT be implemented in NFSv4.3. The designation of 2554 MUST NOT implement is reserved for those operations that were defined 2555 in either NFSv4.0 or NFSV4.1 and MUST NOT be implemented in NFSv4.2. 2557 For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation 2558 for operations sent by the client is for the server implementation. 2559 The client is generally required to implement the operations needed 2560 for the operating environment for which it serves. For example, a 2561 read-only NFSv4.2 client would have no need to implement the WRITE 2562 operation and is not required to do so. 2564 The REQUIRED or OPTIONAL designation for callback operations sent by 2565 the server is for both the client and server. Generally, the client 2566 has the option of creating the backchannel and sending the operations 2567 on the fore channel that will be a catalyst for the server sending 2568 callback operations. A partial exception is CB_RECALL_SLOT; the only 2569 way the client can avoid supporting this operation is by not creating 2570 a backchannel. 2572 Since this is a summary of the operations and their designation, 2573 there are subtleties that are not presented here. Therefore, if 2574 there is a question of the requirements of implementation, the 2575 operation descriptions themselves must be consulted along with other 2576 relevant explanatory text within this either specification or that of 2577 NFSv4.1 [RFC5661]. 2579 The abbreviations used in the second and third columns of the table 2580 are defined as follows. 2582 REQ REQUIRED to implement 2584 REC RECOMMENDED to implement 2586 OPT OPTIONAL to implement 2588 MNI MUST NOT implement 2590 OBS Also OBSOLESCENT for future versions. 2592 For the NFSv4.2 features that are OPTIONAL, the operations that 2593 support those features are OPTIONAL, and the server would return 2594 NFS4ERR_NOTSUPP in response to the client's use of those operations. 2595 If an OPTIONAL feature is supported, it is possible that a set of 2596 operations related to the feature become REQUIRED to implement. The 2597 third column of the table designates the feature(s) and if the 2598 operation is REQUIRED or OPTIONAL in the presence of support for the 2599 feature. 2601 The OPTIONAL features identified and their abbreviations are as 2602 follows: 2604 pNFS Parallel NFS 2606 FDELG File Delegations 2608 DDELG Directory Delegations 2610 COPY Server Side Copy 2612 ADH Application Data Holes 2614 Operations 2616 +----------------------+---------------------+----------------------+ 2617 | Operation | EOL, REQ, REC, OPT, | Feature (REQ, REC, | 2618 | | or MNI | or OPT) | 2619 +----------------------+---------------------+----------------------+ 2620 | ACCESS | REQ | | 2621 | BACKCHANNEL_CTL | REQ | | 2622 | BIND_CONN_TO_SESSION | REQ | | 2623 | CLOSE | REQ | | 2624 | COMMIT | REQ | | 2625 | COPY | OPT | COPY (REQ) | 2626 | OFFLOAD_ABORT | OPT | COPY (REQ) | 2627 | COPY_NOTIFY | OPT | COPY (REQ) | 2628 | OFFLOAD_REVOKE | OPT | COPY (REQ) | 2629 | OFFLOAD_STATUS | OPT | COPY (REQ) | 2630 | CREATE | REQ | | 2631 | CREATE_SESSION | REQ | | 2632 | DELEGPURGE | OPT | FDELG (REQ) | 2633 | DELEGRETURN | OPT | FDELG, DDELG, pNFS | 2634 | | | (REQ) | 2635 | DESTROY_CLIENTID | REQ | | 2636 | DESTROY_SESSION | REQ | | 2637 | EXCHANGE_ID | REQ | | 2638 | FREE_STATEID | REQ | | 2639 | GETATTR | REQ | | 2640 | GETDEVICEINFO | OPT | pNFS (REQ) | 2641 | GETDEVICELIST | OPT | pNFS (OPT) | 2642 | GETFH | REQ | | 2643 | WRITE_PLUS | OPT | ADH (REQ) | 2644 | GET_DIR_DELEGATION | OPT | DDELG (REQ) | 2645 | LAYOUTCOMMIT | OPT | pNFS (REQ) | 2646 | LAYOUTGET | OPT | pNFS (REQ) | 2647 | LAYOUTRETURN | OPT | pNFS (REQ) | 2648 | LINK | OPT | | 2649 | LOCK | REQ | | 2650 | LOCKT | REQ | | 2651 | LOCKU | REQ | | 2652 | LOOKUP | REQ | | 2653 | LOOKUPP | REQ | | 2654 | NVERIFY | REQ | | 2655 | OPEN | REQ | | 2656 | OPENATTR | OPT | | 2657 | OPEN_CONFIRM | MNI | | 2658 | OPEN_DOWNGRADE | REQ | | 2659 | PUTFH | REQ | | 2660 | PUTPUBFH | REQ | | 2661 | PUTROOTFH | REQ | | 2662 | READ | REQ (OBS) | | 2663 | READDIR | REQ | | 2664 | READLINK | OPT | | 2665 | READ_PLUS | OPT | ADH (REQ) | 2666 | RECLAIM_COMPLETE | REQ | | 2667 | RELEASE_LOCKOWNER | MNI | | 2668 | REMOVE | REQ | | 2669 | RENAME | REQ | | 2670 | RENEW | MNI | | 2671 | RESTOREFH | REQ | | 2672 | SAVEFH | REQ | | 2673 | SECINFO | REQ | | 2674 | SECINFO_NO_NAME | REC | pNFS file layout | 2675 | | | (REQ) | 2676 | SEQUENCE | REQ | | 2677 | SETATTR | REQ | | 2678 | SETCLIENTID | MNI | | 2679 | SETCLIENTID_CONFIRM | MNI | | 2680 | SET_SSV | REQ | | 2681 | TEST_STATEID | REQ | | 2682 | VERIFY | REQ | | 2683 | WANT_DELEGATION | OPT | FDELG (OPT) | 2684 | WRITE | REQ (OBS) | | 2685 +----------------------+---------------------+----------------------+ 2687 Callback Operations 2689 +-------------------------+-------------------+---------------------+ 2690 | Operation | REQ, REC, OPT, or | Feature (REQ, REC, | 2691 | | MNI | or OPT) | 2692 +-------------------------+-------------------+---------------------+ 2693 | CB_OFFLOAD | OPT | COPY (REQ) | 2694 | CB_GETATTR | OPT | FDELG (REQ) | 2695 | CB_LAYOUTRECALL | OPT | pNFS (REQ) | 2696 | CB_NOTIFY | OPT | DDELG (REQ) | 2697 | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | 2698 | CB_NOTIFY_LOCK | OPT | | 2699 | CB_PUSH_DELEG | OPT | FDELG (OPT) | 2700 | CB_RECALL | OPT | FDELG, DDELG, pNFS | 2701 | | | (REQ) | 2702 | CB_RECALL_ANY | OPT | FDELG, DDELG, pNFS | 2703 | | | (REQ) | 2704 | CB_RECALL_SLOT | REQ | | 2705 | CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS (REQ) | 2706 | CB_SEQUENCE | OPT | FDELG, DDELG, pNFS | 2707 | | | (REQ) | 2708 | CB_WANTS_CANCELLED | OPT | FDELG, DDELG, pNFS | 2709 | | | (REQ) | 2710 +-------------------------+-------------------+---------------------+ 2712 14. NFSv4.2 Operations 2714 14.1. Operation 59: COPY - Initiate a server-side copy 2715 14.1.1. ARGUMENT 2717 struct COPY4args { 2718 /* SAVED_FH: source file */ 2719 /* CURRENT_FH: destination file */ 2720 stateid4 ca_src_stateid; 2721 stateid4 ca_dst_stateid; 2722 offset4 ca_src_offset; 2723 offset4 ca_dst_offset; 2724 length4 ca_count; 2725 netloc4 ca_source_server<>; 2726 }; 2728 14.1.2. RESULT 2730 union COPY4res switch (nfsstat4 cr_status) { 2731 case NFS4_OK: 2732 write_response4 resok4; 2733 default: 2734 length4 cr_bytes_copied; 2735 }; 2737 14.1.3. DESCRIPTION 2739 The COPY operation is used for both intra-server and inter-server 2740 copies. In both cases, the COPY is always sent from the client to 2741 the destination server of the file copy. The COPY operation requests 2742 that a file be copied from the location specified by the SAVED_FH 2743 value to the location specified by the CURRENT_FH. 2745 The SAVED_FH must be a regular file. If SAVED_FH is not a regular 2746 file, the operation MUST fail and return NFS4ERR_WRONG_TYPE. 2748 In order to set SAVED_FH to the source file handle, the compound 2749 procedure requesting the COPY will include a sub-sequence of 2750 operations such as 2752 PUTFH source-fh 2753 SAVEFH 2755 If the request is for a server-to-server copy, the source-fh is a 2756 filehandle from the source server and the compound procedure is being 2757 executed on the destination server. In this case, the source-fh is a 2758 foreign filehandle on the server receiving the COPY request. If 2759 either PUTFH or SAVEFH checked the validity of the filehandle, the 2760 operation would likely fail and return NFS4ERR_STALE. 2762 If a server supports the server-to-server COPY feature, a PUTFH 2763 followed by a SAVEFH MUST NOT return NFS4ERR_STALE for either 2764 operation. These restrictions do not pose substantial difficulties 2765 for servers. The CURRENT_FH and SAVED_FH may be validated in the 2766 context of the operation referencing them and an NFS4ERR_STALE error 2767 returned for an invalid file handle at that point. 2769 For an intra-server copy, both the ca_src_stateid and ca_dst_stateid 2770 MUST refer to either open or locking states provided earlier by the 2771 server. If either stateid is invalid, then the operation MUST fail. 2772 If the request is for a inter-server copy, then the ca_src_stateid 2773 can be ignored. If ca_dst_stateid is invalid, then the operation 2774 MUST fail. 2776 The CURRENT_FH specifies the destination of the copy operation. The 2777 CURRENT_FH MUST be a regular file and not a directory. Note, the 2778 file MUST exist before the COPY operation begins. It is the 2779 responsibility of the client to create the file if necessary, 2780 regardless of the actual copy protocol used. If the file cannot be 2781 created in the destination file system (due to file name 2782 restrictions, such as case or length), the COPY operation MUST NOT be 2783 called. 2785 The ca_src_offset is the offset within the source file from which the 2786 data will be read, the ca_dst_offset is the offset within the 2787 destination file to which the data will be written, and the ca_count 2788 is the number of bytes that will be copied. An offset of 0 (zero) 2789 specifies the start of the file. A count of 0 (zero) requests that 2790 all bytes from ca_src_offset through EOF be copied to the 2791 destination. If concurrent modifications to the source file overlap 2792 with the source file region being copied, the data copied may include 2793 all, some, or none of the modifications. The client can use standard 2794 NFS operations (e.g., OPEN with OPEN4_SHARE_DENY_WRITE or mandatory 2795 byte range locks) to protect against concurrent modifications if the 2796 client is concerned about this. If the source file's end of file is 2797 being modified in parallel with a copy that specifies a count of 0 2798 (zero) bytes, the amount of data copied is implementation dependent 2799 (clients may guard against this case by specifying a non-zero count 2800 value or preventing modification of the source file as mentioned 2801 above). 2803 If the source offset or the source offset plus count is greater than 2804 or equal to the size of the source file, the operation will fail with 2805 NFS4ERR_INVAL. The destination offset or destination offset plus 2806 count may be greater than the size of the destination file. This 2807 allows for the client to issue parallel copies to implement 2808 operations such as "cat file1 file2 file3 file4 > dest". 2810 If the ca_source_server list is specified, then this is an inter- 2811 server copy operation and the source file is on a remote server. The 2812 client is expected to have previously issued a successful COPY_NOTIFY 2813 request to the remote source server. The ca_source_server list MUST 2814 be the same as the COPY_NOTIFY response's cnr_source_server list. If 2815 the client includes the entries from the COPY_NOTIFY response's 2816 cnr_source_server list in the ca_source_server list, the source 2817 server can indicate a specific copy protocol for the destination 2818 server to use by returning a URL, which specifies both a protocol 2819 service and server name. Server-to-server copy protocol 2820 considerations are described in Section 3.2.5 and Section 3.4.1. 2822 The copying of any and all attributes on the source file is the 2823 responsibility of both the client and the copy protocol. Any 2824 attribute which is both exposed via the NFS protocol on the source 2825 file and set SHOULD be copied to the destination file. Any attribute 2826 supported by the destination server that is not set on the source 2827 file SHOULD be left unset. If the client cannot copy an attribute 2828 from the source to destination, it MAY fail the copy transaction. 2830 Metadata attributes not exposed via the NFS protocol SHOULD be copied 2831 to the destination file where appropriate via the copy protocol. 2832 Note that if the copy protocol is NFSv4.x, then these attributes will 2833 be lost. 2835 The destination file's named attributes are not duplicated from the 2836 source file. After the copy process completes, the client MAY 2837 attempt to duplicate named attributes using standard NFSv4 2838 operations. However, the destination file's named attribute 2839 capabilities MAY be different from the source file's named attribute 2840 capabilities. 2842 If the operation does not result in an immediate failure, the server 2843 will return NFS4_OK, and the CURRENT_FH will remain the destination's 2844 filehandle. 2846 If an immediate failure does occur, cr_bytes_copied will be set to 2847 the number of bytes copied to the destination file before the error 2848 occurred. The cr_bytes_copied value indicates the number of bytes 2849 copied but not which specific bytes have been copied. 2851 A return of NFS4_OK indicates that either the operation is complete 2852 or the operation was initiated and a callback will be used to deliver 2853 the final status of the operation. 2855 If the wr_callback_id is returned, this indicates that the operation 2856 was initiated and a CB_OFFLOAD callback will deliver the final 2857 results of the operation. The wr_callback_id stateid is termed a 2858 copy stateid in this context. The server is given the option of 2859 returning the results in a callback because the data may require a 2860 relatively long period of time to copy. 2862 If no wr_callback_id is returned, the operation completed 2863 synchronously and no callback will be issued by the server. The 2864 completion status of the operation is indicated by cr_status. 2866 If the copy completes successfully, either synchronously or 2867 asynchronously, the data copied from the source file to the 2868 destination file MUST appear identical to the NFS client. However, 2869 the NFS server's on disk representation of the data in the source 2870 file and destination file MAY differ. For example, the NFS server 2871 might encrypt, compress, deduplicate, or otherwise represent the on 2872 disk data in the source and destination file differently. 2874 14.2. Operation 60: OFFLOAD_ABORT - Cancel a server-side copy 2876 14.2.1. ARGUMENT 2878 struct OFFLOAD_ABORT4args { 2879 /* CURRENT_FH: destination file */ 2880 stateid4 oaa_stateid; 2881 }; 2883 14.2.2. RESULT 2885 struct OFFLOAD_ABORT4res { 2886 nfsstat4 oar_status; 2887 }; 2889 14.2.3. DESCRIPTION 2891 OFFLOAD_ABORT is used for both intra- and inter-server asynchronous 2892 copies. The OFFLOAD_ABORT operation allows the client to cancel a 2893 server-side copy operation that it initiated. This operation is sent 2894 in a COMPOUND request from the client to the destination server. 2895 This operation may be used to cancel a copy when the application that 2896 requested the copy exits before the operation is completed or for 2897 some other reason. 2899 The request contains the filehandle and copy stateid cookies that act 2900 as the context for the previously initiated copy operation. 2902 The result's oar_status field indicates whether the cancel was 2903 successful or not. A value of NFS4_OK indicates that the copy 2904 operation was canceled and no callback will be issued by the server. 2906 A copy operation that is successfully canceled may result in none, 2907 some, or all of the data and/or metadata copied. 2909 If the server supports asynchronous copies, the server is REQUIRED to 2910 support the OFFLOAD_ABORT operation. 2912 14.3. Operation 61: COPY_NOTIFY - Notify a source server of a future 2913 copy 2915 14.3.1. ARGUMENT 2917 struct COPY_NOTIFY4args { 2918 /* CURRENT_FH: source file */ 2919 stateid4 cna_src_stateid; 2920 netloc4 cna_destination_server; 2921 }; 2923 14.3.2. RESULT 2925 struct COPY_NOTIFY4resok { 2926 nfstime4 cnr_lease_time; 2927 netloc4 cnr_source_server<>; 2928 }; 2930 union COPY_NOTIFY4res switch (nfsstat4 cnr_status) { 2931 case NFS4_OK: 2932 COPY_NOTIFY4resok resok4; 2933 default: 2934 void; 2935 }; 2937 14.3.3. DESCRIPTION 2939 This operation is used for an inter-server copy. A client sends this 2940 operation in a COMPOUND request to the source server to authorize a 2941 destination server identified by cna_destination_server to read the 2942 file specified by CURRENT_FH on behalf of the given user. 2944 The cna_src_stateid MUST refer to either open or locking states 2945 provided earlier by the server. If it is invalid, then the operation 2946 MUST fail. 2948 The cna_destination_server MUST be specified using the netloc4 2949 network location format. The server is not required to resolve the 2950 cna_destination_server address before completing this operation. 2952 If this operation succeeds, the source server will allow the 2953 cna_destination_server to copy the specified file on behalf of the 2954 given user as long as both of the following conditions are met: 2956 o The destination server begins reading the source file before the 2957 cnr_lease_time expires. If the cnr_lease_time expires while the 2958 destination server is still reading the source file, the 2959 destination server is allowed to finish reading the file. 2961 o The client has not issued a COPY_REVOKE for the same combination 2962 of user, filehandle, and destination server. 2964 The cnr_lease_time is chosen by the source server. A cnr_lease_time 2965 of 0 (zero) indicates an infinite lease. To avoid the need for 2966 synchronized clocks, copy lease times are granted by the server as a 2967 time delta. To renew the copy lease time the client should resend 2968 the same copy notification request to the source server. 2970 A successful response will also contain a list of netloc4 network 2971 location formats called cnr_source_server, on which the source is 2972 willing to accept connections from the destination. These might not 2973 be reachable from the client and might be located on networks to 2974 which the client has no connection. 2976 If the client wishes to perform an inter-server copy, the client MUST 2977 send a COPY_NOTIFY to the source server. Therefore, the source 2978 server MUST support COPY_NOTIFY. 2980 For a copy only involving one server (the source and destination are 2981 on the same server), this operation is unnecessary. 2983 14.4. Operation 62: OFFLOAD_REVOKE - Revoke a destination server's copy 2984 privileges 2986 14.4.1. ARGUMENT 2988 struct OFFLOAD_REVOKE4args { 2989 /* CURRENT_FH: source file */ 2990 netloc4 ora_destination_server; 2991 }; 2993 14.4.2. RESULT 2995 struct OFFLOAD_REVOKE4res { 2996 nfsstat4 orr_status; 2997 }; 2999 14.4.3. DESCRIPTION 3001 This operation is used for an inter-server copy. A client sends this 3002 operation in a COMPOUND request to the source server to revoke the 3003 authorization of a destination server identified by 3004 ora_destination_server from reading the file specified by CURRENT_FH 3005 on behalf of given user. If the ora_destination_server has already 3006 begun copying the file, a successful return from this operation 3007 indicates that further access will be prevented. 3009 The ora_destination_server MUST be specified using the netloc4 3010 network location format. The server is not required to resolve the 3011 ora_destination_server address before completing this operation. 3013 The client uses OFFLOAD_ABORT to inform the destination to stop the 3014 active transfer and OFFLOAD_REVOKE to inform the source to not allow 3015 any more copy requests from the destination. The OFFLOAD_REVOKE 3016 operation is also useful in situations in which the source server 3017 granted a very long or infinite lease on the destination server's 3018 ability to read the source file and all copy operations on the source 3019 file have been completed. 3021 For a copy only involving one server (the source and destination are 3022 on the same server), this operation is unnecessary. 3024 If the server supports COPY_NOTIFY, the server is REQUIRED to support 3025 the OFFLOAD_REVOKE operation. 3027 14.5. Operation 63: OFFLOAD_STATUS - Poll for status of a server-side 3028 copy 3030 14.5.1. ARGUMENT 3032 struct OFFLOAD_STATUS4args { 3033 /* CURRENT_FH: destination file */ 3034 stateid4 osa_stateid; 3035 }; 3037 14.5.2. RESULT 3039 struct OFFLOAD_STATUS4resok { 3040 length4 osr_bytes_copied; 3041 nfsstat4 osr_complete<1>; 3042 }; 3044 union OFFLOAD_STATUS4res switch (nfsstat4 osr_status) { 3045 case NFS4_OK: 3046 OFFLOAD_STATUS4resok osr_resok4; 3047 default: 3048 void; 3049 }; 3051 14.5.3. DESCRIPTION 3053 OFFLOAD_STATUS is used for both intra- and inter-server asynchronous 3054 copies. The OFFLOAD_STATUS operation allows the client to poll the 3055 destination server to determine the status of an asynchronous copy 3056 operation. 3058 If this operation is successful, the number of bytes copied are 3059 returned to the client in the osr_bytes_copied field. The 3060 osr_bytes_copied value indicates the number of bytes copied but not 3061 which specific bytes have been copied. 3063 If the optional osr_complete field is present, the copy has 3064 completed. In this case the status value indicates the result of the 3065 asynchronous copy operation. In all cases, the server will also 3066 deliver the final results of the asynchronous copy in a CB_OFFLOAD 3067 operation. 3069 The failure of this operation does not indicate the result of the 3070 asynchronous copy in any way. 3072 If the server supports asynchronous copies, the server is REQUIRED to 3073 support the OFFLOAD_STATUS operation. 3075 14.6. Modification to Operation 42: EXCHANGE_ID - Instantiate Client ID 3077 14.6.1. ARGUMENT 3079 /* new */ 3080 const EXCHGID4_FLAG_SUPP_FENCE_OPS = 0x00000004; 3082 14.6.2. RESULT 3084 Unchanged 3086 14.6.3. MOTIVATION 3088 Enterprise applications require guarantees that an operation has 3089 either aborted or completed. NFSv4.1 provides this guarantee as long 3090 as the session is alive: simply send a SEQUENCE operation on the same 3091 slot with a new sequence number, and the successful return of 3092 SEQUENCE indicates the previous operation has completed. However, if 3093 the session is lost, there is no way to know when any in progress 3094 operations have aborted or completed. In hindsight, the NFSv4.1 3095 specification should have mandated that DESTROY_SESSION either abort 3096 or complete all outstanding operations. 3098 14.6.4. DESCRIPTION 3100 A client SHOULD request the EXCHGID4_FLAG_SUPP_FENCE_OPS capability 3101 when it sends an EXCHANGE_ID operation. The server SHOULD set this 3102 capability in the EXCHANGE_ID reply whether the client requests it or 3103 not. It is the server's return that determines whether this 3104 capability is in effect. When it is in effect, the following will 3105 occur: 3107 o The server will not reply to any DESTROY_SESSION invoked with the 3108 client ID until all operations in progress are completed or 3109 aborted. 3111 o The server will not reply to subsequent EXCHANGE_ID invoked on the 3112 same client owner with a new verifier until all operations in 3113 progress on the client ID's session are completed or aborted. 3115 o The NFS server SHOULD support client ID trunking, and if it does 3116 and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a 3117 session ID created on one node of the storage cluster MUST be 3118 destroyable via DESTROY_SESSION. In addition, DESTROY_CLIENTID 3119 and an EXCHANGE_ID with a new verifier affects all sessions 3120 regardless what node the sessions were created on. 3122 14.7. Operation 64: WRITE_PLUS 3123 14.7.1. ARGUMENT 3125 struct data_info4 { 3126 offset4 di_offset; 3127 length4 di_length; 3128 bool di_allocated; 3129 }; 3131 struct data4 { 3132 offset4 d_offset; 3133 bool d_allocated; 3134 opaque d_data<>; 3135 }; 3137 union write_plus_arg4 switch (data_content4 wpa_content) { 3138 case NFS4_CONTENT_DATA: 3139 data4 wpa_data; 3140 case NFS4_CONTENT_APP_DATA_HOLE: 3141 app_data_hole4 wpa_adh; 3142 case NFS4_CONTENT_HOLE: 3143 data_info4 wpa_hole; 3144 default: 3145 void; 3146 }; 3148 struct WRITE_PLUS4args { 3149 /* CURRENT_FH: file */ 3150 stateid4 wp_stateid; 3151 stable_how4 wp_stable; 3152 write_plus_arg4 wp_data; 3153 }; 3155 14.7.2. RESULT 3157 struct write_response4 { 3158 stateid4 wr_callback_id<1>; 3159 count4 wr_count; 3160 stable_how4 wr_committed; 3161 verifier4 wr_writeverf; 3162 }; 3163 union WRITE_PLUS4res switch (nfsstat4 wp_status) { 3164 case NFS4_OK: 3165 write_response4 wp_resok4; 3166 default: 3167 void; 3168 }; 3170 14.7.3. DESCRIPTION 3172 The WRITE_PLUS operation is an extension of the NFSv4.1 WRITE 3173 operation (see Section 18.2 of [RFC5661]) and writes data to the 3174 regular file identified by the current filehandle. The server MAY 3175 write fewer bytes than requested by the client. 3177 The WRITE_PLUS argument is comprised of an array of rpr_contents, 3178 each of which describe a data_content4 type of data (Section 7.1.2). 3179 For NFSv4.2, the allowed values are data, ADH, and hole. The array 3180 contents MUST be contiguous in the file. A successful WRITE_PLUS 3181 will construct a reply for wr_count, wr_committed, and wr_writeverf 3182 as per the NFSv4.1 WRITE operation results. If wr_callback_id is 3183 set, it indicates an asynchronous reply (see Section 14.7.3.4). 3185 WRITE_PLUS has to support all of the errors which are returned by 3186 WRITE plus NFS4ERR_UNION_NOTSUPP. If the client asks for a hole and 3187 the server does not support that arm of the discriminated union, but 3188 does support one or more additional arms, it can signal to the client 3189 that it supports the operation, but not the arm with 3190 NFS4ERR_UNION_NOTSUPP. 3192 If the client supports WRITE_PLUS and any arm of the discriminated 3193 union other than NFS4_CONTENT_DATA, it MUST support CB_OFFLOAD. 3195 14.7.3.1. Data 3197 The d_offset specifies the offset where the data should be written. 3198 An d_offset of zero specifies that the write should start at the 3199 beginning of the file. The d_count, as encoded as part of the opaque 3200 data parameter, represents the number of bytes of data that are to be 3201 written. If the d_count is zero, the WRITE_PLUS will succeed and 3202 return a d_count of zero subject to permissions checking. 3204 Note that d_allocated has no meaning for WRITE_PLUS. 3206 The data MUST be written synchronously and MUST follow the same 3207 semantics of COMMIT as does the WRITE operation. 3209 14.7.3.2. Hole punching 3211 Whenever a client wishes to zero the blocks backing a particular 3212 region in the file, it calls the WRITE_PLUS operation with the 3213 current filehandle set to the filehandle of the file in question, and 3214 the equivalent of start offset and length in bytes of the region set 3215 in wpa_hole.di_offset and wpa_hole.di_length respectively. If the 3216 wpa_hole.di_allocated is set to TRUE, then the blocks will be zeroed 3217 and if it is set to FALSE, then they will be deallocated. All 3218 further reads to this region MUST return zeros until overwritten. 3219 The filehandle specified must be that of a regular file. 3221 Situations may arise where di_offset and/or di_offset + di_length 3222 will not be aligned to a boundary that the server does allocations/ 3223 deallocations in. For most file systems, this is the block size of 3224 the file system. In such a case, the server can deallocate as many 3225 bytes as it can in the region. The blocks that cannot be deallocated 3226 MUST be zeroed. Except for the block deallocation and maximum hole 3227 punching capability, a WRITE_PLUS operation is to be treated similar 3228 to a write of zeroes. 3230 The server is not required to complete deallocating the blocks 3231 specified in the operation before returning. The server SHOULD 3232 return an asynchronous result if it can determine the operation will 3233 be long running (see Section 14.7.3.4). 3235 If used to hole punch, WRITE_PLUS will result in the space_used 3236 attribute being decreased by the number of bytes that were 3237 deallocated. The space_freed attribute may or may not decrease, 3238 depending on the support and whether the blocks backing the specified 3239 range were shared or not. The size attribute will remain unchanged. 3241 The WRITE_PLUS operation MUST NOT change the space reservation 3242 guarantee of the file. While the server can deallocate the blocks 3243 specified by di_offset and di_length, future writes to this region 3244 MUST NOT fail with NFSERR_NOSPC. 3246 14.7.3.3. ADHs 3248 If the server supports ADHs, then it MUST support the 3249 NFS4_CONTENT_APP_DATA_HOLE arm of the WRITE_PLUS operation. The 3250 server has no concept of the structure imposed by the application. 3251 It is only when the application writes to a section of the file does 3252 order get imposed. In order to detect corruption even before the 3253 application utilizes the file, the application will want to 3254 initialize a range of ADHs using WRITE_PLUS. 3256 For ADHs, when the client invokes the WRITE_PLUS operation, it has 3257 two desired results: 3259 1. The structure described by the app_data_block4 be imposed on the 3260 file. 3262 2. The contents described by the app_data_block4 be sparse. 3264 If the server supports the WRITE_PLUS operation, it still might not 3265 support sparse files. So if it receives the WRITE_PLUS operation, 3266 then it MUST populate the contents of the file with the initialized 3267 ADHs. The server SHOULD return an asynchronous result if it can 3268 determine the operation will be long running (see Section 14.7.3.4). 3270 If the data was already initialized, there are two interesting 3271 scenarios: 3273 1. The data blocks are allocated. 3275 2. Initializing in the middle of an existing ADH. 3277 If the data blocks were already allocated, then the WRITE_PLUS is a 3278 hole punch operation. If WRITE_PLUS supports sparse files, then the 3279 data blocks are to be deallocated. If not, then the data blocks are 3280 to be rewritten in the indicated ADH format. 3282 Since the server has no knowledge of ADHs, it should not report 3283 misaligned creation of ADHs. Even while it can detect them, it 3284 cannot disallow them, as the application might be in the process of 3285 changing the size of the ADHs. Thus the server must be prepared to 3286 handle an WRITE_PLUS into an existing ADH. 3288 This document does not mandate the manner in which the server stores 3289 ADHs sparsely for a file. However, if an WRITE_PLUS arrives that 3290 will force a new ADH to start inside an existing ADH then the server 3291 will have three ADHs instead of two. It will have one up to the new 3292 one for the WRITE_PLUS, one for the WRITE_PLUS, and one for after the 3293 WRITE_PLUS. Note that depending on server specific policies for 3294 block allocation, there may also be some physical blocks allocated to 3295 align the boundaries. 3297 14.7.3.4. Asynchronous Transactions 3299 Both hole punching and ADH initialization may lead to server 3300 determining to service the operation asynchronously. If it decides 3301 to do so, it sets the stateid in wr_callback_id to be that of the 3302 wp_stateid. If it does not set the wr_callback_id, then the result 3303 is synchronous. 3305 When the client determines that the reply will be given 3306 asynchronously, it should not assume anything about the contents of 3307 what it wrote until it is informed by the server that the operation 3308 is complete. It can use OFFLOAD_STATUS (Section 14.5) to monitor the 3309 operation and OFFLOAD_ABORT (Section 14.2) to cancel the operation. 3310 An example of a asynchronous WRITE_PLUS is shown in Figure 6. Note 3311 that as with the COPY operation, WRITE_PLUS must provide a stateid 3312 for tracking the asynchronous operation. 3314 Client Server 3315 + + 3316 | | 3317 |--- OPEN ---------------------------->| Client opens 3318 |<------------------------------------/| the file 3319 | | 3320 |--- WRITE_PLUS ---------------------->| Client punches 3321 |<------------------------------------/| a hole 3322 | | 3323 | | 3324 |--- OFFLOAD_STATUS ------------------>| Client may poll 3325 |<------------------------------------/| for status 3326 | | 3327 | . | Multiple OFFLOAD_STATUS 3328 | . | operations may be sent. 3329 | . | 3330 | | 3331 |<-- CB_OFFLOAD -----------------------| Server reports results 3332 |\------------------------------------>| 3333 | | 3334 |--- CLOSE --------------------------->| Client closes 3335 |<------------------------------------/| the file 3336 | | 3337 | | 3339 Figure 6: An asynchronous WRITE_PLUS. 3341 When CB_OFFLOAD informs the client of the successful WRITE_PLUS, the 3342 write_response4 embedded in the operation will provide the necessary 3343 information that a synchronous WRITE_PLUS would have provided. 3345 Regardless of whether the operation is asynchronous or synchronous, 3346 it MUST still support the COMMIT operation semantics as outlined in 3347 Section 18.3 of [RFC5661]. I.e., COMMIT works on one or more WRITE 3348 operations and the WRITE_PLUS operation can appear as several WRITE 3349 operations to the server. The client can use locking operations to 3350 control the behavior on the server with respect to long running 3351 asynchronous write operations. 3353 14.8. Operation 67: IO_ADVISE - Application I/O access pattern hints 3355 14.8.1. ARGUMENT 3357 enum IO_ADVISE_type4 { 3358 IO_ADVISE4_NORMAL = 0, 3359 IO_ADVISE4_SEQUENTIAL = 1, 3360 IO_ADVISE4_SEQUENTIAL_BACKWARDS = 2, 3361 IO_ADVISE4_RANDOM = 3, 3362 IO_ADVISE4_WILLNEED = 4, 3363 IO_ADVISE4_WILLNEED_OPPORTUNISTIC = 5, 3364 IO_ADVISE4_DONTNEED = 6, 3365 IO_ADVISE4_NOREUSE = 7, 3366 IO_ADVISE4_READ = 8, 3367 IO_ADVISE4_WRITE = 9, 3368 IO_ADVISE4_INIT_PROXIMITY = 10 3369 }; 3371 struct IO_ADVISE4args { 3372 /* CURRENT_FH: file */ 3373 stateid4 iar_stateid; 3374 offset4 iar_offset; 3375 length4 iar_count; 3376 bitmap4 iar_hints; 3377 }; 3379 14.8.2. RESULT 3381 struct IO_ADVISE4resok { 3382 bitmap4 ior_hints; 3383 }; 3385 union IO_ADVISE4res switch (nfsstat4 _status) { 3386 case NFS4_OK: 3387 IO_ADVISE4resok resok4; 3388 default: 3389 void; 3390 }; 3392 14.8.3. DESCRIPTION 3394 The IO_ADVISE operation sends an I/O access pattern hint to the 3395 server for the owner of the stateid for a given byte range specified 3396 by iar_offset and iar_count. The byte range specified by iar_offset 3397 and iar_count need not currently exist in the file, but the iar_hints 3398 will apply to the byte range when it does exist. If iar_count is 0, 3399 all data following iar_offset is specified. The server MAY ignore 3400 the advice. 3402 The following are the allowed hints for a stateid holder: 3404 IO_ADVISE4_NORMAL There is no advice to give, this is the default 3405 behavior. 3407 IO_ADVISE4_SEQUENTIAL Expects to access the specified data 3408 sequentially from lower offsets to higher offsets. 3410 IO_ADVISE4_SEQUENTIAL_BACKWARDS Expects to access the specified data 3411 sequentially from higher offsets to lower offsets. 3413 IO_ADVISE4_RANDOM Expects to access the specified data in a random 3414 order. 3416 IO_ADVISE4_WILLNEED Expects to access the specified data in the near 3417 future. 3419 IO_ADVISE4_WILLNEED_OPPORTUNISTIC Expects to possibly access the 3420 data in the near future. This is a speculative hint, and 3421 therefore the server should prefetch data or indirect blocks only 3422 if it can be done at a marginal cost. 3424 IO_ADVISE_DONTNEED Expects that it will not access the specified 3425 data in the near future. 3427 IO_ADVISE_NOREUSE Expects to access the specified data once and then 3428 not reuse it thereafter. 3430 IO_ADVISE4_READ Expects to read the specified data in the near 3431 future. 3433 IO_ADVISE4_WRITE Expects to write the specified data in the near 3434 future. 3436 IO_ADVISE4_INIT_PROXIMITY Informs the server that the data in the 3437 byte range remains important to the client. 3439 Since IO_ADVISE is a hint, a server SHOULD NOT return an error and 3440 invalidate a entire Compound request if one of the sent hints in 3441 iar_hints is not supported by the server. Also, the server MUST NOT 3442 return an error if the client sends contradictory hints to the 3443 server, e.g., IO_ADVISE4_SEQUENTIAL and IO_ADVISE4_RANDOM in a single 3444 IO_ADVISE operation. In these cases, the server MUST return success 3445 and a ior_hints value that indicates the hint it intends to 3446 implement. This may mean simply returning IO_ADVISE4_NORMAL. 3448 The ior_hints returned by the server is primarily for debugging 3449 purposes since the server is under no obligation to carry out the 3450 hints that it describes in the ior_hints result. In addition, while 3451 the server may have intended to implement the hints returned in 3452 ior_hints, as time progresses, the server may need to change its 3453 handling of a given file due to several reasons including, but not 3454 limited to, memory pressure, additional IO_ADVISE hints sent by other 3455 clients, and heuristically detected file access patterns. 3457 The server MAY return different advice than what the client 3458 requested. If it does, then this might be due to one of several 3459 conditions, including, but not limited to another client advising of 3460 a different I/O access pattern; a different I/O access pattern from 3461 another client that that the server has heuristically detected; or 3462 the server is not able to support the requested I/O access pattern, 3463 perhaps due to a temporary resource limitation. 3465 Each issuance of the IO_ADVISE operation overrides all previous 3466 issuances of IO_ADVISE for a given byte range. This effectively 3467 follows a strategy of last hint wins for a given stateid and byte 3468 range. 3470 Clients should assume that hints included in an IO_ADVISE operation 3471 will be forgotten once the file is closed. 3473 14.8.4. IMPLEMENTATION 3475 The NFS client may choose to issue an IO_ADVISE operation to the 3476 server in several different instances. 3478 The most obvious is in direct response to an application's execution 3479 of posix_fadvise(). In this case, IO_ADVISE4_WRITE and 3480 IO_ADVISE4_READ may be set based upon the type of file access 3481 specified when the file was opened. 3483 14.8.5. IO_ADVISE4_INIT_PROXIMITY 3485 The IO_ADVISE4_INIT_PROXIMITY hint is non-posix in origin and conveys 3486 that the client has recently accessed the byte range in its own 3487 cache. I.e., it has not accessed it on the server, but it has 3488 locally. When the server reaches resource exhaustion, knowing which 3489 data is more important allows the server to make better choices about 3490 which data to, for example purge from a cache, or move to secondary 3491 storage. It also informs the server which delegations are more 3492 important, since if delegations are working correctly, once delegated 3493 to a client and the client has read the content for that byte range, 3494 a server might never receive another read request for that byte 3495 range. 3497 This hint is also useful in the case of NFS clients which are network 3498 booting from a server. If the first client to be booted sends this 3499 hint, then it keeps the cache warm for the remaining clients. 3501 14.8.6. pNFS File Layout Data Type Considerations 3503 The IO_ADVISE considerations for pNFS are very similar to the COMMIT 3504 considerations for pNFS. That is, as with COMMIT, some NFS server 3505 implementations prefer IO_ADVISE be done on the DS, and some prefer 3506 it be done on the MDS. 3508 So for the file's layout type, it is proposed that NFSv4.2 include an 3509 additional hint NFL42_CARE_IO_ADVISE_THRU_MDS which is valid only on 3510 NFSv4.2 or higher. Any file's layout obtained with NFSv4.1 MUST NOT 3511 have NFL42_UFLG_IO_ADVISE_THRU_MDS set. Any file's layout obtained 3512 with NFSv4.2 MAY have NFL42_UFLG_IO_ADVISE_THRU_MDS set. If the 3513 client does not implement IO_ADVISE, then it MUST ignore 3514 NFL42_UFLG_IO_ADVISE_THRU_MDS. 3516 If NFL42_UFLG_IO_ADVISE_THRU_MDS is set, the client MUST send the 3517 IO_ADVISE operation to the MDS in order for it to be honored by the 3518 DS. Once the MDS receives the IO_ADVISE operation, it will 3519 communicate the advice to each DS. 3521 If NFL42_UFLG_IO_ADVISE_THRU_MDS is not set, then the client SHOULD 3522 send an IO_ADVISE operation to the appropriate DS for the specified 3523 byte range. While the client MAY always send IO_ADVISE to the MDS, 3524 if the server has not set NFL42_UFLG_IO_ADVISE_THRU_MDS, the client 3525 should expect that such an IO_ADVISE is futile. Note that a client 3526 SHOULD use the same set of arguments on each IO_ADVISE sent to a DS 3527 for the same open file reference. 3529 The server is not required to support different advice for different 3530 DS's with the same open file reference. 3532 14.8.6.1. Dense and Sparse Packing Considerations 3534 The IO_ADVISE operation MUST use the iar_offset and byte range as 3535 dictated by the presence or absence of NFL4_UFLG_DENSE. 3537 E.g., if NFL4_UFLG_DENSE is present, and a READ or WRITE to the DS 3538 for iar_offset 0 really means iar_offset 10000 in the logical file, 3539 then an IO_ADVISE for iar_offset 0 means iar_offset 10000. 3541 E.g., if NFL4_UFLG_DENSE is absent, then a READ or WRITE to the DS 3542 for iar_offset 0 really means iar_offset 0 in the logical file, then 3543 an IO_ADVISE for iar_offset 0 means iar_offset 0 in the logical file. 3545 E.g., if NFL4_UFLG_DENSE is present, the stripe unit is 1000 bytes 3546 and the stripe count is 10, and the dense DS file is serving 3547 iar_offset 0. A READ or WRITE to the DS for iar_offsets 0, 1000, 3548 2000, and 3000, really mean iar_offsets 10000, 20000, 30000, and 3549 40000 (implying a stripe count of 10 and a stripe unit of 1000), then 3550 an IO_ADVISE sent to the same DS with an iar_offset of 500, and a 3551 iar_count of 3000 means that the IO_ADVISE applies to these byte 3552 ranges of the dense DS file: 3554 - 500 to 999 3555 - 1000 to 1999 3556 - 2000 to 2999 3557 - 3000 to 3499 3559 I.e., the contiguous range 500 to 3499 as specified in IO_ADVISE. 3561 It also applies to these byte ranges of the logical file: 3563 - 10500 to 10999 (500 bytes) 3564 - 20000 to 20999 (1000 bytes) 3565 - 30000 to 30999 (1000 bytes) 3566 - 40000 to 40499 (500 bytes) 3567 (total 3000 bytes) 3569 E.g., if NFL4_UFLG_DENSE is absent, the stripe unit is 250 bytes, the 3570 stripe count is 4, and the sparse DS file is serving iar_offset 0. 3571 Then a READ or WRITE to the DS for iar_offsets 0, 1000, 2000, and 3572 3000, really mean iar_offsets 0, 1000, 2000, and 3000 in the logical 3573 file, keeping in mind that on the DS file,. byte ranges 250 to 999, 3574 1250 to 1999, 2250 to 2999, and 3250 to 3999 are not accessible. 3575 Then an IO_ADVISE sent to the same DS with an iar_offset of 500, and 3576 a iar_count of 3000 means that the IO_ADVISE applies to these byte 3577 ranges of the logical file and the sparse DS file: 3579 - 500 to 999 (500 bytes) - no effect 3580 - 1000 to 1249 (250 bytes) - effective 3581 - 1250 to 1999 (750 bytes) - no effect 3582 - 2000 to 2249 (250 bytes) - effective 3583 - 2250 to 2999 (750 bytes) - no effect 3584 - 3000 to 3249 (250 bytes) - effective 3585 - 3250 to 3499 (250 bytes) - no effect 3586 (subtotal 2250 bytes) - no effect 3587 (subtotal 750 bytes) - effective 3588 (grand total 3000 bytes) - no effect + effective 3590 If neither of the flags NFL42_UFLG_IO_ADVISE_THRU_MDS and 3591 NFL4_UFLG_DENSE are set in the layout, then any IO_ADVISE request 3592 sent to the data server with a byte range that overlaps stripe unit 3593 that the data server does not serve MUST NOT result in the status 3594 NFS4ERR_PNFS_IO_HOLE. Instead, the response SHOULD be successful and 3595 if the server applies IO_ADVISE hints on any stripe units that 3596 overlap with the specified range, those hints SHOULD be indicated in 3597 the response. 3599 14.9. Changes to Operation 51: LAYOUTRETURN 3601 14.9.1. Introduction 3603 In the pNFS description provided in [RFC5661], the client is not 3604 capable to relay an error code from the DS to the MDS. In the 3605 specification of the Objects-Based Layout protocol [RFC5664], use is 3606 made of the opaque lrf_body field of the LAYOUTRETURN argument to do 3607 such a relaying of error codes. In this section, we define a new 3608 data structure to enable the passing of error codes back to the MDS 3609 and provide some guidelines on what both the client and MDS should 3610 expect in such circumstances. 3612 There are two broad classes of errors, transient and persistent. The 3613 client SHOULD strive to only use this new mechanism to report 3614 persistent errors. It MUST be able to deal with transient issues by 3615 itself. Also, while the client might consider an issue to be 3616 persistent, it MUST be prepared for the MDS to consider such issues 3617 to be transient. A prime example of this is if the MDS fences off a 3618 client from either a stateid or a filehandle. The client will get an 3619 error from the DS and might relay either NFS4ERR_ACCESS or 3620 NFS4ERR_BAD_STATEID back to the MDS, with the belief that this is a 3621 hard error. If the MDS is informed by the client that there is an 3622 error, it can safely ignore that. For it, the mission is 3623 accomplished in that the client has returned a layout that the MDS 3624 had most likely recalled. 3626 The client might also need to inform the MDS that it cannot reach one 3627 or more of the DSes. While the MDS can detect the connectivity of 3628 both of these paths: 3630 o MDS to DS 3632 o MDS to client 3634 it cannot determine if the client and DS path is working. As with 3635 the case of the DS passing errors to the client, it must be prepared 3636 for the MDS to consider such outages as being transitory. 3638 The existing LAYOUTRETURN operation is extended by introducing a new 3639 data structure to report errors, layoutreturn_device_error4. Also, 3640 layoutreturn_device_error4 is introduced to enable an array of errors 3641 to be reported. 3643 14.9.2. ARGUMENT 3645 The ARGUMENT specification of the LAYOUTRETURN operation in section 3646 18.44.1 of [RFC5661] is augmented by the following XDR code 3647 [RFC4506]: 3649 struct layoutreturn_device_error4 { 3650 deviceid4 lrde_deviceid; 3651 nfsstat4 lrde_status; 3652 nfs_opnum4 lrde_opnum; 3653 }; 3655 struct layoutreturn_error_report4 { 3656 layoutreturn_device_error4 lrer_errors<>; 3657 }; 3659 14.9.3. RESULT 3661 The RESULT of the LAYOUTRETURN operation is unchanged; see section 3662 18.44.2 of [RFC5661]. 3664 14.9.4. DESCRIPTION 3666 The following text is added to the end of the LAYOUTRETURN operation 3667 DESCRIPTION in section 18.44.3 of [RFC5661]. 3669 When a client uses LAYOUTRETURN with a type of LAYOUTRETURN4_FILE, 3670 then if the lrf_body field is NULL, it indicates to the MDS that the 3671 client experienced no errors. If lrf_body is non-NULL, then the 3672 field references error information which is layout type specific. 3673 I.e., the Objects-Based Layout protocol can continue to utilize 3674 lrf_body as specified in [RFC5664]. For both Files-Based and Block- 3675 Based Layouts, the field references a layoutreturn_device_error4, 3676 which contains an array of layoutreturn_device_error4. 3678 Each individual layoutreturn_device_error4 describes a single error 3679 associated with a DS, which is identified via lrde_deviceid. The 3680 operation which returned the error is identified via lrde_opnum. 3681 Finally the NFS error value (nfsstat4) encountered is provided via 3682 lrde_status and may consist of the following error codes: 3684 NFS4ERR_NXIO: The client was unable to establish any communication 3685 with the DS. 3687 NFS4ERR_*: The client was able to establish communication with the 3688 DS and is returning one of the allowed error codes for the 3689 operation denoted by lrde_opnum. 3691 14.9.5. IMPLEMENTATION 3693 The following text is added to the end of the LAYOUTRETURN operation 3694 IMPLEMENTATION in section 18.4.4 of [RFC5661]. 3696 Clients are expected to tolerate transient storage device errors, and 3697 hence clients SHOULD NOT use the LAYOUTRETURN error handling for 3698 device access problems that may be transient. The methods by which a 3699 client decides whether a device access problem is transient vs. 3700 persistent are implementation-specific, but may include retrying I/Os 3701 to a data server under appropriate conditions. 3703 When an I/O fails to a storage device, the client SHOULD retry the 3704 failed I/O via the MDS. In this situation, before retrying the I/O, 3705 the client SHOULD return the layout, or the affected portion thereof, 3706 and SHOULD indicate which storage device or devices was problematic. 3707 The client needs to do this when the DS is being unresponsive in 3708 order to fence off any failed write attempts, and ensure that they do 3709 not end up overwriting any later data being written through the MDS. 3710 If the client does not do this, the MDS MAY issue a layout recall 3711 callback in order to perform the retried I/O. 3713 The client needs to be cognizant that since this error handling is 3714 optional in the MDS, the MDS may silently ignore this functionality. 3715 Also, as the MDS may consider some issues the client reports to be 3716 expected (see Section 14.9.1), the client might find it difficult to 3717 detect a MDS which has not implemented error handling via 3718 LAYOUTRETURN. 3720 If an MDS is aware that a storage device is proving problematic to a 3721 client, the MDS SHOULD NOT include that storage device in any pNFS 3722 layouts sent to that client. If the MDS is aware that a storage 3723 device is affecting many clients, then the MDS SHOULD NOT include 3724 that storage device in any pNFS layouts sent out. If a client asks 3725 for a new layout for the file from the MDS, it MUST be prepared for 3726 the MDS to return that storage device in the layout. The MDS might 3727 not have any choice in using the storage device, i.e., there might 3728 only be one possible layout for the system. Also, in the case of 3729 existing files, the MDS might have no choice in which storage devices 3730 to hand out to clients. 3732 The MDS is not required to indefinitely retain per-client storage 3733 device error information. An MDS is also not required to 3734 automatically reinstate use of a previously problematic storage 3735 device; administrative intervention may be required instead. 3737 14.10. Operation 65: READ_PLUS 3739 14.10.1. ARGUMENT 3741 struct READ_PLUS4args { 3742 /* CURRENT_FH: file */ 3743 stateid4 rpa_stateid; 3744 offset4 rpa_offset; 3745 count4 rpa_count; 3746 }; 3748 14.10.2. RESULT 3750 struct data_info4 { 3751 offset4 di_offset; 3752 length4 di_length; 3753 bool di_allocated; 3754 }; 3756 struct data4 { 3757 offset4 d_offset; 3758 bool d_allocated; 3759 opaque d_data<>; 3760 }; 3761 union read_plus_content switch (data_content4 rpc_content) { 3762 case NFS4_CONTENT_DATA: 3763 data4 rpc_data; 3764 case NFS4_CONTENT_APP_DATA_HOLE: 3765 app_data_hole4 rpc_adh; 3766 case NFS4_CONTENT_HOLE: 3767 data_info4 rpc_hole; 3768 default: 3769 void; 3770 }; 3772 /* 3773 * Allow a return of an array of contents. 3774 */ 3775 struct read_plus_res4 { 3776 bool rpr_eof; 3777 read_plus_content rpr_contents<>; 3778 }; 3780 union READ_PLUS4res switch (nfsstat4 rp_status) { 3781 case NFS4_OK: 3782 read_plus_res4 rp_resok4; 3783 default: 3784 void; 3785 }; 3787 14.10.3. DESCRIPTION 3789 The READ_PLUS operation is based upon the NFSv4.1 READ operation (see 3790 Section 18.22 of [RFC5661]) and similarly reads data from the regular 3791 file identified by the current filehandle. 3793 The client provides a rpa_offset of where the READ_PLUS is to start 3794 and a rpa_count of how many bytes are to be read. A rpa_offset of 3795 zero means to read data starting at the beginning of the file. If 3796 rpa_offset is greater than or equal to the size of the file, the 3797 status NFS4_OK is returned with di_length (the data length) set to 3798 zero and eof set to TRUE. 3800 The READ_PLUS result is comprised of an array of rpr_contents, each 3801 of which describe a data_content4 type of data (Section 7.1.2). For 3802 NFSv4.2, the allowed values are data, ADH, and hole. A server is 3803 required to support the data type, but neither ADH nor hole. Both an 3804 ADH and a hole must be returned in its entirety - clients must be 3805 prepared to get more information than they requested. Both the start 3806 and the end of the hole may exceed what was requested. The array 3807 contents MUST be contiguous in the file. 3809 If the data to be returned is comprised entirely of zeros, then the 3810 server may elect to return that data as a hole. The server 3811 differentiates this to the client by setting di_allocated to TRUE in 3812 this case. Note that in such a scenario, the server is not required 3813 to determine the full extent of the "hole" - it does not need to 3814 determine where the zeros start and end. If the server elects to 3815 return the hole as data, then it can set the d_allocted to FALSE in 3816 the rpc_data to indicate it is a hole. 3818 The server may elect to return adjacent elements of the same type. 3819 For example, the guard pattern or block size of an ADH might change, 3820 which would require adjacent elements of type ADH. Likewise if the 3821 server has a range of data comprised entirely of zeros and then a 3822 hole, it might want to return two adjacent holes to the client. 3824 If the client specifies a rpa_count value of zero, the READ_PLUS 3825 succeeds and returns zero bytes of data. In all situations, the 3826 server may choose to return fewer bytes than specified by the client. 3827 The client needs to check for this condition and handle the condition 3828 appropriately. 3830 If the client specifies an rpa_offset and rpa_count value that is 3831 entirely contained within a hole of the file, then the di_offset and 3832 di_length returned must be for the entire hole. This result is 3833 considered valid until the file is changed (detected via the change 3834 attribute). The server MUST provide the same semantics for the hole 3835 as if the client read the region and received zeroes; the implied 3836 holes contents lifetime MUST be exactly the same as any other read 3837 data. 3839 If the client specifies an rpa_offset and rpa_count value that begins 3840 in a non-hole of the file but extends into hole the server should 3841 return an array comprised of both data and a hole. The client MUST 3842 be prepared for the server to return a short read describing just the 3843 data. The client will then issue another READ_PLUS for the remaining 3844 bytes, which the server will respond with information about the hole 3845 in the file. 3847 Except when special stateids are used, the stateid value for a 3848 READ_PLUS request represents a value returned from a previous byte- 3849 range lock or share reservation request or the stateid associated 3850 with a delegation. The stateid identifies the associated owners if 3851 any and is used by the server to verify that the associated locks are 3852 still valid (e.g., have not been revoked). 3854 If the read ended at the end-of-file (formally, in a correctly formed 3855 READ_PLUS operation, if rpa_offset + rpa_count is equal to the size 3856 of the file), or the READ_PLUS operation extends beyond the size of 3857 the file (if rpa_offset + rpa_count is greater than the size of the 3858 file), eof is returned as TRUE; otherwise, it is FALSE. A successful 3859 READ_PLUS of an empty file will always return eof as TRUE. 3861 If the current filehandle is not an ordinary file, an error will be 3862 returned to the client. In the case that the current filehandle 3863 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If 3864 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 3865 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 3867 For a READ_PLUS with a stateid value of all bits equal to zero, the 3868 server MAY allow the READ_PLUS to be serviced subject to mandatory 3869 byte-range locks or the current share deny modes for the file. For a 3870 READ_PLUS with a stateid value of all bits equal to one, the server 3871 MAY allow READ_PLUS operations to bypass locking checks at the 3872 server. 3874 On success, the current filehandle retains its value. 3876 14.10.4. IMPLEMENTATION 3878 In general, the IMPLEMENTATION notes for READ in Section 18.22.4 of 3879 [RFC5661] also apply to READ_PLUS. One delta is that when the owner 3880 has a locked byte range, the server MUST return an array of 3881 rpr_contents with values inside that range. 3883 14.10.4.1. Additional pNFS Implementation Information 3885 With pNFS, the semantics of using READ_PLUS remains the same. Any 3886 data server MAY return a hole or ADH result for a READ_PLUS request 3887 that it receives. When a data server chooses to return such a 3888 result, it has the option of returning information for the data 3889 stored on that data server (as defined by the data layout), but it 3890 MUST NOT return results for a byte range that includes data managed 3891 by another data server. 3893 A data server should do its best to return as much information about 3894 a ADH as is feasible without having to contact the metadata server. 3895 If communication with the metadata server is required, then every 3896 attempt should be taken to minimize the number of requests. 3898 If mandatory locking is enforced, then the data server must also 3899 ensure that to return only information that is within the owner's 3900 locked byte range. 3902 14.10.5. READ_PLUS with Sparse Files Example 3904 The following table describes a sparse file. For each byte range, 3905 the file contains either non-zero data or a hole. In addition, the 3906 server in this example uses a Hole Threshold of 32K. 3908 +-------------+----------+ 3909 | Byte-Range | Contents | 3910 +-------------+----------+ 3911 | 0-15999 | Hole | 3912 | 16K-31999 | Non-Zero | 3913 | 32K-255999 | Hole | 3914 | 256K-287999 | Non-Zero | 3915 | 288K-353999 | Hole | 3916 | 354K-417999 | Non-Zero | 3917 +-------------+----------+ 3919 Table 5 3921 Under the given circumstances, if a client was to read from the file 3922 with a max read size of 64K, the following will be the results for 3923 the given READ_PLUS calls. This assumes the client has already 3924 opened the file, acquired a valid stateid ('s' in the example), and 3925 just needs to issue READ_PLUS requests. 3927 1. READ_PLUS(s, 0, 64K) --> NFS_OK, eof = false, . Since the first hole is less than the server's 3929 Hole Threshhold, the first 32K of the file is returned as data 3930 and the remaining 32K is returned as a hole which actually 3931 extends to 256K. 3933 2. READ_PLUS(s, 32K, 64K) --> NFS_OK, eof = false, 3934 The requested range was all zeros, and the current hole begins at 3935 offset 32K and is 224K in length. Note that the client should 3936 not have followed up the previous READ_PLUS request with this one 3937 as the hole information from the previous call extended past what 3938 the client was requesting. 3940 3. READ_PLUS(s, 256K, 64K) --> NFS_OK, eof = false, . Returns an array of the 32K data and 3942 the hole which extends to 354K. 3944 4. READ_PLUS(s, 354K, 64K) --> NFS_OK, eof = true, . Returns the final 64K of data and informs the client 3946 there is no more data in the file. 3948 14.11. Operation 66: SEEK 3950 SEEK is an operation that allows a client to determine the location 3951 of the next data_content4 in a file. It allows an implementation of 3952 the emerging extension to lseek(2) to allow clients to determine 3953 SEEK_HOLE and SEEK_DATA. 3955 14.11.1. ARGUMENT 3957 struct SEEK4args { 3958 /* CURRENT_FH: file */ 3959 stateid4 sa_stateid; 3960 offset4 sa_offset; 3961 data_content4 sa_what; 3962 }; 3964 14.11.2. RESULT 3966 union seek_content switch (data_content4 content) { 3967 case NFS4_CONTENT_DATA: 3968 data_info4 sc_data; 3969 case NFS4_CONTENT_APP_DATA_HOLE: 3970 app_data_hole4 sc_adh; 3971 case NFS4_CONTENT_HOLE: 3972 data_info4 sc_hole; 3973 default: 3974 void; 3975 }; 3977 struct seek_res4 { 3978 bool sr_eof; 3979 seek_content sr_contents; 3980 }; 3982 union SEEK4res switch (nfsstat4 status) { 3983 case NFS4_OK: 3984 seek_res4 resok4; 3985 default: 3986 void; 3987 }; 3989 14.11.3. DESCRIPTION 3991 From the given sa_offset, find the next data_content4 of type sa_what 3992 in the file. For either a hole or ADH, this must return the 3993 data_content4 in its entirety. For data, it must not return the 3994 actual data. 3996 SEEK must follow the same rules for stateids as READ_PLUS 3997 (Section 14.10.3). 3999 If the server could not find a corresponding sa_what, then the status 4000 would still be NFS4_OK, but sr_eof would be TRUE. The sr_contents 4001 would contain a zero-ed out content of the appropriate type. 4003 15. NFSv4.2 Callback Operations 4005 15.1. Operation 15: CB_OFFLOAD - Report results of an asynchronous 4006 operation 4008 15.1.1. ARGUMENT 4010 struct write_response4 { 4011 stateid4 wr_callback_id<1>; 4012 count4 wr_count; 4013 stable_how4 wr_committed; 4014 verifier4 wr_writeverf; 4015 }; 4017 union offload_info4 switch (nfsstat4 coa_status) { 4018 case NFS4_OK: 4019 write_response4 coa_resok4; 4020 default: 4021 length4 coa_bytes_copied; 4022 }; 4024 struct CB_OFFLOAD4args { 4025 nfs_fh4 coa_fh; 4026 stateid4 coa_stateid; 4027 offload_info4 coa_offload_info; 4028 }; 4030 15.1.2. RESULT 4032 struct CB_OFFLOAD4res { 4033 nfsstat4 cor_status; 4034 }; 4036 15.1.3. DESCRIPTION 4038 CB_OFFLOAD is used to report to the client the results of an 4039 asynchronous operation, e.g., Server-side Copy or a hole punch. The 4040 coa_fh and coa_stateid identify the transaction and the coa_status 4041 indicates success or failure. The coa_resok4.wr_callback_id MUST NOT 4042 be set. If the transaction failed, then the coa_bytes_copied 4043 contains the number of bytes copied before the failure occurred. The 4044 coa_bytes_copied value indicates the number of bytes copied but not 4045 which specific bytes have been copied. 4047 If the client supports either 4049 1. the COPY operation 4051 2. the WRITE_PLUS operation and any arm of the discriminated union 4052 other than NFS4_CONTENT_DATA 4054 then the client is REQUIRED to support the CB_OFFLOAD operation. 4056 There is a potential race between the reply to the original 4057 transaction on the forechannel and the CB_OFFLOAD callback on the 4058 backchannel. Sections 2.10.6.3 and 20.9.3 of [RFC5661] describe how 4059 to handle this type of issue. 4061 15.1.3.1. Server-side Copy 4063 CB_OFFLOAD is used for both intra- and inter-server asynchronous 4064 copies. This operation is sent by the destination server to the 4065 client in a CB_COMPOUND request. Upon success, the 4066 coa_resok4.wr_count presents the total number of bytes copied. 4068 15.1.3.2. WRITE_PLUS 4070 CB_OFFLOAD is used to report the completion of either a hole punch or 4071 an ADH initialization. Upon success, the coa_resok4 will contain the 4072 same information that a synchronous WRITE_PLUS would have returned. 4074 16. IANA Considerations 4076 This section uses terms that are defined in [RFC5226]. 4078 17. References 4080 17.1. Normative References 4082 [NFSv42xdr] 4083 Haynes, T., "Network File System (NFS) Version 4 Minor 4084 Version 2 External Data Representation Standard (XDR) 4085 Description", March 2013. 4087 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 4088 Resource Identifier (URI): Generic Syntax", STD 66, 4089 RFC 3986, January 2005. 4091 [RFC5661] Shepler, S., Eisler, M., and D. Noveck, "Network File 4092 System (NFS) Version 4 Minor Version 1 Protocol", 4093 RFC 5661, January 2010. 4095 [RFC5664] Halevy, B., Welch, B., and J. Zelenka, "Object-Based 4096 Parallel NFS (pNFS) Operations", RFC 5664, January 2010. 4098 [posix_fadvise] 4099 The Open Group, "Section 'posix_fadvise()' of System 4100 Interfaces of The Open Group Base Specifications Issue 6, 4101 IEEE Std 1003.1, 2004 Edition", 2004. 4103 [rpcsec_gssv3] 4104 Adamson, W. and N. Williams, "Remote Procedure Call (RPC) 4105 Security Version 3", October 2013. 4107 17.2. Informative References 4109 [Ashdown08] 4110 Ashdown, L., "Chapter 15, Validating Database Files and 4111 Backups, of Oracle Database Backup and Recovery User's 4112 Guide 11g Release 1 (11.1)", August 2008. 4114 [Baira08] Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci- 4115 Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data 4116 Corruption in the Storage Stack", Proceedings of the 6th 4117 USENIX Symposium on File and Storage Technologies (FAST 4118 '08) , 2008. 4120 [FEDFS-ADMIN] 4121 Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. 4122 Naik, "Administration Protocol for Federated Filesystems", 4123 draft-ietf-nfsv4-federated-fs-admin (Work In Progress), 4124 2010. 4126 [FEDFS-NSDB] 4127 Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. 4128 Naik, "NSDB Protocol for Federated Filesystems", 4129 draft-ietf-nfsv4-federated-fs-protocol (Work In Progress), 4130 2010. 4132 [Haynes13] 4133 Haynes, T., "Requirements for Labeled NFS", 4134 draft-ietf-nfsv4-labreqs-04 (work in progress), 2013. 4136 [I-D.ietf-nfsv4-rfc3530bis] 4137 Haynes, T. and D. Noveck, "Network File System (NFS) 4138 version 4 Protocol", draft-ietf-nfsv4-rfc3530bis-25 (Work 4139 In Progress), February 2013. 4141 [IESG08] ISEG, "IESG Processing of RFC Errata for the IETF Stream", 4142 2008. 4144 [MLS] "Section 46.6. Multi-Level Security (MLS) of Deployment 4145 Guide: Deployment, configuration and administration of Red 4146 Hat Enterprise Linux 5, Edition 6", 2011. 4148 [McDougall07] 4149 McDougall, R. and J. Mauro, "Section 11.4.3, Detecting 4150 Memory Corruption of Solaris Internals", 2007. 4152 [Quigley11] 4153 Quigley, D. and J. Lu, "Registry Specification for MAC 4154 Security Label Formats", 4155 draft-quigley-label-format-registry (work in progress), 4156 2011. 4158 [RFC0959] Postel, J. and J. Reynolds, "File Transfer Protocol", 4159 STD 9, RFC 959, October 1985. 4161 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 4162 Requirement Levels", March 1997. 4164 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 4165 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 4166 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 4168 [RFC4506] Eisler, M., "XDR: External Data Representation Standard", 4169 RFC 4506, May 2006. 4171 [RFC5226] Narten, T. and H. Alvestrand, "Guidelines for Writing an 4172 IANA Considerations Section in RFCs", BCP 26, RFC 5226, 4173 May 2008. 4175 [Strohm11] 4176 Strohm, R., "Chapter 2, Data Blocks, Extents, and 4177 Segments, of Oracle Database Concepts 11g Release 1 4178 (11.1)", January 2011. 4180 Appendix A. Acknowledgments 4182 For the pNFS Access Permissions Check, the original draft was by 4183 Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow. The work 4184 was influenced by discussions with Benny Halevy and Bruce Fields. A 4185 review was done by Tom Haynes. 4187 For the Sharing change attribute implementation details with NFSv4 4188 clients, the original draft was by Trond Myklebust. 4190 For the NFS Server-side Copy, the original draft was by James 4191 Lentini, Mike Eisler, Deepak Kenchammana, Anshul Madan, and Rahul 4192 Iyer. Tom Talpey co-authored an unpublished version of that 4193 document. It was also was reviewed by a number of individuals: 4194 Pranoop Erasani, Tom Haynes, Arthur Lent, Trond Myklebust, Dave 4195 Noveck, Theresa Lingutla-Raj, Manjunath Shankararao, Satyam Vaghani, 4196 and Nico Williams. Anna Schumaker's early prototyping experience 4197 helped us avoid some traps. 4199 For the NFS space reservation operations, the original draft was by 4200 Mike Eisler, James Lentini, Manjunath Shankararao, and Rahul Iyer. 4202 For the sparse file support, the original draft was by Dean 4203 Hildebrand and Marc Eshel. Valuable input and advice was received 4204 from Sorin Faibish, Bruce Fields, Benny Halevy, Trond Myklebust, and 4205 Richard Scheffenegger. 4207 For the Application IO Hints, the original draft was by Dean 4208 Hildebrand, Mike Eisler, Trond Myklebust, and Sam Falkner. Some 4209 early reviewers included Benny Halevy and Pranoop Erasani. 4211 For Labeled NFS, the original draft was by David Quigley, James 4212 Morris, Jarret Lu, and Tom Haynes. Peter Staubach, Trond Myklebust, 4213 Stephen Smalley, Sorin Faibish, Nico Williams, and David Black also 4214 contributed in the final push to get this accepted. 4216 During the review process, Talia Reyes-Ortiz helped the sessions run 4217 smoothly. While many people contributed here and there, the core 4218 reviewers were Andy Adamson, Pranoop Erasani, Bruce Fields, Chuck 4219 Lever, Trond Myklebust, David Noveck, Peter Staubach, and Mike 4220 Kupfer. 4222 Appendix B. RFC Editor Notes 4224 [RFC Editor: please remove this section prior to publishing this 4225 document as an RFC] 4227 [RFC Editor: prior to publishing this document as an RFC, please 4228 replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the 4229 RFC number of this document] 4231 Author's Address 4233 Thomas Haynes (editor) 4234 NetApp 4235 495 E Java Dr 4236 Sunnyvale, CA 95054 4237 USA 4239 Phone: +1 408 419 3018 4240 Email: thomas@netapp.com