idnits 2.17.1 draft-ietf-nfsv4-minorversion2-16.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The document has examples using IPv4 documentation addresses according to RFC6890, but does not use any IPv6 documentation addresses. Maybe there should be IPv6 examples, too? Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 18, 2012) is 4201 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '0' is mentioned on line 3857, but not defined -- Looks like a reference, but probably isn't: '32K' on line 3857 ** Obsolete normative reference: RFC 5661 (ref. '1') (Obsoleted by RFC 8881) -- Possible downref: Non-RFC (?) normative reference: ref. '2' -- Possible downref: Non-RFC (?) normative reference: ref. '5' == Outdated reference: A later version (-35) exists of draft-ietf-nfsv4-rfc3530bis-20 -- Obsolete informational reference (is this intentional?): RFC 2616 (ref. '12') (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) == Outdated reference: A later version (-05) exists of draft-ietf-nfsv4-labreqs-03 -- Obsolete informational reference (is this intentional?): RFC 5226 (ref. '24') (Obsoleted by RFC 8126) Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 T. Haynes 3 Internet-Draft Editor 4 Intended status: Standards Track October 18, 2012 5 Expires: April 21, 2013 7 NFS Version 4 Minor Version 2 8 draft-ietf-nfsv4-minorversion2-16.txt 10 Abstract 12 This Internet-Draft describes NFS version 4 minor version two, 13 focusing mainly on the protocol extensions made from NFS version 4 14 minor version 0 and NFS version 4 minor version 1. Major extensions 15 introduced in NFS version 4 minor version two include: Server-side 16 Copy, Application I/O Advise, Space Reservations, Sparse Files, 17 Application Data Blocks, and Labeled NFS. 19 Requirements Language 21 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 22 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 23 document are to be interpreted as described in RFC 2119 [8]. 25 Status of this Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at http://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on April 21, 2013. 42 Copyright Notice 44 Copyright (c) 2012 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents 49 (http://trustee.ietf.org/license-info) in effect on the date of 50 publication of this document. Please review these documents 51 carefully, as they describe your rights and restrictions with respect 52 to this document. Code Components extracted from this document must 53 include Simplified BSD License text as described in Section 4.e of 54 the Trust Legal Provisions and are provided without warranty as 55 described in the Simplified BSD License. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 60 1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . 5 61 1.2. Scope of This Document . . . . . . . . . . . . . . . . . 5 62 1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 5 63 1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . 6 64 1.4.1. Server-side Copy . . . . . . . . . . . . . . . . . . . 6 65 1.4.2. Application I/O Advise . . . . . . . . . . . . . . . . 6 66 1.4.3. Sparse Files . . . . . . . . . . . . . . . . . . . . . 6 67 1.4.4. Space Reservation . . . . . . . . . . . . . . . . . . 6 68 1.4.5. Application Data Hole (ADH) Support . . . . . . . . . 6 69 1.4.6. Labeled NFS . . . . . . . . . . . . . . . . . . . . . 6 70 1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . 7 71 2. Server-side Copy . . . . . . . . . . . . . . . . . . . . . . . 7 72 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 7 73 2.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 7 74 2.2.1. Overview of Copy Operations . . . . . . . . . . . . . 8 75 2.2.2. Locking the Files . . . . . . . . . . . . . . . . . . 9 76 2.2.3. Intra-Server Copy . . . . . . . . . . . . . . . . . . 9 77 2.2.4. Inter-Server Copy . . . . . . . . . . . . . . . . . . 10 78 2.2.5. Server-to-Server Copy Protocol . . . . . . . . . . . . 14 79 2.3. Requirements for Operations . . . . . . . . . . . . . . . 15 80 2.3.1. netloc4 - Network Locations . . . . . . . . . . . . . 16 81 2.3.2. Copy Offload Stateids . . . . . . . . . . . . . . . . 16 82 2.4. Security Considerations . . . . . . . . . . . . . . . . . 17 83 2.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 17 84 3. Support for Application IO Hints . . . . . . . . . . . . . . . 25 85 4. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 25 86 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 25 87 4.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 26 88 4.3. New Operations . . . . . . . . . . . . . . . . . . . . . 26 89 4.3.1. READ_PLUS . . . . . . . . . . . . . . . . . . . . . . 27 90 4.3.2. WRITE_PLUS . . . . . . . . . . . . . . . . . . . . . . 27 91 5. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 27 92 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 27 93 6. Application Data Hole Support . . . . . . . . . . . . . . . . 29 94 6.1. Generic Framework . . . . . . . . . . . . . . . . . . . . 30 95 6.1.1. Data Hole Representation . . . . . . . . . . . . . . . 31 96 6.1.2. Data Content . . . . . . . . . . . . . . . . . . . . . 31 97 6.2. An Example of Detecting Corruption . . . . . . . . . . . 32 98 6.3. Example of READ_PLUS . . . . . . . . . . . . . . . . . . 33 99 7. Labeled NFS . . . . . . . . . . . . . . . . . . . . . . . . . 34 100 7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 34 101 7.2. Definitions . . . . . . . . . . . . . . . . . . . . . . . 35 102 7.3. MAC Security Attribute . . . . . . . . . . . . . . . . . 35 103 7.3.1. Delegations . . . . . . . . . . . . . . . . . . . . . 36 104 7.3.2. Permission Checking . . . . . . . . . . . . . . . . . 36 105 7.3.3. Object Creation . . . . . . . . . . . . . . . . . . . 36 106 7.3.4. Existing Objects . . . . . . . . . . . . . . . . . . . 37 107 7.3.5. Label Changes . . . . . . . . . . . . . . . . . . . . 37 108 7.4. pNFS Considerations . . . . . . . . . . . . . . . . . . . 37 109 7.5. Discovery of Server Labeled NFS Support . . . . . . . . . 38 110 7.6. MAC Security NFS Modes of Operation . . . . . . . . . . . 38 111 7.6.1. Full Mode . . . . . . . . . . . . . . . . . . . . . . 38 112 7.6.2. Guest Mode . . . . . . . . . . . . . . . . . . . . . . 40 113 7.7. Security Considerations . . . . . . . . . . . . . . . . . 40 114 8. Sharing change attribute implementation details with NFSv4 115 clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 116 8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 41 117 9. Security Considerations . . . . . . . . . . . . . . . . . . . 41 118 10. Error Values . . . . . . . . . . . . . . . . . . . . . . . . . 41 119 10.1. Error Definitions . . . . . . . . . . . . . . . . . . . . 42 120 10.1.1. General Errors . . . . . . . . . . . . . . . . . . . . 42 121 10.1.2. Server to Server Copy Errors . . . . . . . . . . . . . 42 122 10.1.3. Labeled NFS Errors . . . . . . . . . . . . . . . . . . 43 123 10.2. New Operations and Their Valid Errors . . . . . . . . . . 43 124 10.3. New Callback Operations and Their Valid Errors . . . . . 46 125 11. New File Attributes . . . . . . . . . . . . . . . . . . . . . 47 126 11.1. New RECOMMENDED Attributes - List and Definition 127 References . . . . . . . . . . . . . . . . . . . . . . . 47 128 11.2. Attribute Definitions . . . . . . . . . . . . . . . . . . 48 129 12. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 51 130 13. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 55 131 13.1. Operation 59: COPY - Initiate a server-side copy . . . . 55 132 13.2. Operation 60: OFFLOAD_ABORT - Cancel a server-side 133 copy . . . . . . . . . . . . . . . . . . . . . . . . . . 62 134 13.3. Operation 61: COPY_NOTIFY - Notify a source server of 135 a future copy . . . . . . . . . . . . . . . . . . . . . . 63 136 13.4. Operation 62: OFFLOAD_REVOKE - Revoke a destination 137 server's copy privileges . . . . . . . . . . . . . . . . 64 138 13.5. Operation 63: OFFLOAD_STATUS - Poll for status of a 139 server-side copy . . . . . . . . . . . . . . . . . . . . 65 140 13.6. Modification to Operation 42: EXCHANGE_ID - 141 Instantiate Client ID . . . . . . . . . . . . . . . . . . 66 142 13.7. Operation 64: WRITE_PLUS . . . . . . . . . . . . . . . . 67 143 13.8. Operation 67: IO_ADVISE - Application I/O access 144 pattern hints . . . . . . . . . . . . . . . . . . . . . . 72 145 13.9. Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 78 146 13.10. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 81 147 13.11. Operation 66: SEEK . . . . . . . . . . . . . . . . . . . 86 148 14. NFSv4.2 Callback Operations . . . . . . . . . . . . . . . . . 87 149 14.1. Operation 15: CB_OFFLOAD - Report results of an 150 asynchronous operation . . . . . . . . . . . . . . . . . 87 151 15. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 88 152 16. References . . . . . . . . . . . . . . . . . . . . . . . . . . 89 153 16.1. Normative References . . . . . . . . . . . . . . . . . . 89 154 16.2. Informative References . . . . . . . . . . . . . . . . . 89 155 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 90 156 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 91 157 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 92 159 1. Introduction 161 1.1. The NFS Version 4 Minor Version 2 Protocol 163 The NFS version 4 minor version 2 (NFSv4.2) protocol is the third 164 minor version of the NFS version 4 (NFSv4) protocol. The first minor 165 version, NFSv4.0, is described in [9] and the second minor version, 166 NFSv4.1, is described in [1]. It follows the guidelines for minor 167 versioning that are listed in Section 11 of [9]. 169 As a minor version, NFSv4.2 is consistent with the overall goals for 170 NFSv4, but extends the protocol so as to better meet those goals, 171 based on experiences with NFSv4.1. In addition, NFSv4.2 has adopted 172 some additional goals, which motivate some of the major extensions in 173 NFSv4.2. 175 1.2. Scope of This Document 177 This document describes the NFSv4.2 protocol. With respect to 178 NFSv4.0 and NFSv4.1, this document does not: 180 o describe the NFSv4.0 or NFSv4.1 protocols, except where needed to 181 contrast with NFSv4.2 183 o modify the specification of the NFSv4.0 or NFSv4.1 protocols 185 o clarify the NFSv4.0 or NFSv4.1 protocols. I.e., any 186 clarifications made here apply to NFSv4.2 and neither of the prior 187 protocols 189 The full XDR for NFSv4.2 is presented in [2]. 191 1.3. NFSv4.2 Goals 193 The goal of the design of NFSv4.2 is to take common local file system 194 features and offer them remotely. These features might 196 o already be available on the servers, e.g., sparse files 198 o be under development as a new standard, e.g., SEEK_HOLE and 199 SEEK_DATA 201 o be used by clients with the servers via some proprietary means, 202 e.g., Labeled NFS 204 but the clients are not able to leverage them on the server within 205 the confines of the NFS protocol. 207 1.4. Overview of NFSv4.2 Features 209 1.4.1. Server-side Copy 211 A traditional file copy from one server to another results in the 212 data being put on the network twice - source to client and then 213 client to destination. New operations are introduced to allow the 214 client to authorize the two servers to interact directly. As this 215 copy can be lengthy, asynchronous support is also provided. 217 1.4.2. Application I/O Advise 219 Applications and clients want to advise the server as to expected I/O 220 behavior. Using IO_ADVISE (see Section 13.8) to communicate future 221 I/O behavior such as whether a file will be accessed sequentially or 222 randomly, and whether a file will or will not be accessed in the near 223 future, allows servers to optimize future I/O requests for a file by, 224 for example, prefetching or evicting data. This operation can be 225 used to support the posix_fadvise function as well as other 226 applications such as databases and video editors. 228 1.4.3. Sparse Files 230 Sparse files are ones which have unallocated data blocks as holes in 231 the file. Such holes are typically transferred as 0s during I/O. 232 READ_PLUS (see Section 13.10) allows a server to send back to the 233 client metadata describing the hole and WRITE_PLUS (see Section 13.7) 234 allows the client to punch holes into a file. In addition, SEEK (see 235 Section 13.11) is provided to scan for the next hole or data from a 236 given location. 238 1.4.4. Space Reservation 240 When a file is sparse, one concern applications have is ensuring that 241 there will always be enough data blocks available for the file during 242 future writes. A new attribute, space_reserved (see Section 11.2.4) 243 provides the client a guarantee that space will be available. 245 1.4.5. Application Data Hole (ADH) Support 247 Some applications treat a file as if it were a disk and as such want 248 to initialize (or format) the file image. We extend both READ_PLUS 249 and WRITE_PLUS to understand this metadata as a new form of a hole. 251 1.4.6. Labeled NFS 253 While both clients and servers can employ Mandatory Access Control 254 (MAC) security models to enforce data access, there has been no 255 protocol support to allow full interoperability. A new file object 256 attribute, sec_label (see Section 11.2.2) allows for the server to 257 store and enforce MAC labels. The format of the sec_label 258 accommodates any MAC security system. 260 1.5. Differences from NFSv4.1 262 In NFSv4.1, the only way to introduce new variants of an operation 263 was to introduce a new operation. I.e., READ becomes either READ2 or 264 READ_PLUS. With the use of discriminated unions as parameters to 265 such functions in NFSv4.2, it is possible to add a new arm in a 266 subsequent minor version. And it is also possible to move such an 267 operation from OPTIONAL/RECOMMENDED to REQUIRED. Forcing an 268 implementation to adopt each arm of a discriminated union at such a 269 time does not meet the spirit of the minor versioning rules. As 270 such, new arms of a discriminated union MUST follow the same 271 guidelines for minor versioning as operations in NFSv4.1 - i.e., they 272 may not be made REQUIRED. To support this, a new error code, 273 NFS4ERR_UNION_NOTSUPP, is introduced which allows the server to 274 communicate to the client that the operation is supported, but the 275 specific arm of the discriminated union is not. 277 2. Server-side Copy 279 2.1. Introduction 281 The server-side copy feature provides a mechanism for the NFS client 282 to perform a file copy on the server without the data being 283 transmitted back and forth over the network. Without this feature, 284 an NFS client copies data from one location to another by reading the 285 data from the server over the network, and then writing the data back 286 over the network to the server. Using this server-side copy 287 operation, the client is able to instruct the server to copy the data 288 locally without the data being sent back and forth over the network 289 unnecessarily. 291 If the source object and destination object are on different file 292 servers, the file servers will communicate with one another to 293 perform the copy operation. The server-to-server protocol by which 294 this is accomplished is not defined in this document. 296 2.2. Protocol Overview 298 The server-side copy offload operations support both intra-server and 299 inter-server file copies. An intra-server copy is a copy in which 300 the source file and destination file reside on the same server. In 301 an inter-server copy, the source file and destination file are on 302 different servers. In both cases, the copy may be performed 303 synchronously or asynchronously. 305 Throughout the rest of this document, we refer to the NFS server 306 containing the source file as the "source server" and the NFS server 307 to which the file is transferred as the "destination server". In the 308 case of an intra-server copy, the source server and destination 309 server are the same server. Therefore in the context of an intra- 310 server copy, the terms source server and destination server refer to 311 the single server performing the copy. 313 The operations described below are designed to copy files. Other 314 file system objects can be copied by building on these operations or 315 using other techniques. For example if the user wishes to copy a 316 directory, the client can synthesize a directory copy by first 317 creating the destination directory and then copying the source 318 directory's files to the new destination directory. If the user 319 wishes to copy a namespace junction [10] [11], the client can use the 320 ONC RPC Federated Filesystem protocol [11] to perform the copy. 321 Specifically the client can determine the source junction's 322 attributes using the FEDFS_LOOKUP_FSN procedure and create a 323 duplicate junction using the FEDFS_CREATE_JUNCTION procedure. 325 For the inter-server copy, the operations are defined to be 326 compatible with the traditional copy authentication approach. The 327 client and user are authorized at the source for reading. Then they 328 are authorized at the destination for writing. 330 2.2.1. Overview of Copy Operations 332 COPY_NOTIFY: For inter-server copies, the client sends this 333 operation to the source server to notify it of a future file copy 334 from a given destination server for the given user. 335 (Section 13.3) 337 OFFLOAD_REVOKE: Also for inter-server copies, the client sends this 338 operation to the source server to revoke permission to copy a file 339 for the given user. (Section 13.4) 341 COPY: Used by the client to request a file copy. (Section 13.1) 343 OFFLOAD_ABORT: Used by the client to abort an asynchronous file 344 copy. (Section 13.2) 346 OFFLOAD_STATUS: Used by the client to poll the status of an 347 asynchronous file copy. (Section 13.5) 349 CB_OFFLOAD: Used by the destination server to report the results of 350 an asynchronous file copy to the client. (Section 14.1) 352 2.2.2. Locking the Files 354 Both the source and destination file may need to be locked to protect 355 the content during the copy operations. A client can achieve this by 356 a combination of OPEN and LOCK operations. I.e., either share or 357 byte range locks might be desired. 359 2.2.3. Intra-Server Copy 361 To copy a file on a single server, the client uses a COPY operation. 362 The server may respond to the copy operation with the final results 363 of the copy or it may perform the copy asynchronously and deliver the 364 results using a CB_OFFLOAD operation callback. If the copy is 365 performed asynchronously, the client may poll the status of the copy 366 using OFFLOAD_STATUS or cancel the copy using OFFLOAD_ABORT. 368 A synchronous intra-server copy is shown in Figure 1. In this 369 example, the NFS server chooses to perform the copy synchronously. 370 The copy operation is completed, either successfully or 371 unsuccessfully, before the server replies to the client's request. 372 The server's reply contains the final result of the operation. 374 Client Server 375 + + 376 | | 377 |--- OPEN ---------------------------->| Client opens 378 |<------------------------------------/| the source file 379 | | 380 |--- OPEN ---------------------------->| Client opens 381 |<------------------------------------/| the destination file 382 | | 383 |--- COPY ---------------------------->| Client requests 384 |<------------------------------------/| a file copy 385 | | 386 |--- CLOSE --------------------------->| Client closes 387 |<------------------------------------/| the destination file 388 | | 389 |--- CLOSE --------------------------->| Client closes 390 |<------------------------------------/| the source file 391 | | 392 | | 394 Figure 1: A synchronous intra-server copy. 396 An asynchronous intra-server copy is shown in Figure 2. In this 397 example, the NFS server performs the copy asynchronously. The 398 server's reply to the copy request indicates that the copy operation 399 was initiated and the final result will be delivered at a later time. 400 The server's reply also contains a copy stateid. The client may use 401 this copy stateid to poll for status information (as shown) or to 402 cancel the copy using a OFFLOAD_ABORT. When the server completes the 403 copy, the server performs a callback to the client and reports the 404 results. 406 Client Server 407 + + 408 | | 409 |--- OPEN ---------------------------->| Client opens 410 |<------------------------------------/| the source file 411 | | 412 |--- OPEN ---------------------------->| Client opens 413 |<------------------------------------/| the destination file 414 | | 415 |--- COPY ---------------------------->| Client requests 416 |<------------------------------------/| a file copy 417 | | 418 | | 419 |--- OFFLOAD_STATUS ------------------>| Client may poll 420 |<------------------------------------/| for status 421 | | 422 | . | Multiple OFFLOAD_STATUS 423 | . | operations may be sent. 424 | . | 425 | | 426 |<-- CB_OFFLOAD -----------------------| Server reports results 427 |\------------------------------------>| 428 | | 429 |--- CLOSE --------------------------->| Client closes 430 |<------------------------------------/| the destination file 431 | | 432 |--- CLOSE --------------------------->| Client closes 433 |<------------------------------------/| the source file 434 | | 435 | | 437 Figure 2: An asynchronous intra-server copy. 439 2.2.4. Inter-Server Copy 441 A copy may also be performed between two servers. The copy protocol 442 is designed to accommodate a variety of network topologies. As shown 443 in Figure 3, the client and servers may be connected by multiple 444 networks. In particular, the servers may be connected by a 445 specialized, high speed network (network 192.0.2.0/24 in the diagram) 446 that does not include the client. The protocol allows the client to 447 setup the copy between the servers (over network 203.0.113.0/24 in 448 the diagram) and for the servers to communicate on the high speed 449 network if they choose to do so. 451 192.0.2.0/24 452 +-------------------------------------+ 453 | | 454 | | 455 | 192.0.2.18 | 192.0.2.56 456 +-------+------+ +------+------+ 457 | Source | | Destination | 458 +-------+------+ +------+------+ 459 | 203.0.113.18 | 203.0.113.56 460 | | 461 | | 462 | 203.0.113.0/24 | 463 +------------------+------------------+ 464 | 465 | 466 | 203.0.113.243 467 +-----+-----+ 468 | Client | 469 +-----------+ 471 Figure 3: An example inter-server network topology. 473 For an inter-server copy, the client notifies the source server that 474 a file will be copied by the destination server using a COPY_NOTIFY 475 operation. The client then initiates the copy by sending the COPY 476 operation to the destination server. The destination server may 477 perform the copy synchronously or asynchronously. 479 A synchronous inter-server copy is shown in Figure 4. In this case, 480 the destination server chooses to perform the copy before responding 481 to the client's COPY request. 483 An asynchronous copy is shown in Figure 5. In this case, the 484 destination server chooses to respond to the client's COPY request 485 immediately and then perform the copy asynchronously. 487 Client Source Destination 488 + + + 489 | | | 490 |--- OPEN --->| | Returns os1 491 |<------------------/| | 492 | | | 493 |--- COPY_NOTIFY --->| | 494 |<------------------/| | 495 | | | 496 |--- OPEN ---------------------------->| Returns os2 497 |<------------------------------------/| 498 | | | 499 |--- COPY ---------------------------->| 500 | | | 501 | | | 502 | |<----- read -----| 503 | |\--------------->| 504 | | | 505 | | . | Multiple reads may 506 | | . | be necessary 507 | | . | 508 | | | 509 | | | 510 |<------------------------------------/| Destination replies 511 | | | to COPY 512 | | | 513 |--- CLOSE --------------------------->| Release open state 514 |<------------------------------------/| 515 | | | 516 |--- CLOSE --->| | Release open state 517 |<------------------/| | 519 Figure 4: A synchronous inter-server copy. 521 Client Source Destination 522 + + + 523 | | | 524 |--- OPEN --->| | Returns os1 525 |<------------------/| | 526 | | | 527 |--- LOCK --->| | Optional, could be done 528 |<------------------/| | with a share lock 529 | | | 530 |--- COPY_NOTIFY --->| | Need to pass in 531 |<------------------/| | os1 or lock state 532 | | | 533 | | | 534 | | | 535 |--- OPEN ---------------------------->| Returns os2 536 |<------------------------------------/| 537 | | | 538 |--- LOCK ---------------------------->| Optional ... 539 |<------------------------------------/| 540 | | | 541 |--- COPY ---------------------------->| Need to pass in 542 |<------------------------------------/| os2 or lock state 543 | | | 544 | | | 545 | |<----- read -----| 546 | |\--------------->| 547 | | | 548 | | . | Multiple reads may 549 | | . | be necessary 550 | | . | 551 | | | 552 | | | 553 |--- OFFLOAD_STATUS ------------------>| Client may poll 554 |<------------------------------------/| for status 555 | | | 556 | | . | Multiple OFFLOAD_STATUS 557 | | . | operations may be sent 558 | | . | 559 | | | 560 | | | 561 | | | 562 |<-- CB_OFFLOAD -----------------------| Destination reports 563 |\------------------------------------>| results 564 | | | 565 |--- LOCKU --------------------------->| Only if LOCK was done 566 |<------------------------------------/| 567 | | | 568 |--- CLOSE --------------------------->| Release open state 569 |<------------------------------------/| 570 | | | 571 |--- LOCKU --->| | Only if LOCK was done 572 |<------------------/| | 573 | | | 574 |--- CLOSE --->| | Release open state 575 |<------------------/| | 576 | | | 578 Figure 5: An asynchronous inter-server copy. 580 2.2.5. Server-to-Server Copy Protocol 582 The source server and destination server are not required to use a 583 specific protocol to transfer the file data. The choice of what 584 protocol to use is ultimately the destination server's decision. 586 2.2.5.1. Using NFSv4.x as a Server-to-Server Copy Protocol 588 The destination server MAY use standard NFSv4.x (where x >= 1) to 589 read the data from the source server. If NFSv4.x is used for the 590 server-to-server copy protocol, the destination server can use the 591 filehandle contained in the COPY request with standard NFSv4.x 592 operations to read data from the source server. Specifically, the 593 destination server may use the NFSv4.x OPEN operation's CLAIM_FH 594 facility to open the file being copied and obtain an open stateid. 595 Using the stateid, the destination server may then use NFSv4.x READ 596 operations to read the file. 598 2.2.5.2. Using an alternative Server-to-Server Copy Protocol 600 In a homogeneous environment, the source and destination servers 601 might be able to perform the file copy extremely efficiently using 602 specialized protocols. For example the source and destination 603 servers might be two nodes sharing a common file system format for 604 the source and destination file systems. Thus the source and 605 destination are in an ideal position to efficiently render the image 606 of the source file to the destination file by replicating the file 607 system formats at the block level. Another possibility is that the 608 source and destination might be two nodes sharing a common storage 609 area network, and thus there is no need to copy any data at all, and 610 instead ownership of the file and its contents might simply be re- 611 assigned to the destination. To allow for these possibilities, the 612 destination server is allowed to use a server-to-server copy protocol 613 of its choice. 615 In a heterogeneous environment, using a protocol other than NFSv4.x 616 (e.g., HTTP [12] or FTP [13]) presents some challenges. In 617 particular, the destination server is presented with the challenge of 618 accessing the source file given only an NFSv4.x filehandle. 620 One option for protocols that identify source files with path names 621 is to use an ASCII hexadecimal representation of the source 622 filehandle as the file name. 624 Another option for the source server is to use URLs to direct the 625 destination server to a specialized service. For example, the 626 response to COPY_NOTIFY could include the URL 627 ftp://s1.example.com:9999/_FH/0x12345, where 0x12345 is the ASCII 628 hexadecimal representation of the source filehandle. When the 629 destination server receives the source server's URL, it would use 630 "_FH/0x12345" as the file name to pass to the FTP server listening on 631 port 9999 of s1.example.com. On port 9999 there would be a special 632 instance of the FTP service that understands how to convert NFS 633 filehandles to an open file descriptor (in many operating systems, 634 this would require a new system call, one which is the inverse of the 635 makefh() function that the pre-NFSv4 MOUNT service needs). 637 Authenticating and identifying the destination server to the source 638 server is also a challenge. Recommendations for how to accomplish 639 this are given in Section 2.4.1.2.4 and Section 2.4.1.4. 641 2.3. Requirements for Operations 643 The implementation of server-side copy is OPTIONAL by the client and 644 the server. However, in order to successfully copy a file, some 645 operations MUST be supported by the client and/or server. 647 If a client desires an intra-server file copy, then it MUST support 648 the COPY and CB_OFFLOAD operations. If COPY returns a stateid, then 649 the client MAY use the OFFLOAD_ABORT and OFFLOAD_STATUS operations. 651 If a client desires an inter-server file copy, then it MUST support 652 the COPY, COPY_NOTICE, and CB_OFFLOAD operations, and MAY use the 653 OFFLOAD_REVOKE operation. If COPY returns a stateid, then the client 654 MAY use the OFFLOAD_ABORT and OFFLOAD_STATUS operations. 656 If a server supports intra-server copy, then the server MUST support 657 the COPY operation. If a server's COPY operation returns a stateid, 658 then the server MUST also support these operations: CB_OFFLOAD, 659 OFFLOAD_ABORT, and OFFLOAD_STATUS. 661 If a source server supports inter-server copy, then the source server 662 MUST support all these operations: COPY_NOTIFY and OFFLOAD_REVOKE. 663 If a destination server supports inter-server copy, then the 664 destination server MUST support the COPY operation. If a destination 665 server's COPY operation returns a stateid, then the destination 666 server MUST also support these operations: CB_OFFLOAD, OFFLOAD_ABORT, 667 COPY_NOTIFY, OFFLOAD_REVOKE, and OFFLOAD_STATUS. 669 Each operation is performed in the context of the user identified by 670 the ONC RPC credential of its containing COMPOUND or CB_COMPOUND 671 request. For example, a OFFLOAD_ABORT operation issued by a given 672 user indicates that a specified COPY operation initiated by the same 673 user be canceled. Therefore a OFFLOAD_ABORT MUST NOT interfere with 674 a copy of the same file initiated by another user. 676 An NFS server MAY allow an administrative user to monitor or cancel 677 copy operations using an implementation specific interface. 679 2.3.1. netloc4 - Network Locations 681 The server-side copy operations specify network locations using the 682 netloc4 data type shown below: 684 enum netloc_type4 { 685 NL4_NAME = 0, 686 NL4_URL = 1, 687 NL4_NETADDR = 2 688 }; 689 union netloc4 switch (netloc_type4 nl_type) { 690 case NL4_NAME: utf8str_cis nl_name; 691 case NL4_URL: utf8str_cis nl_url; 692 case NL4_NETADDR: netaddr4 nl_addr; 693 }; 695 If the netloc4 is of type NL4_NAME, the nl_name field MUST be 696 specified as a UTF-8 string. The nl_name is expected to be resolved 697 to a network address via DNS, LDAP, NIS, /etc/hosts, or some other 698 means. If the netloc4 is of type NL4_URL, a server URL [3] 699 appropriate for the server-to-server copy operation is specified as a 700 UTF-8 string. If the netloc4 is of type NL4_NETADDR, the nl_addr 701 field MUST contain a valid netaddr4 as defined in Section 3.3.9 of 702 [1]. 704 When netloc4 values are used for an inter-server copy as shown in 705 Figure 3, their values may be evaluated on the source server, 706 destination server, and client. The network environment in which 707 these systems operate should be configured so that the netloc4 values 708 are interpreted as intended on each system. 710 2.3.2. Copy Offload Stateids 712 A server may perform a copy offload operation asynchronously. An 713 asynchronous copy is tracked using a copy offload stateid. Copy 714 offload stateids are included in the COPY, OFFLOAD_ABORT, 715 OFFLOAD_STATUS, and CB_OFFLOAD operations. 717 Section 8.2.4 of [1] specifies that stateids are valid until either 718 (A) the client or server restart or (B) the client returns the 719 resource. 721 A copy offload stateid will be valid until either (A) the client or 722 server restarts or (B) the client returns the resource by issuing a 723 OFFLOAD_ABORT operation or the client replies to a CB_OFFLOAD 724 operation. 726 A copy offload stateid's seqid MUST NOT be 0. In the context of a 727 copy offload operation, it is ambiguous to indicate the most recent 728 copy offload operation using a stateid with seqid of 0. Therefore a 729 copy offload stateid with seqid of 0 MUST be considered invalid. 731 2.4. Security Considerations 733 The security considerations pertaining to NFSv4 [9] apply to this 734 chapter. 736 The standard security mechanisms provide by NFSv4 [9] may be used to 737 secure the protocol described in this chapter. 739 NFSv4 clients and servers supporting the inter-server copy operations 740 described in this chapter are REQUIRED to implement [4], including 741 the RPCSEC_GSSv3 privileges copy_from_auth and copy_to_auth. If the 742 server-to-server copy protocol is ONC RPC based, the servers are also 743 REQUIRED to implement the RPCSEC_GSSv3 privilege copy_confirm_auth. 744 These requirements to implement are not requirements to use. NFSv4 745 clients and servers are RECOMMENDED to use [4] to secure server-side 746 copy operations. 748 2.4.1. Inter-Server Copy Security 750 2.4.1.1. Requirements for Secure Inter-Server Copy 752 Inter-server copy is driven by several requirements: 754 o The specification MUST NOT mandate an inter-server copy protocol. 755 There are many ways to copy data. Some will be more optimal than 756 others depending on the identities of the source server and 757 destination server. For example the source and destination 758 servers might be two nodes sharing a common file system format for 759 the source and destination file systems. Thus the source and 760 destination are in an ideal position to efficiently render the 761 image of the source file to the destination file by replicating 762 the file system formats at the block level. In other cases, the 763 source and destination might be two nodes sharing a common storage 764 area network, and thus there is no need to copy any data at all, 765 and instead ownership of the file and its contents simply gets re- 766 assigned to the destination. 768 o The specification MUST provide guidance for using NFSv4.x as a 769 copy protocol. For those source and destination servers willing 770 to use NFSv4.x there are specific security considerations that 771 this specification can and does address. 773 o The specification MUST NOT mandate pre-configuration between the 774 source and destination server. Requiring that the source and 775 destination first have a "copying relationship" increases the 776 administrative burden. However the specification MUST NOT 777 preclude implementations that require pre-configuration. 779 o The specification MUST NOT mandate a trust relationship between 780 the source and destination server. The NFSv4 security model 781 requires mutual authentication between a principal on an NFS 782 client and a principal on an NFS server. This model MUST continue 783 with the introduction of COPY. 785 2.4.1.2. Inter-Server Copy with RPCSEC_GSSv3 787 When the client sends a COPY_NOTIFY to the source server to expect 788 the destination to attempt to copy data from the source server, it is 789 expected that this copy is being done on behalf of the principal 790 (called the "user principal") that sent the RPC request that encloses 791 the COMPOUND procedure that contains the COPY_NOTIFY operation. The 792 user principal is identified by the RPC credentials. A mechanism 793 that allows the user principal to authorize the destination server to 794 perform the copy in a manner that lets the source server properly 795 authenticate the destination's copy, and without allowing the 796 destination to exceed its authorization is necessary. 798 An approach that sends delegated credentials of the client's user 799 principal to the destination server is not used for the following 800 reasons. If the client's user delegated its credentials, the 801 destination would authenticate as the user principal. If the 802 destination were using the NFSv4 protocol to perform the copy, then 803 the source server would authenticate the destination server as the 804 user principal, and the file copy would securely proceed. However, 805 this approach would allow the destination server to copy other files. 806 The user principal would have to trust the destination server to not 807 do so. This is counter to the requirements, and therefore is not 808 considered. Instead an approach using RPCSEC_GSSv3 [4] privileges is 809 proposed. 811 One of the stated applications of the proposed RPCSEC_GSSv3 protocol 812 is compound client host and user authentication [+ privilege 813 assertion]. For inter-server file copy, we require compound NFS 814 server host and user authentication [+ privilege assertion]. The 815 distinction between the two is one without meaning. 817 RPCSEC_GSSv3 introduces the notion of privileges. We define three 818 privileges: 820 copy_from_auth: A user principal is authorizing a source principal 821 ("nfs@") to allow a destination principal ("nfs@ 822 ") to copy a file from the source to the destination. 823 This privilege is established on the source server before the user 824 principal sends a COPY_NOTIFY operation to the source server. 826 struct copy_from_auth_priv { 827 secret4 cfap_shared_secret; 828 netloc4 cfap_destination; 829 /* the NFSv4 user name that the user principal maps to */ 830 utf8str_mixed cfap_username; 831 /* equal to seq_num of rpc_gss_cred_vers_3_t */ 832 unsigned int cfap_seq_num; 833 }; 835 cfp_shared_secret is a secret value the user principal generates. 837 copy_to_auth: A user principal is authorizing a destination 838 principal ("nfs@") to allow it to copy a file from 839 the source to the destination. This privilege is established on 840 the destination server before the user principal sends a COPY 841 operation to the destination server. 843 struct copy_to_auth_priv { 844 /* equal to cfap_shared_secret */ 845 secret4 ctap_shared_secret; 846 netloc4 ctap_source; 847 /* the NFSv4 user name that the user principal maps to */ 848 utf8str_mixed ctap_username; 849 /* equal to seq_num of rpc_gss_cred_vers_3_t */ 850 unsigned int ctap_seq_num; 851 }; 853 ctap_shared_secret is a secret value the user principal generated 854 and was used to establish the copy_from_auth privilege with the 855 source principal. 857 copy_confirm_auth: A destination principal is confirming with the 858 source principal that it is authorized to copy data from the 859 source on behalf of the user principal. When the inter-server 860 copy protocol is NFSv4, or for that matter, any protocol capable 861 of being secured via RPCSEC_GSSv3 (i.e., any ONC RPC protocol), 862 this privilege is established before the file is copied from the 863 source to the destination. 865 struct copy_confirm_auth_priv { 866 /* equal to GSS_GetMIC() of cfap_shared_secret */ 867 opaque ccap_shared_secret_mic<>; 868 /* the NFSv4 user name that the user principal maps to */ 869 utf8str_mixed ccap_username; 870 /* equal to seq_num of rpc_gss_cred_vers_3_t */ 871 unsigned int ccap_seq_num; 872 }; 874 2.4.1.2.1. Establishing a Security Context 876 When the user principal wants to COPY a file between two servers, if 877 it has not established copy_from_auth and copy_to_auth privileges on 878 the servers, it establishes them: 880 o The user principal generates a secret it will share with the two 881 servers. This shared secret will be placed in the 882 cfap_shared_secret and ctap_shared_secret fields of the 883 appropriate privilege data types, copy_from_auth_priv and 884 copy_to_auth_priv. 886 o An instance of copy_from_auth_priv is filled in with the shared 887 secret, the destination server, and the NFSv4 user id of the user 888 principal. It will be sent with an RPCSEC_GSS3_CREATE procedure, 889 and so cfap_seq_num is set to the seq_num of the credential of the 890 RPCSEC_GSS3_CREATE procedure. Because cfap_shared_secret is a 891 secret, after XDR encoding copy_from_auth_priv, GSS_Wrap() (with 892 privacy) is invoked on copy_from_auth_priv. The 893 RPCSEC_GSS3_CREATE procedure's arguments are: 895 struct { 896 rpc_gss3_gss_binding *compound_binding; 897 rpc_gss3_chan_binding *chan_binding_mic; 898 rpc_gss3_assertion assertions<>; 899 rpc_gss3_extension extensions<>; 900 } rpc_gss3_create_args; 902 The string "copy_from_auth" is placed in assertions[0].privs. The 903 output of GSS_Wrap() is placed in extensions[0].data. The field 904 extensions[0].critical is set to TRUE. The source server calls 905 GSS_Unwrap() on the privilege, and verifies that the seq_num 906 matches the credential. It then verifies that the NFSv4 user id 907 being asserted matches the source server's mapping of the user 908 principal. If it does, the privilege is established on the source 909 server as: <"copy_from_auth", user id, destination>. The 910 successful reply to RPCSEC_GSS3_CREATE has: 912 struct { 913 opaque handle<>; 914 rpc_gss3_chan_binding *chan_binding_mic; 915 rpc_gss3_assertion granted_assertions<>; 916 rpc_gss3_assertion server_assertions<>; 917 rpc_gss3_extension extensions<>; 918 } rpc_gss3_create_res; 920 The field "handle" is the RPCSEC_GSSv3 handle that the client will 921 use on COPY_NOTIFY requests involving the source and destination 922 server. granted_assertions[0].privs will be equal to 923 "copy_from_auth". The server will return a GSS_Wrap() of 924 copy_to_auth_priv. 926 o An instance of copy_to_auth_priv is filled in with the shared 927 secret, the source server, and the NFSv4 user id. It will be sent 928 with an RPCSEC_GSS3_CREATE procedure, and so ctap_seq_num is set 929 to the seq_num of the credential of the RPCSEC_GSS3_CREATE 930 procedure. Because ctap_shared_secret is a secret, after XDR 931 encoding copy_to_auth_priv, GSS_Wrap() is invoked on 932 copy_to_auth_priv. The RPCSEC_GSS3_CREATE procedure's arguments 933 are: 935 struct { 936 rpc_gss3_gss_binding *compound_binding; 937 rpc_gss3_chan_binding *chan_binding_mic; 938 rpc_gss3_assertion assertions<>; 939 rpc_gss3_extension extensions<>; 940 } rpc_gss3_create_args; 942 The string "copy_to_auth" is placed in assertions[0].privs. The 943 output of GSS_Wrap() is placed in extensions[0].data. The field 944 extensions[0].critical is set to TRUE. After unwrapping, 945 verifying the seq_num, and the user principal to NFSv4 user ID 946 mapping, the destination establishes a privilege of 947 <"copy_to_auth", user id, source>. The successful reply to 948 RPCSEC_GSS3_CREATE has: 950 struct { 951 opaque handle<>; 952 rpc_gss3_chan_binding *chan_binding_mic; 953 rpc_gss3_assertion granted_assertions<>; 954 rpc_gss3_assertion server_assertions<>; 955 rpc_gss3_extension extensions<>; 957 } rpc_gss3_create_res; 959 The field "handle" is the RPCSEC_GSSv3 handle that the client will 960 use on COPY requests involving the source and destination server. 961 The field granted_assertions[0].privs will be equal to 962 "copy_to_auth". The server will return a GSS_Wrap() of 963 copy_to_auth_priv. 965 2.4.1.2.2. Starting a Secure Inter-Server Copy 967 When the client sends a COPY_NOTIFY request to the source server, it 968 uses the privileged "copy_from_auth" RPCSEC_GSSv3 handle. 969 cna_destination_server in COPY_NOTIFY MUST be the same as the name of 970 the destination server specified in copy_from_auth_priv. Otherwise, 971 COPY_NOTIFY will fail with NFS4ERR_ACCESS. The source server 972 verifies that the privilege <"copy_from_auth", user id, destination> 973 exists, and annotates it with the source filehandle, if the user 974 principal has read access to the source file, and if administrative 975 policies give the user principal and the NFS client read access to 976 the source file (i.e., if the ACCESS operation would grant read 977 access). Otherwise, COPY_NOTIFY will fail with NFS4ERR_ACCESS. 979 When the client sends a COPY request to the destination server, it 980 uses the privileged "copy_to_auth" RPCSEC_GSSv3 handle. 981 ca_source_server in COPY MUST be the same as the name of the source 982 server specified in copy_to_auth_priv. Otherwise, COPY will fail 983 with NFS4ERR_ACCESS. The destination server verifies that the 984 privilege <"copy_to_auth", user id, source> exists, and annotates it 985 with the source and destination filehandles. If the client has 986 failed to establish the "copy_to_auth" policy it will reject the 987 request with NFS4ERR_PARTNER_NO_AUTH. 989 If the client sends a OFFLOAD_REVOKE to the source server to rescind 990 the destination server's copy privilege, it uses the privileged 991 "copy_from_auth" RPCSEC_GSSv3 handle and the cra_destination_server 992 in OFFLOAD_REVOKE MUST be the same as the name of the destination 993 server specified in copy_from_auth_priv. The source server will then 994 delete the <"copy_from_auth", user id, destination> privilege and 995 fail any subsequent copy requests sent under the auspices of this 996 privilege from the destination server. 998 2.4.1.2.3. Securing ONC RPC Server-to-Server Copy Protocols 1000 After a destination server has a "copy_to_auth" privilege established 1001 on it, and it receives a COPY request, if it knows it will use an ONC 1002 RPC protocol to copy data, it will establish a "copy_confirm_auth" 1003 privilege on the source server, using nfs@ as the 1004 initiator principal, and nfs@ as the target principal. 1006 The value of the field ccap_shared_secret_mic is a GSS_VerifyMIC() of 1007 the shared secret passed in the copy_to_auth privilege. The field 1008 ccap_username is the mapping of the user principal to an NFSv4 user 1009 name ("user"@"domain" form), and MUST be the same as ctap_username 1010 and cfap_username. The field ccap_seq_num is the seq_num of the 1011 RPCSEC_GSSv3 credential used for the RPCSEC_GSS3_CREATE procedure the 1012 destination will send to the source server to establish the 1013 privilege. 1015 The source server verifies the privilege, and establishes a 1016 <"copy_confirm_auth", user id, destination> privilege. If the source 1017 server fails to verify the privilege, the COPY operation will be 1018 rejected with NFS4ERR_PARTNER_NO_AUTH. All subsequent ONC RPC 1019 requests sent from the destination to copy data from the source to 1020 the destination will use the RPCSEC_GSSv3 handle returned by the 1021 source's RPCSEC_GSS3_CREATE response. 1023 Note that the use of the "copy_confirm_auth" privilege accomplishes 1024 the following: 1026 o if a protocol like NFS is being used, with export policies, export 1027 policies can be overridden in case the destination server as-an- 1028 NFS-client is not authorized 1030 o manual configuration to allow a copy relationship between the 1031 source and destination is not needed. 1033 If the attempt to establish a "copy_confirm_auth" privilege fails, 1034 then when the user principal sends a COPY request to destination, the 1035 destination server will reject it with NFS4ERR_PARTNER_NO_AUTH. 1037 2.4.1.2.4. Securing Non ONC RPC Server-to-Server Copy Protocols 1039 If the destination won't be using ONC RPC to copy the data, then the 1040 source and destination are using an unspecified copy protocol. The 1041 destination could use the shared secret and the NFSv4 user id to 1042 prove to the source server that the user principal has authorized the 1043 copy. 1045 For protocols that authenticate user names with passwords (e.g., HTTP 1046 [12] and FTP [13]), the NFSv4 user id could be used as the user name, 1047 and an ASCII hexadecimal representation of the RPCSEC_GSSv3 shared 1048 secret could be used as the user password or as input into non- 1049 password authentication methods like CHAP [14]. 1051 2.4.1.3. Inter-Server Copy via ONC RPC but without RPCSEC_GSSv3 1053 ONC RPC security flavors other than RPCSEC_GSSv3 MAY be used with the 1054 server-side copy offload operations described in this chapter. In 1055 particular, host-based ONC RPC security flavors such as AUTH_NONE and 1056 AUTH_SYS MAY be used. If a host-based security flavor is used, a 1057 minimal level of protection for the server-to-server copy protocol is 1058 possible. 1060 In the absence of strong security mechanisms such as RPCSEC_GSSv3, 1061 the challenge is how the source server and destination server 1062 identify themselves to each other, especially in the presence of 1063 multi-homed source and destination servers. In a multi-homed 1064 environment, the destination server might not contact the source 1065 server from the same network address specified by the client in the 1066 COPY_NOTIFY. This can be overcome using the procedure described 1067 below. 1069 When the client sends the source server the COPY_NOTIFY operation, 1070 the source server may reply to the client with a list of target 1071 addresses, names, and/or URLs and assign them to the unique 1072 quadruple: . If the destination uses one of these target netlocs to contact 1074 the source server, the source server will be able to uniquely 1075 identify the destination server, even if the destination server does 1076 not connect from the address specified by the client in COPY_NOTIFY. 1077 The level of assurance in this identification depends on the 1078 unpredictability, strength and secrecy of the random number. 1080 For example, suppose the network topology is as shown in Figure 3. 1081 If the source filehandle is 0x12345, the source server may respond to 1082 a COPY_NOTIFY for destination 203.0.113.56 with the URLs: 1084 nfs://203.0.113.18//_COPY/FvhH1OKbu8VrxvV1erdjvR7N/203.0.113.56/ 1085 _FH/0x12345 1087 nfs://192.0.2.18//_COPY/FvhH1OKbu8VrxvV1erdjvR7N/203.0.113.56/_FH/ 1088 0x12345 1090 The name component after _COPY is 24 characters of base 64, more than 1091 enough to encode a 128 bit random number. 1093 The client will then send these URLs to the destination server in the 1094 COPY operation. Suppose that the 192.0.2.0/24 network is a high 1095 speed network and the destination server decides to transfer the file 1096 over this network. If the destination contacts the source server 1097 from 192.0.2.56 over this network using NFSv4.1, it does the 1098 following: 1100 COMPOUND { PUTROOTFH, LOOKUP "_COPY" ; LOOKUP 1101 "FvhH1OKbu8VrxvV1erdjvR7N" ; LOOKUP "203.0.113.56"; LOOKUP "_FH" ; 1102 OPEN "0x12345" ; GETFH } 1104 Provided that the random number is unpredictable and has been kept 1105 secret by the parties involved, the source server will therefore know 1106 that these NFSv4.x operations are being issued by the destination 1107 server identified in the COPY_NOTIFY. This random number technique 1108 only provides initial authentication of the destination server, and 1109 cannot defend against man-in-the-middle attacks after authentication 1110 or an eavesdropper that observes the random number on the wire. 1111 Other secure communication techniques (e.g., IPsec) are necessary to 1112 block these attacks. 1114 2.4.1.4. Inter-Server Copy without ONC RPC and RPCSEC_GSSv3 1116 The same techniques as Section 2.4.1.3, using unique URLs for each 1117 destination server, can be used for other protocols (e.g., HTTP [12] 1118 and FTP [13]) as well. 1120 3. Support for Application IO Hints 1122 Applications can issue client I/O hints via posix_fadvise() [5] to 1123 the NFS client. While this can help the NFS client optimize I/O and 1124 caching for a file, it does not allow the NFS server and its exported 1125 file system to do likewise. We add an IO_ADVISE procedure 1126 (Section 13.8) to communicate the client file access patterns to the 1127 NFS server. The NFS server upon receiving a IO_ADVISE operation MAY 1128 choose to alter its I/O and caching behavior, but is under no 1129 obligation to do so. 1131 Application specific NFS clients such as those used by hypervisors 1132 and databases can also leverage application hints to communicate 1133 their specialized requirements. 1135 4. Sparse Files 1137 4.1. Introduction 1139 A sparse file is a common way of representing a large file without 1140 having to utilize all of the disk space for it. Consequently, a 1141 sparse file uses less physical space than its size indicates. This 1142 means the file contains 'holes', byte ranges within the file that 1143 contain no data. Most modern file systems support sparse files, 1144 including most UNIX file systems and NTFS, but notably not Apple's 1145 HFS+. Common examples of sparse files include Virtual Machine (VM) 1146 OS/disk images, database files, log files, and even checkpoint 1147 recovery files most commonly used by the HPC community. 1149 If an application reads a hole in a sparse file, the file system must 1150 return all zeros to the application. For local data access there is 1151 little penalty, but with NFS these zeroes must be transferred back to 1152 the client. If an application uses the NFS client to read data into 1153 memory, this wastes time and bandwidth as the application waits for 1154 the zeroes to be transferred. 1156 A sparse file is typically created by initializing the file to be all 1157 zeros - nothing is written to the data in the file, instead the hole 1158 is recorded in the metadata for the file. So a 8G disk image might 1159 be represented initially by a couple hundred bits in the inode and 1160 nothing on the disk. If the VM then writes 100M to a file in the 1161 middle of the image, there would now be two holes represented in the 1162 metadata and 100M in the data. 1164 Two new operations WRITE_PLUS (Section 13.7) and READ_PLUS 1165 (Section 13.10) are introduced. WRITE_PLUS allows for the creation 1166 of a sparse file and for hole punching. An application might want to 1167 zero out a range of the file. READ_PLUS supports all the features of 1168 READ but includes an extension to support sparse pattern files 1169 (Section 6.1.2). READ_PLUS is guaranteed to perform no worse than 1170 READ, and can dramatically improve performance with sparse files. 1171 READ_PLUS does not depend on pNFS protocol features, but can be used 1172 by pNFS to support sparse files. 1174 4.2. Terminology 1176 Regular file: An object of file type NF4REG or NF4NAMEDATTR. 1178 Sparse file: A Regular file that contains one or more Holes. 1180 Hole: A byte range within a Sparse file that contains regions of all 1181 zeroes. For block-based file systems, this could also be an 1182 unallocated region of the file. 1184 Hole Threshold: The minimum length of a Hole as determined by the 1185 server. If a server chooses to define a Hole Threshold, then it 1186 would not return hole information about holes with a length 1187 shorter than the Hole Threshold. 1189 4.3. New Operations 1191 READ_PLUS and WRITE_PLUS are new variants of the NFSv4.1 READ and 1192 WRITE operations [1]. Besides being able to support all of the data 1193 semantics of those operations, they can also be used by the client 1194 and server to efficiently transfer both holes and ADHs (see 1195 Section 6.1.1). As both READ and WRITE are inefficient for transfer 1196 of sparse sections of the file, they are marked as OBSOLETE in 1197 NFSv4.2. Instead, a client should utilize READ_PLUS and WRITE_PLUS. 1198 Note that as the client has no a priori knowledge of whether either 1199 an ADH or a hole is present or not, if it supports these operations 1200 and so does the server, then it should always use these operations. 1202 4.3.1. READ_PLUS 1204 For holes, READ_PLUS extends the response to avoid returning data for 1205 portions of the file which are either initialized and contain no 1206 backing store or if the result would appear to be so. I.e., if the 1207 result was a data block composed entirely of zeros, then it is easier 1208 to return a hole. Returning data blocks of uninitialized data wastes 1209 computational and network resources, thus reducing performance. For 1210 ADHs, READ_PLUS is used to return the metadata describing the 1211 portions of the file which are either initialized and contain no 1212 backing store. 1214 If the client sends a READ operation, it is explicitly stating that 1215 it is neither supporting sparse files nor ADHs. So if a READ occurs 1216 on a sparse ADH or file, then the server must expand such data to be 1217 raw bytes. If a READ occurs in the middle of a hole or ADH, the 1218 server can only send back bytes starting from that offset. In 1219 contrast, if a READ_PLUS occurs in the middle of a hole or ADH, the 1220 server can send back a range which starts before the offset and 1221 extends past the range. 1223 4.3.2. WRITE_PLUS 1225 WRITE_PLUS can be used to either hole punch or initialize ADHs. For 1226 either purpose, the client can avoid the transfer of a repetitive 1227 pattern across the network. If the filesystem on the server does not 1228 supports sparse files, the WRITE_PLUS operation may return the result 1229 asynchronously via the CB_OFFLOAD operation. As a hole punch may 1230 entail deallocating data blocks, even if the filesystem supports 1231 sparse files, it may still have to return the result via CB_OFFLOAD. 1233 5. Space Reservation 1235 5.1. Introduction 1237 Applications such as hypervisors want to be able to reserve space for 1238 a file, report the amount of actual disk space a file occupies, and 1239 freeup the backing space of a file when it is not required. In 1240 virtualized environments, virtual disk files are often stored on NFS 1241 mounted volumes. Since virtual disk files represent the hard disks 1242 of virtual machines, hypervisors often have to guarantee certain 1243 properties for the file. 1245 One such example is space reservation. When a hypervisor creates a 1246 virtual disk file, it often tries to preallocate the space for the 1247 file so that there are no future allocation related errors during the 1248 operation of the virtual machine. Such errors prevent a virtual 1249 machine from continuing execution and result in downtime. 1251 Currently, in order to achieve such a guarantee, applications zero 1252 the entire file. The initial zeroing allocates the backing blocks 1253 and all subsequent writes are overwrites of already allocated blocks. 1254 This approach is not only inefficient in terms of the amount of I/O 1255 done, it is also not guaranteed to work on file systems that are log 1256 structured or deduplicated. An efficient way of guaranteeing space 1257 reservation would be beneficial to such applications. 1259 We define a "reservation" as being the combination of the 1260 space_reserved attribute (see Section 11.2.4) and the size attribute 1261 (see Section 5.8.1.5 of [1]). If space_reserved attribute is set on 1262 a file, it is guaranteed that writes that do not grow the file past 1263 the size will not fail with NFS4ERR_NOSPC. Once the size is changed, 1264 then the reservation is changed to that new size. 1266 Another useful feature is the ability to report the number of blocks 1267 that would be freed when a file is deleted. Currently, NFS reports 1268 two size attributes: 1270 size The logical file size of the file. 1272 space_used The size in bytes that the file occupies on disk 1274 While these attributes are sufficient for space accounting in 1275 traditional file systems, they prove to be inadequate in modern file 1276 systems that support block sharing. In such file systems, multiple 1277 inodes can point to a single block with a block reference count to 1278 guard against premature freeing. Having a way to tell the number of 1279 blocks that would be freed if the file was deleted would be useful to 1280 applications that wish to migrate files when a volume is low on 1281 space. 1283 Since virtual disks represent a hard drive in a virtual machine, a 1284 virtual disk can be viewed as a file system within a file. Since not 1285 all blocks within a file system are in use, there is an opportunity 1286 to reclaim blocks that are no longer in use. A call to deallocate 1287 blocks could result in better space efficiency. Lesser space MAY be 1288 consumed for backups after block deallocation. 1290 The following operations and attributes can be used to resolve this 1291 issues: 1293 space_reserved This attribute specifies that writes to the reserved 1294 area of the file will not fail with NFS4ERR_NOSPACE. 1296 space_freed This attribute specifies the space freed when a file is 1297 deleted, taking block sharing into consideration. 1299 WRITE_PLUS This operation zeroes and/or deallocates the blocks 1300 backing a region of the file. 1302 If space_used of a file is interpreted to mean the size in bytes of 1303 all disk blocks pointed to by the inode of the file, then shared 1304 blocks get double counted, over-reporting the space utilization. 1305 This also has the adverse effect that the deletion of a file with 1306 shared blocks frees up less than space_used bytes. 1308 On the other hand, if space_used is interpreted to mean the size in 1309 bytes of those disk blocks unique to the inode of the file, then 1310 shared blocks are not counted in any file, resulting in under- 1311 reporting of the space utilization. 1313 For example, two files A and B have 10 blocks each. Let 6 of these 1314 blocks be shared between them. Thus, the combined space utilized by 1315 the two files is 14 * BLOCK_SIZE bytes. In the former case, the 1316 combined space utilization of the two files would be reported as 20 * 1317 BLOCK_SIZE. However, deleting either would only result in 4 * 1318 BLOCK_SIZE being freed. Conversely, the latter interpretation would 1319 report that the space utilization is only 8 * BLOCK_SIZE. 1321 Adding another size attribute, space_freed (see Section 11.2.5), is 1322 helpful in solving this problem. space_freed is the number of blocks 1323 that are allocated to the given file that would be freed on its 1324 deletion. In the example, both A and B would report space_freed as 4 1325 * BLOCK_SIZE and space_used as 10 * BLOCK_SIZE. If A is deleted, B 1326 will report space_freed as 10 * BLOCK_SIZE as the deletion of B would 1327 result in the deallocation of all 10 blocks. 1329 The addition of this problem does not solve the problem of space 1330 being over-reported. However, over-reporting is better than under- 1331 reporting. 1333 6. Application Data Hole Support 1335 At the OS level, files are contained on disk blocks. Applications 1336 are also free to impose structure on the data contained in a file and 1337 we can define an Application Data Block (ADB) to be such a structure. 1338 From the application's viewpoint, it only wants to handle ADBs and 1339 not raw bytes (see [15]). An ADB is typically comprised of two 1340 sections: a header and data. The header describes the 1341 characteristics of the block and can provide a means to detect 1342 corruption in the data payload. The data section is typically 1343 initialized to all zeros. 1345 The format of the header is application specific, but there are two 1346 main components typically encountered: 1348 1. A logical block number which allows the application to determine 1349 which data block is being referenced. This is useful when the 1350 client is not storing the blocks in contiguous memory. 1352 2. Fields to describe the state of the ADB and a means to detect 1353 block corruption. For both pieces of data, a useful property is 1354 that allowed values be unique in that if passed across the 1355 network, corruption due to translation between big and little 1356 endian architectures are detectable. For example, 0xF0DEDEF0 has 1357 the same bit pattern in both architectures. 1359 Applications already impose structures on files [15] and detect 1360 corruption in data blocks [16]. What they are not able to do is 1361 efficiently transfer and store ADBs. To initialize a file with ADBs, 1362 the client must send the full ADB to the server and that must be 1363 stored on the server. 1365 In this section, we are going to define an Application Data Hole 1366 (ADH), which is a generic framework for transferring the ADB, present 1367 one approach to detecting corruption in a given ADH implementation, 1368 and describe the model for how the client and server can support 1369 efficient initialization of ADHs, reading of ADH holes, punching ADH 1370 holes in a file, and space reservation. We define the ADHN to be the 1371 Application Data Hole Number, which is the logical block number 1372 discussed earlier. 1374 6.1. Generic Framework 1376 We want the representation of the ADH to be flexible enough to 1377 support many different applications. The most basic approach is no 1378 imposition of a block at all, which means we are working with the raw 1379 bytes. Such an approach would be useful for storing holes, punching 1380 holes, etc. In more complex deployments, a server might be 1381 supporting multiple applications, each with their own definition of 1382 the ADH. One might store the ADHN at the start of the block and then 1383 have a guard pattern to detect corruption [17]. The next might store 1384 the ADHN at an offset of 100 bytes within the block and have no guard 1385 pattern at all, i.e., existing applications might already have well 1386 defined formats for their data blocks. 1388 The guard pattern can be used to represent the state of the block, to 1389 protect against corruption, or both. Again, it needs to be able to 1390 be placed anywhere within the ADH. 1392 We need to be able to represent the starting offset of the block and 1393 the size of the block. Note that nothing prevents the application 1394 from defining different sized blocks in a file. 1396 6.1.1. Data Hole Representation 1398 struct app_data_hole4 { 1399 offset4 adh_offset; 1400 length4 adh_block_size; 1401 length4 adh_block_count; 1402 length4 adh_reloff_blocknum; 1403 count4 adh_block_num; 1404 length4 adh_reloff_pattern; 1405 opaque adh_pattern<>; 1406 }; 1408 The app_data_hole4 structure captures the abstraction presented for 1409 the ADH. The additional fields present are to allow the transmission 1410 of adh_block_count ADHs at one time. We also use adh_block_num to 1411 convey the ADHN of the first block in the sequence. Each ADH will 1412 contain the same adh_pattern string. 1414 As both adh_block_num and adh_pattern are optional, if either 1415 adh_reloff_pattern or adh_reloff_blocknum is set to NFS4_UINT64_MAX, 1416 then the corresponding field is not set in any of the ADH. 1418 6.1.2. Data Content 1420 /* 1421 * Use an enum such that we can extend new types. 1422 */ 1423 enum data_content4 { 1424 NFS4_CONTENT_DATA = 0, 1425 NFS4_CONTENT_APP_DATA_HOLE = 1, 1426 NFS4_CONTENT_HOLE = 2 1427 }; 1429 New operations might need to differentiate between wanting to access 1430 data versus an ADH. Also, future minor versions might want to 1431 introduce new data formats. This enumeration allows that to occur. 1433 6.2. An Example of Detecting Corruption 1435 In this section, we define an ADH format in which corruption can be 1436 detected. Note that this is just one possible format and means to 1437 detect corruption. 1439 Consider a very basic implementation of an operating system's disk 1440 blocks. A block is either data or it is an indirect block which 1441 allows for files to be larger than one block. It is desired to be 1442 able to initialize a block. Lastly, to quickly unlink a file, a 1443 block can be marked invalid. The contents remain intact - which 1444 would enable this OS application to undelete a file. 1446 The application defines 4k sized data blocks, with an 8 byte block 1447 counter occurring at offset 0 in the block, and with the guard 1448 pattern occurring at offset 8 inside the block. Furthermore, the 1449 guard pattern can take one of four states: 1451 0xfeedface - This is the FREE state and indicates that the ADH 1452 format has been applied. 1454 0xcafedead - This is the DATA state and indicates that real data 1455 has been written to this block. 1457 0xe4e5c001 - This is the INDIRECT state and indicates that the 1458 block contains block counter numbers that are chained off of this 1459 block. 1461 0xba1ed4a3 - This is the INVALID state and indicates that the block 1462 contains data whose contents are garbage. 1464 Finally, it also defines an 8 byte checksum [18] starting at byte 16 1465 which applies to the remaining contents of the block. If the state 1466 is FREE, then that checksum is trivially zero. As such, the 1467 application has no need to transfer the checksum implicitly inside 1468 the ADH - it need not make the transfer layer aware of the fact that 1469 there is a checksum (see [16] for an example of checksums used to 1470 detect corruption in application data blocks). 1472 Corruption in each ADH can thus be detected: 1474 o If the guard pattern is anything other than one of the allowed 1475 values, including all zeros. 1477 o If the guard pattern is FREE and any other byte in the remainder 1478 of the ADH is anything other than zero. 1480 o If the guard pattern is anything other than FREE, then if the 1481 stored checksum does not match the computed checksum. 1483 o If the guard pattern is INDIRECT and one of the stored indirect 1484 block numbers has a value greater than the number of ADHs in the 1485 file. 1487 o If the guard pattern is INDIRECT and one of the stored indirect 1488 block numbers is a duplicate of another stored indirect block 1489 number. 1491 As can be seen, the application can detect errors based on the 1492 combination of the guard pattern state and the checksum. But also, 1493 the application can detect corruption based on the state and the 1494 contents of the ADH. This last point is important in validating the 1495 minimum amount of data we incorporated into our generic framework. 1496 I.e., the guard pattern is sufficient in allowing applications to 1497 design their own corruption detection. 1499 Finally, it is important to note that none of these corruption checks 1500 occur in the transport layer. The server and client components are 1501 totally unaware of the file format and might report everything as 1502 being transferred correctly even in the case the application detects 1503 corruption. 1505 6.3. Example of READ_PLUS 1507 The hypothetical application presented in Section 6.2 can be used to 1508 illustrate how READ_PLUS would return an array of results. A file is 1509 created and initialized with 100 4k ADHs in the FREE state: 1511 WRITE_PLUS {0, 4k, 100, 0, 0, 8, 0xfeedface} 1513 Further, assume the application writes a single ADH at 16k, changing 1514 the guard pattern to 0xcafedead, we would then have in memory: 1516 0 -> (16k - 1) : 4k, 4, 0, 0, 8, 0xfeedface 1517 16k -> (20k - 1) : 00 00 00 05 ca fe de ad XX XX ... XX XX 1518 20k -> 400k : 4k, 95, 0, 6, 0xfeedface 1520 And when the client did a READ_PLUS of 64k at the start of the file, 1521 it would get back a result of an ADH, some data, and a final ADH: 1523 ADH {0, 4, 0, 0, 8, 0xfeedface} 1524 data 4k 1525 ADH {20k, 4k, 59, 0, 6, 0xfeedface} 1527 7. Labeled NFS 1529 7.1. Introduction 1531 Access control models such as Unix permissions or Access Control 1532 Lists are commonly referred to as Discretionary Access Control (DAC) 1533 models. These systems base their access decisions on user identity 1534 and resource ownership. In contrast Mandatory Access Control (MAC) 1535 models base their access control decisions on the label on the 1536 subject (usually a process) and the object it wishes to access [19]. 1537 These labels may contain user identity information but usually 1538 contain additional information. In DAC systems users are free to 1539 specify the access rules for resources that they own. MAC models 1540 base their security decisions on a system wide policy established by 1541 an administrator or organization which the users do not have the 1542 ability to override. In this section, we add a MAC model to NFSv4.2. 1544 The first change necessary is to devise a method for transporting and 1545 storing security label data on NFSv4 file objects. Security labels 1546 have several semantics that are met by NFSv4 recommended attributes 1547 such as the ability to set the label value upon object creation. 1548 Access control on these attributes are done through a combination of 1549 two mechanisms. As with other recommended attributes on file objects 1550 the usual DAC checks (ACLs and permission bits) will be performed to 1551 ensure that proper file ownership is enforced. In addition a MAC 1552 system MAY be employed on the client, server, or both to enforce 1553 additional policy on what subjects may modify security label 1554 information. 1556 The second change is to provide methods for the client to determine 1557 if the security label has changed. A client which needs to know if a 1558 label is going to change SHOULD register a delegation on that file. 1559 In order to change the security label, the server will have to recall 1560 all delegations. This will inform the client of the change. If a 1561 client wants to detect if the label has changed, it MAY use VERIFY 1562 and NVERIFY on FATTR4_CHANGE_SEC_LABEL to detect that the 1563 FATTR4_SEC_LABEL has been modified. 1565 The final change necessary is a modification to the RPC layer used in 1566 NFSv4 in the form of a new version of the RPCSEC_GSS [6] framework. 1567 In order for an NFSv4 server to apply MAC checks it must obtain 1568 additional information from the client. Several methods were 1569 explored for performing this and it was decided that the best 1570 approach was to incorporate the ability to make security attribute 1571 assertions through the RPC mechanism. RPCSECGSSv3 [4] outlines a 1572 method to assert additional security information such as security 1573 labels on gss context creation and have that data bound to all RPC 1574 requests that make use of that context. 1576 7.2. Definitions 1578 Label Format Specifier (LFS): is an identifier used by the client to 1579 establish the syntactic format of the security label and the 1580 semantic meaning of its components. These specifiers exist in a 1581 registry associated with documents describing the format and 1582 semantics of the label. 1584 Label Format Registry: is the IANA registry containing all 1585 registered LFS along with references to the documents that 1586 describe the syntactic format and semantics of the security label. 1588 Policy Identifier (PI): is an optional part of the definition of a 1589 Label Format Specifier which allows for clients and server to 1590 identify specific security policies. 1592 Object: is a passive resource within the system that we wish to be 1593 protected. Objects can be entities such as files, directories, 1594 pipes, sockets, and many other system resources relevant to the 1595 protection of the system state. 1597 Subject: is an active entity usually a process which is requesting 1598 access to an object. 1600 MAC-Aware: is a server which can transmit and store object labels. 1602 MAC-Functional: is a client or server which is Labeled NFS enabled. 1603 Such a system can interpret labels and apply policies based on the 1604 security system. 1606 Multi-Level Security (MLS): is a traditional model where objects are 1607 given a sensitivity level (Unclassified, Secret, Top Secret, etc) 1608 and a category set [20]. 1610 7.3. MAC Security Attribute 1612 MAC models base access decisions on security attributes bound to 1613 subjects and objects. This information can range from a user 1614 identity for an identity based MAC model, sensitivity levels for 1615 Multi-level security, or a type for Type Enforcement. These models 1616 base their decisions on different criteria but the semantics of the 1617 security attribute remain the same. The semantics required by the 1618 security attributes are listed below: 1620 o MUST provide flexibility with respect to the MAC model. 1622 o MUST provide the ability to atomically set security information 1623 upon object creation. 1625 o MUST provide the ability to enforce access control decisions both 1626 on the client and the server. 1628 o MUST NOT expose an object to either the client or server name 1629 space before its security information has been bound to it. 1631 NFSv4 implements the security attribute as a recommended attribute. 1632 These attributes have a fixed format and semantics, which conflicts 1633 with the flexible nature of the security attribute. To resolve this 1634 the security attribute consists of two components. The first 1635 component is a LFS as defined in [21] to allow for interoperability 1636 between MAC mechanisms. The second component is an opaque field 1637 which is the actual security attribute data. To allow for various 1638 MAC models, NFSv4 should be used solely as a transport mechanism for 1639 the security attribute. It is the responsibility of the endpoints to 1640 consume the security attribute and make access decisions based on 1641 their respective models. In addition, creation of objects through 1642 OPEN and CREATE allows for the security attribute to be specified 1643 upon creation. By providing an atomic create and set operation for 1644 the security attribute it is possible to enforce the second and 1645 fourth requirements. The recommended attribute FATTR4_SEC_LABEL (see 1646 Section 11.2.2) will be used to satisfy this requirement. 1648 7.3.1. Delegations 1650 In the event that a security attribute is changed on the server while 1651 a client holds a delegation on the file, both the server and the 1652 client MUST follow the NFSv4.1 protocol (see Chapter 10 of [1]) with 1653 respect to attribute changes. It SHOULD flush all changes back to 1654 the server and relinquish the delegation. 1656 7.3.2. Permission Checking 1658 It is not feasible to enumerate all possible MAC models and even 1659 levels of protection within a subset of these models. This means 1660 that the NFSv4 client and servers cannot be expected to directly make 1661 access control decisions based on the security attribute. Instead 1662 NFSv4 should defer permission checking on this attribute to the host 1663 system. These checks are performed in addition to existing DAC and 1664 ACL checks outlined in the NFSv4 protocol. Section 7.6 gives a 1665 specific example of how the security attribute is handled under a 1666 particular MAC model. 1668 7.3.3. Object Creation 1670 When creating files in NFSv4 the OPEN and CREATE operations are used. 1671 One of the parameters to these operations is an fattr4 structure 1672 containing the attributes the file is to be created with. This 1673 allows NFSv4 to atomically set the security attribute of files upon 1674 creation. When a client is MAC-Functional it must always provide the 1675 initial security attribute upon file creation. In the event that the 1676 server is MAC-Functional as well, it should determine by policy 1677 whether it will accept the attribute from the client or instead make 1678 the determination itself. If the client is not MAC-Functional, then 1679 the MAC-Functional server must decide on a default label. A more in 1680 depth explanation can be found in Section 7.6. 1682 7.3.4. Existing Objects 1684 Note that under the MAC model, all objects must have labels. 1685 Therefore, if an existing server is upgraded to include Labeled NFS 1686 support, then it is the responsibility of the security system to 1687 define the behavior for existing objects. 1689 7.3.5. Label Changes 1691 If there are open delegations on the file belonging to client other 1692 than the one making the label change, then the process described in 1693 Section 7.3.1 must be followed. In short, the delegation will be 1694 recalled, which effectively notifies the client of the change. 1696 As the server is always presented with the subject label from the 1697 client, it does not necessarily need to communicate the fact that the 1698 label has changed to the client. In the cases where the change 1699 outright denies the client access, the client will be able to quickly 1700 determine that there is a new label in effect. 1702 Consider a system in which the clients enforce MAC checks and and the 1703 server has a very simple security system which just stores the 1704 labels. In this system, the MAC label check always allows access, 1705 regardless of the subject label. 1707 The way in which MAC labels are enforced is by the client. The 1708 security policies on the client can be such that the client does not 1709 have access to the file unless it has a delegation. The recall of 1710 the delegation will force the client to flush any cached content of 1711 the file. The clients could also be configured to periodically 1712 VERIFY/NVERIFY the FATTR4_CHANGE_SEC_LABEL attribute to determine 1713 when the label has changed. When a change is detected, then the 1714 client could take the costlier action of retrieving the 1715 FATTR4_SEC_LABEL. 1717 7.4. pNFS Considerations 1719 This section examines the issues in deploying Labeled NFS in a pNFS 1720 community of servers. 1722 7.4.1. MAC Label Checks 1724 The new FATTR4_SEC_LABEL attribute is metadata information and as 1725 such the DS is not aware of the value contained on the MDS. 1726 Fortunately, the NFSv4.1 protocol [1] already has provisions for 1727 doing access level checks from the DS to the MDS. In order for the 1728 DS to validate the subject label presented by the client, it SHOULD 1729 utilize this mechanism. 1731 7.5. Discovery of Server Labeled NFS Support 1733 The server can easily determine that a client supports Labeled NFS 1734 when it queries for the FATTR4_SEC_LABEL label for an object. Note 1735 that it cannot assume that the presence of RPCSEC_GSSv3 indicates 1736 Labeled NFS support. The client might need to discover which LFS the 1737 server supports. 1739 A server which supports Labeled NFS MUST allow a client with any 1740 subject label to retrieve the FATTR4_SEC_LABEL attribute for the root 1741 filehandle, ROOTFH. The following compound must always succeed as 1742 far as a MAC label check is concerned: 1744 PUTROOTFH, GETATTR {FATTR4_SEC_LABEL} 1746 Note that the server might have imposed a security flavor on the root 1747 that precludes such access. I.e., if the server requires kerberized 1748 access and the client presents a compound with AUTH_SYS, then the 1749 server is allowed to return NFS4ERR_WRONGSEC in this case. But if 1750 the client presents a correct security flavor, then the server MUST 1751 return the FATTR4_SEC_LABEL attribute with the supported LFS filled 1752 in. 1754 7.6. MAC Security NFS Modes of Operation 1756 A system using Labeled NFS may operate in two modes. The first mode 1757 provides the most protection and is called "full mode". In this mode 1758 both the client and server implement a MAC model allowing each end to 1759 make an access control decision. The remaining mode is called the 1760 "guest mode" and in this mode one end of the connection is not 1761 implementing a MAC model and thus offers less protection than full 1762 mode. 1764 7.6.1. Full Mode 1766 Full mode environments consist of MAC-Functional NFSv4 servers and 1767 clients and may be composed of mixed MAC models and policies. The 1768 system requires that both the client and server have an opportunity 1769 to perform an access control check based on all relevant information 1770 within the network. The file object security attribute is provided 1771 using the mechanism described in Section 7.3. The security attribute 1772 of the subject making the request is transported at the RPC layer 1773 using the mechanism described in RPCSECGSSv3 [4]. 1775 7.6.1.1. Initial Labeling and Translation 1777 The ability to create a file is an action that a MAC model may wish 1778 to mediate. The client is given the responsibility to determine the 1779 initial security attribute to be placed on a file. This allows the 1780 client to make a decision as to the acceptable security attributes to 1781 create a file with before sending the request to the server. Once 1782 the server receives the creation request from the client it may 1783 choose to evaluate if the security attribute is acceptable. 1785 Security attributes on the client and server may vary based on MAC 1786 model and policy. To handle this the security attribute field has an 1787 LFS component. This component is a mechanism for the host to 1788 identify the format and meaning of the opaque portion of the security 1789 attribute. A full mode environment may contain hosts operating in 1790 several different LFSs. In this case a mechanism for translating the 1791 opaque portion of the security attribute is needed. The actual 1792 translation function will vary based on MAC model and policy and is 1793 out of the scope of this document. If a translation is unavailable 1794 for a given LFS then the request MUST be denied. Another recourse is 1795 to allow the host to provide a fallback mapping for unknown security 1796 attributes. 1798 7.6.1.2. Policy Enforcement 1800 In full mode access control decisions are made by both the clients 1801 and servers. When a client makes a request it takes the security 1802 attribute from the requesting process and makes an access control 1803 decision based on that attribute and the security attribute of the 1804 object it is trying to access. If the client denies that access an 1805 RPC call to the server is never made. If however the access is 1806 allowed the client will make a call to the NFS server. 1808 When the server receives the request from the client it extracts the 1809 security attribute conveyed in the RPC request. The server then uses 1810 this security attribute and the attribute of the object the client is 1811 trying to access to make an access control decision. If the server's 1812 policy allows this access it will fulfill the client's request, 1813 otherwise it will return NFS4ERR_ACCESS. 1815 Implementations MAY validate security attributes supplied over the 1816 network to ensure that they are within a set of attributes permitted 1817 from a specific peer, and if not, reject them. Note that a system 1818 may permit a different set of attributes to be accepted from each 1819 peer. 1821 7.6.1.3. Limited Server 1823 A Limited Server mode (see Section 3.5.2 of [19]) consists of a 1824 server which is label aware, but does not enforce policies. Such a 1825 server will store and retrieve all object labels presented by 1826 clients, utilize the methods described in Section 7.3.5 to allow the 1827 clients to detect changing labels,, but will not restrict access via 1828 the subject label. Instead, it will expect the clients to enforce 1829 all such access locally. 1831 7.6.2. Guest Mode 1833 Guest mode implies that either the client or the server does not 1834 handle labels. If the client is not Labeled NFS aware, then it will 1835 not offer subject labels to the server. The server is the only 1836 entity enforcing policy, and may selectively provide standard NFS 1837 services to clients based on their authentication credentials and/or 1838 associated network attributes (e.g., IP address, network interface). 1839 The level of trust and access extended to a client in this mode is 1840 configuration-specific. If the server is not Labeled NFS aware, then 1841 it will not return object labels to the client. Clients in this 1842 environment are may consist of groups implementing different MAC 1843 model policies. The system requires that all clients in the 1844 environment be responsible for access control checks. 1846 7.7. Security Considerations 1848 This entire chapter deals with security issues. 1850 Depending on the level of protection the MAC system offers there may 1851 be a requirement to tightly bind the security attribute to the data. 1853 When only one of the client or server enforces labels, it is 1854 important to realize that the other side is not enforcing MAC 1855 protections. Alternate methods might be in use to handle the lack of 1856 MAC support and care should be taken to identify and mitigate threats 1857 from possible tampering outside of these methods. 1859 An example of this is that a server that modifies READDIR or LOOKUP 1860 results based on the client's subject label might want to always 1861 construct the same subject label for a client which does not present 1862 one. This will prevent a non-Labeled NFS client from mixing entries 1863 in the directory cache. 1865 8. Sharing change attribute implementation details with NFSv4 clients 1867 8.1. Introduction 1869 Although both the NFSv4 [9] and NFSv4.1 protocol [1], define the 1870 change attribute as being mandatory to implement, there is little in 1871 the way of guidance. The only mandated feature is that the value 1872 must change whenever the file data or metadata change. 1874 While this allows for a wide range of implementations, it also leaves 1875 the client with a conundrum: how does it determine which is the most 1876 recent value for the change attribute in a case where several RPC 1877 calls have been issued in parallel? In other words if two COMPOUNDs, 1878 both containing WRITE and GETATTR requests for the same file, have 1879 been issued in parallel, how does the client determine which of the 1880 two change attribute values returned in the replies to the GETATTR 1881 requests correspond to the most recent state of the file? In some 1882 cases, the only recourse may be to send another COMPOUND containing a 1883 third GETATTR that is fully serialized with the first two. 1885 NFSv4.2 avoids this kind of inefficiency by allowing the server to 1886 share details about how the change attribute is expected to evolve, 1887 so that the client may immediately determine which, out of the 1888 several change attribute values returned by the server, is the most 1889 recent. change_attr_type is defined as a new recommended attribute 1890 (see Section 11.2.1), and is per file system. 1892 9. Security Considerations 1894 NFSv4.2 has all of the security concerns present in NFSv4.1 (see 1895 Section 21 of [1]) and those present in the Server-side Copy (see 1896 Section 2.4) and in Labeled NFS (see Section 7.7). 1898 10. Error Values 1900 NFS error numbers are assigned to failed operations within a Compound 1901 (COMPOUND or CB_COMPOUND) request. A Compound request contains a 1902 number of NFS operations that have their results encoded in sequence 1903 in a Compound reply. The results of successful operations will 1904 consist of an NFS4_OK status followed by the encoded results of the 1905 operation. If an NFS operation fails, an error status will be 1906 entered in the reply and the Compound request will be terminated. 1908 10.1. Error Definitions 1910 Protocol Error Definitions 1912 +--------------------------+--------+------------------+ 1913 | Error | Number | Description | 1914 +--------------------------+--------+------------------+ 1915 | NFS4ERR_BADLABEL | 10093 | Section 10.1.3.1 | 1916 | NFS4ERR_METADATA_NOTSUPP | 10090 | Section 10.1.2.1 | 1917 | NFS4ERR_OFFLOAD_DENIED | 10091 | Section 10.1.2.2 | 1918 | NFS4ERR_PARTNER_NO_AUTH | 10089 | Section 10.1.2.3 | 1919 | NFS4ERR_PARTNER_NOTSUPP | 10088 | Section 10.1.2.4 | 1920 | NFS4ERR_UNION_NOTSUPP | 10094 | Section 10.1.1.1 | 1921 | NFS4ERR_WRONG_LFS | 10092 | Section 10.1.3.2 | 1922 +--------------------------+--------+------------------+ 1924 Table 1 1926 10.1.1. General Errors 1928 This section deals with errors that are applicable to a broad set of 1929 different purposes. 1931 10.1.1.1. NFS4ERR_UNION_NOTSUPP (Error Code 10094) 1933 One of the arguments to the operation is a discriminated union and 1934 while the server supports the given operation, it does not support 1935 the selected arm of the discriminated union. For an example, see 1936 READ_PLUS (Section 13.10). 1938 10.1.2. Server to Server Copy Errors 1940 These errors deal with the interaction between server to server 1941 copies. 1943 10.1.2.1. NFS4ERR_METADATA_NOTSUPP (Error Code 10090) 1945 The destination file cannot support the same metadata as the source 1946 file. 1948 10.1.2.2. NFS4ERR_OFFLOAD_DENIED (Error Code 10091) 1950 The copy offload operation is supported by both the source and the 1951 destination, but the destination is not allowing it for this file. 1952 If the client sees this error, it should fall back to the normal copy 1953 semantics. 1955 10.1.2.3. NFS4ERR_PARTNER_NO_AUTH (Error Code 10089) 1957 The source server does not authorize a server-to-server copy offload 1958 operation. This may be due to the client's failure to send the 1959 COPY_NOTIFY operation to the source server, the source server 1960 receiving a server-to-server copy offload request after the copy 1961 lease time expired, or for some other permission problem. 1963 10.1.2.4. NFS4ERR_PARTNER_NOTSUPP (Error Code 10088) 1965 The remote server does not support the server-to-server copy offload 1966 protocol. 1968 10.1.3. Labeled NFS Errors 1970 These errors are used in Labeled NFS. 1972 10.1.3.1. NFS4ERR_BADLABEL (Error Code 10093) 1974 The label specified is invalid in some manner. 1976 10.1.3.2. NFS4ERR_WRONG_LFS (Error Code 10092) 1978 The LFS specified in the subject label is not compatible with the LFS 1979 in the object label. 1981 10.2. New Operations and Their Valid Errors 1983 This section contains a table that gives the valid error returns for 1984 each new NFSv4.2 protocol operation. The error code NFS4_OK 1985 (indicating no error) is not listed but should be understood to be 1986 returnable by all new operations. The error values for all other 1987 operations are defined in Section 15.2 of [1]. 1989 Valid Error Returns for Each New Protocol Operation 1991 +----------------+--------------------------------------------------+ 1992 | Operation | Errors | 1993 +----------------+--------------------------------------------------+ 1994 | COPY | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 1995 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 1996 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 1997 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | 1998 | | NFS4ERR_EXPIRED, NFS4ERR_FBIG, | 1999 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, NFS4ERR_INVAL, | 2000 | | NFS4ERR_IO, NFS4ERR_ISDIR, NFS4ERR_LOCKED, | 2001 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 2002 | | NFS4ERR_NOSPC, NFS4ERR_OFFLOAD_DENIED, | 2003 | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | 2004 | | NFS4ERR_OP_NOT_IN_SESSION, | 2005 | | NFS4ERR_PARTNER_NO_AUTH, | 2006 | | NFS4ERR_PARTNER_NOTSUPP, NFS4ERR_PNFS_IO_HOLE, | 2007 | | NFS4ERR_PNFS_NO_LAYOUT, NFS4ERR_REP_TOO_BIG, | 2008 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 2009 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, | 2010 | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | 2011 | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | 2012 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_TYPE | 2013 | COPY_NOTIFY | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 2014 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 2015 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 2016 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_EXPIRED, | 2017 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, NFS4ERR_INVAL, | 2018 | | NFS4ERR_ISDIR, NFS4ERR_IO, NFS4ERR_LOCKED, | 2019 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 2020 | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | 2021 | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_PNFS_IO_HOLE, | 2022 | | NFS4ERR_PNFS_NO_LAYOUT, NFS4ERR_REP_TOO_BIG, | 2023 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 2024 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, | 2025 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 2026 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 2027 | | NFS4ERR_WRONG_TYPE | 2028 | OFFLOAD_ABORT | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 2029 | | NFS4ERR_BAD_STATEID, NFS4ERR_COMPLETE_ALREADY, | 2030 | | NFS4ERR_DEADSESSION, NFS4ERR_EXPIRED, | 2031 | | NFS4ERR_DELAY, NFS4ERR_GRACE, NFS4ERR_NOTSUPP, | 2032 | | NFS4ERR_OLD_STATEID, NFS4ERR_OP_NOT_IN_SESSION, | 2033 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS | 2034 | OFFLOAD_REVOKE | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 2035 | | NFS4ERR_COMPLETE_ALREADY, NFS4ERR_DELAY, | 2036 | | NFS4ERR_GRACE, NFS4ERR_INVALID, NFS4ERR_MOVED, | 2037 | | NFS4ERR_NOTSUPP, NFS4ERR_OP_NOT_IN_SESSION, | 2038 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS | 2039 | OFFLOAD_STATUS | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 2040 | | NFS4ERR_BAD_STATEID, NFS4ERR_COMPLETE_ALREADY, | 2041 | | NFS4ERR_DEADSESSION, NFS4ERR_EXPIRED, | 2042 | | NFS4ERR_DELAY, NFS4ERR_GRACE, NFS4ERR_NOTSUPP, | 2043 | | NFS4ERR_OLD_STATEID, NFS4ERR_OP_NOT_IN_SESSION, | 2044 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS | 2045 | READ_PLUS | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 2046 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 2047 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 2048 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_EXPIRED, | 2049 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, NFS4ERR_INVAL, | 2050 | | NFS4ERR_ISDIR, NFS4ERR_IO, NFS4ERR_LOCKED, | 2051 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 2052 | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | 2053 | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_PNFS_IO_HOLE, | 2054 | | NFS4ERR_PNFS_NO_LAYOUT, NFS4ERR_REP_TOO_BIG, | 2055 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 2056 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, | 2057 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 2058 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 2059 | | NFS4ERR_UNION_NOTSUPP, NFS4ERR_WRONG_TYPE | 2060 | SEEK | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 2061 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 2062 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 2063 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_EXPIRED, | 2064 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, NFS4ERR_INVAL, | 2065 | | NFS4ERR_ISDIR, NFS4ERR_IO, NFS4ERR_LOCKED, | 2066 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 2067 | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | 2068 | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_PNFS_IO_HOLE, | 2069 | | NFS4ERR_PNFS_NO_LAYOUT, NFS4ERR_REP_TOO_BIG, | 2070 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 2071 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, | 2072 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 2073 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 2074 | | NFS4ERR_UNION_NOTSUPP, NFS4ERR_WRONG_TYPE | 2075 | SEQUENCE | NFS4ERR_BADSESSION, NFS4ERR_BADSLOT, | 2076 | | NFS4ERR_BADXDR, NFS4ERR_BAD_HIGH_SLOT, | 2077 | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | 2078 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 2079 | | NFS4ERR_REP_TOO_BIG, | 2080 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 2081 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, | 2082 | | NFS4ERR_SEQUENCE_POS, NFS4ERR_SEQ_FALSE_RETRY, | 2083 | | NFS4ERR_SEQ_MISORDERED, NFS4ERR_TOO_MANY_OPS | 2084 | WRITE_PLUS | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 2085 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 2086 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 2087 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | 2088 | | NFS4ERR_EXPIRED, NFS4ERR_FBIG, | 2089 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, NFS4ERR_INVAL, | 2090 | | NFS4ERR_IO, NFS4ERR_ISDIR, NFS4ERR_LOCKED, | 2091 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 2092 | | NFS4ERR_NOSPC, NFS4ERR_OLD_STATEID, | 2093 | | NFS4ERR_OPENMODE, NFS4ERR_OP_NOT_IN_SESSION, | 2094 | | NFS4ERR_PNFS_IO_HOLE, NFS4ERR_PNFS_NO_LAYOUT, | 2095 | | NFS4ERR_REP_TOO_BIG, | 2096 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 2097 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, | 2098 | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | 2099 | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | 2100 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_UNION_NOTSUPP, | 2101 | | NFS4ERR_WRONG_TYPE | 2102 +----------------+--------------------------------------------------+ 2104 Table 2 2106 10.3. New Callback Operations and Their Valid Errors 2108 This section contains a table that gives the valid error returns for 2109 each new NFSv4.2 callback operation. The error code NFS4_OK 2110 (indicating no error) is not listed but should be understood to be 2111 returnable by all new callback operations. The error values for all 2112 other callback operations are defined in Section 15.3 of [1]. 2114 Valid Error Returns for Each New Protocol Callback Operation 2116 +------------+------------------------------------------------------+ 2117 | Callback | Errors | 2118 | Operation | | 2119 +------------+------------------------------------------------------+ 2120 | CB_OFFLOAD | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 2121 | | NFS4ERR_BAD_STATEID, NFS4ERR_DELAY, | 2122 | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_REP_TOO_BIG, | 2123 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, NFS4ERR_REQ_TOO_BIG, | 2124 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_SERVERFAULT, | 2125 | | NFS4ERR_TOO_MANY_OPS | 2126 +------------+------------------------------------------------------+ 2128 Table 3 2130 11. New File Attributes 2132 11.1. New RECOMMENDED Attributes - List and Definition References 2134 The list of new RECOMMENDED attributes appears in Table 4. The 2135 meaning of the columns of the table are: 2137 Name: The name of the attribute. 2139 Id: The number assigned to the attribute. In the event of conflicts 2140 between the assigned number and [2], the latter is likely 2141 authoritative, but should be resolved with Errata to this document 2142 and/or [2]. See [22] for the Errata process. 2144 Data Type: The XDR data type of the attribute. 2146 Acc: Access allowed to the attribute. 2148 R means read-only (GETATTR may retrieve, SETATTR may not set). 2150 W means write-only (SETATTR may set, GETATTR may not retrieve). 2152 R W means read/write (GETATTR may retrieve, SETATTR may set). 2154 Defined in: The section of this specification that describes the 2155 attribute. 2157 +------------------+----+-------------------+-----+----------------+ 2158 | Name | Id | Data Type | Acc | Defined in | 2159 +------------------+----+-------------------+-----+----------------+ 2160 | change_attr_type | 79 | change_attr_type4 | R | Section 11.2.1 | 2161 | sec_label | 80 | sec_label4 | R W | Section 11.2.2 | 2162 | change_sec_label | 81 | change_sec_label4 | R | Section 11.2.3 | 2163 | space_reserved | 77 | boolean | R W | Section 11.2.4 | 2164 | space_freed | 78 | length4 | R | Section 11.2.5 | 2165 +------------------+----+-------------------+-----+----------------+ 2167 Table 4 2169 11.2. Attribute Definitions 2171 11.2.1. Attribute 79: change_attr_type 2173 enum change_attr_type4 { 2174 NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR = 0, 2175 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER = 1, 2176 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS = 2, 2177 NFS4_CHANGE_TYPE_IS_TIME_METADATA = 3, 2178 NFS4_CHANGE_TYPE_IS_UNDEFINED = 4 2179 }; 2181 change_attr_type is a per file system attribute which enables the 2182 NFSv4.2 server to provide additional information about how it expects 2183 the change attribute value to evolve after the file data, or metadata 2184 has changed. While Section 5.4 of [1] discusses per file system 2185 attributes, it is expected that the value of change_attr_type not 2186 depend on the value of "homogeneous" and only changes in the event of 2187 a migration. 2189 NFS4_CHANGE_TYPE_IS_UNDEFINED: The change attribute does not take 2190 values that fit into any of these categories. 2192 NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR: The change attribute value MUST 2193 monotonically increase for every atomic change to the file 2194 attributes, data, or directory contents. 2196 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER: The change attribute value MUST 2197 be incremented by one unit for every atomic change to the file 2198 attributes, data, or directory contents. This property is 2199 preserved when writing to pNFS data servers. 2201 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS: The change attribute 2202 value MUST be incremented by one unit for every atomic change to 2203 the file attributes, data, or directory contents. In the case 2204 where the client is writing to pNFS data servers, the number of 2205 increments is not guaranteed to exactly match the number of 2206 writes. 2208 NFS4_CHANGE_TYPE_IS_TIME_METADATA: The change attribute is 2209 implemented as suggested in the NFSv4 spec [9] in terms of the 2210 time_metadata attribute. 2212 If either NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR, 2213 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, or 2214 NFS4_CHANGE_TYPE_IS_TIME_METADATA are set, then the client knows at 2215 the very least that the change attribute is monotonically increasing, 2216 which is sufficient to resolve the question of which value is the 2217 most recent. 2219 If the client sees the value NFS4_CHANGE_TYPE_IS_TIME_METADATA, then 2220 by inspecting the value of the 'time_delta' attribute it additionally 2221 has the option of detecting rogue server implementations that use 2222 time_metadata in violation of the spec. 2224 If the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it has the 2225 ability to predict what the resulting change attribute value should 2226 be after a COMPOUND containing a SETATTR, WRITE, or CREATE. This 2227 again allows it to detect changes made in parallel by another client. 2228 The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits the 2229 same, but only if the client is not doing pNFS WRITEs. 2231 Finally, if the server does not support change_attr_type or if 2232 NFS4_CHANGE_TYPE_IS_UNDEFINED is set, then the server SHOULD make an 2233 effort to implement the change attribute in terms of the 2234 time_metadata attribute. 2236 11.2.2. Attribute 80: sec_label 2238 typedef uint32_t policy4; 2240 struct labelformat_spec4 { 2241 policy4 lfs_lfs; 2242 policy4 lfs_pi; 2243 }; 2245 struct sec_label4 { 2246 labelformat_spec4 slai_lfs; 2247 opaque slai_data<>; 2248 }; 2250 The FATTR4_SEC_LABEL contains an array of two components with the 2251 first component being an LFS. It serves to provide the receiving end 2252 with the information necessary to translate the security attribute 2253 into a form that is usable by the endpoint. Label Formats assigned 2254 an LFS may optionally choose to include a Policy Identifier field to 2255 allow for complex policy deployments. The LFS and Label Format 2256 Registry are described in detail in [21]. The translation used to 2257 interpret the security attribute is not specified as part of the 2258 protocol as it may depend on various factors. The second component 2259 is an opaque section which contains the data of the attribute. This 2260 component is dependent on the MAC model to interpret and enforce. 2262 In particular, it is the responsibility of the LFS specification to 2263 define a maximum size for the opaque section, slai_data<>. When 2264 creating or modifying a label for an object, the client needs to be 2265 guaranteed that the server will accept a label that is sized 2266 correctly. By both client and server being part of a specific MAC 2267 model, the client will be aware of the size. 2269 If a server supports sec_label, then it MUST also support 2270 change_sec_label. Any modification to sec_label MUST modify the 2271 value for change_sec_label. 2273 11.2.3. Attribute 81: change_sec_label 2275 struct change_sec_label4 { 2276 uint64_t csl_major; 2277 uint64_t csl_minor; 2278 }; 2280 The change_sec_label attribute is a read-only attribute per file. If 2281 the value of sec_label for a file is not the same at two disparate 2282 times then the values of change_sec_label at those times MUST be 2283 different as well. The value of change_sec_label MAY change at other 2284 times as well, but this should be rare, as that will require the 2285 client to abort any operation in progress, re-read the label, and 2286 retry the operation. As the sec_label is not bounded by size, this 2287 attribute allows for VERIFY and NVERIFY to quickly determine if the 2288 sec_label has been modified. 2290 11.2.4. Attribute 77: space_reserved 2292 The space_reserve attribute is a read/write attribute of type 2293 boolean. It is a per file attribute and applies during the lifetime 2294 of the file or until it is turned off. When the space_reserved 2295 attribute is set via SETATTR, the server must ensure that there is 2296 disk space to accommodate every byte in the file before it can return 2297 success. If the server cannot guarantee this, it must return 2298 NFS4ERR_NOSPC. 2300 If the client tries to grow a file which has the space_reserved 2301 attribute set, the server must guarantee that there is disk space to 2302 accommodate every byte in the file with the new size before it can 2303 return success. If the server cannot guarantee this, it must return 2304 NFS4ERR_NOSPC. 2306 It is not required that the server allocate the space to the file 2307 before returning success. The allocation can be deferred, however, 2308 it must be guaranteed that it will not fail for lack of space. 2310 The value of space_reserved can be obtained at any time through 2311 GETATTR. If the size is retrieved at the same time, the client can 2312 determine the size of the reservation. 2314 In order to avoid ambiguity, the space_reserve bit cannot be set 2315 along with the size bit in SETATTR. Increasing the size of a file 2316 with space_reserve set will fail if space reservation cannot be 2317 guaranteed for the new size. If the file size is decreased, space 2318 reservation is only guaranteed for the new size. If a hole is 2319 punched into the file, then the reservation is not changed. 2321 11.2.5. Attribute 78: space_freed 2323 space_freed gives the number of bytes freed if the file is deleted. 2324 This attribute is read only and is of type length4. It is a per file 2325 attribute. 2327 12. Operations: REQUIRED, RECOMMENDED, or OPTIONAL 2329 The following tables summarize the operations of the NFSv4.2 protocol 2330 and the corresponding designation of REQUIRED, RECOMMENDED, and 2331 OPTIONAL to implement or either OBSOLETE if implemented or MUST NOT 2332 implement. The designation of OBSOLETE if implemented is reserved 2333 for those operations which are defined in either NFSv4.0 or NFSV4.1, 2334 can be implemented in NFSv4.2, and are intended to be MUST NOT be 2335 implemented in NFSv4.3. The designation of MUST NOT implement is 2336 reserved for those operations that were defined in either NFSv4.0 or 2337 NFSV4.1 and MUST NOT be implemented in NFSv4.2. 2339 For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation 2340 for operations sent by the client is for the server implementation. 2341 The client is generally required to implement the operations needed 2342 for the operating environment for which it serves. For example, a 2343 read-only NFSv4.2 client would have no need to implement the WRITE 2344 operation and is not required to do so. 2346 The REQUIRED or OPTIONAL designation for callback operations sent by 2347 the server is for both the client and server. Generally, the client 2348 has the option of creating the backchannel and sending the operations 2349 on the fore channel that will be a catalyst for the server sending 2350 callback operations. A partial exception is CB_RECALL_SLOT; the only 2351 way the client can avoid supporting this operation is by not creating 2352 a backchannel. 2354 Since this is a summary of the operations and their designation, 2355 there are subtleties that are not presented here. Therefore, if 2356 there is a question of the requirements of implementation, the 2357 operation descriptions themselves must be consulted along with other 2358 relevant explanatory text within this either specification or that of 2359 NFSv4.1 [1]. 2361 The abbreviations used in the second and third columns of the table 2362 are defined as follows. 2364 REQ REQUIRED to implement 2366 REC RECOMMENDED to implement 2368 OPT OPTIONAL to implement 2370 OBS OBSOLETE, might be required to implement 2372 MNI MUST NOT implement 2374 For the NFSv4.2 features that are OPTIONAL, the operations that 2375 support those features are OPTIONAL, and the server would return 2376 NFS4ERR_NOTSUPP in response to the client's use of those operations. 2377 If an OPTIONAL feature is supported, it is possible that a set of 2378 operations related to the feature become REQUIRED to implement. The 2379 third column of the table designates the feature(s) and if the 2380 operation is REQUIRED or OPTIONAL in the presence of support for the 2381 feature. 2383 The OPTIONAL features identified and their abbreviations are as 2384 follows: 2386 pNFS Parallel NFS 2388 FDELG File Delegations 2390 DDELG Directory Delegations 2391 COPY Server Side Copy 2393 ADH Application Data Holes 2395 Operations 2397 +----------------------+--------------------+-----------------------+ 2398 | Operation | REQ, REC, OPT, or | Feature (REQ, REC, or | 2399 | | MNI | OPT) | 2400 +----------------------+--------------------+-----------------------+ 2401 | ACCESS | REQ | | 2402 | BACKCHANNEL_CTL | REQ | | 2403 | BIND_CONN_TO_SESSION | REQ | | 2404 | CLOSE | REQ | | 2405 | COMMIT | REQ | | 2406 | COPY | OPT | COPY (REQ) | 2407 | OFFLOAD_ABORT | OPT | COPY (REQ) | 2408 | COPY_NOTIFY | OPT | COPY (REQ) | 2409 | OFFLOAD_REVOKE | OPT | COPY (REQ) | 2410 | OFFLOAD_STATUS | OPT | COPY (REQ) | 2411 | CREATE | REQ | | 2412 | CREATE_SESSION | REQ | | 2413 | DELEGPURGE | OPT | FDELG (REQ) | 2414 | DELEGRETURN | OPT | FDELG, DDELG, pNFS | 2415 | | | (REQ) | 2416 | DESTROY_CLIENTID | REQ | | 2417 | DESTROY_SESSION | REQ | | 2418 | EXCHANGE_ID | REQ | | 2419 | FREE_STATEID | REQ | | 2420 | GETATTR | REQ | | 2421 | GETDEVICEINFO | OPT | pNFS (REQ) | 2422 | GETDEVICELIST | OPT | pNFS (OPT) | 2423 | GETFH | REQ | | 2424 | WRITE_PLUS | OPT | ADH (REQ) | 2425 | GET_DIR_DELEGATION | OPT | DDELG (REQ) | 2426 | LAYOUTCOMMIT | OPT | pNFS (REQ) | 2427 | LAYOUTGET | OPT | pNFS (REQ) | 2428 | LAYOUTRETURN | OPT | pNFS (REQ) | 2429 | LINK | OPT | | 2430 | LOCK | REQ | | 2431 | LOCKT | REQ | | 2432 | LOCKU | REQ | | 2433 | LOOKUP | REQ | | 2434 | LOOKUPP | REQ | | 2435 | NVERIFY | REQ | | 2436 | OPEN | REQ | | 2437 | OPENATTR | OPT | | 2438 | OPEN_CONFIRM | MNI | | 2439 | OPEN_DOWNGRADE | REQ | | 2440 | PUTFH | REQ | | 2441 | PUTPUBFH | REQ | | 2442 | PUTROOTFH | REQ | | 2443 | READ | OBS | | 2444 | READDIR | REQ | | 2445 | READLINK | OPT | | 2446 | READ_PLUS | OPT | ADH (REQ) | 2447 | RECLAIM_COMPLETE | REQ | | 2448 | RELEASE_LOCKOWNER | MNI | | 2449 | REMOVE | REQ | | 2450 | RENAME | REQ | | 2451 | RENEW | MNI | | 2452 | RESTOREFH | REQ | | 2453 | SAVEFH | REQ | | 2454 | SECINFO | REQ | | 2455 | SECINFO_NO_NAME | REC | pNFS file layout | 2456 | | | (REQ) | 2457 | SEQUENCE | REQ | | 2458 | SETATTR | REQ | | 2459 | SETCLIENTID | MNI | | 2460 | SETCLIENTID_CONFIRM | MNI | | 2461 | SET_SSV | REQ | | 2462 | TEST_STATEID | REQ | | 2463 | VERIFY | REQ | | 2464 | WANT_DELEGATION | OPT | FDELG (OPT) | 2465 | WRITE | OBS | | 2466 +----------------------+--------------------+-----------------------+ 2467 Callback Operations 2469 +-------------------------+-------------------+---------------------+ 2470 | Operation | REQ, REC, OPT, or | Feature (REQ, REC, | 2471 | | MNI | or OPT) | 2472 +-------------------------+-------------------+---------------------+ 2473 | CB_OFFLOAD | OPT | COPY (REQ) | 2474 | CB_GETATTR | OPT | FDELG (REQ) | 2475 | CB_LAYOUTRECALL | OPT | pNFS (REQ) | 2476 | CB_NOTIFY | OPT | DDELG (REQ) | 2477 | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | 2478 | CB_NOTIFY_LOCK | OPT | | 2479 | CB_PUSH_DELEG | OPT | FDELG (OPT) | 2480 | CB_RECALL | OPT | FDELG, DDELG, pNFS | 2481 | | | (REQ) | 2482 | CB_RECALL_ANY | OPT | FDELG, DDELG, pNFS | 2483 | | | (REQ) | 2484 | CB_RECALL_SLOT | REQ | | 2485 | CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS (REQ) | 2486 | CB_SEQUENCE | OPT | FDELG, DDELG, pNFS | 2487 | | | (REQ) | 2488 | CB_WANTS_CANCELLED | OPT | FDELG, DDELG, pNFS | 2489 | | | (REQ) | 2490 +-------------------------+-------------------+---------------------+ 2492 13. NFSv4.2 Operations 2494 13.1. Operation 59: COPY - Initiate a server-side copy 2496 13.1.1. ARGUMENT 2498 const COPY4_GUARDED = 0x00000001; 2499 const COPY4_METADATA = 0x00000002; 2501 struct COPY4args { 2502 /* SAVED_FH: source file */ 2503 /* CURRENT_FH: destination file or */ 2504 /* directory */ 2505 stateid4 ca_src_stateid; 2506 stateid4 ca_dst_stateid; 2507 offset4 ca_src_offset; 2508 offset4 ca_dst_offset; 2509 length4 ca_count; 2510 uint32_t ca_flags; 2511 component4 ca_destination; 2512 netloc4 ca_source_server<>; 2514 }; 2516 13.1.2. RESULT 2518 union COPY4res switch (nfsstat4 cr_status) { 2519 case NFS4_OK: 2520 write_response4 resok4; 2521 default: 2522 length4 cr_bytes_copied; 2523 }; 2525 13.1.3. DESCRIPTION 2527 The COPY operation is used for both intra-server and inter-server 2528 copies. In both cases, the COPY is always sent from the client to 2529 the destination server of the file copy. The COPY operation requests 2530 that a file be copied from the location specified by the SAVED_FH 2531 value to the location specified by the combination of CURRENT_FH and 2532 ca_destination. 2534 The SAVED_FH must be a regular file. If SAVED_FH is not a regular 2535 file, the operation MUST fail and return NFS4ERR_WRONG_TYPE. 2537 In order to set SAVED_FH to the source file handle, the compound 2538 procedure requesting the COPY will include a sub-sequence of 2539 operations such as 2541 PUTFH source-fh 2542 SAVEFH 2544 If the request is for a server-to-server copy, the source-fh is a 2545 filehandle from the source server and the compound procedure is being 2546 executed on the destination server. In this case, the source-fh is a 2547 foreign filehandle on the server receiving the COPY request. If 2548 either PUTFH or SAVEFH checked the validity of the filehandle, the 2549 operation would likely fail and return NFS4ERR_STALE. 2551 If a server supports the server-to-server COPY feature, a PUTFH 2552 followed by a SAVEFH MUST NOT return NFS4ERR_STALE for either 2553 operation. These restrictions do not pose substantial difficulties 2554 for servers. The CURRENT_FH and SAVED_FH may be validated in the 2555 context of the operation referencing them and an NFS4ERR_STALE error 2556 returned for an invalid file handle at that point. 2558 For an intra-server copy, both the ca_src_stateid and ca_dst_stateid 2559 MUST refer to either open or locking states provided earlier by the 2560 server. If either stateid is invalid, then the operation MUST fail. 2561 If the request is for a inter-server copy, then the ca_src_stateid 2562 can be ignored. If ca_dst_stateid is invalid, then the operation 2563 MUST fail. 2565 The CURRENT_FH and ca_destination together specify the destination of 2566 the copy operation. If ca_destination is of 0 (zero) length, then 2567 CURRENT_FH specifies the target file. In this case, CURRENT_FH MUST 2568 be a regular file and not a directory. If ca_destination is not of 0 2569 (zero) length, the ca_destination argument specifies the file name to 2570 which the data will be copied within the directory identified by 2571 CURRENT_FH. In this case, CURRENT_FH MUST be a directory and not a 2572 regular file. 2574 If the file named by ca_destination does not exist and the operation 2575 completes successfully, the file will be visible in the file system 2576 namespace. If the file does not exist and the operation fails, the 2577 file MAY be visible in the file system namespace depending on when 2578 the failure occurs and on the implementation of the NFS server 2579 receiving the COPY operation. If the ca_destination name cannot be 2580 created in the destination file system (due to file name 2581 restrictions, such as case or length), the operation MUST fail. 2583 The ca_src_offset is the offset within the source file from which the 2584 data will be read, the ca_dst_offset is the offset within the 2585 destination file to which the data will be written, and the ca_count 2586 is the number of bytes that will be copied. An offset of 0 (zero) 2587 specifies the start of the file. A count of 0 (zero) requests that 2588 all bytes from ca_src_offset through EOF be copied to the 2589 destination. If concurrent modifications to the source file overlap 2590 with the source file region being copied, the data copied may include 2591 all, some, or none of the modifications. The client can use standard 2592 NFS operations (e.g., OPEN with OPEN4_SHARE_DENY_WRITE or mandatory 2593 byte range locks) to protect against concurrent modifications if the 2594 client is concerned about this. If the source file's end of file is 2595 being modified in parallel with a copy that specifies a count of 0 2596 (zero) bytes, the amount of data copied is implementation dependent 2597 (clients may guard against this case by specifying a non-zero count 2598 value or preventing modification of the source file as mentioned 2599 above). 2601 If the source offset or the source offset plus count is greater than 2602 or equal to the size of the source file, the operation will fail with 2603 NFS4ERR_INVAL. The destination offset or destination offset plus 2604 count may be greater than the size of the destination file. This 2605 allows for the client to issue parallel copies to implement 2606 operations such as "cat file1 file2 file3 file4 > dest". 2608 If the destination file is created as a result of this command, the 2609 destination file's size will be equal to the number of bytes 2610 successfully copied. If the destination file already existed, the 2611 destination file's size may increase as a result of this operation 2612 (e.g. if ca_dst_offset plus ca_count is greater than the 2613 destination's initial size). 2615 If the ca_source_server list is specified, then this is an inter- 2616 server copy operation and the source file is on a remote server. The 2617 client is expected to have previously issued a successful COPY_NOTIFY 2618 request to the remote source server. The ca_source_server list MUST 2619 be the same as the COPY_NOTIFY response's cnr_source_server list. If 2620 the client includes the entries from the COPY_NOTIFY response's 2621 cnr_source_server list in the ca_source_server list, the source 2622 server can indicate a specific copy protocol for the destination 2623 server to use by returning a URL, which specifies both a protocol 2624 service and server name. Server-to-server copy protocol 2625 considerations are described in Section 2.2.5 and Section 2.4.1. 2627 The ca_flags argument allows the copy operation to be customized in 2628 the following ways using the guarded flag (COPY4_GUARDED) and the 2629 metadata flag (COPY4_METADATA). 2631 If the guarded flag is set and the destination exists on the server, 2632 this operation will fail with NFS4ERR_EXIST. 2634 If the guarded flag is not set and the destination exists on the 2635 server, the behavior is implementation dependent. 2637 If the metadata flag is set and the client is requesting a whole file 2638 copy (i.e., ca_count is 0 (zero)), a subset of the destination file's 2639 attributes MUST be the same as the source file's corresponding 2640 attributes and a subset of the destination file's attributes SHOULD 2641 be the same as the source file's corresponding attributes. The 2642 attributes in the MUST and SHOULD copy subsets will be defined for 2643 each NFS version. 2645 For NFSv4.2, Table 5 and Table 6 list the REQUIRED and RECOMMENDED 2646 attributes respectively. In the "Copy to destination file?" column, 2647 a "MUST" indicates that the attribute is part of the MUST copy set. 2648 A "SHOULD" indicates that the attribute is part of the SHOULD copy 2649 set. A "no" indicates that the attribute MUST NOT be copied. 2651 REQUIRED attributes 2653 +--------------------+----+---------------------------+ 2654 | Name | Id | Copy to destination file? | 2655 +--------------------+----+---------------------------+ 2656 | supported_attrs | 0 | no | 2657 | type | 1 | MUST | 2658 | fh_expire_type | 2 | no | 2659 | change | 3 | SHOULD | 2660 | size | 4 | MUST | 2661 | link_support | 5 | no | 2662 | symlink_support | 6 | no | 2663 | named_attr | 7 | no | 2664 | fsid | 8 | no | 2665 | unique_handles | 9 | no | 2666 | lease_time | 10 | no | 2667 | rdattr_error | 11 | no | 2668 | filehandle | 19 | no | 2669 | suppattr_exclcreat | 75 | no | 2670 +--------------------+----+---------------------------+ 2672 Table 5 2674 RECOMMENDED attributes 2676 +--------------------+----+---------------------------+ 2677 | Name | Id | Copy to destination file? | 2678 +--------------------+----+---------------------------+ 2679 | acl | 12 | MUST | 2680 | aclsupport | 13 | no | 2681 | archive | 14 | no | 2682 | cansettime | 15 | no | 2683 | case_insensitive | 16 | no | 2684 | case_preserving | 17 | no | 2685 | change_attr_type | 79 | no | 2686 | change_policy | 60 | no | 2687 | chown_restricted | 18 | MUST | 2688 | dacl | 58 | MUST | 2689 | dir_notif_delay | 56 | no | 2690 | dirent_notif_delay | 57 | no | 2691 | fileid | 20 | no | 2692 | files_avail | 21 | no | 2693 | files_free | 22 | no | 2694 | files_total | 23 | no | 2695 | fs_charset_cap | 76 | no | 2696 | fs_layout_type | 62 | no | 2697 | fs_locations | 24 | no | 2698 | fs_locations_info | 67 | no | 2699 | fs_status | 61 | no | 2700 | hidden | 25 | MUST | 2701 | homogeneous | 26 | no | 2702 | layout_alignment | 66 | no | 2703 | layout_blksize | 65 | no | 2704 | layout_hint | 63 | no | 2705 | layout_type | 64 | no | 2706 | maxfilesize | 27 | no | 2707 | maxlink | 28 | no | 2708 | maxname | 29 | no | 2709 | maxread | 30 | no | 2710 | maxwrite | 31 | no | 2711 | mdsthreshold | 68 | no | 2712 | mimetype | 32 | MUST | 2713 | mode | 33 | MUST | 2714 | mode_set_masked | 74 | no | 2715 | mounted_on_fileid | 55 | no | 2716 | no_trunc | 34 | no | 2717 | numlinks | 35 | no | 2718 | owner | 36 | MUST | 2719 | owner_group | 37 | MUST | 2720 | quota_avail_hard | 38 | no | 2721 | quota_avail_soft | 39 | no | 2722 | quota_used | 40 | no | 2723 | rawdev | 41 | no | 2724 | retentevt_get | 71 | MUST | 2725 | retentevt_set | 72 | no | 2726 | retention_get | 69 | MUST | 2727 | retention_hold | 73 | MUST | 2728 | retention_set | 70 | no | 2729 | sacl | 59 | MUST | 2730 | sec_label | 80 | MUST | 2731 | space_avail | 42 | no | 2732 | space_free | 43 | no | 2733 | space_freed | 78 | no | 2734 | space_reserved | 77 | MUST | 2735 | space_total | 44 | no | 2736 | space_used | 45 | no | 2737 | system | 46 | MUST | 2738 | time_access | 47 | MUST | 2739 | time_access_set | 48 | no | 2740 | time_backup | 49 | no | 2741 | time_create | 50 | MUST | 2742 | time_delta | 51 | no | 2743 | time_metadata | 52 | SHOULD | 2744 | time_modify | 53 | MUST | 2745 | time_modify_set | 54 | no | 2746 +--------------------+----+---------------------------+ 2747 Table 6 2749 [NOTE: The source file's attribute values will take precedence over 2750 any attribute values inherited by the destination file.] 2752 In the case of an inter-server copy or an intra-server copy between 2753 file systems, the attributes supported for the source file and 2754 destination file could be different. By definition,the REQUIRED 2755 attributes will be supported in all cases. If the metadata flag is 2756 set and the source file has a RECOMMENDED attribute that is not 2757 supported for the destination file, the copy MUST fail with 2758 NFS4ERR_ATTRNOTSUPP. 2760 Any attribute supported by the destination server that is not set on 2761 the source file SHOULD be left unset. 2763 Metadata attributes not exposed via the NFS protocol SHOULD be copied 2764 to the destination file where appropriate. 2766 The destination file's named attributes are not duplicated from the 2767 source file. After the copy process completes, the client MAY 2768 attempt to duplicate named attributes using standard NFSv4 2769 operations. However, the destination file's named attribute 2770 capabilities MAY be different from the source file's named attribute 2771 capabilities. 2773 If the metadata flag is not set and the client is requesting a whole 2774 file copy (i.e., ca_count is 0 (zero)), the destination file's 2775 metadata is implementation dependent. 2777 If the client is requesting a partial file copy (i.e., ca_count is 2778 not 0 (zero)), the client SHOULD NOT set the metadata flag and the 2779 server MUST ignore the metadata flag. 2781 If the operation does not result in an immediate failure, the server 2782 will return NFS4_OK, and the CURRENT_FH will remain the destination's 2783 filehandle. 2785 If an immediate failure does occur, cr_bytes_copied will be set to 2786 the number of bytes copied to the destination file before the error 2787 occurred. The cr_bytes_copied value indicates the number of bytes 2788 copied but not which specific bytes have been copied. 2790 A return of NFS4_OK indicates that either the operation is complete 2791 or the operation was initiated and a callback will be used to deliver 2792 the final status of the operation. 2794 If the cr_callback_id is returned, this indicates that the operation 2795 was initiated and a CB_OFFLOAD callback will deliver the final 2796 results of the operation. The cr_callback_id stateid is termed a 2797 copy stateid in this context. The server is given the option of 2798 returning the results in a callback because the data may require a 2799 relatively long period of time to copy. 2801 If no cr_callback_id is returned, the operation completed 2802 synchronously and no callback will be issued by the server. The 2803 completion status of the operation is indicated by cr_status. 2805 If the copy completes successfully, either synchronously or 2806 asynchronously, the data copied from the source file to the 2807 destination file MUST appear identical to the NFS client. However, 2808 the NFS server's on disk representation of the data in the source 2809 file and destination file MAY differ. For example, the NFS server 2810 might encrypt, compress, deduplicate, or otherwise represent the on 2811 disk data in the source and destination file differently. 2813 13.2. Operation 60: OFFLOAD_ABORT - Cancel a server-side copy 2815 13.2.1. ARGUMENT 2817 struct OFFLOAD_ABORT4args { 2818 /* CURRENT_FH: destination file */ 2819 stateid4 oaa_stateid; 2820 }; 2822 13.2.2. RESULT 2824 struct OFFLOAD_ABORT4res { 2825 nfsstat4 oar_status; 2826 }; 2828 13.2.3. DESCRIPTION 2830 OFFLOAD_ABORT is used for both intra- and inter-server asynchronous 2831 copies. The OFFLOAD_ABORT operation allows the client to cancel a 2832 server-side copy operation that it initiated. This operation is sent 2833 in a COMPOUND request from the client to the destination server. 2834 This operation may be used to cancel a copy when the application that 2835 requested the copy exits before the operation is completed or for 2836 some other reason. 2838 The request contains the filehandle and copy stateid cookies that act 2839 as the context for the previously initiated copy operation. 2841 The result's oar_status field indicates whether the cancel was 2842 successful or not. A value of NFS4_OK indicates that the copy 2843 operation was canceled and no callback will be issued by the server. 2844 A copy operation that is successfully canceled may result in none, 2845 some, or all of the data and/or metadata copied. 2847 If the server supports asynchronous copies, the server is REQUIRED to 2848 support the OFFLOAD_ABORT operation. 2850 13.3. Operation 61: COPY_NOTIFY - Notify a source server of a future 2851 copy 2853 13.3.1. ARGUMENT 2855 struct COPY_NOTIFY4args { 2856 /* CURRENT_FH: source file */ 2857 stateid4 cna_src_stateid; 2858 netloc4 cna_destination_server; 2859 }; 2861 13.3.2. RESULT 2863 struct COPY_NOTIFY4resok { 2864 nfstime4 cnr_lease_time; 2865 netloc4 cnr_source_server<>; 2866 }; 2868 union COPY_NOTIFY4res switch (nfsstat4 cnr_status) { 2869 case NFS4_OK: 2870 COPY_NOTIFY4resok resok4; 2871 default: 2872 void; 2873 }; 2875 13.3.3. DESCRIPTION 2877 This operation is used for an inter-server copy. A client sends this 2878 operation in a COMPOUND request to the source server to authorize a 2879 destination server identified by cna_destination_server to read the 2880 file specified by CURRENT_FH on behalf of the given user. 2882 The cna_src_stateid MUST refer to either open or locking states 2883 provided earlier by the server. If it is invalid, then the operation 2884 MUST fail. 2886 The cna_destination_server MUST be specified using the netloc4 2887 network location format. The server is not required to resolve the 2888 cna_destination_server address before completing this operation. 2890 If this operation succeeds, the source server will allow the 2891 cna_destination_server to copy the specified file on behalf of the 2892 given user as long as both of the following conditions are met: 2894 o The destination server begins reading the source file before the 2895 cnr_lease_time expires. If the cnr_lease_time expires while the 2896 destination server is still reading the source file, the 2897 destination server is allowed to finish reading the file. 2899 o The client has not issued a COPY_REVOKE for the same combination 2900 of user, filehandle, and destination server. 2902 The cnr_lease_time is chosen by the source server. A cnr_lease_time 2903 of 0 (zero) indicates an infinite lease. To avoid the need for 2904 synchronized clocks, copy lease times are granted by the server as a 2905 time delta. To renew the copy lease time the client should resend 2906 the same copy notification request to the source server. 2908 A successful response will also contain a list of netloc4 network 2909 location formats called cnr_source_server, on which the source is 2910 willing to accept connections from the destination. These might not 2911 be reachable from the client and might be located on networks to 2912 which the client has no connection. 2914 If the client wishes to perform an inter-server copy, the client MUST 2915 send a COPY_NOTIFY to the source server. Therefore, the source 2916 server MUST support COPY_NOTIFY. 2918 For a copy only involving one server (the source and destination are 2919 on the same server), this operation is unnecessary. 2921 13.4. Operation 62: OFFLOAD_REVOKE - Revoke a destination server's copy 2922 privileges 2924 13.4.1. ARGUMENT 2926 struct OFFLOAD_REVOKE4args { 2927 /* CURRENT_FH: source file */ 2928 netloc4 ora_destination_server; 2929 }; 2931 13.4.2. RESULT 2933 struct OFFLOAD_REVOKE4res { 2934 nfsstat4 orr_status; 2935 }; 2937 13.4.3. DESCRIPTION 2939 This operation is used for an inter-server copy. A client sends this 2940 operation in a COMPOUND request to the source server to revoke the 2941 authorization of a destination server identified by 2942 ora_destination_server from reading the file specified by CURRENT_FH 2943 on behalf of given user. If the ora_destination_server has already 2944 begun copying the file, a successful return from this operation 2945 indicates that further access will be prevented. 2947 The ora_destination_server MUST be specified using the netloc4 2948 network location format. The server is not required to resolve the 2949 ora_destination_server address before completing this operation. 2951 The client uses OFFLOAD_ABORT to inform the destination to stop the 2952 active transfer and OFFLOAD_REVOKE to inform the source to not allow 2953 any more copy requests from the destination. The OFFLOAD_REVOKE 2954 operation is also useful in situations in which the source server 2955 granted a very long or infinite lease on the destination server's 2956 ability to read the source file and all copy operations on the source 2957 file have been completed. 2959 For a copy only involving one server (the source and destination are 2960 on the same server), this operation is unnecessary. 2962 If the server supports COPY_NOTIFY, the server is REQUIRED to support 2963 the OFFLOAD_REVOKE operation. 2965 13.5. Operation 63: OFFLOAD_STATUS - Poll for status of a server-side 2966 copy 2968 13.5.1. ARGUMENT 2970 struct OFFLOAD_STATUS4args { 2971 /* CURRENT_FH: destination file */ 2972 stateid4 osa_stateid; 2973 }; 2975 13.5.2. RESULT 2977 struct OFFLOAD_STATUS4resok { 2978 length4 osr_bytes_copied; 2979 nfsstat4 osr_complete<1>; 2980 }; 2982 union OFFLOAD_STATUS4res switch (nfsstat4 osr_status) { 2983 case NFS4_OK: 2984 OFFLOAD_STATUS4resok osr_resok4; 2985 default: 2986 void; 2987 }; 2989 13.5.3. DESCRIPTION 2991 OFFLOAD_STATUS is used for both intra- and inter-server asynchronous 2992 copies. The OFFLOAD_STATUS operation allows the client to poll the 2993 destination server to determine the status of an asynchronous copy 2994 operation. 2996 If this operation is successful, the number of bytes copied are 2997 returned to the client in the osr_bytes_copied field. The 2998 osr_bytes_copied value indicates the number of bytes copied but not 2999 which specific bytes have been copied. 3001 If the optional osr_complete field is present, the copy has 3002 completed. In this case the status value indicates the result of the 3003 asynchronous copy operation. In all cases, the server will also 3004 deliver the final results of the asynchronous copy in a CB_OFFLOAD 3005 operation. 3007 The failure of this operation does not indicate the result of the 3008 asynchronous copy in any way. 3010 If the server supports asynchronous copies, the server is REQUIRED to 3011 support the OFFLOAD_STATUS operation. 3013 13.6. Modification to Operation 42: EXCHANGE_ID - Instantiate Client ID 3015 13.6.1. ARGUMENT 3017 /* new */ 3018 const EXCHGID4_FLAG_SUPP_FENCE_OPS = 0x00000004; 3020 13.6.2. RESULT 3022 Unchanged 3024 13.6.3. MOTIVATION 3026 Enterprise applications require guarantees that an operation has 3027 either aborted or completed. NFSv4.1 provides this guarantee as long 3028 as the session is alive: simply send a SEQUENCE operation on the same 3029 slot with a new sequence number, and the successful return of 3030 SEQUENCE indicates the previous operation has completed. However, if 3031 the session is lost, there is no way to know when any in progress 3032 operations have aborted or completed. In hindsight, the NFSv4.1 3033 specification should have mandated that DESTROY_SESSION either abort 3034 or complete all outstanding operations. 3036 13.6.4. DESCRIPTION 3038 A client SHOULD request the EXCHGID4_FLAG_SUPP_FENCE_OPS capability 3039 when it sends an EXCHANGE_ID operation. The server SHOULD set this 3040 capability in the EXCHANGE_ID reply whether the client requests it or 3041 not. It is the server's return that determines whether this 3042 capability is in effect. When it is in effect, the following will 3043 occur: 3045 o The server will not reply to any DESTROY_SESSION invoked with the 3046 client ID until all operations in progress are completed or 3047 aborted. 3049 o The server will not reply to subsequent EXCHANGE_ID invoked on the 3050 same client owner with a new verifier until all operations in 3051 progress on the client ID's session are completed or aborted. 3053 o The NFS server SHOULD support client ID trunking, and if it does 3054 and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a 3055 session ID created on one node of the storage cluster MUST be 3056 destroyable via DESTROY_SESSION. In addition, DESTROY_CLIENTID 3057 and an EXCHANGE_ID with a new verifier affects all sessions 3058 regardless what node the sessions were created on. 3060 13.7. Operation 64: WRITE_PLUS 3061 13.7.1. ARGUMENT 3063 struct data_info4 { 3064 offset4 di_offset; 3065 length4 di_length; 3066 bool di_allocated; 3067 }; 3069 struct data4 { 3070 offset4 d_offset; 3071 bool d_allocated; 3072 opaque d_data<>; 3073 }; 3075 union write_plus_arg4 switch (data_content4 wpa_content) { 3076 case NFS4_CONTENT_DATA: 3077 data4 wpa_data; 3078 case NFS4_CONTENT_APP_DATA_HOLE: 3079 app_data_hole4 wpa_adh; 3080 case NFS4_CONTENT_HOLE: 3081 data_info4 wpa_hole; 3082 default: 3083 void; 3084 }; 3086 struct WRITE_PLUS4args { 3087 /* CURRENT_FH: file */ 3088 stateid4 wp_stateid; 3089 stable_how4 wp_stable; 3090 write_plus_arg4 wp_data<>; 3091 }; 3093 13.7.2. RESULT 3095 struct write_response4 { 3096 stateid4 wr_callback_id<1>; 3097 count4 wr_count; 3098 stable_how4 wr_committed; 3099 verifier4 wr_writeverf; 3100 }; 3101 union WRITE_PLUS4res switch (nfsstat4 wp_status) { 3102 case NFS4_OK: 3103 write_response4 wp_resok4; 3104 default: 3105 void; 3106 }; 3108 13.7.3. DESCRIPTION 3110 The WRITE_PLUS operation is an extension of the NFSv4.1 WRITE 3111 operation (see Section 18.2 of [1] and writes data to the regular 3112 file identified by the current filehandle. The server MAY write 3113 fewer bytes than requested by the client. 3115 The WRITE_PLUS argument is comprised of an array of rpr_contents, 3116 each of which describe a data_content4 type of data (Section 6.1.2). 3117 For NFSv4.2, the allowed values are data, ADH, and hole. The array 3118 contents MUST be contiguous in the file. A successful WRITE_PLUS 3119 will construct a reply for wr_count, wr_committed, and wr_writeverf 3120 as per the NFSv4.1 WRITE operation results. If wr_callback_id is 3121 set, it indicates an asynchronous reply (see Section 13.7.3.4). 3123 WRITE_PLUS has to support all of the errors which are returned by 3124 WRITE plus NFS4ERR_UNION_NOTSUPP. If the client asks for a hole and 3125 the server does not support that arm of the discriminated union, but 3126 does support one or more additional arms, it can signal to the client 3127 that it supports the operation, but not the arm with 3128 NFS4ERR_UNION_NOTSUPP. 3130 If the client supports WRITE_PLUS, it MUST support CB_OFFLOAD. 3132 13.7.3.1. Data 3134 The d_offset specifies the offset where the data should be written. 3135 An d_offset of zero specifies that the write should start at the 3136 beginning of the file. The d_count, as encoded as part of the opaque 3137 data parameter, represents the number of bytes of data that are to be 3138 written. If the d_count is zero, the WRITE_PLUS will succeed and 3139 return a d_count of zero subject to permissions checking. 3141 Note that d_allocated has no meaning for WRITE_PLUS. 3143 13.7.3.2. Hole punching 3145 Whenever a client wishes to zero the blocks backing a particular 3146 region in the file, it calls the WRITE_PLUS operation with the 3147 current filehandle set to the filehandle of the file in question, and 3148 the equivalent of start offset and length in bytes of the region set 3149 in wpa_hole.di_offset and wpa_hole.di_length respectively. If the 3150 wpa_hole.di_allocated is set to TRUE, then the blocks will be zeroed 3151 and if it is set to FALSE, then they will be deallocated. All 3152 further reads to this region MUST return zeros until overwritten. 3153 The filehandle specified must be that of a regular file. 3155 Situations may arise where di_offset and/or di_offset + di_length 3156 will not be aligned to a boundary that the server does allocations/ 3157 deallocations in. For most file systems, this is the block size of 3158 the file system. In such a case, the server can deallocate as many 3159 bytes as it can in the region. The blocks that cannot be deallocated 3160 MUST be zeroed. Except for the block deallocation and maximum hole 3161 punching capability, a WRITE_PLUS operation is to be treated similar 3162 to a write of zeroes. 3164 The server is not required to complete deallocating the blocks 3165 specified in the operation before returning. The server SHOULD 3166 return an asynchronous result if it can determine the operation will 3167 be long running (see Section 13.7.3.4). 3169 If used to hole punch, WRITE_PLUS will result in the space_used 3170 attribute being decreased by the number of bytes that were 3171 deallocated. The space_freed attribute may or may not decrease, 3172 depending on the support and whether the blocks backing the specified 3173 range were shared or not. The size attribute will remain unchanged. 3175 The WRITE_PLUS operation MUST NOT change the space reservation 3176 guarantee of the file. While the server can deallocate the blocks 3177 specified by di_offset and di_length, future writes to this region 3178 MUST NOT fail with NFSERR_NOSPC. 3180 13.7.3.3. ADHs 3182 If the server supports ADHs, then it MUST support the 3183 NFS4_CONTENT_APP_DATA_HOLE arm of the WRITE_PLUS operation. The 3184 server has no concept of the structure imposed by the application. 3185 It is only when the application writes to a section of the file does 3186 order get imposed. In order to detect corruption even before the 3187 application utilizes the file, the application will want to 3188 initialize a range of ADHs using WRITE_PLUS. 3190 For ADHs, when the client invokes the WRITE_PLUS operation, it has 3191 two desired results: 3193 1. The structure described by the app_data_block4 be imposed on the 3194 file. 3196 2. The contents described by the app_data_block4 be sparse. 3198 If the server supports the WRITE_PLUS operation, it still might not 3199 support sparse files. So if it receives the WRITE_PLUS operation, 3200 then it MUST populate the contents of the file with the initialized 3201 ADHs. The server SHOULD return an asynchronous result if it can 3202 determine the operation will be long running (see Section 13.7.3.4). 3204 If the data was already initialized, there are two interesting 3205 scenarios: 3207 1. The data blocks are allocated. 3209 2. Initializing in the middle of an existing ADH. 3211 If the data blocks were already allocated, then the WRITE_PLUS is a 3212 hole punch operation. If WRITE_PLUS supports sparse files, then the 3213 data blocks are to be deallocated. If not, then the data blocks are 3214 to be rewritten in the indicated ADH format. 3216 Since the server has no knowledge of ADHs, it should not report 3217 misaligned creation of ADHs. Even while it can detect them, it 3218 cannot disallow them, as the application might be in the process of 3219 changing the size of the ADHs. Thus the server must be prepared to 3220 handle an WRITE_PLUS into an existing ADH. 3222 This document does not mandate the manner in which the server stores 3223 ADHs sparsely for a file. However, if an WRITE_PLUS arrives that 3224 will force a new ADH to start inside an existing ADH then the server 3225 will have three ADHs instead of two. It will have one up to the new 3226 one for the WRITE_PLUS, one for the WRITE_PLUS, and one for after the 3227 WRITE_PLUS. Note that depending on server specific policies for 3228 block allocation, there may also be some physical blocks allocated to 3229 align the boundaries. 3231 13.7.3.4. Asynchronous Transactions 3233 Both hole punching and ADH initialization may lead to server 3234 determining to service the operation asynchronously. If it decides 3235 to do so, it sets the stateid in wr_callback_id to be that of the 3236 wp_stateid. If it does not set the wr_callback_id, then the result 3237 is synchronous. 3239 When the client determines that the reply will be given 3240 asynchronously, it should not assume anything about the contents of 3241 what it wrote until it is informed by the server that the operation 3242 is complete. It can use OFFLOAD_STATUS (Section 13.5) to monitor the 3243 operation and OFFLOAD_ABORT (Section 13.2) to cancel the operation. 3245 An example of a asynchronous WRITE_PLUS is shown in Figure 6. 3247 Client Server 3248 + + 3249 | | 3250 |--- OPEN ---------------------------->| Client opens 3251 |<------------------------------------/| the file 3252 | | 3253 |--- WRITE_PLUS ---------------------->| Client punches 3254 |<------------------------------------/| a hole 3255 | | 3256 | | 3257 |--- OFFLOAD_STATUS ------------------>| Client may poll 3258 |<------------------------------------/| for status 3259 | | 3260 | . | Multiple OFFLOAD_STATUS 3261 | . | operations may be sent. 3262 | . | 3263 | | 3264 |<-- CB_OFFLOAD -----------------------| Server reports results 3265 |\------------------------------------>| 3266 | | 3267 |--- CLOSE --------------------------->| Client closes 3268 |<------------------------------------/| the file 3269 | | 3270 | | 3272 Figure 6: An asynchronous WRITE_PLUS. 3274 When CB_OFFLOAD informs the client of the successful WRITE_PLUS, the 3275 write_response4 embedded in the operation will provide the necessary 3276 information that a synchronous WRITE_PLUS would have provided. 3278 13.8. Operation 67: IO_ADVISE - Application I/O access pattern hints 3279 13.8.1. ARGUMENT 3281 enum IO_ADVISE_type4 { 3282 IO_ADVISE4_NORMAL = 0, 3283 IO_ADVISE4_SEQUENTIAL = 1, 3284 IO_ADVISE4_SEQUENTIAL_BACKWARDS = 2, 3285 IO_ADVISE4_RANDOM = 3, 3286 IO_ADVISE4_WILLNEED = 4, 3287 IO_ADVISE4_WILLNEED_OPPORTUNISTIC = 5, 3288 IO_ADVISE4_DONTNEED = 6, 3289 IO_ADVISE4_NOREUSE = 7, 3290 IO_ADVISE4_READ = 8, 3291 IO_ADVISE4_WRITE = 9, 3292 IO_ADVISE4_INIT_PROXIMITY = 10 3293 }; 3295 struct IO_ADVISE4args { 3296 /* CURRENT_FH: file */ 3297 stateid4 iar_stateid; 3298 offset4 iar_offset; 3299 length4 iar_count; 3300 bitmap4 iar_hints; 3301 }; 3303 13.8.2. RESULT 3305 struct IO_ADVISE4resok { 3306 bitmap4 ior_hints; 3307 }; 3309 union IO_ADVISE4res switch (nfsstat4 _status) { 3310 case NFS4_OK: 3311 IO_ADVISE4resok resok4; 3312 default: 3313 void; 3314 }; 3316 13.8.3. DESCRIPTION 3318 The IO_ADVISE operation sends an I/O access pattern hint to the 3319 server for the owner of the stateid for a given byte range specified 3320 by iar_offset and iar_count. The byte range specified by iar_offset 3321 and iar_count need not currently exist in the file, but the iar_hints 3322 will apply to the byte range when it does exist. If iar_count is 0, 3323 all data following iar_offset is specified. The server MAY ignore 3324 the advice. 3326 The following are the allowed hints for a stateid holder: 3328 IO_ADVISE4_NORMAL There is no advice to give, this is the default 3329 behavior. 3331 IO_ADVISE4_SEQUENTIAL Expects to access the specified data 3332 sequentially from lower offsets to higher offsets. 3334 IO_ADVISE4_SEQUENTIAL BACKWARDS Expects to access the specified data 3335 sequentially from higher offsets to lower offsets. 3337 IO_ADVISE4_RANDOM Expects to access the specified data in a random 3338 order. 3340 IO_ADVISE4_WILLNEED Expects to access the specified data in the near 3341 future. 3343 IO_ADVISE4_WILLNEED_OPPORTUNISTIC Expects to possibly access the 3344 data in the near future. This is a speculative hint, and 3345 therefore the server should prefetch data or indirect blocks only 3346 if it can be done at a marginal cost. 3348 IO_ADVISE_DONTNEED Expects that it will not access the specified 3349 data in the near future. 3351 IO_ADVISE_NOREUSE Expects to access the specified data once and then 3352 not reuse it thereafter. 3354 IO_ADVISE4_READ Expects to read the specified data in the near 3355 future. 3357 IO_ADVISE4_WRITE Expects to write the specified data in the near 3358 future. 3360 IO_ADVISE4_INIT_PROXIMITY Informs the server that the data in the 3361 byte range remains important to the client. 3363 Since IO_ADVISE is a hint, a server SHOULD NOT return an error and 3364 invalidate a entire Compound request if one of the sent hints in 3365 iar_hints is not supported by the server. Also, the server MUST NOT 3366 return an error if the client sends contradictory hints to the 3367 server, e.g., IO_ADVISE4_SEQUENTIAL and IO_ADVISE4_RANDOM in a single 3368 IO_ADVISE operation. In these cases, the server MUST return success 3369 and a ior_hints value that indicates the hint it intends to 3370 implement. This may mean simply returning IO_ADVISE4_NORMAL. 3372 The ior_hints returned by the server is primarily for debugging 3373 purposes since the server is under no obligation to carry out the 3374 hints that it describes in the ior_hints result. In addition, while 3375 the server may have intended to implement the hints returned in 3376 ior_hints, as time progresses, the server may need to change its 3377 handling of a given file due to several reasons including, but not 3378 limited to, memory pressure, additional IO_ADVISE hints sent by other 3379 clients, and heuristically detected file access patterns. 3381 The server MAY return different advice than what the client 3382 requested. If it does, then this might be due to one of several 3383 conditions, including, but not limited to another client advising of 3384 a different I/O access pattern; a different I/O access pattern from 3385 another client that that the server has heuristically detected; or 3386 the server is not able to support the requested I/O access pattern, 3387 perhaps due to a temporary resource limitation. 3389 Each issuance of the IO_ADVISE operation overrides all previous 3390 issuances of IO_ADVISE for a given byte range. This effectively 3391 follows a strategy of last hint wins for a given stateid and byte 3392 range. 3394 Clients should assume that hints included in an IO_ADVISE operation 3395 will be forgotten once the file is closed. 3397 13.8.4. IMPLEMENTATION 3399 The NFS client may choose to issue an IO_ADVISE operation to the 3400 server in several different instances. 3402 The most obvious is in direct response to an application's execution 3403 of posix_fadvise(). In this case, IO_ADVISE4_WRITE and 3404 IO_ADVISE4_READ may be set based upon the type of file access 3405 specified when the file was opened. 3407 13.8.5. IO_ADVISE4_INIT_PROXIMITY 3409 The IO_ADVISE4_INIT_PROXIMITY hint is non-posix in origin and conveys 3410 that the client has recently accessed the byte range in its own 3411 cache. I.e., it has not accessed it on the server, but it has 3412 locally. When the server reaches resource exhaustion, knowing which 3413 data is more important allows the server to make better choices about 3414 which data to, for example purge from a cache, or move to secondary 3415 storage. It also informs the server which delegations are more 3416 important, since if delegations are working correctly, once delegated 3417 to a client and the client has read the content for that byte range, 3418 a server might never receive another read request for that byte 3419 range. 3421 This hint is also useful in the case of NFS clients which are network 3422 booting from a server. If the first client to be booted sends this 3423 hint, then it keeps the cache warm for the remaining clients. 3425 13.8.6. pNFS File Layout Data Type Considerations 3427 The IO_ADVISE considerations for pNFS are very similar to the COMMIT 3428 considerations for pNFS. That is, as with COMMIT, some NFS server 3429 implementations prefer IO_ADVISE be done on the DS, and some prefer 3430 it be done on the MDS. 3432 So for the file's layout type, it is proposed that NFSv4.2 include an 3433 additional hint NFL42_CARE_IO_ADVISE_THRU_MDS which is valid only on 3434 NFSv4.2 or higher. Any file's layout obtained with NFSv4.1 MUST NOT 3435 have NFL42_UFLG_IO_ADVISE_THRU_MDS set. Any file's layout obtained 3436 with NFSv4.2 MAY have NFL42_UFLG_IO_ADVISE_THRU_MDS set. If the 3437 client does not implement IO_ADVISE, then it MUST ignore 3438 NFL42_UFLG_IO_ADVISE_THRU_MDS. 3440 If NFL42_UFLG_IO_ADVISE_THRU_MDS is set, the client MUST send the 3441 IO_ADVISE operation to the MDS in order for it to be honored by the 3442 DS. Once the MDS receives the IO_ADVISE operation, it will 3443 communicate the advice to each DS. 3445 If NFL42_UFLG_IO_ADVISE_THRU_MDS is not set, then the client SHOULD 3446 send an IO_ADVISE operation to the appropriate DS for the specified 3447 byte range. While the client MAY always send IO_ADVISE to the MDS, 3448 if the server has not set NFL42_UFLG_IO_ADVISE_THRU_MDS, the client 3449 should expect that such an IO_ADVISE is futile. Note that a client 3450 SHOULD use the same set of arguments on each IO_ADVISE sent to a DS 3451 for the same open file reference. 3453 The server is not required to support different advice for different 3454 DS's with the same open file reference. 3456 13.8.6.1. Dense and Sparse Packing Considerations 3458 The IO_ADVISE operation MUST use the iar_offset and byte range as 3459 dictated by the presence or absence of NFL4_UFLG_DENSE. 3461 E.g., if NFL4_UFLG_DENSE is present, and a READ or WRITE to the DS 3462 for iar_offset 0 really means iar_offset 10000 in the logical file, 3463 then an IO_ADVISE for iar_offset 0 means iar_offset 10000. 3465 E.g., if NFL4_UFLG_DENSE is absent, then a READ or WRITE to the DS 3466 for iar_offset 0 really means iar_offset 0 in the logical file, then 3467 an IO_ADVISE for iar_offset 0 means iar_offset 0 in the logical file. 3469 E.g., if NFL4_UFLG_DENSE is present, the stripe unit is 1000 bytes 3470 and the stripe count is 10, and the dense DS file is serving 3471 iar_offset 0. A READ or WRITE to the DS for iar_offsets 0, 1000, 3472 2000, and 3000, really mean iar_offsets 10000, 20000, 30000, and 3473 40000 (implying a stripe count of 10 and a stripe unit of 1000), then 3474 an IO_ADVISE sent to the same DS with an iar_offset of 500, and a 3475 iar_count of 3000 means that the IO_ADVISE applies to these byte 3476 ranges of the dense DS file: 3478 - 500 to 999 3479 - 1000 to 1999 3480 - 2000 to 2999 3481 - 3000 to 3499 3483 I.e., the contiguous range 500 to 3499 as specified in IO_ADVISE. 3485 It also applies to these byte ranges of the logical file: 3487 - 10500 to 10999 (500 bytes) 3488 - 20000 to 20999 (1000 bytes) 3489 - 30000 to 30999 (1000 bytes) 3490 - 40000 to 40499 (500 bytes) 3491 (total 3000 bytes) 3493 E.g., if NFL4_UFLG_DENSE is absent, the stripe unit is 250 bytes, the 3494 stripe count is 4, and the sparse DS file is serving iar_offset 0. 3495 Then a READ or WRITE to the DS for iar_offsets 0, 1000, 2000, and 3496 3000, really mean iar_offsets 0, 1000, 2000, and 3000 in the logical 3497 file, keeping in mind that on the DS file,. byte ranges 250 to 999, 3498 1250 to 1999, 2250 to 2999, and 3250 to 3999 are not accessible. 3499 Then an IO_ADVISE sent to the same DS with an iar_offset of 500, and 3500 a iar_count of 3000 means that the IO_ADVISE applies to these byte 3501 ranges of the logical file and the sparse DS file: 3503 - 500 to 999 (500 bytes) - no effect 3504 - 1000 to 1249 (250 bytes) - effective 3505 - 1250 to 1999 (750 bytes) - no effect 3506 - 2000 to 2249 (250 bytes) - effective 3507 - 2250 to 2999 (750 bytes) - no effect 3508 - 3000 to 3249 (250 bytes) - effective 3509 - 3250 to 3499 (250 bytes) - no effect 3510 (subtotal 2250 bytes) - no effect 3511 (subtotal 750 bytes) - effective 3512 (grand total 3000 bytes) - no effect + effective 3514 If neither of the flags NFL42_UFLG_IO_ADVISE_THRU_MDS and 3515 NFL4_UFLG_DENSE are set in the layout, then any IO_ADVISE request 3516 sent to the data server with a byte range that overlaps stripe unit 3517 that the data server does not serve MUST NOT result in the status 3518 NFS4ERR_PNFS_IO_HOLE. Instead, the response SHOULD be successful and 3519 if the server applies IO_ADVISE hints on any stripe units that 3520 overlap with the specified range, those hints SHOULD be indicated in 3521 the response. 3523 13.9. Changes to Operation 51: LAYOUTRETURN 3525 13.9.1. Introduction 3527 In the pNFS description provided in [1], the client is not capable to 3528 relay an error code from the DS to the MDS. In the specification of 3529 the Objects-Based Layout protocol [7], use is made of the opaque 3530 lrf_body field of the LAYOUTRETURN argument to do such a relaying of 3531 error codes. In this section, we define a new data structure to 3532 enable the passing of error codes back to the MDS and provide some 3533 guidelines on what both the client and MDS should expect in such 3534 circumstances. 3536 There are two broad classes of errors, transient and persistent. The 3537 client SHOULD strive to only use this new mechanism to report 3538 persistent errors. It MUST be able to deal with transient issues by 3539 itself. Also, while the client might consider an issue to be 3540 persistent, it MUST be prepared for the MDS to consider such issues 3541 to be transient. A prime example of this is if the MDS fences off a 3542 client from either a stateid or a filehandle. The client will get an 3543 error from the DS and might relay either NFS4ERR_ACCESS or 3544 NFS4ERR_BAD_STATEID back to the MDS, with the belief that this is a 3545 hard error. If the MDS is informed by the client that there is an 3546 error, it can safely ignore that. For it, the mission is 3547 accomplished in that the client has returned a layout that the MDS 3548 had most likely recalled. 3550 The client might also need to inform the MDS that it cannot reach one 3551 or more of the DSes. While the MDS can detect the connectivity of 3552 both of these paths: 3554 o MDS to DS 3556 o MDS to client 3558 it cannot determine if the client and DS path is working. As with 3559 the case of the DS passing errors to the client, it must be prepared 3560 for the MDS to consider such outages as being transitory. 3562 The existing LAYOUTRETURN operation is extended by introducing a new 3563 data structure to report errors, layoutreturn_device_error4. Also, 3564 layoutreturn_device_error4 is introduced to enable an array of errors 3565 to be reported. 3567 13.9.2. ARGUMENT 3569 The ARGUMENT specification of the LAYOUTRETURN operation in section 3570 18.44.1 of [1] is augmented by the following XDR code [23]: 3572 struct layoutreturn_device_error4 { 3573 deviceid4 lrde_deviceid; 3574 nfsstat4 lrde_status; 3575 nfs_opnum4 lrde_opnum; 3576 }; 3578 struct layoutreturn_error_report4 { 3579 layoutreturn_device_error4 lrer_errors<>; 3580 }; 3582 13.9.3. RESULT 3584 The RESULT of the LAYOUTRETURN operation is unchanged; see section 3585 18.44.2 of [1]. 3587 13.9.4. DESCRIPTION 3589 The following text is added to the end of the LAYOUTRETURN operation 3590 DESCRIPTION in section 18.44.3 of [1]. 3592 When a client uses LAYOUTRETURN with a type of LAYOUTRETURN4_FILE, 3593 then if the lrf_body field is NULL, it indicates to the MDS that the 3594 client experienced no errors. If lrf_body is non-NULL, then the 3595 field references error information which is layout type specific. 3596 I.e., the Objects-Based Layout protocol can continue to utilize 3597 lrf_body as specified in [7]. For both Files-Based and Block-Based 3598 Layouts, the field references a layoutreturn_device_error4, which 3599 contains an array of layoutreturn_device_error4. 3601 Each individual layoutreturn_device_error4 describes a single error 3602 associated with a DS, which is identified via lrde_deviceid. The 3603 operation which returned the error is identified via lrde_opnum. 3604 Finally the NFS error value (nfsstat4) encountered is provided via 3605 lrde_status and may consist of the following error codes: 3607 NFS4ERR_NXIO: The client was unable to establish any communication 3608 with the DS. 3610 NFS4ERR_*: The client was able to establish communication with the 3611 DS and is returning one of the allowed error codes for the 3612 operation denoted by lrde_opnum. 3614 13.9.5. IMPLEMENTATION 3616 The following text is added to the end of the LAYOUTRETURN operation 3617 IMPLEMENTATION in section 18.4.4 of [1]. 3619 Clients are expected to tolerate transient storage device errors, and 3620 hence clients SHOULD NOT use the LAYOUTRETURN error handling for 3621 device access problems that may be transient. The methods by which a 3622 client decides whether a device access problem is transient vs. 3623 persistent are implementation-specific, but may include retrying I/Os 3624 to a data server under appropriate conditions. 3626 When an I/O fails to a storage device, the client SHOULD retry the 3627 failed I/O via the MDS. In this situation, before retrying the I/O, 3628 the client SHOULD return the layout, or the affected portion thereof, 3629 and SHOULD indicate which storage device or devices was problematic. 3630 The client needs to do this when the DS is being unresponsive in 3631 order to fence off any failed write attempts, and ensure that they do 3632 not end up overwriting any later data being written through the MDS. 3633 If the client does not do this, the MDS MAY issue a layout recall 3634 callback in order to perform the retried I/O. 3636 The client needs to be cognizant that since this error handling is 3637 optional in the MDS, the MDS may silently ignore this functionality. 3638 Also, as the MDS may consider some issues the client reports to be 3639 expected (see Section 13.9.1), the client might find it difficult to 3640 detect a MDS which has not implemented error handling via 3641 LAYOUTRETURN. 3643 If an MDS is aware that a storage device is proving problematic to a 3644 client, the MDS SHOULD NOT include that storage device in any pNFS 3645 layouts sent to that client. If the MDS is aware that a storage 3646 device is affecting many clients, then the MDS SHOULD NOT include 3647 that storage device in any pNFS layouts sent out. If a client asks 3648 for a new layout for the file from the MDS, it MUST be prepared for 3649 the MDS to return that storage device in the layout. The MDS might 3650 not have any choice in using the storage device, i.e., there might 3651 only be one possible layout for the system. Also, in the case of 3652 existing files, the MDS might have no choice in which storage devices 3653 to hand out to clients. 3655 The MDS is not required to indefinitely retain per-client storage 3656 device error information. An MDS is also not required to 3657 automatically reinstate use of a previously problematic storage 3658 device; administrative intervention may be required instead. 3660 13.10. Operation 65: READ_PLUS 3662 13.10.1. ARGUMENT 3664 struct READ_PLUS4args { 3665 /* CURRENT_FH: file */ 3666 stateid4 rpa_stateid; 3667 offset4 rpa_offset; 3668 count4 rpa_count; 3669 }; 3671 13.10.2. RESULT 3673 struct data_info4 { 3674 offset4 di_offset; 3675 length4 di_length; 3676 bool di_allocated; 3677 }; 3679 struct data4 { 3680 offset4 d_offset; 3681 bool d_allocated; 3682 opaque d_data<>; 3683 }; 3684 union read_plus_content switch (data_content4 rpc_content) { 3685 case NFS4_CONTENT_DATA: 3686 data4 rpc_data; 3687 case NFS4_CONTENT_APP_DATA_HOLE: 3688 app_data_hole4 rpc_adh; 3689 case NFS4_CONTENT_HOLE: 3690 data_info4 rpc_hole; 3691 default: 3692 void; 3693 }; 3695 /* 3696 * Allow a return of an array of contents. 3697 */ 3698 struct read_plus_res4 { 3699 bool rpr_eof; 3700 read_plus_content rpr_contents<>; 3701 }; 3703 union READ_PLUS4res switch (nfsstat4 rp_status) { 3704 case NFS4_OK: 3705 read_plus_res4 rp_resok4; 3706 default: 3707 void; 3708 }; 3710 13.10.3. DESCRIPTION 3712 The READ_PLUS operation is based upon the NFSv4.1 READ operation (see 3713 Section 18.22 of [1]) and similarly reads data from the regular file 3714 identified by the current filehandle. 3716 The client provides a rpa_offset of where the READ_PLUS is to start 3717 and a rpa_count of how many bytes are to be read. A rpa_offset of 3718 zero means to read data starting at the beginning of the file. If 3719 rpa_offset is greater than or equal to the size of the file, the 3720 status NFS4_OK is returned with di_length (the data length) set to 3721 zero and eof set to TRUE. 3723 The READ_PLUS result is comprised of an array of rpr_contents, each 3724 of which describe a data_content4 type of data (Section 6.1.2). For 3725 NFSv4.2, the allowed values are data, ADH, and hole. A server is 3726 required to support the data type, but neither ADH nor hole. Both an 3727 ADH and a hole must be returned in its entirety - clients must be 3728 prepared to get more information than they requested. Both the start 3729 and the end of the hole may exceed what was requested. The array 3730 contents MUST be contiguous in the file. 3732 READ_PLUS has to support all of the errors which are returned by READ 3733 plus NFS4ERR_UNION_NOTSUPP. If the client asks for a hole and the 3734 server does not support that arm of the discriminated union, but does 3735 support one or more additional arms, it can signal to the client that 3736 it supports the operation, but not the arm with 3737 NFS4ERR_UNION_NOTSUPP. 3739 If the data to be returned is comprised entirely of zeros, then the 3740 server may elect to return that data as a hole. The server 3741 differentiates this to the client by setting di_allocated to TRUE in 3742 this case. Note that in such a scenario, the server is not required 3743 to determine the full extent of the "hole" - it does not need to 3744 determine where the zeros start and end. If the server elects to 3745 return the hole as data, then it can set the d_allocted to FALSE in 3746 the rpc_data to indicate it is a hole. 3748 The server may elect to return adjacent elements of the same type. 3749 For example, the guard pattern or block size of an ADH might change, 3750 which would require adjacent elements of type ADH. Likewise if the 3751 server has a range of data comprised entirely of zeros and then a 3752 hole, it might want to return two adjacent holes to the client. 3754 If the client specifies a rpa_count value of zero, the READ_PLUS 3755 succeeds and returns zero bytes of data. In all situations, the 3756 server may choose to return fewer bytes than specified by the client. 3757 The client needs to check for this condition and handle the condition 3758 appropriately. 3760 If the client specifies an rpa_offset and rpa_count value that is 3761 entirely contained within a hole of the file, then the di_offset and 3762 di_length returned must be for the entire hole. This result is 3763 considered valid until the file is changed (detected via the change 3764 attribute). The server MUST provide the same semantics for the hole 3765 as if the client read the region and received zeroes; the implied 3766 holes contents lifetime MUST be exactly the same as any other read 3767 data. 3769 If the client specifies an rpa_offset and rpa_count value that begins 3770 in a non-hole of the file but extends into hole the server should 3771 return an array comprised of both data and a hole. The client MUST 3772 be prepared for the server to return a short read describing just the 3773 data. The client will then issue another READ_PLUS for the remaining 3774 bytes, which the server will respond with information about the hole 3775 in the file. 3777 Except when special stateids are used, the stateid value for a 3778 READ_PLUS request represents a value returned from a previous byte- 3779 range lock or share reservation request or the stateid associated 3780 with a delegation. The stateid identifies the associated owners if 3781 any and is used by the server to verify that the associated locks are 3782 still valid (e.g., have not been revoked). 3784 If the read ended at the end-of-file (formally, in a correctly formed 3785 READ_PLUS operation, if rpa_offset + rpa_count is equal to the size 3786 of the file), or the READ_PLUS operation extends beyond the size of 3787 the file (if rpa_offset + rpa_count is greater than the size of the 3788 file), eof is returned as TRUE; otherwise, it is FALSE. A successful 3789 READ_PLUS of an empty file will always return eof as TRUE. 3791 If the current filehandle is not an ordinary file, an error will be 3792 returned to the client. In the case that the current filehandle 3793 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If 3794 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 3795 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 3797 For a READ_PLUS with a stateid value of all bits equal to zero, the 3798 server MAY allow the READ_PLUS to be serviced subject to mandatory 3799 byte-range locks or the current share deny modes for the file. For a 3800 READ_PLUS with a stateid value of all bits equal to one, the server 3801 MAY allow READ_PLUS operations to bypass locking checks at the 3802 server. 3804 On success, the current filehandle retains its value. 3806 13.10.4. IMPLEMENTATION 3808 In general, the IMPLEMENTATION notes for READ in Section 18.22.4 of 3809 [1] also apply to READ_PLUS. One delta is that when the owner has a 3810 locked byte range, the server MUST return an array of rpr_contents 3811 with values inside that range. 3813 13.10.4.1. Additional pNFS Implementation Information 3815 With pNFS, the semantics of using READ_PLUS remains the same. Any 3816 data server MAY return a hole or ADH result for a READ_PLUS request 3817 that it receives. When a data server chooses to return such a 3818 result, it has the option of returning information for the data 3819 stored on that data server (as defined by the data layout), but it 3820 MUST NOT return results for a byte range that includes data managed 3821 by another data server. 3823 A data server should do its best to return as much information about 3824 a ADH as is feasible without having to contact the metadata server. 3825 If communication with the metadata server is required, then every 3826 attempt should be taken to minimize the number of requests. 3828 If mandatory locking is enforced, then the data server must also 3829 ensure that to return only information that is within the owner's 3830 locked byte range. 3832 13.10.5. READ_PLUS with Sparse Files Example 3834 The following table describes a sparse file. For each byte range, 3835 the file contains either non-zero data or a hole. In addition, the 3836 server in this example uses a Hole Threshold of 32K. 3838 +-------------+----------+ 3839 | Byte-Range | Contents | 3840 +-------------+----------+ 3841 | 0-15999 | Hole | 3842 | 16K-31999 | Non-Zero | 3843 | 32K-255999 | Hole | 3844 | 256K-287999 | Non-Zero | 3845 | 288K-353999 | Hole | 3846 | 354K-417999 | Non-Zero | 3847 +-------------+----------+ 3849 Table 7 3851 Under the given circumstances, if a client was to read from the file 3852 with a max read size of 64K, the following will be the results for 3853 the given READ_PLUS calls. This assumes the client has already 3854 opened the file, acquired a valid stateid ('s' in the example), and 3855 just needs to issue READ_PLUS requests. 3857 1. READ_PLUS(s, 0, 64K) --> NFS_OK, eof = false, . Since the first hole is less than the server's 3859 Hole Threshhold, the first 32K of the file is returned as data 3860 and the remaining 32K is returned as a hole which actually 3861 extends to 256K. 3863 2. READ_PLUS(s, 32K, 64K) --> NFS_OK, eof = false, 3864 The requested range was all zeros, and the current hole begins at 3865 offset 32K and is 224K in length. Note that the client should 3866 not have followed up the previous READ_PLUS request with this one 3867 as the hole information from the previous call extended past what 3868 the client was requesting. 3870 3. READ_PLUS(s, 256K, 64K) --> NFS_OK, eof = false, . Returns an array of the 32K data and 3872 the hole which extends to 354K. 3874 4. READ_PLUS(s, 354K, 64K) --> NFS_OK, eof = true, . Returns the final 64K of data and informs the client 3876 there is no more data in the file. 3878 13.11. Operation 66: SEEK 3880 SEEK is an operation that allows a client to determine the location 3881 of the next data_content4 in a file. It allows an implementation of 3882 the emerging extension to lseek(2) to allow clients to determine 3883 SEEK_HOLE and SEEK_DATA. 3885 13.11.1. ARGUMENT 3887 struct SEEK4args { 3888 /* CURRENT_FH: file */ 3889 stateid4 sa_stateid; 3890 offset4 sa_offset; 3891 data_content4 sa_what; 3892 }; 3894 13.11.2. RESULT 3896 union seek_content switch (data_content4 content) { 3897 case NFS4_CONTENT_DATA: 3898 data_info4 sc_data; 3899 case NFS4_CONTENT_APP_DATA_HOLE: 3900 app_data_hole4 sc_adh; 3901 case NFS4_CONTENT_HOLE: 3902 data_info4 sc_hole; 3903 default: 3904 void; 3905 }; 3907 struct seek_res4 { 3908 bool sr_eof; 3909 seek_content sr_contents; 3910 }; 3912 union SEEK4res switch (nfsstat4 status) { 3913 case NFS4_OK: 3914 seek_res4 resok4; 3915 default: 3916 void; 3917 }; 3919 13.11.3. DESCRIPTION 3921 From the given sa_offset, find the next data_content4 of type sa_what 3922 in the file. For either a hole or ADH, this must return the 3923 data_content4 in its entirety. For data, it must not return the 3924 actual data. 3926 SEEK must follow the same rules for stateids as READ_PLUS 3927 (Section 13.10.3). 3929 If the server could not find a corresponding sa_what, then the status 3930 would still be NFS4_OK, but sr_eof would be TRUE. The sr_contents 3931 would contain a zero-ed out content of the appropriate type. 3933 14. NFSv4.2 Callback Operations 3935 14.1. Operation 15: CB_OFFLOAD - Report results of an asynchronous 3936 operation 3938 14.1.1. ARGUMENT 3940 struct write_response4 { 3941 stateid4 wr_callback_id<1>; 3942 count4 wr_count; 3943 stable_how4 wr_committed; 3944 verifier4 wr_writeverf; 3945 }; 3947 union offload_info4 switch (nfsstat4 coa_status) { 3948 case NFS4_OK: 3949 write_response4 coa_resok4; 3950 default: 3951 length4 coa_bytes_copied; 3952 }; 3954 struct CB_OFFLOAD4args { 3955 nfs_fh4 coa_fh; 3956 stateid4 coa_stateid; 3957 offload_info4 coa_offload_info; 3958 }; 3960 14.1.2. RESULT 3962 struct CB_OFFLOAD4res { 3963 nfsstat4 cor_status; 3964 }; 3966 14.1.3. DESCRIPTION 3968 CB_OFFLOAD is used to report to the client the results of an 3969 asynchronous operation, e.g., Server-side Copy or a hole punch. The 3970 coa_fh and coa_stateid identify the transaction and the coa_status 3971 indicates success or failure. The coa_resok4.wr_callback_id MUST NOT 3972 be set. If the transaction failed, then the coa_bytes_copied 3973 contains the number of bytes copied before the failure occurred. The 3974 coa_bytes_copied value indicates the number of bytes copied but not 3975 which specific bytes have been copied. 3977 If the client supports either the COPY or WRITE_PLUS operation, the 3978 client is REQUIRED to support the CB_OFFLOAD operation. 3980 There is a potential race between the reply to the original 3981 transaction on the forechannel and the CB_OFFLOAD callback on the 3982 backchannel. Sections 2.10.6.3 and 20.9.3 in [1] describes how to 3983 handle this type of issue. 3985 14.1.3.1. Server-side Copy 3987 CB_OFFLOAD is used for both intra- and inter-server asynchronous 3988 copies. This operation is sent by the destination server to the 3989 client in a CB_COMPOUND request. Upon success, the 3990 coa_resok4.wr_count presents the total number of bytes copied. 3992 14.1.3.2. WRITE_PLUS 3994 CB_OFFLOAD is used to report the completion of either a hole punch or 3995 an ADH initialization. Upon success, the coa_resok4 will contain the 3996 same information that a synchronous WRITE_PLUS would have returned. 3998 15. IANA Considerations 4000 This section uses terms that are defined in [24]. 4002 16. References 4003 16.1. Normative References 4005 [1] Shepler, S., Eisler, M., and D. Noveck, "Network File System 4006 (NFS) Version 4 Minor Version 1 Protocol", RFC 5661, 4007 January 2010. 4009 [2] Haynes, T., "Network File System (NFS) Version 4 Minor Version 4010 2 External Data Representation Standard (XDR) Description", 4011 March 2011. 4013 [3] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 4014 Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, 4015 January 2005. 4017 [4] Haynes, T. and N. Williams, "Remote Procedure Call (RPC) 4018 Security Version 3", draft-williams-rpcsecgssv3 (work in 4019 progress), 2011. 4021 [5] The Open Group, "Section 'posix_fadvise()' of System Interfaces 4022 of The Open Group Base Specifications Issue 6, IEEE Std 1003.1, 4023 2004 Edition", 2004. 4025 [6] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol 4026 Specification", RFC 2203, September 1997. 4028 [7] Halevy, B., Welch, B., and J. Zelenka, "Object-Based Parallel 4029 NFS (pNFS) Operations", RFC 5664, January 2010. 4031 16.2. Informative References 4033 [8] Bradner, S., "Key words for use in RFCs to Indicate Requirement 4034 Levels", March 1997. 4036 [9] Haynes, T. and D. Noveck, "Network File System (NFS) version 4 4037 Protocol", draft-ietf-nfsv4-rfc3530bis-20 (Work In Progress), 4038 October 2012. 4040 [10] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik, 4041 "NSDB Protocol for Federated Filesystems", 4042 draft-ietf-nfsv4-federated-fs-protocol (Work In Progress), 4043 2010. 4045 [11] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik, 4046 "Administration Protocol for Federated Filesystems", 4047 draft-ietf-nfsv4-federated-fs-admin (Work In Progress), 2010. 4049 [12] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., 4050 Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- 4051 HTTP/1.1", RFC 2616, June 1999. 4053 [13] Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9, 4054 RFC 959, October 1985. 4056 [14] Simpson, W., "PPP Challenge Handshake Authentication Protocol 4057 (CHAP)", RFC 1994, August 1996. 4059 [15] Strohm, R., "Chapter 2, Data Blocks, Extents, and Segments, of 4060 Oracle Database Concepts 11g Release 1 (11.1)", January 2011. 4062 [16] Ashdown, L., "Chapter 15, Validating Database Files and 4063 Backups, of Oracle Database Backup and Recovery User's Guide 4064 11g Release 1 (11.1)", August 2008. 4066 [17] McDougall, R. and J. Mauro, "Section 11.4.3, Detecting Memory 4067 Corruption of Solaris Internals", 2007. 4069 [18] Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci- 4070 Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data 4071 Corruption in the Storage Stack", Proceedings of the 6th USENIX 4072 Symposium on File and Storage Technologies (FAST '08) , 2008. 4074 [19] Haynes, T., "Requirements for Labeled NFS", 4075 draft-ietf-nfsv4-labreqs-03 (work in progress). 4077 [20] "Section 46.6. Multi-Level Security (MLS) of Deployment Guide: 4078 Deployment, configuration and administration of Red Hat 4079 Enterprise Linux 5, Edition 6", 2011. 4081 [21] Quigley, D. and J. Lu, "Registry Specification for MAC Security 4082 Label Formats", draft-quigley-label-format-registry (work in 4083 progress), 2011. 4085 [22] ISEG, "IESG Processing of RFC Errata for the IETF Stream", 4086 2008. 4088 [23] Eisler, M., "XDR: External Data Representation Standard", 4089 RFC 4506, May 2006. 4091 [24] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA 4092 Considerations Section in RFCs", BCP 26, RFC 5226, May 2008. 4094 Appendix A. Acknowledgments 4096 For the pNFS Access Permissions Check, the original draft was by 4097 Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow. The work 4098 was influenced by discussions with Benny Halevy and Bruce Fields. A 4099 review was done by Tom Haynes. 4101 For the Sharing change attribute implementation details with NFSv4 4102 clients, the original draft was by Trond Myklebust. 4104 For the NFS Server-side Copy, the original draft was by James 4105 Lentini, Mike Eisler, Deepak Kenchammana, Anshul Madan, and Rahul 4106 Iyer. Tom Talpey co-authored an unpublished version of that 4107 document. It was also was reviewed by a number of individuals: 4108 Pranoop Erasani, Tom Haynes, Arthur Lent, Trond Myklebust, Dave 4109 Noveck, Theresa Lingutla-Raj, Manjunath Shankararao, Satyam Vaghani, 4110 and Nico Williams. 4112 For the NFS space reservation operations, the original draft was by 4113 Mike Eisler, James Lentini, Manjunath Shankararao, and Rahul Iyer. 4115 For the sparse file support, the original draft was by Dean 4116 Hildebrand and Marc Eshel. Valuable input and advice was received 4117 from Sorin Faibish, Bruce Fields, Benny Halevy, Trond Myklebust, and 4118 Richard Scheffenegger. 4120 For the Application IO Hints, the original draft was by Dean 4121 Hildebrand, Mike Eisler, Trond Myklebust, and Sam Falkner. Some 4122 early reviewers included Benny Halevy and Pranoop Erasani. 4124 For Labeled NFS, the original draft was by David Quigley, James 4125 Morris, Jarret Lu, and Tom Haynes. Peter Staubach, Trond Myklebust, 4126 Stephen Smalley, Sorrin Faibish, Nico Williams, and David Black also 4127 contributed in the final push to get this accepted. 4129 During the review process, Talia Reyes-Ortiz helped the sessions run 4130 smoothly. While many people contributed here and there, the core 4131 reviewers were Andy Adamson, Pranoop Erasani, Bruce Fields, Chuck 4132 Lever, Trond Myklebust, David Noveck, and Peter Staubach. 4134 Appendix B. RFC Editor Notes 4136 [RFC Editor: please remove this section prior to publishing this 4137 document as an RFC] 4139 [RFC Editor: prior to publishing this document as an RFC, please 4140 replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the 4141 RFC number of this document] 4143 Author's Address 4145 Thomas Haynes 4146 NetApp 4147 9110 E 66th St 4148 Tulsa, OK 74133 4149 USA 4151 Phone: +1 918 307 1415 4152 Email: thomas@netapp.com 4153 URI: http://www.tulsalabs.com