idnits 2.17.1 draft-haynes-nfsv4-minorversion2-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 5 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. == There are 5 instances of lines with private range IPv4 addresses in the document. If these are generic example addresses, they should be changed to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x, 198.51.100.x or 203.0.113.x. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 2359 has weird spacing: '...us_hole rpd_...' == Line 2364 has weird spacing: '...us_data rpr_...' == Line 2369 has weird spacing: '...S4resok resok...' == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: When a data server chooses to return a READ_HOLE result, it has the option of returning hole information for the data stored on that data server (as defined by the data layout), but it MUST not return a nfs_readplusreshole structure with a byte range that includes data managed by another data server. == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (April 18, 2011) is 4729 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '0' is mentioned on line 1746, but not defined == Unused Reference: '6' is defined on line 2678, but no explicit reference was found in the text == Unused Reference: '7' is defined on line 2682, but no explicit reference was found in the text == Unused Reference: '8' is defined on line 2685, but no explicit reference was found in the text == Unused Reference: '9' is defined on line 2689, but no explicit reference was found in the text == Unused Reference: '18' is defined on line 2723, but no explicit reference was found in the text == Unused Reference: '19' is defined on line 2726, but no explicit reference was found in the text == Unused Reference: '20' is defined on line 2729, but no explicit reference was found in the text == Unused Reference: '21' is defined on line 2732, but no explicit reference was found in the text == Unused Reference: '22' is defined on line 2736, but no explicit reference was found in the text == Unused Reference: '23' is defined on line 2738, but no explicit reference was found in the text == Unused Reference: '24' is defined on line 2741, but no explicit reference was found in the text == Unused Reference: '25' is defined on line 2744, but no explicit reference was found in the text == Unused Reference: '26' is defined on line 2747, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. '1' ** Obsolete normative reference: RFC 5661 (ref. '2') (Obsoleted by RFC 8881) -- Possible downref: Non-RFC (?) normative reference: ref. '8' == Outdated reference: A later version (-35) exists of draft-ietf-nfsv4-rfc3530bis-09 -- Obsolete informational reference (is this intentional?): RFC 2616 (ref. '14') (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) -- Obsolete informational reference (is this intentional?): RFC 5226 (ref. '17') (Obsoleted by RFC 8126) -- Obsolete informational reference (is this intentional?): RFC 3530 (ref. '26') (Obsoleted by RFC 7530) Summary: 1 error (**), 0 flaws (~~), 23 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 T. Haynes 3 Internet-Draft Editor 4 Intended status: Standards Track April 18, 2011 5 Expires: October 20, 2011 7 NFS Version 4 Minor Version 2 8 draft-haynes-nfsv4-minorversion2-01.txt 10 Abstract 12 This Internet-Draft describes NFS version 4 minor version two, 13 focusing mainly on the protocol extensions made from NFS version 4 14 minor version 0 and NFS version 4 minor version 1. Major extensions 15 introduced in NFS version 4 minor version two include: Server-side 16 Copy, Space Reservations, and Support for Sparse Files. 18 Requirements Language 20 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 21 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 22 document are to be interpreted as described in RFC 2119 [1]. 24 Status of this Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at http://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on October 20, 2011. 41 Copyright Notice 43 Copyright (c) 2011 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 This document may contain material from IETF Documents or IETF 57 Contributions published or made publicly available before November 58 10, 2008. The person(s) controlling the copyright in some of this 59 material may not have granted the IETF Trust the right to allow 60 modifications of such material outside the IETF Standards Process. 61 Without obtaining an adequate license from the person(s) controlling 62 the copyright in such materials, this document may not be modified 63 outside the IETF Standards Process, and derivative works of it may 64 not be created outside the IETF Standards Process, except to format 65 it for publication as an RFC or to translate it into languages other 66 than English. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 71 1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . . 5 72 1.2. Scope of This Document . . . . . . . . . . . . . . . . . . 5 73 1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 5 74 1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . . 5 75 1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . . 5 76 2. pNFS LAYOUTRETURN Error Handling . . . . . . . . . . . . . . . 5 77 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 5 78 2.2. Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 6 79 2.2.1. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 6 80 2.2.2. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 6 81 2.2.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 6 82 2.2.4. IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 7 83 3. Sharing change attribute implementation details with NFSv4 84 clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 85 3.1. Abstract . . . . . . . . . . . . . . . . . . . . . . . . . 8 86 3.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . 9 87 3.3. Definition of the 'change_attr_type' per-file system 88 attribute . . . . . . . . . . . . . . . . . . . . . . . . 9 89 4. NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . . 10 90 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 11 91 4.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 11 92 4.2.1. Intra-Server Copy . . . . . . . . . . . . . . . . . . 13 93 4.2.2. Inter-Server Copy . . . . . . . . . . . . . . . . . . 14 94 4.2.3. Server-to-Server Copy Protocol . . . . . . . . . . . . 17 95 4.3. Operations . . . . . . . . . . . . . . . . . . . . . . . . 19 96 4.3.1. netloc4 - Network Locations . . . . . . . . . . . . . 19 97 4.3.2. Operation 61: COPY_NOTIFY - Notify a source server 98 of a future copy . . . . . . . . . . . . . . . . . . . 20 99 4.3.3. Operation 62: COPY_REVOKE - Revoke a destination 100 server's copy privileges . . . . . . . . . . . . . . . 22 101 4.3.4. Operation 59: COPY - Initiate a server-side copy . . . 23 102 4.3.5. Operation 60: COPY_ABORT - Cancel a server-side 103 copy . . . . . . . . . . . . . . . . . . . . . . . . . 31 104 4.3.6. Operation 63: COPY_STATUS - Poll for status of a 105 server-side copy . . . . . . . . . . . . . . . . . . . 32 106 4.3.7. Operation 15: CB_COPY - Report results of a 107 server-side copy . . . . . . . . . . . . . . . . . . . 33 108 4.3.8. Copy Offload Stateids . . . . . . . . . . . . . . . . 35 109 4.4. Security Considerations . . . . . . . . . . . . . . . . . 35 110 4.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 35 111 4.5. IANA Considerations . . . . . . . . . . . . . . . . . . . 43 112 5. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 43 113 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 43 114 5.2. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 45 115 5.2.1. Space Reservation . . . . . . . . . . . . . . . . . . 45 116 5.2.2. Space freed on deletes . . . . . . . . . . . . . . . . 45 117 5.2.3. Operations and attributes . . . . . . . . . . . . . . 46 118 5.2.4. Attribute 77: space_reserve . . . . . . . . . . . . . 46 119 5.2.5. Attribute 78: space_freed . . . . . . . . . . . . . . 47 120 5.2.6. Attribute 79: max_hole_punch . . . . . . . . . . . . . 47 121 5.2.7. Operation 64: HOLE_PUNCH - Zero and deallocate 122 blocks backing the file in the specified range. . . . 47 123 5.3. Security Considerations . . . . . . . . . . . . . . . . . 49 124 5.4. IANA Considerations . . . . . . . . . . . . . . . . . . . 49 125 6. Simple and Efficient Read Support for Sparse Files . . . . . . 49 126 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 49 127 6.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 50 128 6.3. Applications and Sparse Files . . . . . . . . . . . . . . 50 129 6.4. Overview of Sparse Files and NFSv4 . . . . . . . . . . . . 51 130 6.5. Operation 65: READPLUS . . . . . . . . . . . . . . . . . . 52 131 6.5.1. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 53 132 6.5.2. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 54 133 6.5.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 54 134 6.5.4. IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 56 135 6.5.5. READPLUS with Sparse Files Example . . . . . . . . . . 57 136 6.6. Related Work . . . . . . . . . . . . . . . . . . . . . . . 58 137 6.7. Other Proposed Designs . . . . . . . . . . . . . . . . . . 59 138 6.7.1. Multi-Data Server Hole Information . . . . . . . . . . 59 139 6.7.2. Data Result Array . . . . . . . . . . . . . . . . . . 59 140 6.7.3. User-Defined Sparse Mask . . . . . . . . . . . . . . . 60 141 6.7.4. Allocated flag . . . . . . . . . . . . . . . . . . . . 60 142 6.7.5. Dense and Sparse pNFS File Layouts . . . . . . . . . . 60 143 6.8. Security Considerations . . . . . . . . . . . . . . . . . 60 144 6.9. IANA Considerations . . . . . . . . . . . . . . . . . . . 60 145 7. Security Considerations . . . . . . . . . . . . . . . . . . . 60 146 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 60 147 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 60 148 9.1. Normative References . . . . . . . . . . . . . . . . . . . 60 149 9.2. Informative References . . . . . . . . . . . . . . . . . . 61 150 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 62 151 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 63 152 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 63 154 1. Introduction 156 1.1. The NFS Version 4 Minor Version 2 Protocol 158 The NFS version 4 minor version 2 (NFSv4.2) protocol is the third 159 minor version of the NFS version 4 (NFSv4) protocol. The first minor 160 version, NFSv4.0, is described in [10] and the second minor version, 161 NFSv4.1, is described in [2]. It follows the guidelines for minor 162 versioning that are listed in Section 11 of RFC 3530bis. 164 As a minor version, NFSv4.2 is consistent with the overall goals for 165 NFSv4, but extends the protocol so as to better meet those goals, 166 based on experiences with NFSv4.1. In addition, NFSv4.2 has adopted 167 some additional goals, which motivate some of the major extensions in 168 NFSv4.2. 170 1.2. Scope of This Document 172 This document describes the NFSv4.2 protocol. With respect to 173 NFSv4.0 and NFSv4.1, this document does not: 175 o describe the NFSv4.0 or NFSv4.1 protocols, except where needed to 176 contrast with NFSv4.2. 178 o modify the specification of the NFSv4.0 or NFSv4.1 protocols. 180 o clarify the NFSv4.0 or NFSv4.1 protocols. 182 1.3. NFSv4.2 Goals 184 1.4. Overview of NFSv4.2 Features 186 1.5. Differences from NFSv4.1 188 2. pNFS LAYOUTRETURN Error Handling 190 2.1. Introduction 192 In the pNFS description provided in [2], the client is not enabled to 193 relay an error code from the DS to the MDS. In the specification of 194 the Objects-Based Layout protocol [3], use is made of the opaque 195 lrf_body field of the LAYOUTRETURN argument to do such a relaying of 196 error codes. In this section, we define a new data structure to 197 enable the passing of error codes back to the MDS and provide some 198 guidelines on what both the client and MDS should expect in such 199 circumstances. 201 There are two broad classes of errors, transient and persistent. The 202 client SHOULD strive to only use this new mechanism to report 203 persistent errors. It MUST be able to deal with transient issues by 204 itself. Also, while the client might consider an issue to be 205 persistent, it MUST be prepared for the MDS to consider such issues 206 to be persistent. A prime example of this is if the MDS fences off a 207 client from either a stateid or a filehandle. The client will get an 208 error from the DS and might relay either NFS4ERR_ACCESS or 209 NFS4ERR_STALE_STATEID back to the MDS, with the belief that this is a 210 hard error. The MDS on the other hand, is waiting for the client to 211 report such an error. For it, the mission is accomplished in that 212 the client has returned a layout that the MDS had most likley 213 recalled. 215 2.2. Changes to Operation 51: LAYOUTRETURN 217 The existing LAYOUTRETURN operation is extended by introducing a new 218 data structure to report errors, layoutreturn_device_error4. Also, 219 layoutreturn_device_error4 is introduced to enable an array of errors 220 to be reported. 222 2.2.1. ARGUMENT 224 The ARGUMENT specification of the LAYOUTRETURN operation in section 225 18.44.1 of [2] is augmented by the following XDR code [11]: 227 struct layoutreturn_device_error4 { 228 deviceid4 lrde_deviceid; 229 nfsstat4 lrde_status; 230 nfs_opnum4 lrde_opnum; 231 }; 233 struct layoutreturn_error_report4 { 234 layoutreturn_device_error4 lrer_errors<>; 235 }; 237 2.2.2. RESULT 239 The RESULT of the LAYOUTRETURN operation is unchanged; see section 240 18.44.2 of [2]. 242 2.2.3. DESCRIPTION 244 The following text is added to the end of the LAYOUTRETURN operation 245 DESCRIPTION in section 18.44.3 of [2]. 247 When a client used LAYOUTRETURN with a type of LAYOUTRETURN4_FILE, 248 then if the lrf_body field is NULL, it indicates to the MDS that the 249 client experienced no errors. If lrf_body is non-NULL, then the 250 field references error information which is layout type specific. 251 I.e., the Objects-Based Layout protocol can continue to utilize 252 lrf_body as specified in [3]. For both Files-Based Layouts, the 253 field references a layoutreturn_device_error4, which contains an 254 array of layoutreturn_device_error4. 256 Each individual layoutreturn_device_error4 descibes a single error 257 associated with a DS, which is identfied via lrde_deviceid. The 258 operation which returned the error is identified via lrde_opnum. 259 Finally the NFS error value (nfsstat4) encountered is provided via 260 lrde_status and may consist of the following error codes: 262 NFS4_OKAY: No issues were found for this device. 264 NFS4ERR_NXIO: The client was unable to establish any communication 265 with the DS. 267 NFS4ERR_*: The client was able to establish communication with the 268 DS and is returning one of the allowed error codes for the 269 operation denoted by lrde_opnum. 271 2.2.4. IMPLEMENTATION 273 The following text is added to the end of the LAYOUTRETURN operation 274 IMPLEMENTATION in section 18.4.4 of [2]. 276 A client that expects to use pNFS for a mounted filesystem SHOULD 277 check for pNFS support at mount time. This check SHOULD be performed 278 by sending a GETDEVICELIST operation, followed by layout-type- 279 specific checks for accessibility of each storage device returned by 280 GETDEVICELIST. If the NFS server does not support pNFS, the 281 GETDEVICELIST operation will be rejected with an NFS4ERR_NOTSUPP 282 error; in this situation it is up to the client to determine whether 283 it is acceptable to proceed with NFS-only access. 285 Clients are expected to tolerate transient storage device errors, and 286 hence clients SHOULD NOT use the LAYOUTRETURN error handling for 287 device access problems that may be transient. The methods by which a 288 client decides whether an access problem is transient vs. persistent 289 are implementation-specific, but may include retrying I/Os to a data 290 server under appropriate conditions. 292 When an I/O fails to a storage device, the client SHOULD retry the 293 failed I/O via the MDS. In this situation, before retrying the I/O, 294 the client SHOULD return the layout, or the affected portion thereof, 295 and SHOULD indicate which storage device or devices was problematic. 296 If the client does not do this, the MDS may issue a layout recall 297 callback in order to perform the retried I/O. 299 The client needs to be cognizant that since this error handling is 300 optional in the MDS, the MDS may silently ignore this functionality. 301 Also, as the MDS may consider some issues the client reports to be 302 expected (see Section 2.1), the client might find it difficult to 303 detect a MDS which has not implemented error handling via 304 LAYOUTRETURN. 306 If an MDS is aware that a storage device is proving problematic to a 307 client, the MDS SHOULD NOT include that storage device in any pNFS 308 layouts sent to that client. If the MDS is aware that a storage 309 device is affecting many clients, then the MDS SHOULD NOT include 310 that storage device in any pNFS layouts sent out. Clients must still 311 be aware that the MDS might not have any choice in using the storage 312 device, i.e., there might only be one possible layout for the system. 314 Another interesting complication is that for existing files, the MDS 315 might have no choice in which storage devices to hand out to clients. 316 The MDS might try to restripe a file across a different storage 317 device, but clients need to be aware that not all implementations 318 have restriping support. 320 An MDS SHOULD react to a client return of layouts with errors by not 321 using the problematic storage devices in layouts for that client, but 322 the MDS is not required to indefinitely retain per-client storage 323 device error information. An MDS is also not required to 324 automatically reinstate use of a previously problematic storage 325 device; administrative intervention may be required instead. 327 A client MAY perform I/O via the MDS even when the client holds a 328 layout that covers the I/O; servers MUST support this client 329 behavior, and MAY recall layouts as needed to complete I/Os. 331 3. Sharing change attribute implementation details with NFSv4 clients 333 3.1. Abstract 335 This document describes an extension to the NFSv4 protocol that 336 allows the server to share information about the implementation of 337 its change attribute with the client. The aim is to improve the 338 client's ability to determine the order in which parallel updates to 339 the same file were processed. 341 3.2. Introduction 343 Although both the NFSv4 [10] and NFSv4.1 protocol [2], define the 344 change attribute as being mandatory to implement, there is little in 345 the way of guidance. The only feature that is mandated by the spec 346 is that the value must change whenever the file data or metadata 347 change. 349 While this allows for a wide range of implementations, it also leaves 350 the client with a conundrum: how does it determine which is the most 351 recent value for the change attribute in a case where several RPC 352 calls have been issued in parallel? In other words if two COMPOUNDs, 353 both containing WRITE and GETATTR requests for the same file, have 354 been issued in parallel, how does the client determine which of the 355 two change attribute values returned in the replies to the GETATTR 356 requests corresponds to the most recent state of the file? In some 357 cases, the only recourse may be to send another COMPOUND containing a 358 third GETATTR that is fully serialised with the first two. 360 In order to avoid this kind of inefficiency, we propose a method to 361 allow the server to share details about how the change attribute is 362 expected to evolve, so that the client may immediately determine 363 which, out of the several change attribute values returned by the 364 server, is the most recent. 366 3.3. Definition of the 'change_attr_type' per-file system attribute 368 enum change_attr_typeinfo { 369 NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR = 0, 370 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER = 1, 371 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS = 2, 372 NFS4_CHANGE_TYPE_IS_TIME_METADATA = 3, 373 NFS4_CHANGE_TYPE_IS_UNDEFINED = 4 374 }; 376 +------------------+----+---------------------------+-----+ 377 | Name | Id | Data Type | Acc | 378 +------------------+----+---------------------------+-----+ 379 | change_attr_type | XX | enum change_attr_typeinfo | R | 380 +------------------+----+---------------------------+-----+ 382 The proposed solution is to enable the NFS server to provide 383 additional information about how it expects the change attribute 384 value to evolve after the file data or metadata has changed. To do 385 so, we define a new recommended attribute, 'change_attr_type', which 386 may take values from enum change_attr_typeinfo as follows: 388 NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR: The change attribute value MUST 389 monotonically increase for every atomic change to the file 390 attributes, data or directory contents. 392 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER: The change attribute value MUST 393 be incremented by one unit for every atomic change to the file 394 attributes, data or directory contents. This property is 395 preserved when writing to pNFS data servers. 397 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS: The change attribute 398 value MUST be incremented by one unit for every atomic change to 399 the file attributes, data or directory contents. In the case 400 where the client is writing to pNFS data servers, the number of 401 increments is not guaranteed to exactly match the number of 402 writes. 404 NFS4_CHANGE_TYPE_IS_TIME_METADATA: The change attribute is 405 implemented as suggested in the NFSv4 spec [10] in terms of the 406 time_metadata attribute. 408 NFS4_CHANGE_TYPE_IS_UNDEFINED: The change attribute does not take 409 values that fit into any of these categories. 411 If either NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR, 412 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, or 413 NFS4_CHANGE_TYPE_IS_TIME_METADATA are set, then the client knows at 414 the very least that the change attribute is monotonically increasing, 415 which is sufficient to resolve the question of which value is the 416 most recent. 418 If the client sees the value NFS4_CHANGE_TYPE_IS_TIME_METADATA, then 419 by inspecting the value of the 'time_delta' attribute it additionally 420 has the option of detecting rogue server implementations that use 421 time_metadata in violation of the spec. 423 Finally, if the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it 424 has the ability to predict what the resulting change attribute value 425 should be after a COMPOUND containing a SETATTR, WRITE, or CREATE. 426 This again allows it to detect changes made in parallel by another 427 client. The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits 428 the same, but only if the client is not doing pNFS WRITEs. 430 4. NFS Server-side Copy 431 4.1. Introduction 433 This document describes a server-side copy feature for the NFS 434 protocol. 436 The server-side copy feature provides a mechanism for the NFS client 437 to perform a file copy on the server without the data being 438 transmitted back and forth over the network. 440 Without this feature, an NFS client copies data from one location to 441 another by reading the data from the server over the network, and 442 then writing the data back over the network to the server. Using 443 this server-side copy operation, the client is able to instruct the 444 server to copy the data locally without the data being sent back and 445 forth over the network unnecessarily. 447 In general, this feature is useful whenever data is copied from one 448 location to another on the server. It is particularly useful when 449 copying the contents of a file from a backup. Backup-versions of a 450 file are copied for a number of reasons, including restoring and 451 cloning data. 453 If the source object and destination object are on different file 454 servers, the file servers will communicate with one another to 455 perform the copy operation. The server-to-server protocol by which 456 this is accomplished is not defined in this document. 458 4.2. Protocol Overview 460 The server-side copy offload operations support both intra-server and 461 inter-server file copies. An intra-server copy is a copy in which 462 the source file and destination file reside on the same server. In 463 an inter-server copy, the source file and destination file are on 464 different servers. In both cases, the copy may be performed 465 synchronously or asynchronously. 467 Throughout the rest of this document, we refer to the NFS server 468 containing the source file as the "source server" and the NFS server 469 to which the file is transferred as the "destination server". In the 470 case of an intra-server copy, the source server and destination 471 server are the same server. Therefore in the context of an intra- 472 server copy, the terms source server and destination server refer to 473 the single server performing the copy. 475 The operations described below are designed to copy files. Other 476 file system objects can be copied by building on these operations or 477 using other techniques. For example if the user wishes to copy a 478 directory, the client can synthesize a directory copy by first 479 creating the destination directory and then copying the source 480 directory's files to the new destination directory. If the user 481 wishes to copy a namespace junction [12] [13], the client can use the 482 ONC RPC Federated Filesystem protocol [13] to perform the copy. 483 Specifically the client can determine the source junction's 484 attributes using the FEDFS_LOOKUP_FSN procedure and create a 485 duplicate junction using the FEDFS_CREATE_JUNCTION procedure. 487 For the inter-server copy protocol, the operations are defined to be 488 compatible with a server-to-server copy protocol in which the 489 destination server reads the file data from the source server. This 490 model in which the file data is pulled from the source by the 491 destination has a number of advantages over a model in which the 492 source pushes the file data to the destination. The advantages of 493 the pull model include: 495 o The pull model only requires a remote server (i.e. the destination 496 server) to be granted read access. A push model requires a remote 497 server (i.e. the source server) to be granted write access, which 498 is more privileged. 500 o The pull model allows the destination server to stop reading if it 501 has run out of space. In a push model, the destination server 502 must flow control the source server in this situation. 504 o The pull model allows the destination server to easily flow 505 control the data stream by adjusting the size of its read 506 operations. In a push model, the destination server does not have 507 this ability. The source server in a push model is capable of 508 writing chunks larger than the destination server has requested in 509 attributes and session parameters. In theory, the destination 510 server could perform a "short" write in this situation, but this 511 approach is known to behave poorly in practice. 513 The following operations are provided to support server-side copy: 515 COPY_NOTIFY: For inter-server copies, the client sends this 516 operation to the source server to notify it of a future file copy 517 from a given destination server for the given user. 519 COPY_REVOKE: Also for inter-server copies, the client sends this 520 operation to the source server to revoke permission to copy a file 521 for the given user. 523 COPY: Used by the client to request a file copy. 525 COPY_ABORT: Used by the client to abort an asynchronous file copy. 527 COPY_STATUS: Used by the client to poll the status of an 528 asynchronous file copy. 530 CB_COPY: Used by the destination server to report the results of an 531 asynchronous file copy to the client. 533 These operations are described in detail in Section 4.3. This 534 section provides an overview of how these operations are used to 535 perform server-side copies. 537 4.2.1. Intra-Server Copy 539 To copy a file on a single server, the client uses a COPY operation. 540 The server may respond to the copy operation with the final results 541 of the copy or it may perform the copy asynchronously and deliver the 542 results using a CB_COPY operation callback. If the copy is performed 543 asynchronously, the client may poll the status of the copy using 544 COPY_STATUS or cancel the copy using COPY_ABORT. 546 A synchronous intra-server copy is shown in Figure 1. In this 547 example, the NFS server chooses to perform the copy synchronously. 548 The copy operation is completed, either successfully or 549 unsuccessfully, before the server replies to the client's request. 550 The server's reply contains the final result of the operation. 552 Client Server 553 + + 554 | | 555 |--- COPY ---------------------------->| Client requests 556 |<------------------------------------/| a file copy 557 | | 558 | | 560 Figure 1: A synchronous intra-server copy. 562 An asynchronous intra-server copy is shown in Figure 2. In this 563 example, the NFS server performs the copy asynchronously. The 564 server's reply to the copy request indicates that the copy operation 565 was initiated and the final result will be delivered at a later time. 566 The server's reply also contains a copy stateid. The client may use 567 this copy stateid to poll for status information (as shown) or to 568 cancel the copy using a COPY_ABORT. When the server completes the 569 copy, the server performs a callback to the client and reports the 570 results. 572 Client Server 573 + + 574 | | 575 |--- COPY ---------------------------->| Client requests 576 |<------------------------------------/| a file copy 577 | | 578 | | 579 |--- COPY_STATUS --------------------->| Client may poll 580 |<------------------------------------/| for status 581 | | 582 | . | Multiple COPY_STATUS 583 | . | operations may be sent. 584 | . | 585 | | 586 |<-- CB_COPY --------------------------| Server reports results 587 |\------------------------------------>| 588 | | 590 Figure 2: An asynchronous intra-server copy. 592 4.2.2. Inter-Server Copy 594 A copy may also be performed between two servers. The copy protocol 595 is designed to accommodate a variety of network topologies. As shown 596 in Figure 3, the client and servers may be connected by multiple 597 networks. In particular, the servers may be connected by a 598 specialized, high speed network (network 192.168.33.0/24 in the 599 diagram) that does not include the client. The protocol allows the 600 client to setup the copy between the servers (over network 601 10.11.78.0/24 in the diagram) and for the servers to communicate on 602 the high speed network if they choose to do so. 604 192.168.33.0/24 605 +-------------------------------------+ 606 | | 607 | | 608 | 192.168.33.18 | 192.168.33.56 609 +-------+------+ +------+------+ 610 | Source | | Destination | 611 +-------+------+ +------+------+ 612 | 10.11.78.18 | 10.11.78.56 613 | | 614 | | 615 | 10.11.78.0/24 | 616 +------------------+------------------+ 617 | 618 | 619 | 10.11.78.243 620 +-----+-----+ 621 | Client | 622 +-----------+ 624 Figure 3: An example inter-server network topology. 626 For an inter-server copy, the client notifies the source server that 627 a file will be copied by the destination server using a COPY_NOTIFY 628 operation. The client then initiates the copy by sending the COPY 629 operation to the destination server. The destination server may 630 perform the copy synchronously or asynchronously. 632 A synchronous inter-server copy is shown in Figure 4. In this case, 633 the destination server chooses to perform the copy before responding 634 to the client's COPY request. 636 An asynchronous copy is shown in Figure 5. In this case, the 637 destination server chooses to respond to the client's COPY request 638 immediately and then perform the copy asynchronously. 640 Client Source Destination 641 + + + 642 | | | 643 |--- COPY_NOTIFY --->| | 644 |<------------------/| | 645 | | | 646 | | | 647 |--- COPY ---------------------------->| 648 | | | 649 | | | 650 | |<----- read -----| 651 | |\--------------->| 652 | | | 653 | | . | Multiple reads may 654 | | . | be necessary 655 | | . | 656 | | | 657 | | | 658 |<------------------------------------/| Destination replies 659 | | | to COPY 661 Figure 4: A synchronous inter-server copy. 663 Client Source Destination 664 + + + 665 | | | 666 |--- COPY_NOTIFY --->| | 667 |<------------------/| | 668 | | | 669 | | | 670 |--- COPY ---------------------------->| 671 |<------------------------------------/| 672 | | | 673 | | | 674 | |<----- read -----| 675 | |\--------------->| 676 | | | 677 | | . | Multiple reads may 678 | | . | be necessary 679 | | . | 680 | | | 681 | | | 682 |--- COPY_STATUS --------------------->| Client may poll 683 |<------------------------------------/| for status 684 | | | 685 | | . | Multiple COPY_STATUS 686 | | . | operations may be sent 687 | | . | 688 | | | 689 | | | 690 | | | 691 |<-- CB_COPY --------------------------| Destination reports 692 |\------------------------------------>| results 693 | | | 695 Figure 5: An asynchronous inter-server copy. 697 4.2.3. Server-to-Server Copy Protocol 699 During an inter-server copy, the destination server reads the file 700 data from the source server. The source server and destination 701 server are not required to use a specific protocol to transfer the 702 file data. The choice of what protocol to use is ultimately the 703 destination server's decision. 705 4.2.3.1. Using NFSv4.x as a Server-to-Server Copy Protocol 707 The destination server MAY use standard NFSv4.x (where x >= 1) to 708 read the data from the source server. If NFSv4.x is used for the 709 server-to-server copy protocol, the destination server can use the 710 filehandle contained in the COPY request with standard NFSv4.x 711 operations to read data from the source server. Specifically, the 712 destination server may use the NFSv4.x OPEN operation's CLAIM_FH 713 facility to open the file being copied and obtain an open stateid. 714 Using the stateid, the destination server may then use NFSv4.x READ 715 operations to read the file. 717 4.2.3.2. Using an alternative Server-to-Server Copy Protocol 719 In a homogeneous environment, the source and destination servers 720 might be able to perform the file copy extremely efficiently using 721 specialized protocols. For example the source and destination 722 servers might be two nodes sharing a common file system format for 723 the source and destination file systems. Thus the source and 724 destination are in an ideal position to efficiently render the image 725 of the source file to the destination file by replicating the file 726 system formats at the block level. Another possibility is that the 727 source and destination might be two nodes sharing a common storage 728 area network, and thus there is no need to copy any data at all, and 729 instead ownership of the file and its contents might simply be re- 730 assigned to the destination. To allow for these possibilities, the 731 destination server is allowed to use a server-to-server copy protocol 732 of its choice. 734 In a heterogeneous environment, using a protocol other than NFSv4.x 735 (e.g. HTTP [14] or FTP [15]) presents some challenges. In 736 particular, the destination server is presented with the challenge of 737 accessing the source file given only an NFSv4.x filehandle. 739 One option for protocols that identify source files with path names 740 is to use an ASCII hexadecimal representation of the source 741 filehandle as the file name. 743 Another option for the source server is to use URLs to direct the 744 destination server to a specialized service. For example, the 745 response to COPY_NOTIFY could include the URL 746 ftp://s1.example.com:9999/_FH/0x12345, where 0x12345 is the ASCII 747 hexadecimal representation of the source filehandle. When the 748 destination server receives the source server's URL, it would use 749 "_FH/0x12345" as the file name to pass to the FTP server listening on 750 port 9999 of s1.example.com. On port 9999 there would be a special 751 instance of the FTP service that understands how to convert NFS 752 filehandles to an open file descriptor (in many operating systems, 753 this would require a new system call, one which is the inverse of the 754 makefh() function that the pre-NFSv4 MOUNT service needs). 756 Authenticating and identifying the destination server to the source 757 server is also a challenge. Recommendations for how to accomplish 758 this are given in Section 4.4.1.2.4 and Section 4.4.1.4. 760 4.3. Operations 762 In the sections that follow, several operations are defined that 763 together provide the server-side copy feature. These operations are 764 intended to be OPTIONAL operations as defined in section 17 of [2]. 765 The COPY_NOTIFY, COPY_REVOKE, COPY, COPY_ABORT, and COPY_STATUS 766 operations are designed to be sent within an NFSv4 COMPOUND 767 procedure. The CB_COPY operation is designed to be sent within an 768 NFSv4 CB_COMPOUND procedure. 770 Each operation is performed in the context of the user identified by 771 the ONC RPC credential of its containing COMPOUND or CB_COMPOUND 772 request. For example, a COPY_ABORT operation issued by a given user 773 indicates that a specified COPY operation initiated by the same user 774 be canceled. Therefore a COPY_ABORT MUST NOT interfere with a copy 775 of the same file initiated by another user. 777 An NFS server MAY allow an administrative user to monitor or cancel 778 copy operations using an implementation specific interface. 780 4.3.1. netloc4 - Network Locations 782 The server-side copy operations specify network locations using the 783 netloc4 data type shown below: 785 enum netloc_type4 { 786 NL4_NAME = 0, 787 NL4_URL = 1, 788 NL4_NETADDR = 2 789 }; 790 union netloc4 switch (netloc_type4 nl_type) { 791 case NL4_NAME: utf8str_cis nl_name; 792 case NL4_URL: utf8str_cis nl_url; 793 case NL4_NETADDR: netaddr4 nl_addr; 794 }; 796 If the netloc4 is of type NL4_NAME, the nl_name field MUST be 797 specified as a UTF-8 string. The nl_name is expected to be resolved 798 to a network address via DNS, LDAP, NIS, /etc/hosts, or some other 799 means. If the netloc4 is of type NL4_URL, a server URL [4] 800 appropriate for the server-to-server copy operation is specified as a 801 UTF-8 string. If the netloc4 is of type NL4_NETADDR, the nl_addr 802 field MUST contain a valid netaddr4 as defined in Section 3.3.9 of 803 [2]. 805 When netloc4 values are used for an inter-server copy as shown in 806 Figure 3, their values may be evaluated on the source server, 807 destination server, and client. The network environment in which 808 these systems operate should be configured so that the netloc4 values 809 are interpreted as intended on each system. 811 4.3.2. Operation 61: COPY_NOTIFY - Notify a source server of a future 812 copy 814 4.3.2.1. ARGUMENT 816 struct COPY_NOTIFY4args { 817 /* CURRENT_FH: source file */ 818 netloc4 cna_destination_server; 819 }; 821 4.3.2.2. RESULT 823 struct COPY_NOTIFY4resok { 824 nfstime4 cnr_lease_time; 825 netloc4 cnr_source_server<>; 826 }; 828 union COPY_NOTIFY4res switch (nfsstat4 cnr_status) { 829 case NFS4_OK: 830 COPY_NOTIFY4resok resok4; 831 default: 832 void; 833 }; 835 4.3.2.3. DESCRIPTION 837 This operation is used for an inter-server copy. A client sends this 838 operation in a COMPOUND request to the source server to authorize a 839 destination server identified by cna_destination_server to read the 840 file specified by CURRENT_FH on behalf of the given user. 842 The cna_destination_server MUST be specified using the netloc4 843 network location format. The server is not required to resolve the 844 cna_destination_server address before completing this operation. 846 If this operation succeeds, the source server will allow the 847 cna_destination_server to copy the specified file on behalf of the 848 given user. If COPY_NOTIFY succeeds, the destination server is 849 granted permission to read the file as long as both of the following 850 conditions are met: 852 o The destination server begins reading the source file before the 853 cnr_lease_time expires. If the cnr_lease_time expires while the 854 destination server is still reading the source file, the 855 destination server is allowed to finish reading the file. 857 o The client has not issued a COPY_REVOKE for the same combination 858 of user, filehandle, and destination server. 860 The cnr_lease_time is chosen by the source server. A cnr_lease_time 861 of 0 (zero) indicates an infinite lease. To renew the copy lease 862 time the client should resend the same copy notification request to 863 the source server. 865 To avoid the need for synchronized clocks, copy lease times are 866 granted by the server as a time delta. However, there is a 867 requirement that the client and server clocks do not drift 868 excessively over the duration of the lease. There is also the issue 869 of propagation delay across the network which could easily be several 870 hundred milliseconds as well as the possibility that requests will be 871 lost and need to be retransmitted. 873 To take propagation delay into account, the client should subtract it 874 from copy lease times (e.g. if the client estimates the one-way 875 propagation delay as 200 milliseconds, then it can assume that the 876 lease is already 200 milliseconds old when it gets it). In addition, 877 it will take another 200 milliseconds to get a response back to the 878 server. So the client must send a lease renewal or send the copy 879 offload request to the cna_destination_server at least 400 880 milliseconds before the copy lease would expire. If the propagation 881 delay varies over the life of the lease (e.g. the client is on a 882 mobile host), the client will need to continuously subtract the 883 increase in propagation delay from the copy lease times. 885 The server's copy lease period configuration should take into account 886 the network distance of the clients that will be accessing the 887 server's resources. It is expected that the lease period will take 888 into account the network propagation delays and other network delay 889 factors for the client population. Since the protocol does not allow 890 for an automatic method to determine an appropriate copy lease 891 period, the server's administrator may have to tune the copy lease 892 period. 894 A successful response will also contain a list of names, addresses, 895 and URLs called cnr_source_server, on which the source is willing to 896 accept connections from the destination. These might not be 897 reachable from the client and might be located on networks to which 898 the client has no connection. 900 If the client wishes to perform an inter-server copy, the client MUST 901 send a COPY_NOTIFY to the source server. Therefore, the source 902 server MUST support COPY_NOTIFY. 904 For a copy only involving one server (the source and destination are 905 on the same server), this operation is unnecessary. 907 The COPY_NOTIFY operation may fail for the following reasons (this is 908 a partial list): 910 NFS4ERR_MOVED: The file system which contains the source file is not 911 present on the source server. The client can determine the 912 correct location and reissue the operation with the correct 913 location. 915 NFS4ERR_NOTSUPP: The copy offload operation is not supported by the 916 NFS server receiving this request. 918 NFS4ERR_WRONGSEC: The security mechanism being used by the client 919 does not match the server's security policy. 921 4.3.3. Operation 62: COPY_REVOKE - Revoke a destination server's copy 922 privileges 924 4.3.3.1. ARGUMENT 926 struct COPY_REVOKE4args { 927 /* CURRENT_FH: source file */ 928 netloc4 cra_destination_server; 929 }; 931 4.3.3.2. RESULT 933 struct COPY_REVOKE4res { 934 nfsstat4 crr_status; 935 }; 937 4.3.3.3. DESCRIPTION 939 This operation is used for an inter-server copy. A client sends this 940 operation in a COMPOUND request to the source server to revoke the 941 authorization of a destination server identified by 942 cra_destination_server from reading the file specified by CURRENT_FH 943 on behalf of given user. If the cra_destination_server has already 944 begun copying the file, a successful return from this operation 945 indicates that further access will be prevented. 947 The cra_destination_server MUST be specified using the netloc4 948 network location format. The server is not required to resolve the 949 cra_destination_server address before completing this operation. 951 The COPY_REVOKE operation is useful in situations in which the source 952 server granted a very long or infinite lease on the destination 953 server's ability to read the source file and all copy operations on 954 the source file have been completed. 956 For a copy only involving one server (the source and destination are 957 on the same server), this operation is unnecessary. 959 If the server supports COPY_NOTIFY, the server is REQUIRED to support 960 the COPY_REVOKE operation. 962 The COPY_REVOKE operation may fail for the following reasons (this is 963 a partial list): 965 NFS4ERR_MOVED: The file system which contains the source file is not 966 present on the source server. The client can determine the 967 correct location and reissue the operation with the correct 968 location. 970 NFS4ERR_NOTSUPP: The copy offload operation is not supported by the 971 NFS server receiving this request. 973 4.3.4. Operation 59: COPY - Initiate a server-side copy 975 4.3.4.1. ARGUMENT 977 const COPY4_GUARDED = 0x00000001; 978 const COPY4_METADATA = 0x00000002; 980 struct COPY4args { 981 /* SAVED_FH: source file */ 982 /* CURRENT_FH: destination file or */ 983 /* directory */ 984 offset4 ca_src_offset; 985 offset4 ca_dst_offset; 986 length4 ca_count; 987 uint32_t ca_flags; 988 component4 ca_destination; 989 netloc4 ca_source_server<>; 990 }; 992 4.3.4.2. RESULT 994 union COPY4res switch (nfsstat4 cr_status) { 995 case NFS4_OK: 996 stateid4 cr_callback_id<1>; 997 default: 998 length4 cr_bytes_copied; 999 }; 1001 4.3.4.3. DESCRIPTION 1003 The COPY operation is used for both intra- and inter-server copies. 1004 In both cases, the COPY is always sent from the client to the 1005 destination server of the file copy. The COPY operation requests 1006 that a file be copied from the location specified by the SAVED_FH 1007 value to the location specified by the combination of CURRENT_FH and 1008 ca_destination. 1010 The SAVED_FH must be a regular file. If SAVED_FH is not a regular 1011 file, the operation MUST fail and return NFS4ERR_WRONG_TYPE. 1013 In order to set SAVED_FH to the source file handle, the compound 1014 procedure requesting the COPY will include a sub-sequence of 1015 operations such as 1017 PUTFH source-fh 1018 SAVEFH 1020 If the request is for a server-to-server copy, the source-fh is a 1021 filehandle from the source server and the compound procedure is being 1022 executed on the destination server. In this case, the source-fh is a 1023 foreign filehandle on the server receiving the COPY request. If 1024 either PUTFH or SAVEFH checked the validity of the filehandle, the 1025 operation would likely fail and return NFS4ERR_STALE. 1027 In order to avoid this problem, the minor version incorporating the 1028 COPY operations will need to make a few small changes in the handling 1029 of existing operations. If a server supports the server-to-server 1030 COPY feature, a PUTFH followed by a SAVEFH MUST NOT return 1031 NFS4ERR_STALE for either operation. These restrictions do not pose 1032 substantial difficulties for servers. The CURRENT_FH and SAVED_FH 1033 may be validated in the context of the operation referencing them and 1034 an NFS4ERR_STALE error returned for an invalid file handle at that 1035 point. 1037 The CURRENT_FH and ca_destination together specify the destination of 1038 the copy operation. If ca_destination is of 0 (zero) length, then 1039 CURRENT_FH specifies the target file. In this case, CURRENT_FH MUST 1040 be a regular file and not a directory. If ca_destination is not of 0 1041 (zero) length, the ca_destination argument specifies the file name to 1042 which the data will be copied within the directory identified by 1043 CURRENT_FH. In this case, CURRENT_FH MUST be a directory and not a 1044 regular file. 1046 If the file named by ca_destination does not exist and the operation 1047 completes successfully, the file will be visible in the file system 1048 namespace. If the file does not exist and the operation fails, the 1049 file MAY be visible in the file system namespace depending on when 1050 the failure occurs and on the implementation of the NFS server 1051 receiving the COPY operation. If the ca_destination name cannot be 1052 created in the destination file system (due to file name 1053 restrictions, such as case or length), the operation MUST fail. 1055 The ca_src_offset is the offset within the source file from which the 1056 data will be read, the ca_dst_offset is the offset within the 1057 destination file to which the data will be written, and the ca_count 1058 is the number of bytes that will be copied. An offset of 0 (zero) 1059 specifies the start of the file. A count of 0 (zero) requests that 1060 all bytes from ca_src_offset through EOF be copied to the 1061 destination. If concurrent modifications to the source file overlap 1062 with the source file region being copied, the data copied may include 1063 all, some, or none of the modifications. The client can use standard 1064 NFS operations (e.g. OPEN with OPEN4_SHARE_DENY_WRITE or mandatory 1065 byte range locks) to protect against concurrent modifications if the 1066 client is concerned about this. If the source file's end of file is 1067 being modified in parallel with a copy that specifies a count of 0 1068 (zero) bytes, the amount of data copied is implementation dependent 1069 (clients may guard against this case by specifying a non-zero count 1070 value or preventing modification of the source file as mentioned 1071 above). 1073 If the source offset or the source offset plus count is greater than 1074 or equal to the size of the source file, the operation will fail with 1075 NFS4ERR_INVAL. The destination offset or destination offset plus 1076 count may be greater than the size of the destination file. This 1077 allows for the client to issue parallel copies to implement 1078 operations such as "cat file1 file2 file3 file4 > dest". 1080 If the destination file is created as a result of this command, the 1081 destination file's size will be equal to the number of bytes 1082 successfully copied. If the destination file already existed, the 1083 destination file's size may increase as a result of this operation 1084 (e.g. if ca_dst_offset plus ca_count is greater than the 1085 destination's initial size). 1087 If the ca_source_server list is specified, then this is an inter- 1088 server copy operation and the source file is on a remote server. The 1089 client is expected to have previously issued a successful COPY_NOTIFY 1090 request to the remote source server. The ca_source_server list 1091 SHOULD be the same as the COPY_NOTIFY response's cnr_source_server 1092 list. If the client includes the entries from the COPY_NOTIFY 1093 response's cnr_source_server list in the ca_source_server list, the 1094 source server can indicate a specific copy protocol for the 1095 destination server to use by returning a URL, which specifies both a 1096 protocol service and server name. Server-to-server copy protocol 1097 considerations are described in Section 4.2.3 and Section 4.4.1. 1099 The ca_flags argument allows the copy operation to be customized in 1100 the following ways using the guarded flag (COPY4_GUARDED) and the 1101 metadata flag (COPY4_METADATA). 1103 [NOTE: Earlier versions of this document defined a 1104 COPY4_SPACE_RESERVED flag for controlling space reservations on the 1105 destination file. This flag has been removed with the expectation 1106 that the space_reserve attribute defined in XXX_TDH_XXX will be 1107 adopted.] 1109 If the guarded flag is set and the destination exists on the server, 1110 this operation will fail with NFS4ERR_EXIST. 1112 If the guarded flag is not set and the destination exists on the 1113 server, the behavior is implementation dependent. 1115 If the metadata flag is set and the client is requesting a whole file 1116 copy (i.e. ca_count is 0 (zero)), a subset of the destination file's 1117 attributes MUST be the same as the source file's corresponding 1118 attributes and a subset of the destination file's attributes SHOULD 1119 be the same as the source file's corresponding attributes. The 1120 attributes in the MUST and SHOULD copy subsets will be defined for 1121 each NFS version. 1123 For NFSv4.1, Table 1 and Table 2 list the REQUIRED and RECOMMENDED 1124 attributes respectively. A "MUST" in the "Copy to destination file?" 1125 column indicates that the attribute is part of the MUST copy set. A 1126 "SHOULD" in the "Copy to destination file?" column indicates that the 1127 attribute is part of the SHOULD copy set. 1129 +--------------------+----+---------------------------+ 1130 | Name | Id | Copy to destination file? | 1131 +--------------------+----+---------------------------+ 1132 | supported_attrs | 0 | no | 1133 | type | 1 | MUST | 1134 | fh_expire_type | 2 | no | 1135 | change | 3 | SHOULD | 1136 | size | 4 | MUST | 1137 | link_support | 5 | no | 1138 | symlink_support | 6 | no | 1139 | named_attr | 7 | no | 1140 | fsid | 8 | no | 1141 | unique_handles | 9 | no | 1142 | lease_time | 10 | no | 1143 | rdattr_error | 11 | no | 1144 | filehandle | 19 | no | 1145 | suppattr_exclcreat | 75 | no | 1146 +--------------------+----+---------------------------+ 1148 Table 1 1150 +--------------------+----+---------------------------+ 1151 | Name | Id | Copy to destination file? | 1152 +--------------------+----+---------------------------+ 1153 | acl | 12 | MUST | 1154 | aclsupport | 13 | no | 1155 | archive | 14 | no | 1156 | cansettime | 15 | no | 1157 | case_insensitive | 16 | no | 1158 | case_preserving | 17 | no | 1159 | change_policy | 60 | no | 1160 | chown_restricted | 18 | MUST | 1161 | dacl | 58 | MUST | 1162 | dir_notif_delay | 56 | no | 1163 | dirent_notif_delay | 57 | no | 1164 | fileid | 20 | no | 1165 | files_avail | 21 | no | 1166 | files_free | 22 | no | 1167 | files_total | 23 | no | 1168 | fs_charset_cap | 76 | no | 1169 | fs_layout_type | 62 | no | 1170 | fs_locations | 24 | no | 1171 | fs_locations_info | 67 | no | 1172 | fs_status | 61 | no | 1173 | hidden | 25 | MUST | 1174 | homogeneous | 26 | no | 1175 | layout_alignment | 66 | no | 1176 | layout_blksize | 65 | no | 1177 | layout_hint | 63 | no | 1178 | layout_type | 64 | no | 1179 | maxfilesize | 27 | no | 1180 | maxlink | 28 | no | 1181 | maxname | 29 | no | 1182 | maxread | 30 | no | 1183 | maxwrite | 31 | no | 1184 | mdsthreshold | 68 | no | 1185 | mimetype | 32 | MUST | 1186 | mode | 33 | MUST | 1187 | mode_set_masked | 74 | no | 1188 | mounted_on_fileid | 55 | no | 1189 | no_trunc | 34 | no | 1190 | numlinks | 35 | no | 1191 | owner | 36 | MUST | 1192 | owner_group | 37 | MUST | 1193 | quota_avail_hard | 38 | no | 1194 | quota_avail_soft | 39 | no | 1195 | quota_used | 40 | no | 1196 | rawdev | 41 | no | 1197 | retentevt_get | 71 | MUST | 1198 | retentevt_set | 72 | no | 1199 | retention_get | 69 | MUST | 1200 | retention_hold | 73 | MUST | 1201 | retention_set | 70 | no | 1202 | sacl | 59 | MUST | 1203 | space_avail | 42 | no | 1204 | space_free | 43 | no | 1205 | space_total | 44 | no | 1206 | space_used | 45 | no | 1207 | system | 46 | MUST | 1208 | time_access | 47 | MUST | 1209 | time_access_set | 48 | no | 1210 | time_backup | 49 | no | 1211 | time_create | 50 | MUST | 1212 | time_delta | 51 | no | 1213 | time_metadata | 52 | SHOULD | 1214 | time_modify | 53 | MUST | 1215 | time_modify_set | 54 | no | 1216 +--------------------+----+---------------------------+ 1218 Table 2 1220 [NOTE: The space_reserve attribute XXX_TDH_XXX will be in the MUST 1221 set.] 1223 [NOTE: The source file's attribute values will take precedence over 1224 any attribute values inherited by the destination file.] 1225 In the case of an inter-server copy or an intra-server copy between 1226 file systems, the attributes supported for the source file and 1227 destination file could be different. By definition,the REQUIRED 1228 attributes will be supported in all cases. If the metadata flag is 1229 set and the source file has a RECOMMENDED attribute that is not 1230 supported for the destination file, the copy MUST fail with 1231 NFS4ERR_ATTRNOTSUPP. 1233 Any attribute supported by the destination server that is not set on 1234 the source file SHOULD be left unset. 1236 Metadata attributes not exposed via the NFS protocol SHOULD be copied 1237 to the destination file where appropriate. 1239 The destination file's named attributes are not duplicated from the 1240 source file. After the copy process completes, the client MAY 1241 attempt to duplicate named attributes using standard NFSv4 1242 operations. However, the destination file's named attribute 1243 capabilities MAY be different from the source file's named attribute 1244 capabilities. 1246 If the metadata flag is not set and the client is requesting a whole 1247 file copy (i.e. ca_count is 0 (zero)), the destination file's 1248 metadata is implementation dependent. 1250 If the client is requesting a partial file copy (i.e. ca_count is not 1251 0 (zero)), the client SHOULD NOT set the metadata flag and the server 1252 MUST ignore the metadata flag. 1254 If the operation does not result in an immediate failure, the server 1255 will return NFS4_OK, and the CURRENT_FH will remain the destination's 1256 filehandle. 1258 If an immediate failure does occur, cr_bytes_copied will be set to 1259 the number of bytes copied to the destination file before the error 1260 occurred. The cr_bytes_copied value indicates the number of bytes 1261 copied but not which specific bytes have been copied. 1263 A return of NFS4_OK indicates that either the operation is complete 1264 or the operation was initiated and a callback will be used to deliver 1265 the final status of the operation. 1267 If the cr_callback_id is returned, this indicates that the operation 1268 was initiated and a CB_COPY callback will deliver the final results 1269 of the operation. The cr_callback_id stateid is termed a copy 1270 stateid in this context. The server is given the option of returning 1271 the results in a callback because the data may require a relatively 1272 long period of time to copy. 1274 If no cr_callback_id is returned, the operation completed 1275 synchronously and no callback will be issued by the server. The 1276 completion status of the operation is indicated by cr_status. 1278 If the copy completes successfully, either synchronously or 1279 asynchronously, the data copied from the source file to the 1280 destination file MUST appear identical to the NFS client. However, 1281 the NFS server's on disk representation of the data in the source 1282 file and destination file MAY differ. For example, the NFS server 1283 might encrypt, compress, deduplicate, or otherwise represent the on 1284 disk data in the source and destination file differently. 1286 In the event of a failure the state of the destination file is 1287 implementation dependent. The COPY operation may fail for the 1288 following reasons (this is a partial list). 1290 NFS4ERR_MOVED: The file system which contains the source file, or 1291 the destination file or directory is not present. The client can 1292 determine the correct location and reissue the operation with the 1293 correct location. 1295 NFS4ERR_NOTSUPP: The copy offload operation is not supported by the 1296 NFS server receiving this request. 1298 NFS4ERR_PARTNER_NOTSUPP: The remote server does not support the 1299 server-to-server copy offload protocol. 1301 NFS4ERR_PARTNER_NO_AUTH: The remote server does not authorize a 1302 server-to-server copy offload operation. This may be due to the 1303 client's failure to send the COPY_NOTIFY operation to the remote 1304 server, the remote server receiving a server-to-server copy 1305 offload request after the copy lease time expired, or for some 1306 other permission problem. 1308 NFS4ERR_FBIG: The copy operation would have caused the file to grow 1309 beyond the server's limit. 1311 NFS4ERR_NOTDIR: The CURRENT_FH is a file and ca_destination has non- 1312 zero length. 1314 NFS4ERR_WRONG_TYPE: The SAVED_FH is not a regular file. 1316 NFS4ERR_ISDIR: The CURRENT_FH is a directory and ca_destination has 1317 zero length. 1319 NFS4ERR_INVAL: The source offset or offset plus count are greater 1320 than or equal to the size of the source file. 1322 NFS4ERR_DELAY: The server does not have the resources to perform the 1323 copy operation at the current time. The client should retry the 1324 operation sometime in the future. 1326 NFS4ERR_METADATA_NOTSUPP: The destination file cannot support the 1327 same metadata as the source file. 1329 NFS4ERR_WRONGSEC: The security mechanism being used by the client 1330 does not match the server's security policy. 1332 4.3.5. Operation 60: COPY_ABORT - Cancel a server-side copy 1334 4.3.5.1. ARGUMENT 1336 struct COPY_ABORT4args { 1337 /* CURRENT_FH: desination file */ 1338 stateid4 caa_stateid; 1339 }; 1341 4.3.5.2. RESULT 1343 struct COPY_ABORT4res { 1344 nfsstat4 car_status; 1345 }; 1347 4.3.5.3. DESCRIPTION 1349 COPY_ABORT is used for both intra- and inter-server asynchronous 1350 copies. The COPY_ABORT operation allows the client to cancel a 1351 server-side copy operation that it initiated. This operation is sent 1352 in a COMPOUND request from the client to the destination server. 1353 This operation may be used to cancel a copy when the application that 1354 requested the copy exits before the operation is completed or for 1355 some other reason. 1357 The request contains the filehandle and copy stateid cookies that act 1358 as the context for the previously initiated copy operation. 1360 The result's car_status field indicates whether the cancel was 1361 successful or not. A value of NFS4_OK indicates that the copy 1362 operation was canceled and no callback will be issued by the server. 1363 A copy operation that is successfully canceled may result in none, 1364 some, or all of the data copied. 1366 If the server supports asynchronous copies, the server is REQUIRED to 1367 support the COPY_ABORT operation. 1369 The COPY_ABORT operation may fail for the following reasons (this is 1370 a partial list): 1372 NFS4ERR_NOTSUPP: The abort operation is not supported by the NFS 1373 server receiving this request. 1375 NFS4ERR_RETRY: The abort failed, but a retry at some time in the 1376 future MAY succeed. 1378 NFS4ERR_COMPLETE_ALREADY: The abort failed, and a callback will 1379 deliver the results of the copy operation. 1381 NFS4ERR_SERVERFAULT: An error occurred on the server that does not 1382 map to a specific error code. 1384 4.3.6. Operation 63: COPY_STATUS - Poll for status of a server-side 1385 copy 1387 4.3.6.1. ARGUMENT 1389 struct COPY_STATUS4args { 1390 /* CURRENT_FH: destination file */ 1391 stateid4 csa_stateid; 1392 }; 1394 4.3.6.2. RESULT 1396 struct COPY_STATUS4resok { 1397 length4 csr_bytes_copied; 1398 nfsstat4 csr_complete<1>; 1399 }; 1401 union COPY_STATUS4res switch (nfsstat4 csr_status) { 1402 case NFS4_OK: 1403 COPY_STATUS4resok resok4; 1404 default: 1405 void; 1406 }; 1408 4.3.6.3. DESCRIPTION 1410 COPY_STATUS is used for both intra- and inter-server asynchronous 1411 copies. The COPY_STATUS operation allows the client to poll the 1412 server to determine the status of an asynchronous copy operation. 1413 This operation is sent by the client to the destination server. 1415 If this operation is successful, the number of bytes copied are 1416 returned to the client in the csr_bytes_copied field. The 1417 csr_bytes_copied value indicates the number of bytes copied but not 1418 which specific bytes have been copied. 1420 If the optional csr_complete field is present, the copy has 1421 completed. In this case the status value indicates the result of the 1422 asynchronous copy operation. In all cases, the server will also 1423 deliver the final results of the asynchronous copy in a CB_COPY 1424 operation. 1426 The failure of this operation does not indicate the result of the 1427 asynchronous copy in any way. 1429 If the server supports asynchronous copies, the server is REQUIRED to 1430 support the COPY_STATUS operation. 1432 The COPY_STATUS operation may fail for the following reasons (this is 1433 a partial list): 1435 NFS4ERR_NOTSUPP: The copy status operation is not supported by the 1436 NFS server receiving this request. 1438 NFS4ERR_BAD_STATEID: The stateid is not valid (see Section 4.3.8 1439 below). 1441 NFS4ERR_EXPIRED: The stateid has expired (see Copy Offload Stateid 1442 section below). 1444 4.3.7. Operation 15: CB_COPY - Report results of a server-side copy 1445 4.3.7.1. ARGUMENT 1447 union copy_info4 switch (nfsstat4 cca_status) { 1448 case NFS4_OK: 1449 void; 1450 default: 1451 length4 cca_bytes_copied; 1452 }; 1454 struct CB_COPY4args { 1455 nfs_fh4 cca_fh; 1456 stateid4 cca_stateid; 1457 copy_info4 cca_copy_info; 1458 }; 1460 4.3.7.2. RESULT 1462 struct CB_COPY4res { 1463 nfsstat4 ccr_status; 1464 }; 1466 4.3.7.3. DESCRIPTION 1468 CB_COPY is used for both intra- and inter-server asynchronous copies. 1469 The CB_COPY callback informs the client of the result of an 1470 asynchronous server-side copy. This operation is sent by the 1471 destination server to the client in a CB_COMPOUND request. The copy 1472 is identified by the filehandle and stateid arguments. The result is 1473 indicated by the status field. If the copy failed, cca_bytes_copied 1474 contains the number of bytes copied before the failure occurred. The 1475 cca_bytes_copied value indicates the number of bytes copied but not 1476 which specific bytes have been copied. 1478 In the absence of an established backchannel, the server cannot 1479 signal the completion of the COPY via a CB_COPY callback. The loss 1480 of a callback channel would be indicated by the server setting the 1481 SEQ4_STATUS_CB_PATH_DOWN flag in the sr_status_flags field of the 1482 SEQUENCE operation. The client must re-establish the callback 1483 channel to receive the status of the COPY operation. Prolonged loss 1484 of the callback channel could result in the server dropping the COPY 1485 operation state and invalidating the copy stateid. 1487 If the client supports the COPY operation, the client is REQUIRED to 1488 support the CB_COPY operation. 1490 The CB_COPY operation may fail for the following reasons (this is a 1491 partial list): 1493 NFS4ERR_NOTSUPP: The copy offload operation is not supported by the 1494 NFS client receiving this request. 1496 4.3.8. Copy Offload Stateids 1498 A server may perform a copy offload operation asynchronously. An 1499 asynchronous copy is tracked using a copy offload stateid. Copy 1500 offload stateids are included in the COPY, COPY_ABORT, COPY_STATUS, 1501 and CB_COPY operations. 1503 Section 8.2.4 of [2] specifies that stateids are valid until either 1504 (A) the client or server restart or (B) the client returns the 1505 resource. 1507 A copy offload stateid will be valid until either (A) the client or 1508 server restart or (B) the client returns the resource by issuing a 1509 COPY_ABORT operation or the client replies to a CB_COPY operation. 1511 A copy offload stateid's seqid MUST NOT be 0 (zero). In the context 1512 of a copy offload operation, it is ambiguous to indicate the most 1513 recent copy offload operation using a stateid with seqid of 0 (zero). 1514 Therefore a copy offload stateid with seqid of 0 (zero) MUST be 1515 considered invalid. 1517 4.4. Security Considerations 1519 The security considerations pertaining to NFSv4 [10] apply to this 1520 document. 1522 The standard security mechanisms provide by NFSv4 [10] may be used to 1523 secure the protocol described in this document. 1525 NFSv4 clients and servers supporting the the inter-server copy 1526 operations described in this document are REQUIRED to implement [5], 1527 including the RPCSEC_GSSv3 privileges copy_from_auth and 1528 copy_to_auth. If the server-to-server copy protocol is ONC RPC 1529 based, the servers are also REQUIRED to implement the RPCSEC_GSSv3 1530 privilege copy_confirm_auth. These requirements to implement are not 1531 requirements to use. NFSv4 clients and servers are RECOMMENDED to 1532 use [5] to secure server-side copy operations. 1534 4.4.1. Inter-Server Copy Security 1536 4.4.1.1. Requirements for Secure Inter-Server Copy 1538 Inter-server copy is driven by several requirements: 1540 o The specification MUST NOT mandate an inter-server copy protocol. 1541 There are many ways to copy data. Some will be more optimal than 1542 others depending on the identities of the source server and 1543 destination server. For example the source and destination 1544 servers might be two nodes sharing a common file system format for 1545 the source and destination file systems. Thus the source and 1546 destination are in an ideal position to efficiently render the 1547 image of the source file to the destination file by replicating 1548 the file system formats at the block level. In other cases, the 1549 source and destination might be two nodes sharing a common storage 1550 area network, and thus there is no need to copy any data at all, 1551 and instead ownership of the file and its contents simply gets re- 1552 assigned to the destination. 1554 o The specification MUST provide guidance for using NFSv4.x as a 1555 copy protocol. For those source and destination servers willing 1556 to use NFSv4.x there are specific security considerations that 1557 this specification can and does address. 1559 o The specification MUST NOT mandate pre-configuration between the 1560 source and destination server. Requiring that the source and 1561 destination first have a "copying relationship" increases the 1562 administrative burden. However the specification MUST NOT 1563 preclude implementations that require pre-configuration. 1565 o The specification MUST NOT mandate a trust relationship between 1566 the source and destination server. The NFSv4 security model 1567 requires mutual authentication between a principal on an NFS 1568 client and a principal on an NFS server. This model MUST continue 1569 with the introduction of COPY. 1571 4.4.1.2. Inter-Server Copy with RPCSEC_GSSv3 1573 When the client sends a COPY_NOTIFY to the source server to expect 1574 the destination to attempt to copy data from the source server, it is 1575 expected that this copy is being done on behalf of the principal 1576 (called the "user principal") that sent the RPC request that encloses 1577 the COMPOUND procedure that contains the COPY_NOTIFY operation. The 1578 user principal is identified by the RPC credentials. A mechanism 1579 that allows the user principal to authorize the destination server to 1580 perform the copy in a manner that lets the source server properly 1581 authenticate the destination's copy, and without allowing the 1582 destination to exceed its authorization is necessary. 1584 An approach that sends delegated credentials of the client's user 1585 principal to the destination server is not used for the following 1586 reasons. If the client's user delegated its credentials, the 1587 destination would authenticate as the user principal. If the 1588 destination were using the NFSv4 protocol to perform the copy, then 1589 the source server would authenticate the destination server as the 1590 user principal, and the file copy would securely proceed. However, 1591 this approach would allow the destination server to copy other files. 1592 The user principal would have to trust the destination server to not 1593 do so. This is counter to the requirements, and therefore is not 1594 considered. Instead an approach using RPCSEC_GSSv3 [5] privileges is 1595 proposed. 1597 One of the stated applications of the proposed RPCSEC_GSSv3 protocol 1598 is compound client host and user authentication [+ privilege 1599 assertion]. For inter-server file copy, we require compound NFS 1600 server host and user authentication [+ privilege assertion]. The 1601 distinction between the two is one without meaning. 1603 RPCSEC_GSSv3 introduces the notion of privileges. We define three 1604 privileges: 1606 copy_from_auth: A user principal is authorizing a source principal 1607 ("nfs@") to allow a destination principal ("nfs@ 1608 ") to copy a file from the source to the destination. 1609 This privilege is established on the source server before the user 1610 principal sends a COPY_NOTIFY operation to the source server. 1612 struct copy_from_auth_priv { 1613 secret4 cfap_shared_secret; 1614 netloc4 cfap_destination; 1615 /* the NFSv4 user name that the user principal maps to */ 1616 utf8str_mixed cfap_username; 1617 /* equal to seq_num of rpc_gss_cred_vers_3_t */ 1618 unsigned int cfap_seq_num; 1619 }; 1621 cap_shared_secret is a secret value the user principal generates. 1623 copy_to_auth: A user principal is authorizing a destination 1624 principal ("nfs@") to allow it to copy a file from 1625 the source to the destination. This privilege is established on 1626 the destination server before the user principal sends a COPY 1627 operation to the destination server. 1629 struct copy_to_auth_priv { 1630 /* equal to cfap_shared_secret */ 1631 secret4 ctap_shared_secret; 1632 netloc4 ctap_source; 1633 /* the NFSv4 user name that the user principal maps to */ 1634 utf8str_mixed ctap_username; 1635 /* equal to seq_num of rpc_gss_cred_vers_3_t */ 1636 unsigned int ctap_seq_num; 1637 }; 1639 ctap_shared_secret is a secret value the user principal generated 1640 and was used to establish the copy_from_auth privilege with the 1641 source principal. 1643 copy_confirm_auth: A destination principal is confirming with the 1644 source principal that it is authorized to copy data from the 1645 source on behalf of the user principal. When the inter-server 1646 copy protocol is NFSv4, or for that matter, any protocol capable 1647 of being secured via RPCSEC_GSSv3 (i.e. any ONC RPC protocol), 1648 this privilege is established before the file is copied from the 1649 source to the destination. 1651 struct copy_confirm_auth_priv { 1652 /* equal to GSS_GetMIC() of cfap_shared_secret */ 1653 opaque ccap_shared_secret_mic<>; 1654 /* the NFSv4 user name that the user principal maps to */ 1655 utf8str_mixed ccap_username; 1656 /* equal to seq_num of rpc_gss_cred_vers_3_t */ 1657 unsigned int ccap_seq_num; 1658 }; 1660 4.4.1.2.1. Establishing a Security Context 1662 When the user principal wants to COPY a file between two servers, if 1663 it has not established copy_from_auth and copy_to_auth privileges on 1664 the servers, it establishes them: 1666 o The user principal generates a secret it will share with the two 1667 servers. This shared secret will be placed in the 1668 cfap_shared_secret and ctap_shared_secret fields of the 1669 appropriate privilege data types, copy_from_auth_priv and 1670 copy_to_auth_priv. 1672 o An instance of copy_from_auth_priv is filled in with the shared 1673 secret, the destination server, and the NFSv4 user id of the user 1674 principal. It will be sent with an RPCSEC_GSS3_CREATE procedure, 1675 and so cfap_seq_num is set to the seq_num of the credential of the 1676 RPCSEC_GSS3_CREATE procedure. Because cfap_shared_secret is a 1677 secret, after XDR encoding copy_from_auth_priv, GSS_Wrap() (with 1678 privacy) is invoked on copy_from_auth_priv. The 1679 RPCSEC_GSS3_CREATE procedure's arguments are: 1681 struct { 1682 rpc_gss3_gss_binding *compound_binding; 1683 rpc_gss3_chan_binding *chan_binding_mic; 1684 rpc_gss3_assertion assertions<>; 1685 rpc_gss3_extension extensions<>; 1686 } rpc_gss3_create_args; 1688 The string "copy_from_auth" is placed in assertions[0].privs. The 1689 output of GSS_Wrap() is placed in extensions[0].data. The field 1690 extensions[0].critical is set to TRUE. The source server calls 1691 GSS_Unwrap() on the privilege, and verifies that the seq_num 1692 matches the credential. It then verifies that the NFSv4 user id 1693 being asserted matches the source server's mapping of the user 1694 principal. If it does, the privilege is established on the source 1695 server as: <"copy_from_auth", user id, destination>. The 1696 successful reply to RPCSEC_GSS3_CREATE has: 1698 struct { 1699 opaque handle<>; 1700 rpc_gss3_chan_binding *chan_binding_mic; 1701 rpc_gss3_assertion granted_assertions<>; 1702 rpc_gss3_assertion server_assertions<>; 1703 rpc_gss3_extension extensions<>; 1704 } rpc_gss3_create_res; 1706 The field "handle" is the RPCSEC_GSSv3 handle that the client will 1707 use on COPY_NOTIFY requests involving the source and destination 1708 server. granted_assertions[0].privs will be equal to 1709 "copy_from_auth". The server will return a GSS_Wrap() of 1710 copy_to_auth_priv. 1712 o An instance of copy_to_auth_priv is filled in with the shared 1713 secret, the source server, and the NFSv4 user id. It will be sent 1714 with an RPCSEC_GSS3_CREATE procedure, and so ctap_seq_num is set 1715 to the seq_num of the credential of the RPCSEC_GSS3_CREATE 1716 procedure. Because ctap_shared_secret is a secret, after XDR 1717 encoding copy_to_auth_priv, GSS_Wrap() is invoked on 1718 copy_to_auth_priv. The RPCSEC_GSS3_CREATE procedure's arguments 1719 are: 1721 struct { 1722 rpc_gss3_gss_binding *compound_binding; 1723 rpc_gss3_chan_binding *chan_binding_mic; 1724 rpc_gss3_assertion assertions<>; 1725 rpc_gss3_extension extensions<>; 1726 } rpc_gss3_create_args; 1728 The string "copy_to_auth" is placed in assertions[0].privs. The 1729 output of GSS_Wrap() is placed in extensions[0].data. The field 1730 extensions[0].critical is set to TRUE. After unwrapping, 1731 verifying the seq_num, and the user principal to NFSv4 user ID 1732 mapping, the destination establishes a privilege of 1733 <"copy_to_auth", user id, source>. The successful reply to 1734 RPCSEC_GSS3_CREATE has: 1736 struct { 1737 opaque handle<>; 1738 rpc_gss3_chan_binding *chan_binding_mic; 1739 rpc_gss3_assertion granted_assertions<>; 1740 rpc_gss3_assertion server_assertions<>; 1741 rpc_gss3_extension extensions<>; 1742 } rpc_gss3_create_res; 1744 The field "handle" is the RPCSEC_GSSv3 handle that the client will 1745 use on COPY requests involving the source and destination server. 1746 The field granted_assertions[0].privs will be equal to 1747 "copy_to_auth". The server will return a GSS_Wrap() of 1748 copy_to_auth_priv. 1750 4.4.1.2.2. Starting a Secure Inter-Server Copy 1752 When the client sends a COPY_NOTIFY request to the source server, it 1753 uses the privileged "copy_from_auth" RPCSEC_GSSv3 handle. 1754 cna_destination_server in COPY_NOTIFY MUST be the same as the name of 1755 the destination server specified in copy_from_auth_priv. Otherwise, 1756 COPY_NOTIFY will fail with NFS4ERR_ACCESS. The source server 1757 verifies that the privilege <"copy_from_auth", user id, destination> 1758 exists, and annotates it with the source filehandle, if the user 1759 principal has read access to the source file, and if administrative 1760 policies give the user principal and the NFS client read access to 1761 the source file (i.e. if the ACCESS operation would grant read 1762 access). Otherwise, COPY_NOTIFY will fail with NFS4ERR_ACCESS. 1764 When the client sends a COPY request to the destination server, it 1765 uses the privileged "copy_to_auth" RPCSEC_GSSv3 handle. 1766 ca_source_server in COPY MUST be the same as the name of the source 1767 server specified in copy_to_auth_priv. Otherwise, COPY will fail 1768 with NFS4ERR_ACCESS. The destination server verifies that the 1769 privilege <"copy_to_auth", user id, source> exists, and annotates it 1770 with the source and destination filehandles. If the client has 1771 failed to establish the "copy_to_auth" policy it will reject the 1772 request with NFS4ERR_PARTNER_NO_AUTH. 1774 If the client sends a COPY_REVOKE to the source server to rescind the 1775 destination server's copy privilege, it uses the privileged 1776 "copy_from_auth" RPCSEC_GSSv3 handle and the cra_destination_server 1777 in COPY_REVOKE MUST be the same as the name of the destination server 1778 specified in copy_from_auth_priv. The source server will then delete 1779 the <"copy_from_auth", user id, destination> privilege and fail any 1780 subsequent copy requests sent under the auspices of this privilege 1781 from the destination server. 1783 4.4.1.2.3. Securing ONC RPC Server-to-Server Copy Protocols 1785 After a destination server has a "copy_to_auth" privilege established 1786 on it, and it receives a COPY request, if it knows it will use an ONC 1787 RPC protocol to copy data, it will establish a "copy_confirm_auth" 1788 privilege on the source server, using nfs@ as the 1789 initiator principal, and nfs@ as the target principal. 1791 The value of the field ccap_shared_secret_mic is a GSS_VerifyMIC() of 1792 the shared secret passed in the copy_to_auth privilege. The field 1793 ccap_username is the mapping of the user principal to an NFSv4 user 1794 name ("user"@"domain" form), and MUST be the same as ctap_username 1795 and cfap_username. The field ccap_seq_num is the seq_num of the 1796 RPCSEC_GSSv3 credential used for the RPCSEC_GSS3_CREATE procedure the 1797 destination will send to the source server to establish the 1798 privilege. 1800 The source server verifies the privilege, and establishes a 1801 <"copy_confirm_auth", user id, destination> privilege. If the source 1802 server fails to verify the privilege, the COPY operation will be 1803 rejected with NFS4ERR_PARTNER_NO_AUTH. All subsequent ONC RPC 1804 requests sent from the destination to copy data from the source to 1805 the destination will use the RPCSEC_GSSv3 handle returned by the 1806 source's RPCSEC_GSS3_CREATE response. 1808 Note that the use of the "copy_confirm_auth" privilege accomplishes 1809 the following: 1811 o if a protocol like NFS is being used, with export policies, export 1812 policies can be overridden in case the destination server as-an- 1813 NFS-client is not authorized 1815 o manual configuration to allow a copy relationship between the 1816 source and destination is not needed. 1818 If the attempt to establish a "copy_confirm_auth" privilege fails, 1819 then when the user principal sends a COPY request to destination, the 1820 destination server will reject it with NFS4ERR_PARTNER_NO_AUTH. 1822 4.4.1.2.4. Securing Non ONC RPC Server-to-Server Copy Protocols 1824 If the destination won't be using ONC RPC to copy the data, then the 1825 source and destination are using an unspecified copy protocol. The 1826 destination could use the shared secret and the NFSv4 user id to 1827 prove to the source server that the user principal has authorized the 1828 copy. 1830 For protocols that authenticate user names with passwords (e.g. HTTP 1831 [14] and FTP [15]), the nfsv4 user id could be used as the user name, 1832 and an ASCII hexadecimal representation of the RPCSEC_GSSv3 shared 1833 secret could be used as the user password or as input into non- 1834 password authentication methods like CHAP [16]. 1836 4.4.1.3. Inter-Server Copy via ONC RPC but without RPCSEC_GSSv3 1838 ONC RPC security flavors other than RPCSEC_GSSv3 MAY be used with the 1839 server-side copy offload operations described in this document. In 1840 particular, host-based ONC RPC security flavors such as AUTH_NONE and 1841 AUTH_SYS MAY be used. If a host-based security flavor is used, a 1842 minimal level of protection for the server-to-server copy protocol is 1843 possible. 1845 In the absence of strong security mechanisms such as RPCSEC_GSSv3, 1846 the challenge is how the source server and destination server 1847 identify themselves to each other, especially in the presence of 1848 multi-homed source and destination servers. In a multi-homed 1849 environment, the destination server might not contact the source 1850 server from the same network address specified by the client in the 1851 COPY_NOTIFY. This can be overcome using the procedure described 1852 below. 1854 When the client sends the source server the COPY_NOTIFY operation, 1855 the source server may reply to the client with a list of target 1856 addresses, names, and/or URLs and assign them to the unique triple: 1857 . If the destination uses 1858 one of these target netlocs to contact the source server, the source 1859 server will be able to uniquely identify the destination server, even 1860 if the destination server does not connect from the address specified 1861 by the client in COPY_NOTIFY. 1863 For example, suppose the network topology is as shown in Figure 3. 1864 If the source filehandle is 0x12345, the source server may respond to 1865 a COPY_NOTIFY for destination 10.11.78.56 with the URLs: 1867 nfs://10.11.78.18//_COPY/10.11.78.56/_FH/0x12345 1869 nfs://192.168.33.18//_COPY/10.11.78.56/_FH/0x12345 1871 The client will then send these URLs to the destination server in the 1872 COPY operation. Suppose that the 192.168.33.0/24 network is a high 1873 speed network and the destination server decides to transfer the file 1874 over this network. If the destination contacts the source server 1875 from 192.168.33.56 over this network using NFSv4.1, it does the 1876 following: 1878 COMPOUND { PUTROOTFH, LOOKUP "_COPY" ; LOOKUP "10.11.78.56"; LOOKUP 1879 "_FH" ; OPEN "0x12345" ; GETFH } 1881 The source server will therefore know that these NFSv4.1 operations 1882 are being issued by the destination server identified in the 1883 COPY_NOTIFY. 1885 4.4.1.4. Inter-Server Copy without ONC RPC and RPCSEC_GSSv3 1887 The same techniques as Section 4.4.1.3, using unique URLs for each 1888 destination server, can be used for other protocols (e.g. HTTP [14] 1889 and FTP [15]) as well. 1891 4.5. IANA Considerations 1893 This section has no actions for IANA. 1895 5. Space Reservation 1897 5.1. Introduction 1899 This section describes a set of operations that allow applications 1900 such as hypervisors to reserve space for a file, report the amount of 1901 actual disk space a file occupies and freeup the backing space of a 1902 file when it is not required. 1904 In virtualized environments, virtual disk files are often stored on 1905 NFS mounted volumes. Since virtual disk files represent the hard 1906 disks of virtual machines, hypervisors often have to guarantee 1907 certain properties for the file. 1909 One such example is space reservation. When a hypervisor creates a 1910 virtual disk file, it often tries to preallocate the space for the 1911 file so that there are no future allocation related errors during the 1912 operation of the virtual machine. Such errors prevent a virtual 1913 machine from continuing execution and result in downtime. 1915 Another useful feature would be the ability to report the number of 1916 blocks that would be freed when a file is deleted. Currently, NFS 1917 reports two size attributes: 1919 size The logical file size of the file. 1921 space_used The size in bytes that the file occupies on disk 1923 While these attributes are sufficient for space accounting in 1924 traditional filesystems, they prove to be inadequate in modern 1925 filesystems that support block sharing. Having a way to tell the 1926 number of blocks that would be freed if the file was deleted would be 1927 useful to applications that wish to migrate files when a volume is 1928 low on space. 1930 Since virtual disks represent a hard drive in a virtual machine, a 1931 virtual disk can be viewed as a filesystem within a file. Since not 1932 all blocks within a filesystem are in use, there is an opportunity to 1933 reclaim blocks that are no longer in use. A call to deallocate 1934 blocks could result in better space efficiency. Lesser space MAY be 1935 consumed for backups after block deallocation. 1937 We propose the following operations and attributes for the 1938 aforementioned use cases: 1940 space_reserve This attribute specifies whether the blocks backing 1941 the file have been preallocated. 1943 space_freed This attribute specifies the space freed when a file is 1944 deleted, taking block sharing into consideration. 1946 max_hole_punch This attribute specifies the maximum sized hole that 1947 can be punched on the filesystem. 1949 HOLE_PUNCH This operation zeroes and/or deallocates the blocks 1950 backing a region of the file. 1952 5.2. Use Cases 1954 5.2.1. Space Reservation 1956 Some applications require that once a file of a certain size is 1957 created, writes to that file never fail with an out of space 1958 condition. One such example is that of a hypervisor writing to a 1959 virtual disk. An out of space condition while writing to virtual 1960 disks would mean that the virtual machine would need to be frozen. 1962 Currently, in order to achieve such a guarantee, applications zero 1963 the entire file. The initial zeroing allocates the backing blocks 1964 and all subsequent writes are overwrites of already allocated blocks. 1965 This approach is not only inefficient in terms of the amount of I/O 1966 done, it is also not guaranteed to work on filesystems that are log 1967 structured or deduplicated. An efficient way of guaranteeing space 1968 reservation would be beneficial to such applications. 1970 If the space_reserved attribute is set on a file, it is guaranteed 1971 that writes that do not grow the file will not fail with 1972 NFSERR_NOSPC. 1974 5.2.2. Space freed on deletes 1976 Currently, files in NFS have two size attributes: 1978 size The logical file size of the file. 1980 space_used The size in bytes that the file occupies on disk. 1982 While these attributes are sufficient for space accounting in 1983 traditional filesystems, they prove to be inadequate in modern 1984 filesystems that support block sharing. In such filesystems, 1985 multiple inodes can point to a single block with a block reference 1986 count to guard against premature freeing. 1988 If space_used of a file is interpreted to mean the size in bytes of 1989 all disk blocks pointed to by the inode of the file, then shared 1990 blocks get double counted, over-reporting the space utilization. 1991 This also has the adverse effect that the deletion of a file with 1992 shared blocks frees up less than space_used bytes. 1994 On the other hand, if space_used is interpreted to mean the size in 1995 bytes of those disk blocks unique to the inode of the file, then 1996 shared blocks are not counted in any file, resulting in under- 1997 reporting of the space utilization. 1999 For example, two files A and B have 10 blocks each. Let 6 of these 2000 blocks be shared between them. Thus, the combined space utilized by 2001 the two files is 14 * BLOCK_SIZE bytes. In the former case, the 2002 combined space utilization of the two files would be reported as 20 * 2003 BLOCK_SIZE. However, deleting either would only result in 4 * 2004 BLOCK_SIZE being freed. Conversely, the latter interpretation would 2005 report that the space utilization is only 8 * BLOCK_SIZE. 2007 Adding another size attribute, space_freed, is helpful in solving 2008 this problem. space_freed is the number of blocks that are allocated 2009 to the given file that would be freed on its deletion. In the 2010 example, both A and B would report space_freed as 4 * BLOCK_SIZE and 2011 space_used as 10 * BLOCK_SIZE. If A is deleted, B will report 2012 space_freed as 10 * BLOCK_SIZE as the deletion of B would result in 2013 the deallocation of all 10 blocks. 2015 The addition of this problem doesn't solve the problem of space being 2016 over-reported. However, over-reporting is better than under- 2017 reporting. 2019 5.2.3. Operations and attributes 2021 In the sections that follow, one operation and three attributes are 2022 defined that together provide the space management facilities 2023 outlined earlier in the document. The operation is intended to be 2024 OPTIONAL and the attributes RECOMMENDED as defined in section 17 of 2025 [2]. 2027 5.2.4. Attribute 77: space_reserve 2029 The space_reserve attribute is a read/write attribute of type 2030 boolean. It is a per file attribute. When the space_reserved 2031 attribute is set via SETATTR, the server must ensure that there is 2032 disk space to accommodate every byte in the file before it can return 2033 success. If the server cannot guarantee this, it must return 2034 NFS4ERR_NOSPC. 2036 If the client tries to grow a file which has the space_reserved 2037 attribute set, the server must guarantee that there is disk space to 2038 accommodate every byte in the file with the new size before it can 2039 return success. If the server cannot guarantee this, it must return 2040 NFS4ERR_NOSPC. 2042 It is not required that the server allocate the space to the file 2043 before returning success. The allocation can be deferred, however, 2044 it must be guaranteed that it will not fail for lack of space. 2046 The value of space_reserved can be obtained at any time through 2047 GETATTR. 2049 In order to avoid ambiguity, the space_reserve bit cannot be set 2050 along with the size bit in SETATTR. Increasing the size of a file 2051 with space_reserve set will fail if space reservation cannot be 2052 guaranteed for the new size. If the file size is decreased, space 2053 reservation is only guaranteed for the new size and the extra blocks 2054 backing the file can be released. 2056 5.2.5. Attribute 78: space_freed 2058 space_freed gives the number of bytes freed if the file is deleted. 2059 This attribute is read only and is of type length4. It is a per file 2060 attribute. 2062 5.2.6. Attribute 79: max_hole_punch 2064 max_hole_punch specifies the maximum size of a hole that the 2065 HOLE_PUNCH operation can handle. This attribute is read only and of 2066 type length4. It is a per filesystem attribute. This attribute MUST 2067 be implemented if HOLE_PUNCH is implemented. 2069 5.2.7. Operation 64: HOLE_PUNCH - Zero and deallocate blocks backing 2070 the file in the specified range. 2072 5.2.7.1. ARGUMENT 2074 struct HOLE_PUNCH4args { 2075 /* CURRENT_FH: file */ 2076 offset4 hpa_offset; 2077 length4 hpa_count; 2078 }; 2080 5.2.7.2. RESULT 2082 struct HOLE_PUNCH4res { 2083 nfsstat4 hpr_status; 2084 }; 2086 5.2.7.3. DESCRIPTION 2088 Whenever a client wishes to deallocate the blocks backing a 2089 particular region in the file, it calls the HOLE_PUNCH operation with 2090 the current filehandle set to the filehandle of the file in question, 2091 start offset and length in bytes of the region set in hpa_offset and 2092 hpa_count respectively. All further reads to this region MUST return 2093 zeros until overwritten. The filehandle specified must be that of a 2094 regular file. 2096 Situations may arise where hpa_offset and/or hpa_offset + hpa_count 2097 will not be aligned to a boundary that the server does allocations/ 2098 deallocations in. For most filesystems, this is the block size of 2099 the file system. In such a case, the server can deallocate as many 2100 bytes as it can in the region. The blocks that cannot be deallocated 2101 MUST be zeroed. Except for the block deallocation and maximum hole 2102 punching capability, a HOLE_PUNCH operation is to be treated similar 2103 to a write of zeroes. 2105 The server is not required to complete deallocating the blocks 2106 specified in the operation before returning. It is acceptable to 2107 have the deallocation be deferred. In fact, HOLE_PUNCH is merely a 2108 hint; it is valid for a server to return success without ever doing 2109 anything towards deallocating the blocks backing the region 2110 specified. However, any future reads to the region MUST return 2111 zeroes. 2113 HOLE_PUNCH will result in the space_used attribute being decreased by 2114 the number of bytes that were deallocated. The space_freed attribute 2115 may or may not decrease, depending on the support and whether the 2116 blocks backing the specified range were shared or not. The size 2117 attribute will remain unchanged. 2119 The HOLE_PUNCH operation MUST NOT change the space reservation 2120 guarantee of the file. While the server can deallocate the blocks 2121 specified by hpa_offset and hpa_count, future writes to this region 2122 MUST NOT fail with NFSERR_NOSPC. 2124 The HOLE_PUNCH operation may fail for the following reasons (this is 2125 a partial list): 2127 NFS4ERR_NOTSUPP The Hole punch operations are not supported by the 2128 NFS server receiving this request. 2130 NFS4ERR_DIR The current filehandle is of type NF4DIR. 2132 NFS4ERR_SYMLINK The current filehandle is of type NF4LNK. 2134 NFS4ERR_WRONG_TYPE The current filehandle does not designate an 2135 ordinary file. 2137 5.3. Security Considerations 2139 There are no security considerations for this section. 2141 5.4. IANA Considerations 2143 This section has no actions for IANA. 2145 6. Simple and Efficient Read Support for Sparse Files 2147 6.1. Introduction 2149 NFS is now used in many data centers as the sole or primary method of 2150 data access. Consequently, more types of applications are using NFS 2151 than ever before, each with their own requirements and generated 2152 workloads. As part of this, sparse files are increasing in number 2153 while NFS continues to lack any specific knowledge of a sparse file's 2154 layout. This document puts forth a proposal for the NFSv4.2 protocol 2155 to support efficient reading of sparse files. 2157 A sparse file is a common way of representing a large file without 2158 having to reserve disk space for it. Consequently, a sparse file 2159 uses less physical space than its size indicates. This means the 2160 file contains 'holes', byte ranges within the file that contain no 2161 data. Most modern file systems support sparse files, including most 2162 UNIX file systems and NTFS, but notably not Apple's HFS+. Common 2163 examples of sparse files include VM OS/disk images, database files, 2164 log files, and even checkpoint recovery files most commonly used by 2165 the HPC community. 2167 If an application reads a hole in a sparse file, the file system must 2168 returns all zeros to the application. For local data access there is 2169 little penalty, but with NFS these zeroes must be transferred back to 2170 the client. If an application uses the NFS client to read data into 2171 memory, this wastes time and bandwidth as the application waits for 2172 the zeroes to be transferred. Once the zeroes arrive, they then 2173 steal memory or cache space from real data. To make matters worse, 2174 if an application then proceeds to write data to another file system, 2175 the zeros are written into the file, expanding the sparse file into a 2176 full sized regular file. Beyond wasting disk space, this can 2177 actually prevent large sparse files from ever being copied to another 2178 storage location due to space limitations. 2180 This document adds a new READPLUS operation to efficiently read from 2181 sparse files by avoiding the transfer of all zero regions from the 2182 server to the client. READPLUS supports all the features of READ but 2183 includes a minimal extension to support sparse files. In addition, 2184 the return value of READPLUS is now compatible with NFSv4.1 minor 2185 versioning rules and could support other future extensions without 2186 requiring yet another operation. READPLUS is guaranteed to perform 2187 no worse than READ, and can dramatically improve performance with 2188 sparse files. READPLUS does not depend on pNFS protocol features, 2189 but can be used by pNFS to support sparse files. 2191 6.2. Terminology 2193 Regular file: An object of file type NF4REG or NF4NAMEDATTR. 2195 Sparse file: A Regular file that contains one or more Holes. 2197 Hole: A byte range within a Sparse file that contains regions of all 2198 zeroes. For block-based file systems, this could also be an 2199 unallocated region of the file. 2201 Hole Threshold The minimum length of a Hole as determined by the 2202 server. If a server chooses to define a Hole Threshold, then it 2203 would not return hole information (nfs_readplusreshole) with a 2204 hole_offset and hole_length that specify a range shorter than the 2205 Hole Threshold. 2207 6.3. Applications and Sparse Files 2209 Applications may cause an NFS client to read holes in a file for 2210 several reasons. This section describes three different application 2211 workloads that cause the NFS client to transfer data unnecessarily. 2212 These workloads are simply examples, and there are probably many more 2213 workloads that are negatively impacted by sparse files. 2215 The first workload that can cause holes to be read is sequential 2216 reads within a sparse file. When this happens, the NFS client may 2217 perform read requests ("readahead") into sections of the file not 2218 explicitly requested by the application. Since the NFS client cannot 2219 differentiate between holes and non-holes, the NFS client may 2220 prefetch empty sections of the file. 2222 This workload is exemplified by Virtual Machines and their associated 2223 file system images, e.g., VMware .vmdk files, which are large sparse 2224 files encapsulating an entire operating system. If a VM reads files 2225 within the file system image, this will translate to sequential NFS 2226 read requests into the much larger file system image file. Since NFS 2227 does not understand the internals of the file system image, it ends 2228 up performing readahead file holes. 2230 The second workload is generated by copying a file from a directory 2231 in NFS to either the same NFS server, to another file system, e.g., 2232 another NFS or Samba server, to a local ext3 file system, or even a 2233 network socket. In this case, bandwidth and server resources are 2234 wasted as the entire file is transferred from the NFS server to the 2235 NFS client. Once a byte range of the file has been transferred to 2236 the client, it is up to the client application, e.g., rsync, cp, scp, 2237 on how it writes the data to the target location. For example, cp 2238 supports sparse files and will not write all zero regions, whereas 2239 scp does not support sparse files and will transfer every byte of the 2240 file. 2242 The third workload is generated by applications that do not utilize 2243 the NFS client cache, but instead use direct I/O and manage cached 2244 data independently, e.g., databases. These applications may perform 2245 whole file caching with sparse files, which would mean that even the 2246 holes will be transferred to the clients and cached. 2248 6.4. Overview of Sparse Files and NFSv4 2250 This proposal seeks to provide sparse file support to the largest 2251 number of NFS client and server implementations, and as such proposes 2252 to add a new return code to the mandatory NFSv4.1 READPLUS operation 2253 instead of proposing additions or extensions of new or existing 2254 optional features (such as pNFS). 2256 As well, this document seeks to ensure that the proposed extensions 2257 are simple and do not transfer data between the client and server 2258 unnecessarily. For example, one possible way to implement sparse 2259 file read support would be to have the client, on the first hole 2260 encountered or at OPEN time, request a Data Region Map from the 2261 server. A Data Region Map would specify all zero and non-zero 2262 regions in a file. While this option seems simple, it is less useful 2263 and can become inefficient and cumbersome for several reasons: 2265 o Data Region Maps can be large, and transferring them can reduce 2266 overall read performance. For example, VMware's .vmdk files can 2267 have a file size of over 100 GBs and have a map well over several 2268 MBs. 2270 o Data Region Maps can change frequently, and become invalidated on 2271 every write to the file. NFSv4 has a single change attribute, 2272 which means any change to any region of a file will invalidate all 2273 Data Region Maps. This can result in the map being transferred 2274 multiple times with each update to the file. For example, a VM 2275 that updates a config file in its file system image would 2276 invalidate the Data Region Map not only for itself, but for all 2277 other clients accessing the same file system image. 2279 o Data Region Maps do not handle all zero-filled sections of the 2280 file, reducing the effectiveness of the solution. While it may be 2281 possible to modify the maps to handle zero-filled sections (at 2282 possibly great effort to the server), it is almost impossible with 2283 pNFS. With pNFS, the owner of the Data Region Map is the metadata 2284 server, which is not in the data path and has no knowledge of the 2285 contents of a data region. 2287 Another way to handle holes is compression, but this not ideal since 2288 it requires all implementations to agree on a single compression 2289 algorithm and requires a fair amount of computational overhead. 2291 Note that supporting writing to a sparse file does not require 2292 changes to the protocol. Applications and/or NFS implementations can 2293 choose to ignore WRITE requests of all zeroes to the NFS server 2294 without consequence. 2296 6.5. Operation 65: READPLUS 2298 The section introduces a new read operation, named READPLUS, which 2299 allows NFS clients to avoid reading holes in a sparse file. READPLUS 2300 is guaranteed to perform no worse than READ, and can dramatically 2301 improve performance with sparse files. 2303 READPLUS supports all the features of the existing NFSv4.1 READ 2304 operation [2] and adds a simple yet significant extension to the 2305 format of its response. The change allows the client to avoid 2306 returning all zeroes from a file hole, wasting computational and 2307 network resources and reducing performance. READPLUS uses a new 2308 result structure that tells the client that the result is all zeroes 2309 AND the byte-range of the hole in which the request was made. 2310 Returning the hole's byte-range, and only upon request, avoids 2311 transferring large Data Region Maps that may be soon invalidated and 2312 contain information about a file that may not even be read in its 2313 entirely. 2315 A new read operation is required due to NFSv4.1 minor versioning 2316 rules that do not allow modification of existing operation's 2317 arguments or results. READPLUS is designed in such a way to allow 2318 future extensions to the result structure. The same approach could 2319 be taken to extend the argument structure, but a good use case is 2320 first required to make such a change. 2322 6.5.1. ARGUMENT 2324 struct READPLUS4args { 2325 /* CURRENT_FH: file */ 2326 stateid4 rpa_stateid; 2327 offset4 rpa_offset; 2328 count4 rpa_count; 2329 }; 2331 6.5.2. RESULT 2333 struct readplus_hole_info { 2334 offset4 rphi_offset; 2335 length4 rphi_length; 2336 }; 2338 enum holeres4 { 2339 HOLE_NOINFO = 0, 2340 HOLE_INFO = 1 2341 }; 2343 union readplus_hole switch (holeres4 resop) { 2344 case HOLE_INFO: 2345 readplus_hole_info rph_info; 2346 case HOLE_NOINFO: 2347 void; 2348 }; 2350 enum readplusrestype4 { 2351 READ_OK = 0, 2352 READ_HOLE = 1 2353 }; 2355 union readplus_data switch (readplusrestype4 resop) { 2356 case READ_OK: 2357 opaque rpd_data<>; 2358 case READ_HOLE: 2359 readplus_hole rpd_hole4; 2360 }; 2362 struct READPLUS4resok { 2363 bool rpr_eof; 2364 readplus_data rpr_data; 2365 }; 2367 union READPLUS4res switch (nfsstat4 status) { 2368 case NFS4_OK: 2369 READPLUS4resok resok4; 2370 default: 2371 void; 2372 }; 2374 6.5.3. DESCRIPTION 2376 The READPLUS operation is based upon the NFSv4.1 READ operation [2], 2377 and similarly reads data from the regular file identified by the 2378 current filehandle. 2380 The client provides an offset of where the READPLUS is to start and a 2381 count of how many bytes are to be read. An offset of zero means to 2382 read data starting at the beginning of the file. If offset is 2383 greater than or equal to the size of the file, the status NFS4_OK is 2384 returned with nfs_readplusrestype4 set to READ_OK, data length set to 2385 zero, and eof set to TRUE. The READPLUS is subject to access 2386 permissions checking. 2388 If the client specifies a count value of zero, the READPLUS succeeds 2389 and returns zero bytes of data, again subject to access permissions 2390 checking. In all situations, the server may choose to return fewer 2391 bytes than specified by the client. The client needs to check for 2392 this condition and handle the condition appropriately. 2394 If the client specifies an offset and count value that is entirely 2395 contained within a hole of the file, the status NFS4_OK is returned 2396 with nfs_readplusresok4 set to READ_HOLE, and if information is 2397 available regarding the hole, a nfs_readplusreshole structure 2398 containing the offset and range of the entire hole. The 2399 nfs_readplusreshole structure is considered valid until the file is 2400 changed (detected via the change attribute). The server MUST provide 2401 the same semantics for nfs_readplusreshole as if the client read the 2402 region and received zeroes; the implied holes contents lifetime MUST 2403 be exactly the same as any other read data. 2405 If the client specifies an offset and count value that begins in a 2406 non-hole of the file but extends into hole the server should return a 2407 short read with status NFS4_OK, nfs_readplusresok4 set to READ_OK, 2408 and data length set to the number of bytes returned. The client will 2409 then issue another READPLUS for the remaining bytes, which the server 2410 will respond with information about the hole in the file. 2412 If the server knows that the requested byte range is into a hole of 2413 the file, but has no further information regarding the hole, it 2414 returns a nfs_readplusreshole structure with holeres4 set to 2415 HOLE_NOINFO. 2417 If hole information is available and can be returned to the client, 2418 the server returns a nfs_readplusreshole structure with the value of 2419 holeres4 to HOLE_INFO. The values of hole_offset and hole_length 2420 define the byte-range for the current hole in the file. These values 2421 represent the information known to the server and may describe a 2422 byte-range smaller than the true size of the hole. 2424 Except when special stateids are used, the stateid value for a 2425 READPLUS request represents a value returned from a previous byte- 2426 range lock or share reservation request or the stateid associated 2427 with a delegation. The stateid identifies the associated owners if 2428 any and is used by the server to verify that the associated locks are 2429 still valid (e.g., have not been revoked). 2431 If the read ended at the end-of-file (formally, in a correctly formed 2432 READPLUS operation, if offset + count is equal to the size of the 2433 file), or the READPLUS operation extends beyond the size of the file 2434 (if offset + count is greater than the size of the file), eof is 2435 returned as TRUE; otherwise, it is FALSE. A successful READPLUS of 2436 an empty file will always return eof as TRUE. 2438 If the current filehandle is not an ordinary file, an error will be 2439 returned to the client. In the case that the current filehandle 2440 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If 2441 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 2442 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 2444 For a READPLUS with a stateid value of all bits equal to zero, the 2445 server MAY allow the READPLUS to be serviced subject to mandatory 2446 byte-range locks or the current share deny modes for the file. For a 2447 READPLUS with a stateid value of all bits equal to one, the server 2448 MAY allow READPLUS operations to bypass locking checks at the server. 2450 On success, the current filehandle retains its value. 2452 6.5.4. IMPLEMENTATION 2454 If the server returns a "short read" (i.e., fewer data than requested 2455 and eof is set to FALSE), the client should send another READPLUS to 2456 get the remaining data. A server may return less data than requested 2457 under several circumstances. The file may have been truncated by 2458 another client or perhaps on the server itself, changing the file 2459 size from what the requesting client believes to be the case. This 2460 would reduce the actual amount of data available to the client. It 2461 is possible that the server reduce the transfer size and so return a 2462 short read result. Server resource exhaustion may also occur in a 2463 short read. 2465 If mandatory byte-range locking is in effect for the file, and if the 2466 byte-range corresponding to the data to be read from the file is 2467 WRITE_LT locked by an owner not associated with the stateid, the 2468 server will return the NFS4ERR_LOCKED error. The client should try 2469 to get the appropriate READ_LT via the LOCK operation before re- 2470 attempting the READPLUS. When the READPLUS completes, the client 2471 should release the byte-range lock via LOCKU. In addition, the 2472 server MUST return a nfs_readplusreshole structure with values of 2473 hole_offset and hole_length that are within the owner's locked byte 2474 range. 2476 If another client has an OPEN_DELEGATE_WRITE delegation for the file 2477 being read, the delegation must be recalled, and the operation cannot 2478 proceed until that delegation is returned or revoked. Except where 2479 this happens very quickly, one or more NFS4ERR_DELAY errors will be 2480 returned to requests made while the delegation remains outstanding. 2481 Normally, delegations will not be recalled as a result of a READPLUS 2482 operation since the recall will occur as a result of an earlier OPEN. 2483 However, since it is possible for a READPLUS to be done with a 2484 special stateid, the server needs to check for this case even though 2485 the client should have done an OPEN previously. 2487 6.5.4.1. Additional pNFS Implementation Information 2489 With pNFS, the semantics of using READPLUS remains the same. Any 2490 data server MAY return a READ_HOLE result for a READPLUS request that 2491 it receives. 2493 When a data server chooses to return a READ_HOLE result, it has the 2494 option of returning hole information for the data stored on that data 2495 server (as defined by the data layout), but it MUST not return a 2496 nfs_readplusreshole structure with a byte range that includes data 2497 managed by another data server. 2499 1. Data servers that cannot determine hole information SHOULD return 2500 HOLE_NOINFO. 2502 2. Data servers that can obtain hole information for the parts of 2503 the file stored on that data server, the data server SHOULD 2504 return HOLE_INFO and the byte range of the hole stored on that 2505 data server. 2507 A data server should do its best to return as much information about 2508 a hole as is feasible without having to contact the metadata server. 2509 If communication with the metadata server is required, then every 2510 attempt should be taken to minimize the number of requests. 2512 If mandatory locking is enforced, then the data server must also 2513 ensure that to return only information for a Hole that is within the 2514 owner's locked byte range. 2516 6.5.5. READPLUS with Sparse Files Example 2518 To see how the return value READ_HOLE will work, the following table 2519 describes a sparse file. For each byte range, the file contains 2520 either non-zero data or a hole. In addition, the server in this 2521 example uses a hole threshold of 32K. 2523 +-------------+----------+ 2524 | Byte-Range | Contents | 2525 +-------------+----------+ 2526 | 0-15999 | Hole | 2527 | 16K-31999 | Non-Zero | 2528 | 32K-255999 | Hole | 2529 | 256K-287999 | Non-Zero | 2530 | 288K-353999 | Hole | 2531 | 354K-417999 | Non-Zero | 2532 +-------------+----------+ 2534 Table 3 2536 Under the given circumstances, if a client was to read the file from 2537 beginning to end with a max read size of 64K, the following will be 2538 the result. This assumes the client has already opened the file and 2539 acquired a valid stateid and just needs to issue READPLUS requests. 2541 1. READPLUS(s, 0, 64K) --> NFS_OK, readplusrestype4 = READ_OK, eof = 2542 false, data<>[32K]. Return a short read, as the last half of the 2543 request was all zeroes. Note that the first hole is read back as 2544 all zeros as it is below the hole threshhold. 2546 2. READPLUS(s, 32K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE, 2547 nfs_readplusreshole(HOLE_INFO)(32K, 224K). The requested range 2548 was all zeros, and the current hole begins at offset 32K and is 2549 224K in length. 2551 3. READPLUS(s, 256K, 64K) --> NFS_OK, readplusrestype4 = READ_OK, 2552 eof = false, data<>[32K]. Return a short read, as the last half 2553 of the request was all zeroes. 2555 4. READPLUS(s, 288K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE, 2556 nfs_readplusreshole(HOLE_INFO)(288K, 66K). 2558 5. READPLUS(s, 354K, 64K) --> NFS_OK, readplusrestype4 = READ_OK, 2559 eof = true, data<>[64K]. 2561 6.6. Related Work 2563 Solaris and ZFS support an extension to lseek(2) that allows 2564 applications to discover holes in a file. The values, SEEK_HOLE and 2565 SEEK_DATA, allow clients to seek to the next hole or beginning of 2566 data, respectively. 2568 XFS supports the XFS_IOC_GETBMAP extended attribute, which returns 2569 the Data Region Map for a file. Clients can then use this 2570 information to avoid reading holes in a file. 2572 NTFS and CIFS support the FSCTL_SET_SPARSE attribute, which allows 2573 applications to control whether empty regions of the file are 2574 preallocated and filled in with zeros or simply left unallocated. 2576 6.7. Other Proposed Designs 2578 6.7.1. Multi-Data Server Hole Information 2580 The current design prohibits pnfs data servers from returning hole 2581 information for regions of a file that are not stored on that data 2582 server. Having data servers return information regarding other data 2583 servers changes the fundamental principal that all metadata 2584 information comes from the metadata server. 2586 Here is a brief description if we did choose to support multi-data 2587 server hole information: 2589 For a data server that can obtain hole information for the entire 2590 file without severe performance impact, it MAY return HOLE_INFO and 2591 the byte range of the entire file hole. When a pNFS client receives 2592 a READ_HOLE result and a non-empty nfs_readplusreshole structure, it 2593 MAY use this information in conjunction with a valid layout for the 2594 file to determine the next data server for the next region of data 2595 that is not in a hole. 2597 6.7.2. Data Result Array 2599 If a single read request contains one or more Holes with a length 2600 greater than the Sparse Threshold, the current design would return 2601 results indicating a short read to the client. A client would then 2602 send a series of read requests to the server to retrieve information 2603 for the Holes and the remaining data. To avoid turning a single read 2604 request into several exchanges between the client and server, the 2605 server may need to choose a relatively large Sparse Threshold in 2606 order to decrease the number of short reads it creates. A large 2607 Sparse Threshold may miss many smaller holes, which in turn may 2608 negate the benefits of sparse read support. 2610 To avoid this situation, one option is to have the READPLUS operation 2611 return information for multiple holes in a single return value. This 2612 would allow several small holes to be described in a single read 2613 response without requiring multliple exchanges between the client and 2614 server. 2616 One important item to consider with returning an array of data chunks 2617 is its impact on RDMA, which may use different block sizes on the 2618 client and server (among other things). 2620 6.7.3. User-Defined Sparse Mask 2622 Add mask (instead of just zeroes). Specified by server or client? 2624 6.7.4. Allocated flag 2626 A Hole on the server may be an allocated byte-range consisting of all 2627 zeroes or may not be allocated at all. To ensure this information is 2628 properly communicated to the client, it may be beneficial to add a 2629 'alloc' flag to the HOLE_INFO section of nfs_readplusreshole. This 2630 would allow an NFS client to copy a file from one file system to 2631 another and have it more closely resemble the original. 2633 6.7.5. Dense and Sparse pNFS File Layouts 2635 The hole information returned form a data server must be understood 2636 by pNFS clients using both Dense or Sparse file layout types. Does 2637 the current READPLUS return value work for both layout types? Does 2638 the data server know if it is using dense or sparse so that it can 2639 return the correct hole_offset and hole_length values? 2641 6.8. Security Considerations 2643 The additions to the NFS protocol for supporting sparse file reads 2644 does not alter the security considerations of the NFSv4.1 protocol 2645 [2]. 2647 6.9. IANA Considerations 2649 There are no IANA considerations in this section. 2651 7. Security Considerations 2653 8. IANA Considerations 2655 This section uses terms that are defined in [17]. 2657 9. References 2659 9.1. Normative References 2661 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 2662 Levels", March 1997. 2664 [2] Shepler, S., Eisler, M., and D. Noveck, "Network File System 2665 (NFS) Version 4 Minor Version 1 Protocol", RFC 5661, 2666 January 2010. 2668 [3] Halevy, B., Welch, B., and J. Zelenka, "Object-Based Parallel 2669 NFS (pNFS) Operations", RFC 5664, January 2010. 2671 [4] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 2672 Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, 2673 January 2005. 2675 [5] Williams, N., "Remote Procedure Call (RPC) Security Version 3", 2676 draft-williams-rpcsecgssv3 (work in progress), 2008. 2678 [6] Shepler, S., Eisler, M., and D. Noveck, "Network File System 2679 (NFS) Version 4 Minor Version 1 External Data Representation 2680 Standard (XDR) Description", RFC 5662, January 2010. 2682 [7] Black, D., Glasgow, J., and S. Fridella, "Parallel NFS (pNFS) 2683 Block/Volume Layout", RFC 5663, January 2010. 2685 [8] Haynes, T., "Network File System (NFS) Version 4 Minor Version 2686 2 External Data Representation Standard (XDR) Description", 2687 March 2011. 2689 [9] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol 2690 Specification", RFC 2203, September 1997. 2692 9.2. Informative References 2694 [10] Haynes, T. and D. Noveck, "Network File System (NFS) version 4 2695 Protocol", draft-ietf-nfsv4-rfc3530bis-09 (Work In Progress), 2696 March 2011. 2698 [11] Eisler, M., "XDR: External Data Representation Standard", 2699 RFC 4506, May 2006. 2701 [12] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik, 2702 "NSDB Protocol for Federated Filesystems", 2703 draft-ietf-nfsv4-federated-fs-protocol (Work In Progress), 2704 2010. 2706 [13] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik, 2707 "Administration Protocol for Federated Filesystems", 2708 draft-ietf-nfsv4-federated-fs-admin (Work In Progress), 2010. 2710 [14] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., 2711 Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- 2712 HTTP/1.1", RFC 2616, June 1999. 2714 [15] Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9, 2715 RFC 959, October 1985. 2717 [16] Simpson, W., "PPP Challenge Handshake Authentication Protocol 2718 (CHAP)", RFC 1994, August 1996. 2720 [17] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA 2721 Considerations Section in RFCs", BCP 26, RFC 5226, May 2008. 2723 [18] Nowicki, B., "NFS: Network File System Protocol specification", 2724 RFC 1094, March 1989. 2726 [19] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3 2727 Protocol Specification", RFC 1813, June 1995. 2729 [20] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", 2730 RFC 1833, August 1995. 2732 [21] Eisler, M., "NFS Version 2 and Version 3 Security Issues and 2733 the NFS Protocol's Use of RPCSEC_GSS and Kerberos V5", 2734 RFC 2623, June 1999. 2736 [22] Callaghan, B., "NFS URL Scheme", RFC 2224, October 1997. 2738 [23] Shepler, S., "NFS Version 4 Design Considerations", RFC 2624, 2739 June 1999. 2741 [24] Reynolds, J., "Assigned Numbers: RFC 1700 is Replaced by an On- 2742 line Database", RFC 3232, January 2002. 2744 [25] Linn, J., "The Kerberos Version 5 GSS-API Mechanism", RFC 1964, 2745 June 1996. 2747 [26] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, 2748 C., Eisler, M., and D. Noveck, "Network File System (NFS) 2749 version 4 Protocol", RFC 3530, April 2003. 2751 Appendix A. Acknowledgments 2753 For the pNFS Access Permissions Check, the original draft was by 2754 Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow. The work 2755 was influenced by discussions with Benny Halevy and Bruce Fields. A 2756 review was done by Tom Haynes. 2758 For the Sharing change attribute implementation details with NFSv4 2759 clients, the original draft was by Trond Myklebust. 2761 For the NFS Server-side Copy, the original draft was by James 2762 Lentini, Mike Eisler, Deepak Kenchammana, Anshul Madan, and Rahul 2763 Iyer. Talpey co-authored an unpublished version of that document. 2764 It was also was reviewed by a number of individuals: Pranoop Erasani, 2765 Tom Haynes, Arthur Lent, Trond Myklebust, Dave Noveck, Theresa 2766 Lingutla-Raj, Manjunath Shankararao, Satyam Vaghani, and Nico 2767 Williams. 2769 For the NFS space reservation operations, the original draft was by 2770 Mike Eisler, James Lentini, Manjunath Shankararao, and Rahul Iyer. 2772 For the sparse file support, the original draft was by Dean 2773 Hildebrand and Marc Eshel. Valuable input and advice was received 2774 from Sorin Faibish, Bruce Fields, Benny Halevy, Trond Myklebust, and 2775 Richard Scheffenegger. 2777 Appendix B. RFC Editor Notes 2779 [RFC Editor: please remove this section prior to publishing this 2780 document as an RFC] 2782 [RFC Editor: prior to publishing this document as an RFC, please 2783 replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the 2784 RFC number of this document] 2786 Author's Address 2788 Thomas Haynes 2789 NetApp 2790 9110 E 66th St 2791 Tulsa, OK 74133 2792 USA 2794 Phone: +1 918 307 1415 2795 Email: thomas@netapp.com 2796 URI: http://www.tulsalabs.com