idnits 2.17.1 draft-ietf-nfsv4-minorversion2-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 5 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. == There are 5 instances of lines with private range IPv4 addresses in the document. If these are generic example addresses, they should be changed to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x, 198.51.100.x or 203.0.113.x. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: Furthermore, each DS MUST not report to a client either a sparse ADB or data which belongs to another DS. One implication of this requirement is that the app_data_block4's adb_block_size MUST be either be the stripe width or the stripe width must be an even multiple of it. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: When a data server chooses to return a READ_HOLE result, it has the option of returning hole information for the data stored on that data server (as defined by the data layout), but it MUST not return a nfs_readplusreshole structure with a byte range that includes data managed by another data server. == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (May 09, 2011) is 4736 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '0' is mentioned on line 1759, but not defined == Unused Reference: '7' is defined on line 3060, but no explicit reference was found in the text == Unused Reference: '8' is defined on line 3064, but no explicit reference was found in the text == Unused Reference: '9' is defined on line 3067, but no explicit reference was found in the text == Unused Reference: '22' is defined on line 3116, but no explicit reference was found in the text == Unused Reference: '23' is defined on line 3119, but no explicit reference was found in the text == Unused Reference: '24' is defined on line 3122, but no explicit reference was found in the text == Unused Reference: '25' is defined on line 3125, but no explicit reference was found in the text == Unused Reference: '26' is defined on line 3129, but no explicit reference was found in the text == Unused Reference: '27' is defined on line 3131, but no explicit reference was found in the text == Unused Reference: '28' is defined on line 3134, but no explicit reference was found in the text == Unused Reference: '29' is defined on line 3137, but no explicit reference was found in the text == Unused Reference: '30' is defined on line 3140, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. '1' ** Obsolete normative reference: RFC 5661 (ref. '2') (Obsoleted by RFC 8881) -- Possible downref: Non-RFC (?) normative reference: ref. '3' == Outdated reference: A later version (-35) exists of draft-ietf-nfsv4-rfc3530bis-09 -- Obsolete informational reference (is this intentional?): RFC 2616 (ref. '14') (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) -- Obsolete informational reference (is this intentional?): RFC 5226 (ref. '21') (Obsoleted by RFC 8126) -- Obsolete informational reference (is this intentional?): RFC 3530 (ref. '30') (Obsoleted by RFC 7530) Summary: 1 error (**), 0 flaws (~~), 20 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 T. Haynes 3 Internet-Draft Editor 4 Intended status: Standards Track May 09, 2011 5 Expires: November 10, 2011 7 NFS Version 4 Minor Version 2 8 draft-ietf-nfsv4-minorversion2-02.txt 10 Abstract 12 This Internet-Draft describes NFS version 4 minor version two, 13 focusing mainly on the protocol extensions made from NFS version 4 14 minor version 0 and NFS version 4 minor version 1. Major extensions 15 introduced in NFS version 4 minor version two include: Server-side 16 Copy, Space Reservations, and Support for Sparse Files. 18 Requirements Language 20 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 21 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 22 document are to be interpreted as described in RFC 2119 [1]. 24 Status of this Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at http://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on November 10, 2011. 41 Copyright Notice 43 Copyright (c) 2011 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 This document may contain material from IETF Documents or IETF 57 Contributions published or made publicly available before November 58 10, 2008. The person(s) controlling the copyright in some of this 59 material may not have granted the IETF Trust the right to allow 60 modifications of such material outside the IETF Standards Process. 61 Without obtaining an adequate license from the person(s) controlling 62 the copyright in such materials, this document may not be modified 63 outside the IETF Standards Process, and derivative works of it may 64 not be created outside the IETF Standards Process, except to format 65 it for publication as an RFC or to translate it into languages other 66 than English. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 71 1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . . 5 72 1.2. Scope of This Document . . . . . . . . . . . . . . . . . . 5 73 1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 5 74 1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . . 5 75 1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . . 5 76 2. pNFS LAYOUTRETURN Error Handling . . . . . . . . . . . . . . . 5 77 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 5 78 2.2. Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 6 79 2.2.1. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 6 80 2.2.2. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 6 81 2.2.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 6 82 2.2.4. IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 7 83 3. Sharing change attribute implementation details with NFSv4 84 clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 85 3.1. Abstract . . . . . . . . . . . . . . . . . . . . . . . . . 8 86 3.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . 9 87 3.3. Definition of the 'change_attr_type' per-file system 88 attribute . . . . . . . . . . . . . . . . . . . . . . . . 9 89 4. NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . . 10 90 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 11 91 4.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 11 92 4.2.1. Intra-Server Copy . . . . . . . . . . . . . . . . . . 13 93 4.2.2. Inter-Server Copy . . . . . . . . . . . . . . . . . . 14 94 4.2.3. Server-to-Server Copy Protocol . . . . . . . . . . . . 17 95 4.3. Operations . . . . . . . . . . . . . . . . . . . . . . . . 19 96 4.3.1. netloc4 - Network Locations . . . . . . . . . . . . . 19 97 4.3.2. Operation 61: COPY_NOTIFY - Notify a source server 98 of a future copy . . . . . . . . . . . . . . . . . . . 20 99 4.3.3. Operation 62: COPY_REVOKE - Revoke a destination 100 server's copy privileges . . . . . . . . . . . . . . . 22 101 4.3.4. Operation 59: COPY - Initiate a server-side copy . . . 23 102 4.3.5. Operation 60: COPY_ABORT - Cancel a server-side 103 copy . . . . . . . . . . . . . . . . . . . . . . . . . 31 104 4.3.6. Operation 63: COPY_STATUS - Poll for status of a 105 server-side copy . . . . . . . . . . . . . . . . . . . 32 106 4.3.7. Operation 15: CB_COPY - Report results of a 107 server-side copy . . . . . . . . . . . . . . . . . . . 33 108 4.3.8. Copy Offload Stateids . . . . . . . . . . . . . . . . 35 109 4.4. Security Considerations . . . . . . . . . . . . . . . . . 35 110 4.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 35 111 5. Application Data Block Support . . . . . . . . . . . . . . . . 43 112 5.1. Generic Framework . . . . . . . . . . . . . . . . . . . . 44 113 5.1.1. Data Block Representation . . . . . . . . . . . . . . 45 114 5.1.2. Data Content . . . . . . . . . . . . . . . . . . . . . 45 115 5.2. Operation 64: INITIALIZE . . . . . . . . . . . . . . . . . 45 116 5.2.1. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 46 117 5.2.2. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 46 118 5.2.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 47 119 5.3. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 48 120 5.3.1. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 48 121 5.3.2. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 49 122 5.3.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 49 123 5.4. pNFS Considerations . . . . . . . . . . . . . . . . . . . 50 124 5.5. An Example of Detecting Corruption . . . . . . . . . . . . 50 125 5.6. Example of READ_PLUS . . . . . . . . . . . . . . . . . . . 52 126 5.7. Zero Filled Holes . . . . . . . . . . . . . . . . . . . . 52 127 6. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 52 128 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 52 129 6.2. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 54 130 6.2.1. Space Reservation . . . . . . . . . . . . . . . . . . 54 131 6.2.2. Space freed on deletes . . . . . . . . . . . . . . . . 54 132 6.2.3. Operations and attributes . . . . . . . . . . . . . . 55 133 6.2.4. Attribute 77: space_reserved . . . . . . . . . . . . . 55 134 6.2.5. Attribute 78: space_freed . . . . . . . . . . . . . . 56 135 6.2.6. Attribute 79: max_hole_punch . . . . . . . . . . . . . 56 136 6.2.7. Operation 64: HOLE_PUNCH - Zero and deallocate 137 blocks backing the file in the specified range. . . . 56 138 7. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 57 139 7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 57 140 7.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 58 141 7.3. Applications and Sparse Files . . . . . . . . . . . . . . 59 142 7.4. Overview of Sparse Files and NFSv4 . . . . . . . . . . . . 60 143 7.5. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 61 144 7.5.1. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 61 145 7.5.2. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 62 146 7.5.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 62 147 7.5.4. IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 64 148 7.5.5. READ_PLUS with Sparse Files Example . . . . . . . . . 65 149 7.6. Related Work . . . . . . . . . . . . . . . . . . . . . . . 66 150 7.7. Other Proposed Designs . . . . . . . . . . . . . . . . . . 66 151 7.7.1. Multi-Data Server Hole Information . . . . . . . . . . 66 152 7.7.2. Data Result Array . . . . . . . . . . . . . . . . . . 67 153 7.7.3. User-Defined Sparse Mask . . . . . . . . . . . . . . . 67 154 7.7.4. Allocated flag . . . . . . . . . . . . . . . . . . . . 67 155 7.7.5. Dense and Sparse pNFS File Layouts . . . . . . . . . . 68 156 8. Security Considerations . . . . . . . . . . . . . . . . . . . 68 157 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 68 158 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 68 159 10.1. Normative References . . . . . . . . . . . . . . . . . . . 68 160 10.2. Informative References . . . . . . . . . . . . . . . . . . 69 161 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 70 162 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 71 163 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 71 165 1. Introduction 167 1.1. The NFS Version 4 Minor Version 2 Protocol 169 The NFS version 4 minor version 2 (NFSv4.2) protocol is the third 170 minor version of the NFS version 4 (NFSv4) protocol. The first minor 171 version, NFSv4.0, is described in [10] and the second minor version, 172 NFSv4.1, is described in [2]. It follows the guidelines for minor 173 versioning that are listed in Section 11 of RFC 3530bis. 175 As a minor version, NFSv4.2 is consistent with the overall goals for 176 NFSv4, but extends the protocol so as to better meet those goals, 177 based on experiences with NFSv4.1. In addition, NFSv4.2 has adopted 178 some additional goals, which motivate some of the major extensions in 179 NFSv4.2. 181 1.2. Scope of This Document 183 This document describes the NFSv4.2 protocol. With respect to 184 NFSv4.0 and NFSv4.1, this document does not: 186 o describe the NFSv4.0 or NFSv4.1 protocols, except where needed to 187 contrast with NFSv4.2. 189 o modify the specification of the NFSv4.0 or NFSv4.1 protocols. 191 o clarify the NFSv4.0 or NFSv4.1 protocols. 193 The full XDR for NFSv4.2 is presented in [3]. 195 1.3. NFSv4.2 Goals 197 1.4. Overview of NFSv4.2 Features 199 1.5. Differences from NFSv4.1 201 2. pNFS LAYOUTRETURN Error Handling 203 2.1. Introduction 205 In the pNFS description provided in [2], the client is not enabled to 206 relay an error code from the DS to the MDS. In the specification of 207 the Objects-Based Layout protocol [4], use is made of the opaque 208 lrf_body field of the LAYOUTRETURN argument to do such a relaying of 209 error codes. In this section, we define a new data structure to 210 enable the passing of error codes back to the MDS and provide some 211 guidelines on what both the client and MDS should expect in such 212 circumstances. 214 There are two broad classes of errors, transient and persistent. The 215 client SHOULD strive to only use this new mechanism to report 216 persistent errors. It MUST be able to deal with transient issues by 217 itself. Also, while the client might consider an issue to be 218 persistent, it MUST be prepared for the MDS to consider such issues 219 to be persistent. A prime example of this is if the MDS fences off a 220 client from either a stateid or a filehandle. The client will get an 221 error from the DS and might relay either NFS4ERR_ACCESS or 222 NFS4ERR_STALE_STATEID back to the MDS, with the belief that this is a 223 hard error. The MDS on the other hand, is waiting for the client to 224 report such an error. For it, the mission is accomplished in that 225 the client has returned a layout that the MDS had most likley 226 recalled. 228 2.2. Changes to Operation 51: LAYOUTRETURN 230 The existing LAYOUTRETURN operation is extended by introducing a new 231 data structure to report errors, layoutreturn_device_error4. Also, 232 layoutreturn_device_error4 is introduced to enable an array of errors 233 to be reported. 235 2.2.1. ARGUMENT 237 The ARGUMENT specification of the LAYOUTRETURN operation in section 238 18.44.1 of [2] is augmented by the following XDR code [11]: 240 struct layoutreturn_device_error4 { 241 deviceid4 lrde_deviceid; 242 nfsstat4 lrde_status; 243 nfs_opnum4 lrde_opnum; 244 }; 246 struct layoutreturn_error_report4 { 247 layoutreturn_device_error4 lrer_errors<>; 248 }; 250 2.2.2. RESULT 252 The RESULT of the LAYOUTRETURN operation is unchanged; see section 253 18.44.2 of [2]. 255 2.2.3. DESCRIPTION 257 The following text is added to the end of the LAYOUTRETURN operation 258 DESCRIPTION in section 18.44.3 of [2]. 260 When a client used LAYOUTRETURN with a type of LAYOUTRETURN4_FILE, 261 then if the lrf_body field is NULL, it indicates to the MDS that the 262 client experienced no errors. If lrf_body is non-NULL, then the 263 field references error information which is layout type specific. 264 I.e., the Objects-Based Layout protocol can continue to utilize 265 lrf_body as specified in [4]. For both Files-Based Layouts, the 266 field references a layoutreturn_device_error4, which contains an 267 array of layoutreturn_device_error4. 269 Each individual layoutreturn_device_error4 descibes a single error 270 associated with a DS, which is identfied via lrde_deviceid. The 271 operation which returned the error is identified via lrde_opnum. 272 Finally the NFS error value (nfsstat4) encountered is provided via 273 lrde_status and may consist of the following error codes: 275 NFS4_OKAY: No issues were found for this device. 277 NFS4ERR_NXIO: The client was unable to establish any communication 278 with the DS. 280 NFS4ERR_*: The client was able to establish communication with the 281 DS and is returning one of the allowed error codes for the 282 operation denoted by lrde_opnum. 284 2.2.4. IMPLEMENTATION 286 The following text is added to the end of the LAYOUTRETURN operation 287 IMPLEMENTATION in section 18.4.4 of [2]. 289 A client that expects to use pNFS for a mounted filesystem SHOULD 290 check for pNFS support at mount time. This check SHOULD be performed 291 by sending a GETDEVICELIST operation, followed by layout-type- 292 specific checks for accessibility of each storage device returned by 293 GETDEVICELIST. If the NFS server does not support pNFS, the 294 GETDEVICELIST operation will be rejected with an NFS4ERR_NOTSUPP 295 error; in this situation it is up to the client to determine whether 296 it is acceptable to proceed with NFS-only access. 298 Clients are expected to tolerate transient storage device errors, and 299 hence clients SHOULD NOT use the LAYOUTRETURN error handling for 300 device access problems that may be transient. The methods by which a 301 client decides whether an access problem is transient vs. persistent 302 are implementation-specific, but may include retrying I/Os to a data 303 server under appropriate conditions. 305 When an I/O fails to a storage device, the client SHOULD retry the 306 failed I/O via the MDS. In this situation, before retrying the I/O, 307 the client SHOULD return the layout, or the affected portion thereof, 308 and SHOULD indicate which storage device or devices was problematic. 309 If the client does not do this, the MDS may issue a layout recall 310 callback in order to perform the retried I/O. 312 The client needs to be cognizant that since this error handling is 313 optional in the MDS, the MDS may silently ignore this functionality. 314 Also, as the MDS may consider some issues the client reports to be 315 expected (see Section 2.1), the client might find it difficult to 316 detect a MDS which has not implemented error handling via 317 LAYOUTRETURN. 319 If an MDS is aware that a storage device is proving problematic to a 320 client, the MDS SHOULD NOT include that storage device in any pNFS 321 layouts sent to that client. If the MDS is aware that a storage 322 device is affecting many clients, then the MDS SHOULD NOT include 323 that storage device in any pNFS layouts sent out. Clients must still 324 be aware that the MDS might not have any choice in using the storage 325 device, i.e., there might only be one possible layout for the system. 327 Another interesting complication is that for existing files, the MDS 328 might have no choice in which storage devices to hand out to clients. 329 The MDS might try to restripe a file across a different storage 330 device, but clients need to be aware that not all implementations 331 have restriping support. 333 An MDS SHOULD react to a client return of layouts with errors by not 334 using the problematic storage devices in layouts for that client, but 335 the MDS is not required to indefinitely retain per-client storage 336 device error information. An MDS is also not required to 337 automatically reinstate use of a previously problematic storage 338 device; administrative intervention may be required instead. 340 A client MAY perform I/O via the MDS even when the client holds a 341 layout that covers the I/O; servers MUST support this client 342 behavior, and MAY recall layouts as needed to complete I/Os. 344 3. Sharing change attribute implementation details with NFSv4 clients 346 3.1. Abstract 348 This document describes an extension to the NFSv4 protocol that 349 allows the server to share information about the implementation of 350 its change attribute with the client. The aim is to improve the 351 client's ability to determine the order in which parallel updates to 352 the same file were processed. 354 3.2. Introduction 356 Although both the NFSv4 [10] and NFSv4.1 protocol [2], define the 357 change attribute as being mandatory to implement, there is little in 358 the way of guidance. The only feature that is mandated by the spec 359 is that the value must change whenever the file data or metadata 360 change. 362 While this allows for a wide range of implementations, it also leaves 363 the client with a conundrum: how does it determine which is the most 364 recent value for the change attribute in a case where several RPC 365 calls have been issued in parallel? In other words if two COMPOUNDs, 366 both containing WRITE and GETATTR requests for the same file, have 367 been issued in parallel, how does the client determine which of the 368 two change attribute values returned in the replies to the GETATTR 369 requests corresponds to the most recent state of the file? In some 370 cases, the only recourse may be to send another COMPOUND containing a 371 third GETATTR that is fully serialised with the first two. 373 In order to avoid this kind of inefficiency, we propose a method to 374 allow the server to share details about how the change attribute is 375 expected to evolve, so that the client may immediately determine 376 which, out of the several change attribute values returned by the 377 server, is the most recent. 379 3.3. Definition of the 'change_attr_type' per-file system attribute 381 enum change_attr_typeinfo { 382 NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR = 0, 383 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER = 1, 384 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS = 2, 385 NFS4_CHANGE_TYPE_IS_TIME_METADATA = 3, 386 NFS4_CHANGE_TYPE_IS_UNDEFINED = 4 387 }; 389 +------------------+----+---------------------------+-----+ 390 | Name | Id | Data Type | Acc | 391 +------------------+----+---------------------------+-----+ 392 | change_attr_type | XX | enum change_attr_typeinfo | R | 393 +------------------+----+---------------------------+-----+ 395 The proposed solution is to enable the NFS server to provide 396 additional information about how it expects the change attribute 397 value to evolve after the file data or metadata has changed. To do 398 so, we define a new recommended attribute, 'change_attr_type', which 399 may take values from enum change_attr_typeinfo as follows: 401 NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR: The change attribute value MUST 402 monotonically increase for every atomic change to the file 403 attributes, data or directory contents. 405 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER: The change attribute value MUST 406 be incremented by one unit for every atomic change to the file 407 attributes, data or directory contents. This property is 408 preserved when writing to pNFS data servers. 410 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS: The change attribute 411 value MUST be incremented by one unit for every atomic change to 412 the file attributes, data or directory contents. In the case 413 where the client is writing to pNFS data servers, the number of 414 increments is not guaranteed to exactly match the number of 415 writes. 417 NFS4_CHANGE_TYPE_IS_TIME_METADATA: The change attribute is 418 implemented as suggested in the NFSv4 spec [10] in terms of the 419 time_metadata attribute. 421 NFS4_CHANGE_TYPE_IS_UNDEFINED: The change attribute does not take 422 values that fit into any of these categories. 424 If either NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR, 425 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, or 426 NFS4_CHANGE_TYPE_IS_TIME_METADATA are set, then the client knows at 427 the very least that the change attribute is monotonically increasing, 428 which is sufficient to resolve the question of which value is the 429 most recent. 431 If the client sees the value NFS4_CHANGE_TYPE_IS_TIME_METADATA, then 432 by inspecting the value of the 'time_delta' attribute it additionally 433 has the option of detecting rogue server implementations that use 434 time_metadata in violation of the spec. 436 Finally, if the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it 437 has the ability to predict what the resulting change attribute value 438 should be after a COMPOUND containing a SETATTR, WRITE, or CREATE. 439 This again allows it to detect changes made in parallel by another 440 client. The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits 441 the same, but only if the client is not doing pNFS WRITEs. 443 4. NFS Server-side Copy 444 4.1. Introduction 446 This document describes a server-side copy feature for the NFS 447 protocol. 449 The server-side copy feature provides a mechanism for the NFS client 450 to perform a file copy on the server without the data being 451 transmitted back and forth over the network. 453 Without this feature, an NFS client copies data from one location to 454 another by reading the data from the server over the network, and 455 then writing the data back over the network to the server. Using 456 this server-side copy operation, the client is able to instruct the 457 server to copy the data locally without the data being sent back and 458 forth over the network unnecessarily. 460 In general, this feature is useful whenever data is copied from one 461 location to another on the server. It is particularly useful when 462 copying the contents of a file from a backup. Backup-versions of a 463 file are copied for a number of reasons, including restoring and 464 cloning data. 466 If the source object and destination object are on different file 467 servers, the file servers will communicate with one another to 468 perform the copy operation. The server-to-server protocol by which 469 this is accomplished is not defined in this document. 471 4.2. Protocol Overview 473 The server-side copy offload operations support both intra-server and 474 inter-server file copies. An intra-server copy is a copy in which 475 the source file and destination file reside on the same server. In 476 an inter-server copy, the source file and destination file are on 477 different servers. In both cases, the copy may be performed 478 synchronously or asynchronously. 480 Throughout the rest of this document, we refer to the NFS server 481 containing the source file as the "source server" and the NFS server 482 to which the file is transferred as the "destination server". In the 483 case of an intra-server copy, the source server and destination 484 server are the same server. Therefore in the context of an intra- 485 server copy, the terms source server and destination server refer to 486 the single server performing the copy. 488 The operations described below are designed to copy files. Other 489 file system objects can be copied by building on these operations or 490 using other techniques. For example if the user wishes to copy a 491 directory, the client can synthesize a directory copy by first 492 creating the destination directory and then copying the source 493 directory's files to the new destination directory. If the user 494 wishes to copy a namespace junction [12] [13], the client can use the 495 ONC RPC Federated Filesystem protocol [13] to perform the copy. 496 Specifically the client can determine the source junction's 497 attributes using the FEDFS_LOOKUP_FSN procedure and create a 498 duplicate junction using the FEDFS_CREATE_JUNCTION procedure. 500 For the inter-server copy protocol, the operations are defined to be 501 compatible with a server-to-server copy protocol in which the 502 destination server reads the file data from the source server. This 503 model in which the file data is pulled from the source by the 504 destination has a number of advantages over a model in which the 505 source pushes the file data to the destination. The advantages of 506 the pull model include: 508 o The pull model only requires a remote server (i.e. the destination 509 server) to be granted read access. A push model requires a remote 510 server (i.e. the source server) to be granted write access, which 511 is more privileged. 513 o The pull model allows the destination server to stop reading if it 514 has run out of space. In a push model, the destination server 515 must flow control the source server in this situation. 517 o The pull model allows the destination server to easily flow 518 control the data stream by adjusting the size of its read 519 operations. In a push model, the destination server does not have 520 this ability. The source server in a push model is capable of 521 writing chunks larger than the destination server has requested in 522 attributes and session parameters. In theory, the destination 523 server could perform a "short" write in this situation, but this 524 approach is known to behave poorly in practice. 526 The following operations are provided to support server-side copy: 528 COPY_NOTIFY: For inter-server copies, the client sends this 529 operation to the source server to notify it of a future file copy 530 from a given destination server for the given user. 532 COPY_REVOKE: Also for inter-server copies, the client sends this 533 operation to the source server to revoke permission to copy a file 534 for the given user. 536 COPY: Used by the client to request a file copy. 538 COPY_ABORT: Used by the client to abort an asynchronous file copy. 540 COPY_STATUS: Used by the client to poll the status of an 541 asynchronous file copy. 543 CB_COPY: Used by the destination server to report the results of an 544 asynchronous file copy to the client. 546 These operations are described in detail in Section 4.3. This 547 section provides an overview of how these operations are used to 548 perform server-side copies. 550 4.2.1. Intra-Server Copy 552 To copy a file on a single server, the client uses a COPY operation. 553 The server may respond to the copy operation with the final results 554 of the copy or it may perform the copy asynchronously and deliver the 555 results using a CB_COPY operation callback. If the copy is performed 556 asynchronously, the client may poll the status of the copy using 557 COPY_STATUS or cancel the copy using COPY_ABORT. 559 A synchronous intra-server copy is shown in Figure 1. In this 560 example, the NFS server chooses to perform the copy synchronously. 561 The copy operation is completed, either successfully or 562 unsuccessfully, before the server replies to the client's request. 563 The server's reply contains the final result of the operation. 565 Client Server 566 + + 567 | | 568 |--- COPY ---------------------------->| Client requests 569 |<------------------------------------/| a file copy 570 | | 571 | | 573 Figure 1: A synchronous intra-server copy. 575 An asynchronous intra-server copy is shown in Figure 2. In this 576 example, the NFS server performs the copy asynchronously. The 577 server's reply to the copy request indicates that the copy operation 578 was initiated and the final result will be delivered at a later time. 579 The server's reply also contains a copy stateid. The client may use 580 this copy stateid to poll for status information (as shown) or to 581 cancel the copy using a COPY_ABORT. When the server completes the 582 copy, the server performs a callback to the client and reports the 583 results. 585 Client Server 586 + + 587 | | 588 |--- COPY ---------------------------->| Client requests 589 |<------------------------------------/| a file copy 590 | | 591 | | 592 |--- COPY_STATUS --------------------->| Client may poll 593 |<------------------------------------/| for status 594 | | 595 | . | Multiple COPY_STATUS 596 | . | operations may be sent. 597 | . | 598 | | 599 |<-- CB_COPY --------------------------| Server reports results 600 |\------------------------------------>| 601 | | 603 Figure 2: An asynchronous intra-server copy. 605 4.2.2. Inter-Server Copy 607 A copy may also be performed between two servers. The copy protocol 608 is designed to accommodate a variety of network topologies. As shown 609 in Figure 3, the client and servers may be connected by multiple 610 networks. In particular, the servers may be connected by a 611 specialized, high speed network (network 192.168.33.0/24 in the 612 diagram) that does not include the client. The protocol allows the 613 client to setup the copy between the servers (over network 614 10.11.78.0/24 in the diagram) and for the servers to communicate on 615 the high speed network if they choose to do so. 617 192.168.33.0/24 618 +-------------------------------------+ 619 | | 620 | | 621 | 192.168.33.18 | 192.168.33.56 622 +-------+------+ +------+------+ 623 | Source | | Destination | 624 +-------+------+ +------+------+ 625 | 10.11.78.18 | 10.11.78.56 626 | | 627 | | 628 | 10.11.78.0/24 | 629 +------------------+------------------+ 630 | 631 | 632 | 10.11.78.243 633 +-----+-----+ 634 | Client | 635 +-----------+ 637 Figure 3: An example inter-server network topology. 639 For an inter-server copy, the client notifies the source server that 640 a file will be copied by the destination server using a COPY_NOTIFY 641 operation. The client then initiates the copy by sending the COPY 642 operation to the destination server. The destination server may 643 perform the copy synchronously or asynchronously. 645 A synchronous inter-server copy is shown in Figure 4. In this case, 646 the destination server chooses to perform the copy before responding 647 to the client's COPY request. 649 An asynchronous copy is shown in Figure 5. In this case, the 650 destination server chooses to respond to the client's COPY request 651 immediately and then perform the copy asynchronously. 653 Client Source Destination 654 + + + 655 | | | 656 |--- COPY_NOTIFY --->| | 657 |<------------------/| | 658 | | | 659 | | | 660 |--- COPY ---------------------------->| 661 | | | 662 | | | 663 | |<----- read -----| 664 | |\--------------->| 665 | | | 666 | | . | Multiple reads may 667 | | . | be necessary 668 | | . | 669 | | | 670 | | | 671 |<------------------------------------/| Destination replies 672 | | | to COPY 674 Figure 4: A synchronous inter-server copy. 676 Client Source Destination 677 + + + 678 | | | 679 |--- COPY_NOTIFY --->| | 680 |<------------------/| | 681 | | | 682 | | | 683 |--- COPY ---------------------------->| 684 |<------------------------------------/| 685 | | | 686 | | | 687 | |<----- read -----| 688 | |\--------------->| 689 | | | 690 | | . | Multiple reads may 691 | | . | be necessary 692 | | . | 693 | | | 694 | | | 695 |--- COPY_STATUS --------------------->| Client may poll 696 |<------------------------------------/| for status 697 | | | 698 | | . | Multiple COPY_STATUS 699 | | . | operations may be sent 700 | | . | 701 | | | 702 | | | 703 | | | 704 |<-- CB_COPY --------------------------| Destination reports 705 |\------------------------------------>| results 706 | | | 708 Figure 5: An asynchronous inter-server copy. 710 4.2.3. Server-to-Server Copy Protocol 712 During an inter-server copy, the destination server reads the file 713 data from the source server. The source server and destination 714 server are not required to use a specific protocol to transfer the 715 file data. The choice of what protocol to use is ultimately the 716 destination server's decision. 718 4.2.3.1. Using NFSv4.x as a Server-to-Server Copy Protocol 720 The destination server MAY use standard NFSv4.x (where x >= 1) to 721 read the data from the source server. If NFSv4.x is used for the 722 server-to-server copy protocol, the destination server can use the 723 filehandle contained in the COPY request with standard NFSv4.x 724 operations to read data from the source server. Specifically, the 725 destination server may use the NFSv4.x OPEN operation's CLAIM_FH 726 facility to open the file being copied and obtain an open stateid. 727 Using the stateid, the destination server may then use NFSv4.x READ 728 operations to read the file. 730 4.2.3.2. Using an alternative Server-to-Server Copy Protocol 732 In a homogeneous environment, the source and destination servers 733 might be able to perform the file copy extremely efficiently using 734 specialized protocols. For example the source and destination 735 servers might be two nodes sharing a common file system format for 736 the source and destination file systems. Thus the source and 737 destination are in an ideal position to efficiently render the image 738 of the source file to the destination file by replicating the file 739 system formats at the block level. Another possibility is that the 740 source and destination might be two nodes sharing a common storage 741 area network, and thus there is no need to copy any data at all, and 742 instead ownership of the file and its contents might simply be re- 743 assigned to the destination. To allow for these possibilities, the 744 destination server is allowed to use a server-to-server copy protocol 745 of its choice. 747 In a heterogeneous environment, using a protocol other than NFSv4.x 748 (e.g. HTTP [14] or FTP [15]) presents some challenges. In 749 particular, the destination server is presented with the challenge of 750 accessing the source file given only an NFSv4.x filehandle. 752 One option for protocols that identify source files with path names 753 is to use an ASCII hexadecimal representation of the source 754 filehandle as the file name. 756 Another option for the source server is to use URLs to direct the 757 destination server to a specialized service. For example, the 758 response to COPY_NOTIFY could include the URL 759 ftp://s1.example.com:9999/_FH/0x12345, where 0x12345 is the ASCII 760 hexadecimal representation of the source filehandle. When the 761 destination server receives the source server's URL, it would use 762 "_FH/0x12345" as the file name to pass to the FTP server listening on 763 port 9999 of s1.example.com. On port 9999 there would be a special 764 instance of the FTP service that understands how to convert NFS 765 filehandles to an open file descriptor (in many operating systems, 766 this would require a new system call, one which is the inverse of the 767 makefh() function that the pre-NFSv4 MOUNT service needs). 769 Authenticating and identifying the destination server to the source 770 server is also a challenge. Recommendations for how to accomplish 771 this are given in Section 4.4.1.2.4 and Section 4.4.1.4. 773 4.3. Operations 775 In the sections that follow, several operations are defined that 776 together provide the server-side copy feature. These operations are 777 intended to be OPTIONAL operations as defined in section 17 of [2]. 778 The COPY_NOTIFY, COPY_REVOKE, COPY, COPY_ABORT, and COPY_STATUS 779 operations are designed to be sent within an NFSv4 COMPOUND 780 procedure. The CB_COPY operation is designed to be sent within an 781 NFSv4 CB_COMPOUND procedure. 783 Each operation is performed in the context of the user identified by 784 the ONC RPC credential of its containing COMPOUND or CB_COMPOUND 785 request. For example, a COPY_ABORT operation issued by a given user 786 indicates that a specified COPY operation initiated by the same user 787 be canceled. Therefore a COPY_ABORT MUST NOT interfere with a copy 788 of the same file initiated by another user. 790 An NFS server MAY allow an administrative user to monitor or cancel 791 copy operations using an implementation specific interface. 793 4.3.1. netloc4 - Network Locations 795 The server-side copy operations specify network locations using the 796 netloc4 data type shown below: 798 enum netloc_type4 { 799 NL4_NAME = 0, 800 NL4_URL = 1, 801 NL4_NETADDR = 2 802 }; 803 union netloc4 switch (netloc_type4 nl_type) { 804 case NL4_NAME: utf8str_cis nl_name; 805 case NL4_URL: utf8str_cis nl_url; 806 case NL4_NETADDR: netaddr4 nl_addr; 807 }; 809 If the netloc4 is of type NL4_NAME, the nl_name field MUST be 810 specified as a UTF-8 string. The nl_name is expected to be resolved 811 to a network address via DNS, LDAP, NIS, /etc/hosts, or some other 812 means. If the netloc4 is of type NL4_URL, a server URL [5] 813 appropriate for the server-to-server copy operation is specified as a 814 UTF-8 string. If the netloc4 is of type NL4_NETADDR, the nl_addr 815 field MUST contain a valid netaddr4 as defined in Section 3.3.9 of 816 [2]. 818 When netloc4 values are used for an inter-server copy as shown in 819 Figure 3, their values may be evaluated on the source server, 820 destination server, and client. The network environment in which 821 these systems operate should be configured so that the netloc4 values 822 are interpreted as intended on each system. 824 4.3.2. Operation 61: COPY_NOTIFY - Notify a source server of a future 825 copy 827 4.3.2.1. ARGUMENT 829 struct COPY_NOTIFY4args { 830 /* CURRENT_FH: source file */ 831 netloc4 cna_destination_server; 832 }; 834 4.3.2.2. RESULT 836 struct COPY_NOTIFY4resok { 837 nfstime4 cnr_lease_time; 838 netloc4 cnr_source_server<>; 839 }; 841 union COPY_NOTIFY4res switch (nfsstat4 cnr_status) { 842 case NFS4_OK: 843 COPY_NOTIFY4resok resok4; 844 default: 845 void; 846 }; 848 4.3.2.3. DESCRIPTION 850 This operation is used for an inter-server copy. A client sends this 851 operation in a COMPOUND request to the source server to authorize a 852 destination server identified by cna_destination_server to read the 853 file specified by CURRENT_FH on behalf of the given user. 855 The cna_destination_server MUST be specified using the netloc4 856 network location format. The server is not required to resolve the 857 cna_destination_server address before completing this operation. 859 If this operation succeeds, the source server will allow the 860 cna_destination_server to copy the specified file on behalf of the 861 given user. If COPY_NOTIFY succeeds, the destination server is 862 granted permission to read the file as long as both of the following 863 conditions are met: 865 o The destination server begins reading the source file before the 866 cnr_lease_time expires. If the cnr_lease_time expires while the 867 destination server is still reading the source file, the 868 destination server is allowed to finish reading the file. 870 o The client has not issued a COPY_REVOKE for the same combination 871 of user, filehandle, and destination server. 873 The cnr_lease_time is chosen by the source server. A cnr_lease_time 874 of 0 (zero) indicates an infinite lease. To renew the copy lease 875 time the client should resend the same copy notification request to 876 the source server. 878 To avoid the need for synchronized clocks, copy lease times are 879 granted by the server as a time delta. However, there is a 880 requirement that the client and server clocks do not drift 881 excessively over the duration of the lease. There is also the issue 882 of propagation delay across the network which could easily be several 883 hundred milliseconds as well as the possibility that requests will be 884 lost and need to be retransmitted. 886 To take propagation delay into account, the client should subtract it 887 from copy lease times (e.g. if the client estimates the one-way 888 propagation delay as 200 milliseconds, then it can assume that the 889 lease is already 200 milliseconds old when it gets it). In addition, 890 it will take another 200 milliseconds to get a response back to the 891 server. So the client must send a lease renewal or send the copy 892 offload request to the cna_destination_server at least 400 893 milliseconds before the copy lease would expire. If the propagation 894 delay varies over the life of the lease (e.g. the client is on a 895 mobile host), the client will need to continuously subtract the 896 increase in propagation delay from the copy lease times. 898 The server's copy lease period configuration should take into account 899 the network distance of the clients that will be accessing the 900 server's resources. It is expected that the lease period will take 901 into account the network propagation delays and other network delay 902 factors for the client population. Since the protocol does not allow 903 for an automatic method to determine an appropriate copy lease 904 period, the server's administrator may have to tune the copy lease 905 period. 907 A successful response will also contain a list of names, addresses, 908 and URLs called cnr_source_server, on which the source is willing to 909 accept connections from the destination. These might not be 910 reachable from the client and might be located on networks to which 911 the client has no connection. 913 If the client wishes to perform an inter-server copy, the client MUST 914 send a COPY_NOTIFY to the source server. Therefore, the source 915 server MUST support COPY_NOTIFY. 917 For a copy only involving one server (the source and destination are 918 on the same server), this operation is unnecessary. 920 The COPY_NOTIFY operation may fail for the following reasons (this is 921 a partial list): 923 NFS4ERR_MOVED: The file system which contains the source file is not 924 present on the source server. The client can determine the 925 correct location and reissue the operation with the correct 926 location. 928 NFS4ERR_NOTSUPP: The copy offload operation is not supported by the 929 NFS server receiving this request. 931 NFS4ERR_WRONGSEC: The security mechanism being used by the client 932 does not match the server's security policy. 934 4.3.3. Operation 62: COPY_REVOKE - Revoke a destination server's copy 935 privileges 937 4.3.3.1. ARGUMENT 939 struct COPY_REVOKE4args { 940 /* CURRENT_FH: source file */ 941 netloc4 cra_destination_server; 942 }; 944 4.3.3.2. RESULT 946 struct COPY_REVOKE4res { 947 nfsstat4 crr_status; 948 }; 950 4.3.3.3. DESCRIPTION 952 This operation is used for an inter-server copy. A client sends this 953 operation in a COMPOUND request to the source server to revoke the 954 authorization of a destination server identified by 955 cra_destination_server from reading the file specified by CURRENT_FH 956 on behalf of given user. If the cra_destination_server has already 957 begun copying the file, a successful return from this operation 958 indicates that further access will be prevented. 960 The cra_destination_server MUST be specified using the netloc4 961 network location format. The server is not required to resolve the 962 cra_destination_server address before completing this operation. 964 The COPY_REVOKE operation is useful in situations in which the source 965 server granted a very long or infinite lease on the destination 966 server's ability to read the source file and all copy operations on 967 the source file have been completed. 969 For a copy only involving one server (the source and destination are 970 on the same server), this operation is unnecessary. 972 If the server supports COPY_NOTIFY, the server is REQUIRED to support 973 the COPY_REVOKE operation. 975 The COPY_REVOKE operation may fail for the following reasons (this is 976 a partial list): 978 NFS4ERR_MOVED: The file system which contains the source file is not 979 present on the source server. The client can determine the 980 correct location and reissue the operation with the correct 981 location. 983 NFS4ERR_NOTSUPP: The copy offload operation is not supported by the 984 NFS server receiving this request. 986 4.3.4. Operation 59: COPY - Initiate a server-side copy 988 4.3.4.1. ARGUMENT 990 const COPY4_GUARDED = 0x00000001; 991 const COPY4_METADATA = 0x00000002; 993 struct COPY4args { 994 /* SAVED_FH: source file */ 995 /* CURRENT_FH: destination file or */ 996 /* directory */ 997 offset4 ca_src_offset; 998 offset4 ca_dst_offset; 999 length4 ca_count; 1000 uint32_t ca_flags; 1001 component4 ca_destination; 1002 netloc4 ca_source_server<>; 1003 }; 1005 4.3.4.2. RESULT 1007 union COPY4res switch (nfsstat4 cr_status) { 1008 case NFS4_OK: 1009 stateid4 cr_callback_id<1>; 1010 default: 1011 length4 cr_bytes_copied; 1012 }; 1014 4.3.4.3. DESCRIPTION 1016 The COPY operation is used for both intra- and inter-server copies. 1017 In both cases, the COPY is always sent from the client to the 1018 destination server of the file copy. The COPY operation requests 1019 that a file be copied from the location specified by the SAVED_FH 1020 value to the location specified by the combination of CURRENT_FH and 1021 ca_destination. 1023 The SAVED_FH must be a regular file. If SAVED_FH is not a regular 1024 file, the operation MUST fail and return NFS4ERR_WRONG_TYPE. 1026 In order to set SAVED_FH to the source file handle, the compound 1027 procedure requesting the COPY will include a sub-sequence of 1028 operations such as 1030 PUTFH source-fh 1031 SAVEFH 1033 If the request is for a server-to-server copy, the source-fh is a 1034 filehandle from the source server and the compound procedure is being 1035 executed on the destination server. In this case, the source-fh is a 1036 foreign filehandle on the server receiving the COPY request. If 1037 either PUTFH or SAVEFH checked the validity of the filehandle, the 1038 operation would likely fail and return NFS4ERR_STALE. 1040 In order to avoid this problem, the minor version incorporating the 1041 COPY operations will need to make a few small changes in the handling 1042 of existing operations. If a server supports the server-to-server 1043 COPY feature, a PUTFH followed by a SAVEFH MUST NOT return 1044 NFS4ERR_STALE for either operation. These restrictions do not pose 1045 substantial difficulties for servers. The CURRENT_FH and SAVED_FH 1046 may be validated in the context of the operation referencing them and 1047 an NFS4ERR_STALE error returned for an invalid file handle at that 1048 point. 1050 The CURRENT_FH and ca_destination together specify the destination of 1051 the copy operation. If ca_destination is of 0 (zero) length, then 1052 CURRENT_FH specifies the target file. In this case, CURRENT_FH MUST 1053 be a regular file and not a directory. If ca_destination is not of 0 1054 (zero) length, the ca_destination argument specifies the file name to 1055 which the data will be copied within the directory identified by 1056 CURRENT_FH. In this case, CURRENT_FH MUST be a directory and not a 1057 regular file. 1059 If the file named by ca_destination does not exist and the operation 1060 completes successfully, the file will be visible in the file system 1061 namespace. If the file does not exist and the operation fails, the 1062 file MAY be visible in the file system namespace depending on when 1063 the failure occurs and on the implementation of the NFS server 1064 receiving the COPY operation. If the ca_destination name cannot be 1065 created in the destination file system (due to file name 1066 restrictions, such as case or length), the operation MUST fail. 1068 The ca_src_offset is the offset within the source file from which the 1069 data will be read, the ca_dst_offset is the offset within the 1070 destination file to which the data will be written, and the ca_count 1071 is the number of bytes that will be copied. An offset of 0 (zero) 1072 specifies the start of the file. A count of 0 (zero) requests that 1073 all bytes from ca_src_offset through EOF be copied to the 1074 destination. If concurrent modifications to the source file overlap 1075 with the source file region being copied, the data copied may include 1076 all, some, or none of the modifications. The client can use standard 1077 NFS operations (e.g. OPEN with OPEN4_SHARE_DENY_WRITE or mandatory 1078 byte range locks) to protect against concurrent modifications if the 1079 client is concerned about this. If the source file's end of file is 1080 being modified in parallel with a copy that specifies a count of 0 1081 (zero) bytes, the amount of data copied is implementation dependent 1082 (clients may guard against this case by specifying a non-zero count 1083 value or preventing modification of the source file as mentioned 1084 above). 1086 If the source offset or the source offset plus count is greater than 1087 or equal to the size of the source file, the operation will fail with 1088 NFS4ERR_INVAL. The destination offset or destination offset plus 1089 count may be greater than the size of the destination file. This 1090 allows for the client to issue parallel copies to implement 1091 operations such as "cat file1 file2 file3 file4 > dest". 1093 If the destination file is created as a result of this command, the 1094 destination file's size will be equal to the number of bytes 1095 successfully copied. If the destination file already existed, the 1096 destination file's size may increase as a result of this operation 1097 (e.g. if ca_dst_offset plus ca_count is greater than the 1098 destination's initial size). 1100 If the ca_source_server list is specified, then this is an inter- 1101 server copy operation and the source file is on a remote server. The 1102 client is expected to have previously issued a successful COPY_NOTIFY 1103 request to the remote source server. The ca_source_server list 1104 SHOULD be the same as the COPY_NOTIFY response's cnr_source_server 1105 list. If the client includes the entries from the COPY_NOTIFY 1106 response's cnr_source_server list in the ca_source_server list, the 1107 source server can indicate a specific copy protocol for the 1108 destination server to use by returning a URL, which specifies both a 1109 protocol service and server name. Server-to-server copy protocol 1110 considerations are described in Section 4.2.3 and Section 4.4.1. 1112 The ca_flags argument allows the copy operation to be customized in 1113 the following ways using the guarded flag (COPY4_GUARDED) and the 1114 metadata flag (COPY4_METADATA). 1116 [NOTE: Earlier versions of this document defined a 1117 COPY4_SPACE_RESERVED flag for controlling space reservations on the 1118 destination file. This flag has been removed with the expectation 1119 that the space_reserve attribute defined in XXX_TDH_XXX will be 1120 adopted.] 1122 If the guarded flag is set and the destination exists on the server, 1123 this operation will fail with NFS4ERR_EXIST. 1125 If the guarded flag is not set and the destination exists on the 1126 server, the behavior is implementation dependent. 1128 If the metadata flag is set and the client is requesting a whole file 1129 copy (i.e. ca_count is 0 (zero)), a subset of the destination file's 1130 attributes MUST be the same as the source file's corresponding 1131 attributes and a subset of the destination file's attributes SHOULD 1132 be the same as the source file's corresponding attributes. The 1133 attributes in the MUST and SHOULD copy subsets will be defined for 1134 each NFS version. 1136 For NFSv4.1, Table 1 and Table 2 list the REQUIRED and RECOMMENDED 1137 attributes respectively. A "MUST" in the "Copy to destination file?" 1138 column indicates that the attribute is part of the MUST copy set. A 1139 "SHOULD" in the "Copy to destination file?" column indicates that the 1140 attribute is part of the SHOULD copy set. 1142 +--------------------+----+---------------------------+ 1143 | Name | Id | Copy to destination file? | 1144 +--------------------+----+---------------------------+ 1145 | supported_attrs | 0 | no | 1146 | type | 1 | MUST | 1147 | fh_expire_type | 2 | no | 1148 | change | 3 | SHOULD | 1149 | size | 4 | MUST | 1150 | link_support | 5 | no | 1151 | symlink_support | 6 | no | 1152 | named_attr | 7 | no | 1153 | fsid | 8 | no | 1154 | unique_handles | 9 | no | 1155 | lease_time | 10 | no | 1156 | rdattr_error | 11 | no | 1157 | filehandle | 19 | no | 1158 | suppattr_exclcreat | 75 | no | 1159 +--------------------+----+---------------------------+ 1161 Table 1 1163 +--------------------+----+---------------------------+ 1164 | Name | Id | Copy to destination file? | 1165 +--------------------+----+---------------------------+ 1166 | acl | 12 | MUST | 1167 | aclsupport | 13 | no | 1168 | archive | 14 | no | 1169 | cansettime | 15 | no | 1170 | case_insensitive | 16 | no | 1171 | case_preserving | 17 | no | 1172 | change_policy | 60 | no | 1173 | chown_restricted | 18 | MUST | 1174 | dacl | 58 | MUST | 1175 | dir_notif_delay | 56 | no | 1176 | dirent_notif_delay | 57 | no | 1177 | fileid | 20 | no | 1178 | files_avail | 21 | no | 1179 | files_free | 22 | no | 1180 | files_total | 23 | no | 1181 | fs_charset_cap | 76 | no | 1182 | fs_layout_type | 62 | no | 1183 | fs_locations | 24 | no | 1184 | fs_locations_info | 67 | no | 1185 | fs_status | 61 | no | 1186 | hidden | 25 | MUST | 1187 | homogeneous | 26 | no | 1188 | layout_alignment | 66 | no | 1189 | layout_blksize | 65 | no | 1190 | layout_hint | 63 | no | 1191 | layout_type | 64 | no | 1192 | maxfilesize | 27 | no | 1193 | maxlink | 28 | no | 1194 | maxname | 29 | no | 1195 | maxread | 30 | no | 1196 | maxwrite | 31 | no | 1197 | mdsthreshold | 68 | no | 1198 | mimetype | 32 | MUST | 1199 | mode | 33 | MUST | 1200 | mode_set_masked | 74 | no | 1201 | mounted_on_fileid | 55 | no | 1202 | no_trunc | 34 | no | 1203 | numlinks | 35 | no | 1204 | owner | 36 | MUST | 1205 | owner_group | 37 | MUST | 1206 | quota_avail_hard | 38 | no | 1207 | quota_avail_soft | 39 | no | 1208 | quota_used | 40 | no | 1209 | rawdev | 41 | no | 1210 | retentevt_get | 71 | MUST | 1211 | retentevt_set | 72 | no | 1212 | retention_get | 69 | MUST | 1213 | retention_hold | 73 | MUST | 1214 | retention_set | 70 | no | 1215 | sacl | 59 | MUST | 1216 | space_avail | 42 | no | 1217 | space_free | 43 | no | 1218 | space_total | 44 | no | 1219 | space_used | 45 | no | 1220 | system | 46 | MUST | 1221 | time_access | 47 | MUST | 1222 | time_access_set | 48 | no | 1223 | time_backup | 49 | no | 1224 | time_create | 50 | MUST | 1225 | time_delta | 51 | no | 1226 | time_metadata | 52 | SHOULD | 1227 | time_modify | 53 | MUST | 1228 | time_modify_set | 54 | no | 1229 +--------------------+----+---------------------------+ 1231 Table 2 1233 [NOTE: The space_reserve attribute XXX_TDH_XXX will be in the MUST 1234 set.] 1236 [NOTE: The source file's attribute values will take precedence over 1237 any attribute values inherited by the destination file.] 1238 In the case of an inter-server copy or an intra-server copy between 1239 file systems, the attributes supported for the source file and 1240 destination file could be different. By definition,the REQUIRED 1241 attributes will be supported in all cases. If the metadata flag is 1242 set and the source file has a RECOMMENDED attribute that is not 1243 supported for the destination file, the copy MUST fail with 1244 NFS4ERR_ATTRNOTSUPP. 1246 Any attribute supported by the destination server that is not set on 1247 the source file SHOULD be left unset. 1249 Metadata attributes not exposed via the NFS protocol SHOULD be copied 1250 to the destination file where appropriate. 1252 The destination file's named attributes are not duplicated from the 1253 source file. After the copy process completes, the client MAY 1254 attempt to duplicate named attributes using standard NFSv4 1255 operations. However, the destination file's named attribute 1256 capabilities MAY be different from the source file's named attribute 1257 capabilities. 1259 If the metadata flag is not set and the client is requesting a whole 1260 file copy (i.e. ca_count is 0 (zero)), the destination file's 1261 metadata is implementation dependent. 1263 If the client is requesting a partial file copy (i.e. ca_count is not 1264 0 (zero)), the client SHOULD NOT set the metadata flag and the server 1265 MUST ignore the metadata flag. 1267 If the operation does not result in an immediate failure, the server 1268 will return NFS4_OK, and the CURRENT_FH will remain the destination's 1269 filehandle. 1271 If an immediate failure does occur, cr_bytes_copied will be set to 1272 the number of bytes copied to the destination file before the error 1273 occurred. The cr_bytes_copied value indicates the number of bytes 1274 copied but not which specific bytes have been copied. 1276 A return of NFS4_OK indicates that either the operation is complete 1277 or the operation was initiated and a callback will be used to deliver 1278 the final status of the operation. 1280 If the cr_callback_id is returned, this indicates that the operation 1281 was initiated and a CB_COPY callback will deliver the final results 1282 of the operation. The cr_callback_id stateid is termed a copy 1283 stateid in this context. The server is given the option of returning 1284 the results in a callback because the data may require a relatively 1285 long period of time to copy. 1287 If no cr_callback_id is returned, the operation completed 1288 synchronously and no callback will be issued by the server. The 1289 completion status of the operation is indicated by cr_status. 1291 If the copy completes successfully, either synchronously or 1292 asynchronously, the data copied from the source file to the 1293 destination file MUST appear identical to the NFS client. However, 1294 the NFS server's on disk representation of the data in the source 1295 file and destination file MAY differ. For example, the NFS server 1296 might encrypt, compress, deduplicate, or otherwise represent the on 1297 disk data in the source and destination file differently. 1299 In the event of a failure the state of the destination file is 1300 implementation dependent. The COPY operation may fail for the 1301 following reasons (this is a partial list). 1303 NFS4ERR_MOVED: The file system which contains the source file, or 1304 the destination file or directory is not present. The client can 1305 determine the correct location and reissue the operation with the 1306 correct location. 1308 NFS4ERR_NOTSUPP: The copy offload operation is not supported by the 1309 NFS server receiving this request. 1311 NFS4ERR_PARTNER_NOTSUPP: The remote server does not support the 1312 server-to-server copy offload protocol. 1314 NFS4ERR_PARTNER_NO_AUTH: The remote server does not authorize a 1315 server-to-server copy offload operation. This may be due to the 1316 client's failure to send the COPY_NOTIFY operation to the remote 1317 server, the remote server receiving a server-to-server copy 1318 offload request after the copy lease time expired, or for some 1319 other permission problem. 1321 NFS4ERR_FBIG: The copy operation would have caused the file to grow 1322 beyond the server's limit. 1324 NFS4ERR_NOTDIR: The CURRENT_FH is a file and ca_destination has non- 1325 zero length. 1327 NFS4ERR_WRONG_TYPE: The SAVED_FH is not a regular file. 1329 NFS4ERR_ISDIR: The CURRENT_FH is a directory and ca_destination has 1330 zero length. 1332 NFS4ERR_INVAL: The source offset or offset plus count are greater 1333 than or equal to the size of the source file. 1335 NFS4ERR_DELAY: The server does not have the resources to perform the 1336 copy operation at the current time. The client should retry the 1337 operation sometime in the future. 1339 NFS4ERR_METADATA_NOTSUPP: The destination file cannot support the 1340 same metadata as the source file. 1342 NFS4ERR_WRONGSEC: The security mechanism being used by the client 1343 does not match the server's security policy. 1345 4.3.5. Operation 60: COPY_ABORT - Cancel a server-side copy 1347 4.3.5.1. ARGUMENT 1349 struct COPY_ABORT4args { 1350 /* CURRENT_FH: desination file */ 1351 stateid4 caa_stateid; 1352 }; 1354 4.3.5.2. RESULT 1356 struct COPY_ABORT4res { 1357 nfsstat4 car_status; 1358 }; 1360 4.3.5.3. DESCRIPTION 1362 COPY_ABORT is used for both intra- and inter-server asynchronous 1363 copies. The COPY_ABORT operation allows the client to cancel a 1364 server-side copy operation that it initiated. This operation is sent 1365 in a COMPOUND request from the client to the destination server. 1366 This operation may be used to cancel a copy when the application that 1367 requested the copy exits before the operation is completed or for 1368 some other reason. 1370 The request contains the filehandle and copy stateid cookies that act 1371 as the context for the previously initiated copy operation. 1373 The result's car_status field indicates whether the cancel was 1374 successful or not. A value of NFS4_OK indicates that the copy 1375 operation was canceled and no callback will be issued by the server. 1376 A copy operation that is successfully canceled may result in none, 1377 some, or all of the data copied. 1379 If the server supports asynchronous copies, the server is REQUIRED to 1380 support the COPY_ABORT operation. 1382 The COPY_ABORT operation may fail for the following reasons (this is 1383 a partial list): 1385 NFS4ERR_NOTSUPP: The abort operation is not supported by the NFS 1386 server receiving this request. 1388 NFS4ERR_RETRY: The abort failed, but a retry at some time in the 1389 future MAY succeed. 1391 NFS4ERR_COMPLETE_ALREADY: The abort failed, and a callback will 1392 deliver the results of the copy operation. 1394 NFS4ERR_SERVERFAULT: An error occurred on the server that does not 1395 map to a specific error code. 1397 4.3.6. Operation 63: COPY_STATUS - Poll for status of a server-side 1398 copy 1400 4.3.6.1. ARGUMENT 1402 struct COPY_STATUS4args { 1403 /* CURRENT_FH: destination file */ 1404 stateid4 csa_stateid; 1405 }; 1407 4.3.6.2. RESULT 1409 struct COPY_STATUS4resok { 1410 length4 csr_bytes_copied; 1411 nfsstat4 csr_complete<1>; 1412 }; 1414 union COPY_STATUS4res switch (nfsstat4 csr_status) { 1415 case NFS4_OK: 1416 COPY_STATUS4resok resok4; 1417 default: 1418 void; 1419 }; 1421 4.3.6.3. DESCRIPTION 1423 COPY_STATUS is used for both intra- and inter-server asynchronous 1424 copies. The COPY_STATUS operation allows the client to poll the 1425 server to determine the status of an asynchronous copy operation. 1426 This operation is sent by the client to the destination server. 1428 If this operation is successful, the number of bytes copied are 1429 returned to the client in the csr_bytes_copied field. The 1430 csr_bytes_copied value indicates the number of bytes copied but not 1431 which specific bytes have been copied. 1433 If the optional csr_complete field is present, the copy has 1434 completed. In this case the status value indicates the result of the 1435 asynchronous copy operation. In all cases, the server will also 1436 deliver the final results of the asynchronous copy in a CB_COPY 1437 operation. 1439 The failure of this operation does not indicate the result of the 1440 asynchronous copy in any way. 1442 If the server supports asynchronous copies, the server is REQUIRED to 1443 support the COPY_STATUS operation. 1445 The COPY_STATUS operation may fail for the following reasons (this is 1446 a partial list): 1448 NFS4ERR_NOTSUPP: The copy status operation is not supported by the 1449 NFS server receiving this request. 1451 NFS4ERR_BAD_STATEID: The stateid is not valid (see Section 4.3.8 1452 below). 1454 NFS4ERR_EXPIRED: The stateid has expired (see Copy Offload Stateid 1455 section below). 1457 4.3.7. Operation 15: CB_COPY - Report results of a server-side copy 1458 4.3.7.1. ARGUMENT 1460 union copy_info4 switch (nfsstat4 cca_status) { 1461 case NFS4_OK: 1462 void; 1463 default: 1464 length4 cca_bytes_copied; 1465 }; 1467 struct CB_COPY4args { 1468 nfs_fh4 cca_fh; 1469 stateid4 cca_stateid; 1470 copy_info4 cca_copy_info; 1471 }; 1473 4.3.7.2. RESULT 1475 struct CB_COPY4res { 1476 nfsstat4 ccr_status; 1477 }; 1479 4.3.7.3. DESCRIPTION 1481 CB_COPY is used for both intra- and inter-server asynchronous copies. 1482 The CB_COPY callback informs the client of the result of an 1483 asynchronous server-side copy. This operation is sent by the 1484 destination server to the client in a CB_COMPOUND request. The copy 1485 is identified by the filehandle and stateid arguments. The result is 1486 indicated by the status field. If the copy failed, cca_bytes_copied 1487 contains the number of bytes copied before the failure occurred. The 1488 cca_bytes_copied value indicates the number of bytes copied but not 1489 which specific bytes have been copied. 1491 In the absence of an established backchannel, the server cannot 1492 signal the completion of the COPY via a CB_COPY callback. The loss 1493 of a callback channel would be indicated by the server setting the 1494 SEQ4_STATUS_CB_PATH_DOWN flag in the sr_status_flags field of the 1495 SEQUENCE operation. The client must re-establish the callback 1496 channel to receive the status of the COPY operation. Prolonged loss 1497 of the callback channel could result in the server dropping the COPY 1498 operation state and invalidating the copy stateid. 1500 If the client supports the COPY operation, the client is REQUIRED to 1501 support the CB_COPY operation. 1503 The CB_COPY operation may fail for the following reasons (this is a 1504 partial list): 1506 NFS4ERR_NOTSUPP: The copy offload operation is not supported by the 1507 NFS client receiving this request. 1509 4.3.8. Copy Offload Stateids 1511 A server may perform a copy offload operation asynchronously. An 1512 asynchronous copy is tracked using a copy offload stateid. Copy 1513 offload stateids are included in the COPY, COPY_ABORT, COPY_STATUS, 1514 and CB_COPY operations. 1516 Section 8.2.4 of [2] specifies that stateids are valid until either 1517 (A) the client or server restart or (B) the client returns the 1518 resource. 1520 A copy offload stateid will be valid until either (A) the client or 1521 server restart or (B) the client returns the resource by issuing a 1522 COPY_ABORT operation or the client replies to a CB_COPY operation. 1524 A copy offload stateid's seqid MUST NOT be 0 (zero). In the context 1525 of a copy offload operation, it is ambiguous to indicate the most 1526 recent copy offload operation using a stateid with seqid of 0 (zero). 1527 Therefore a copy offload stateid with seqid of 0 (zero) MUST be 1528 considered invalid. 1530 4.4. Security Considerations 1532 The security considerations pertaining to NFSv4 [10] apply to this 1533 document. 1535 The standard security mechanisms provide by NFSv4 [10] may be used to 1536 secure the protocol described in this document. 1538 NFSv4 clients and servers supporting the the inter-server copy 1539 operations described in this document are REQUIRED to implement [6], 1540 including the RPCSEC_GSSv3 privileges copy_from_auth and 1541 copy_to_auth. If the server-to-server copy protocol is ONC RPC 1542 based, the servers are also REQUIRED to implement the RPCSEC_GSSv3 1543 privilege copy_confirm_auth. These requirements to implement are not 1544 requirements to use. NFSv4 clients and servers are RECOMMENDED to 1545 use [6] to secure server-side copy operations. 1547 4.4.1. Inter-Server Copy Security 1549 4.4.1.1. Requirements for Secure Inter-Server Copy 1551 Inter-server copy is driven by several requirements: 1553 o The specification MUST NOT mandate an inter-server copy protocol. 1554 There are many ways to copy data. Some will be more optimal than 1555 others depending on the identities of the source server and 1556 destination server. For example the source and destination 1557 servers might be two nodes sharing a common file system format for 1558 the source and destination file systems. Thus the source and 1559 destination are in an ideal position to efficiently render the 1560 image of the source file to the destination file by replicating 1561 the file system formats at the block level. In other cases, the 1562 source and destination might be two nodes sharing a common storage 1563 area network, and thus there is no need to copy any data at all, 1564 and instead ownership of the file and its contents simply gets re- 1565 assigned to the destination. 1567 o The specification MUST provide guidance for using NFSv4.x as a 1568 copy protocol. For those source and destination servers willing 1569 to use NFSv4.x there are specific security considerations that 1570 this specification can and does address. 1572 o The specification MUST NOT mandate pre-configuration between the 1573 source and destination server. Requiring that the source and 1574 destination first have a "copying relationship" increases the 1575 administrative burden. However the specification MUST NOT 1576 preclude implementations that require pre-configuration. 1578 o The specification MUST NOT mandate a trust relationship between 1579 the source and destination server. The NFSv4 security model 1580 requires mutual authentication between a principal on an NFS 1581 client and a principal on an NFS server. This model MUST continue 1582 with the introduction of COPY. 1584 4.4.1.2. Inter-Server Copy with RPCSEC_GSSv3 1586 When the client sends a COPY_NOTIFY to the source server to expect 1587 the destination to attempt to copy data from the source server, it is 1588 expected that this copy is being done on behalf of the principal 1589 (called the "user principal") that sent the RPC request that encloses 1590 the COMPOUND procedure that contains the COPY_NOTIFY operation. The 1591 user principal is identified by the RPC credentials. A mechanism 1592 that allows the user principal to authorize the destination server to 1593 perform the copy in a manner that lets the source server properly 1594 authenticate the destination's copy, and without allowing the 1595 destination to exceed its authorization is necessary. 1597 An approach that sends delegated credentials of the client's user 1598 principal to the destination server is not used for the following 1599 reasons. If the client's user delegated its credentials, the 1600 destination would authenticate as the user principal. If the 1601 destination were using the NFSv4 protocol to perform the copy, then 1602 the source server would authenticate the destination server as the 1603 user principal, and the file copy would securely proceed. However, 1604 this approach would allow the destination server to copy other files. 1605 The user principal would have to trust the destination server to not 1606 do so. This is counter to the requirements, and therefore is not 1607 considered. Instead an approach using RPCSEC_GSSv3 [6] privileges is 1608 proposed. 1610 One of the stated applications of the proposed RPCSEC_GSSv3 protocol 1611 is compound client host and user authentication [+ privilege 1612 assertion]. For inter-server file copy, we require compound NFS 1613 server host and user authentication [+ privilege assertion]. The 1614 distinction between the two is one without meaning. 1616 RPCSEC_GSSv3 introduces the notion of privileges. We define three 1617 privileges: 1619 copy_from_auth: A user principal is authorizing a source principal 1620 ("nfs@") to allow a destination principal ("nfs@ 1621 ") to copy a file from the source to the destination. 1622 This privilege is established on the source server before the user 1623 principal sends a COPY_NOTIFY operation to the source server. 1625 struct copy_from_auth_priv { 1626 secret4 cfap_shared_secret; 1627 netloc4 cfap_destination; 1628 /* the NFSv4 user name that the user principal maps to */ 1629 utf8str_mixed cfap_username; 1630 /* equal to seq_num of rpc_gss_cred_vers_3_t */ 1631 unsigned int cfap_seq_num; 1632 }; 1634 cap_shared_secret is a secret value the user principal generates. 1636 copy_to_auth: A user principal is authorizing a destination 1637 principal ("nfs@") to allow it to copy a file from 1638 the source to the destination. This privilege is established on 1639 the destination server before the user principal sends a COPY 1640 operation to the destination server. 1642 struct copy_to_auth_priv { 1643 /* equal to cfap_shared_secret */ 1644 secret4 ctap_shared_secret; 1645 netloc4 ctap_source; 1646 /* the NFSv4 user name that the user principal maps to */ 1647 utf8str_mixed ctap_username; 1648 /* equal to seq_num of rpc_gss_cred_vers_3_t */ 1649 unsigned int ctap_seq_num; 1650 }; 1652 ctap_shared_secret is a secret value the user principal generated 1653 and was used to establish the copy_from_auth privilege with the 1654 source principal. 1656 copy_confirm_auth: A destination principal is confirming with the 1657 source principal that it is authorized to copy data from the 1658 source on behalf of the user principal. When the inter-server 1659 copy protocol is NFSv4, or for that matter, any protocol capable 1660 of being secured via RPCSEC_GSSv3 (i.e. any ONC RPC protocol), 1661 this privilege is established before the file is copied from the 1662 source to the destination. 1664 struct copy_confirm_auth_priv { 1665 /* equal to GSS_GetMIC() of cfap_shared_secret */ 1666 opaque ccap_shared_secret_mic<>; 1667 /* the NFSv4 user name that the user principal maps to */ 1668 utf8str_mixed ccap_username; 1669 /* equal to seq_num of rpc_gss_cred_vers_3_t */ 1670 unsigned int ccap_seq_num; 1671 }; 1673 4.4.1.2.1. Establishing a Security Context 1675 When the user principal wants to COPY a file between two servers, if 1676 it has not established copy_from_auth and copy_to_auth privileges on 1677 the servers, it establishes them: 1679 o The user principal generates a secret it will share with the two 1680 servers. This shared secret will be placed in the 1681 cfap_shared_secret and ctap_shared_secret fields of the 1682 appropriate privilege data types, copy_from_auth_priv and 1683 copy_to_auth_priv. 1685 o An instance of copy_from_auth_priv is filled in with the shared 1686 secret, the destination server, and the NFSv4 user id of the user 1687 principal. It will be sent with an RPCSEC_GSS3_CREATE procedure, 1688 and so cfap_seq_num is set to the seq_num of the credential of the 1689 RPCSEC_GSS3_CREATE procedure. Because cfap_shared_secret is a 1690 secret, after XDR encoding copy_from_auth_priv, GSS_Wrap() (with 1691 privacy) is invoked on copy_from_auth_priv. The 1692 RPCSEC_GSS3_CREATE procedure's arguments are: 1694 struct { 1695 rpc_gss3_gss_binding *compound_binding; 1696 rpc_gss3_chan_binding *chan_binding_mic; 1697 rpc_gss3_assertion assertions<>; 1698 rpc_gss3_extension extensions<>; 1699 } rpc_gss3_create_args; 1701 The string "copy_from_auth" is placed in assertions[0].privs. The 1702 output of GSS_Wrap() is placed in extensions[0].data. The field 1703 extensions[0].critical is set to TRUE. The source server calls 1704 GSS_Unwrap() on the privilege, and verifies that the seq_num 1705 matches the credential. It then verifies that the NFSv4 user id 1706 being asserted matches the source server's mapping of the user 1707 principal. If it does, the privilege is established on the source 1708 server as: <"copy_from_auth", user id, destination>. The 1709 successful reply to RPCSEC_GSS3_CREATE has: 1711 struct { 1712 opaque handle<>; 1713 rpc_gss3_chan_binding *chan_binding_mic; 1714 rpc_gss3_assertion granted_assertions<>; 1715 rpc_gss3_assertion server_assertions<>; 1716 rpc_gss3_extension extensions<>; 1717 } rpc_gss3_create_res; 1719 The field "handle" is the RPCSEC_GSSv3 handle that the client will 1720 use on COPY_NOTIFY requests involving the source and destination 1721 server. granted_assertions[0].privs will be equal to 1722 "copy_from_auth". The server will return a GSS_Wrap() of 1723 copy_to_auth_priv. 1725 o An instance of copy_to_auth_priv is filled in with the shared 1726 secret, the source server, and the NFSv4 user id. It will be sent 1727 with an RPCSEC_GSS3_CREATE procedure, and so ctap_seq_num is set 1728 to the seq_num of the credential of the RPCSEC_GSS3_CREATE 1729 procedure. Because ctap_shared_secret is a secret, after XDR 1730 encoding copy_to_auth_priv, GSS_Wrap() is invoked on 1731 copy_to_auth_priv. The RPCSEC_GSS3_CREATE procedure's arguments 1732 are: 1734 struct { 1735 rpc_gss3_gss_binding *compound_binding; 1736 rpc_gss3_chan_binding *chan_binding_mic; 1737 rpc_gss3_assertion assertions<>; 1738 rpc_gss3_extension extensions<>; 1739 } rpc_gss3_create_args; 1741 The string "copy_to_auth" is placed in assertions[0].privs. The 1742 output of GSS_Wrap() is placed in extensions[0].data. The field 1743 extensions[0].critical is set to TRUE. After unwrapping, 1744 verifying the seq_num, and the user principal to NFSv4 user ID 1745 mapping, the destination establishes a privilege of 1746 <"copy_to_auth", user id, source>. The successful reply to 1747 RPCSEC_GSS3_CREATE has: 1749 struct { 1750 opaque handle<>; 1751 rpc_gss3_chan_binding *chan_binding_mic; 1752 rpc_gss3_assertion granted_assertions<>; 1753 rpc_gss3_assertion server_assertions<>; 1754 rpc_gss3_extension extensions<>; 1755 } rpc_gss3_create_res; 1757 The field "handle" is the RPCSEC_GSSv3 handle that the client will 1758 use on COPY requests involving the source and destination server. 1759 The field granted_assertions[0].privs will be equal to 1760 "copy_to_auth". The server will return a GSS_Wrap() of 1761 copy_to_auth_priv. 1763 4.4.1.2.2. Starting a Secure Inter-Server Copy 1765 When the client sends a COPY_NOTIFY request to the source server, it 1766 uses the privileged "copy_from_auth" RPCSEC_GSSv3 handle. 1767 cna_destination_server in COPY_NOTIFY MUST be the same as the name of 1768 the destination server specified in copy_from_auth_priv. Otherwise, 1769 COPY_NOTIFY will fail with NFS4ERR_ACCESS. The source server 1770 verifies that the privilege <"copy_from_auth", user id, destination> 1771 exists, and annotates it with the source filehandle, if the user 1772 principal has read access to the source file, and if administrative 1773 policies give the user principal and the NFS client read access to 1774 the source file (i.e. if the ACCESS operation would grant read 1775 access). Otherwise, COPY_NOTIFY will fail with NFS4ERR_ACCESS. 1777 When the client sends a COPY request to the destination server, it 1778 uses the privileged "copy_to_auth" RPCSEC_GSSv3 handle. 1779 ca_source_server in COPY MUST be the same as the name of the source 1780 server specified in copy_to_auth_priv. Otherwise, COPY will fail 1781 with NFS4ERR_ACCESS. The destination server verifies that the 1782 privilege <"copy_to_auth", user id, source> exists, and annotates it 1783 with the source and destination filehandles. If the client has 1784 failed to establish the "copy_to_auth" policy it will reject the 1785 request with NFS4ERR_PARTNER_NO_AUTH. 1787 If the client sends a COPY_REVOKE to the source server to rescind the 1788 destination server's copy privilege, it uses the privileged 1789 "copy_from_auth" RPCSEC_GSSv3 handle and the cra_destination_server 1790 in COPY_REVOKE MUST be the same as the name of the destination server 1791 specified in copy_from_auth_priv. The source server will then delete 1792 the <"copy_from_auth", user id, destination> privilege and fail any 1793 subsequent copy requests sent under the auspices of this privilege 1794 from the destination server. 1796 4.4.1.2.3. Securing ONC RPC Server-to-Server Copy Protocols 1798 After a destination server has a "copy_to_auth" privilege established 1799 on it, and it receives a COPY request, if it knows it will use an ONC 1800 RPC protocol to copy data, it will establish a "copy_confirm_auth" 1801 privilege on the source server, using nfs@ as the 1802 initiator principal, and nfs@ as the target principal. 1804 The value of the field ccap_shared_secret_mic is a GSS_VerifyMIC() of 1805 the shared secret passed in the copy_to_auth privilege. The field 1806 ccap_username is the mapping of the user principal to an NFSv4 user 1807 name ("user"@"domain" form), and MUST be the same as ctap_username 1808 and cfap_username. The field ccap_seq_num is the seq_num of the 1809 RPCSEC_GSSv3 credential used for the RPCSEC_GSS3_CREATE procedure the 1810 destination will send to the source server to establish the 1811 privilege. 1813 The source server verifies the privilege, and establishes a 1814 <"copy_confirm_auth", user id, destination> privilege. If the source 1815 server fails to verify the privilege, the COPY operation will be 1816 rejected with NFS4ERR_PARTNER_NO_AUTH. All subsequent ONC RPC 1817 requests sent from the destination to copy data from the source to 1818 the destination will use the RPCSEC_GSSv3 handle returned by the 1819 source's RPCSEC_GSS3_CREATE response. 1821 Note that the use of the "copy_confirm_auth" privilege accomplishes 1822 the following: 1824 o if a protocol like NFS is being used, with export policies, export 1825 policies can be overridden in case the destination server as-an- 1826 NFS-client is not authorized 1828 o manual configuration to allow a copy relationship between the 1829 source and destination is not needed. 1831 If the attempt to establish a "copy_confirm_auth" privilege fails, 1832 then when the user principal sends a COPY request to destination, the 1833 destination server will reject it with NFS4ERR_PARTNER_NO_AUTH. 1835 4.4.1.2.4. Securing Non ONC RPC Server-to-Server Copy Protocols 1837 If the destination won't be using ONC RPC to copy the data, then the 1838 source and destination are using an unspecified copy protocol. The 1839 destination could use the shared secret and the NFSv4 user id to 1840 prove to the source server that the user principal has authorized the 1841 copy. 1843 For protocols that authenticate user names with passwords (e.g. HTTP 1844 [14] and FTP [15]), the nfsv4 user id could be used as the user name, 1845 and an ASCII hexadecimal representation of the RPCSEC_GSSv3 shared 1846 secret could be used as the user password or as input into non- 1847 password authentication methods like CHAP [16]. 1849 4.4.1.3. Inter-Server Copy via ONC RPC but without RPCSEC_GSSv3 1851 ONC RPC security flavors other than RPCSEC_GSSv3 MAY be used with the 1852 server-side copy offload operations described in this document. In 1853 particular, host-based ONC RPC security flavors such as AUTH_NONE and 1854 AUTH_SYS MAY be used. If a host-based security flavor is used, a 1855 minimal level of protection for the server-to-server copy protocol is 1856 possible. 1858 In the absence of strong security mechanisms such as RPCSEC_GSSv3, 1859 the challenge is how the source server and destination server 1860 identify themselves to each other, especially in the presence of 1861 multi-homed source and destination servers. In a multi-homed 1862 environment, the destination server might not contact the source 1863 server from the same network address specified by the client in the 1864 COPY_NOTIFY. This can be overcome using the procedure described 1865 below. 1867 When the client sends the source server the COPY_NOTIFY operation, 1868 the source server may reply to the client with a list of target 1869 addresses, names, and/or URLs and assign them to the unique triple: 1870 . If the destination uses 1871 one of these target netlocs to contact the source server, the source 1872 server will be able to uniquely identify the destination server, even 1873 if the destination server does not connect from the address specified 1874 by the client in COPY_NOTIFY. 1876 For example, suppose the network topology is as shown in Figure 3. 1877 If the source filehandle is 0x12345, the source server may respond to 1878 a COPY_NOTIFY for destination 10.11.78.56 with the URLs: 1880 nfs://10.11.78.18//_COPY/10.11.78.56/_FH/0x12345 1882 nfs://192.168.33.18//_COPY/10.11.78.56/_FH/0x12345 1884 The client will then send these URLs to the destination server in the 1885 COPY operation. Suppose that the 192.168.33.0/24 network is a high 1886 speed network and the destination server decides to transfer the file 1887 over this network. If the destination contacts the source server 1888 from 192.168.33.56 over this network using NFSv4.1, it does the 1889 following: 1891 COMPOUND { PUTROOTFH, LOOKUP "_COPY" ; LOOKUP "10.11.78.56"; LOOKUP 1892 "_FH" ; OPEN "0x12345" ; GETFH } 1894 The source server will therefore know that these NFSv4.1 operations 1895 are being issued by the destination server identified in the 1896 COPY_NOTIFY. 1898 4.4.1.4. Inter-Server Copy without ONC RPC and RPCSEC_GSSv3 1900 The same techniques as Section 4.4.1.3, using unique URLs for each 1901 destination server, can be used for other protocols (e.g. HTTP [14] 1902 and FTP [15]) as well. 1904 5. Application Data Block Support 1906 At the OS level, files are contained on disk blocks. Applications 1907 are also free to impose structure on the data contained in a file and 1908 we can define an Application Data Block (ADB) to be such a structure. 1909 From the application's viewpoint, it only wants to handle ADBs and 1910 not raw bytes (see [17]). An ADB is typically comprised of two 1911 sections: a header and data. The header describes the 1912 characteristics of the block and can provide a means to detect 1913 corruption in the data payload. The data section is typically 1914 initialized to all zeros. 1916 The format of the header is application specific, but there are two 1917 main components typically encountered: 1919 1. An ADB Number (ADBN), which allows the application to determine 1920 which data block is being referenced. The ADBN is a logical 1921 block number and is useful when the client is not storing the 1922 blocks in contiguous memory. 1924 2. Fields to describe the state of the ADB and a means to detect 1925 block corruption. For both pieces of data, a useful property is 1926 that allowed values be unique in that if passed across the 1927 network, corruption due to translation between big and little 1928 endian architectures are detectable. For example, 0xF0DEDEF0 has 1929 the same bit pattern in both architectures. 1931 Applications already impose structures on files [17] and detect 1932 corruption in data blocks [18]. What they are not able to do is 1933 efficiently transfer and store ADBs. To initialize a file with ADBs, 1934 the client must send the full ADB to the server and that must be 1935 stored on the server. When the application is initializing a file to 1936 have the ADB structure, it could compress the ADBs to just the 1937 information to necessary to later reconstruct the header portion of 1938 the ADB when the contents are read back. Using sparse file 1939 techniques, the disk blocks described by would not be allocated. 1940 Unlike sparse file techniques, there would be a small cost to store 1941 the compressed header data. 1943 In this section, we are going to define a generic framework for an 1944 ADB, present one approach to detecting corruption in a given ADB 1945 implementation, and describe the model for how the client and server 1946 can support efficient initialization of ADBs, reading of ADB holes, 1947 punching holes in ADBs, and space reservation. Further, we need to 1948 be able to extend this model to applications which do not support 1949 ADBs, but wish to be able to handle sparse files, hole punching, and 1950 space reservation. 1952 5.1. Generic Framework 1954 We want the representation of the ADB to be flexible enough to 1955 support many different applications. The most basic approach is no 1956 imposition of a block at all, which means we are working with the raw 1957 bytes. Such an approach would be useful for storing holes, punching 1958 holes, etc. In more complex deployments, a server might be 1959 supporting multiple applications, each with their own definition of 1960 the ADB. One might store the ADBN at the start of the block and then 1961 have a guard pattern to detect corruption [19]. The next might store 1962 the ADBN at an offset of 100 bytes within the block and have no guard 1963 pattern at all. The point is that existing applications might 1964 already have well defined formats for their data blocks. 1966 The guard pattern can be used to represent the state of the block, to 1967 protect against corruption, or both. Again, it needs to be able to 1968 be placed anywhere within the ADB. 1970 We need to be able to represent the starting offset of the block and 1971 the size of the block. Note that nothing prevents the application 1972 from defining different sized blocks in a file. 1974 5.1.1. Data Block Representation 1976 struct app_data_block4 { 1977 offset4 adb_offset; 1978 length4 adb_block_size; 1979 length4 adb_block_count; 1980 length4 adb_reloff_blocknum; 1981 count4 adb_block_num; 1982 length4 adb_reloff_pattern; 1983 opaque adb_pattern<>; 1984 }; 1986 The app_data_block4 structure captures the abstraction presented for 1987 the ADB. The additional fields present are to allow the transmission 1988 of adb_block_count ADBs at one time. We also use adb_block_num to 1989 convey the ADBN of the first block in the sequence. Each ADB will 1990 contain the same adb_pattern string. 1992 As both adb_block_num and adb_pattern are optional, if either 1993 adb_reloff_pattern or adb_reloff_blocknum is set to NFS4_UINT64_MAX, 1994 then the corresponding field is not set in any of the ADB. 1996 5.1.2. Data Content 1998 /* 1999 * Use an enum such that we can extend new types. 2000 */ 2001 enum data_content4 { 2002 NFS4_CONTENT_DATA = 0, 2003 NFS4_CONTENT_APP_BLOCK = 1, 2004 NFS4_CONTENT_HOLE = 2 2005 }; 2007 New operations might need to differentiate between wanting to access 2008 data versus an ADB. Also, future minor versions might want to 2009 introduce new data formats. This enumeration allows that to occur. 2011 5.2. Operation 64: INITIALIZE 2013 The server has no concept of the structure imposed by the 2014 application. It is only when the application writes to a section of 2015 the file does order get imposed. In order to detect corruption even 2016 before the application utilizes the file, the application will want 2017 to initialize a range of ADBs. It uses the INITIALIZE operation to 2018 do so. 2020 5.2.1. ARGUMENT 2022 /* 2023 * We use data_content4 in case we wish to 2024 * extend new types later. Note that we 2025 * are explicitly disallowing data. 2026 */ 2027 union initialize_arg4 switch (data_content4 content) { 2028 case NFS4_CONTENT_APP_BLOCK: 2029 app_data_block4 ia_adb; 2030 case NFS4_CONTENT_HOLE: 2031 length4 ia_hole_length; 2032 default: 2033 void; 2034 }; 2036 struct INITIALIZE4args { 2037 /* CURRENT_FH: file */ 2038 stateid4 ia_stateid; 2039 stable_how4 ia_stable; 2040 offset4 ia_offset; 2041 initialize_arg4 ia_data<>; 2042 }; 2044 5.2.2. RESULT 2046 struct INITIALIZE4resok { 2047 count4 ir_count; 2048 stable_how4 ir_committed; 2049 verifier4 ir_writeverf; 2050 data_content4 ir_sparse; 2051 }; 2053 union INITIALIZE4res switch (nfsstat4 status) { 2054 case NFS4_OK: 2055 INITIALIZE4resok resok4; 2056 default: 2057 void; 2058 }; 2060 5.2.3. DESCRIPTION 2062 When the client invokes the INITIALIZE operation, it has two desired 2063 results: 2065 1. The structure described by the app_data_block4 be imposed on the 2066 file. 2068 2. The contents described by the app_data_block4 be sparse. 2070 If the server supports the INITIALIZE operation, it still might not 2071 support sparse files. So if it receives the INITIALIZE operation, 2072 then it MUST populate the contents of the file with the initialized 2073 ADBs. In other words, if the server supports INITIALIZE, then it 2074 supports the concept of ADBs. [[Comment.1: Do we want to support an 2075 asynchronous INITIALIZE? Do we have to? --TH]] 2077 If the data was already initialized, There are two interesting 2078 scenarios: 2080 1. The data blocks are allocated. 2082 2. Initializing in the middle of an existing ADB. 2084 If the data blocks were already allocated, then the INITIALIZE is a 2085 hole punch operation. If INITIALIZE supports sparse files, then the 2086 data blocks are to be deallocated. If not, then the data blocks are 2087 to be rewritten in the indicated ADB format. [[Comment.2: Need to 2088 document interaction between space reservation and hole punching? 2089 --TH]] 2091 Since the server has no knowledge of ADBs, it should not report 2092 misaligned creation of ADBs. Even while it can detect them, it 2093 cannot disallow them, as the application might be in the process of 2094 changing the size of the ADBs. Thus the server must be prepared to 2095 handle an INITIALIZE into an existing ADB. 2097 This document does not mandate the manner in which the server stores 2098 ADBs sparsely for a file. It does assume that if ADBs are stored 2099 sparsely, then the server can detect when an INITIALIZE arrives that 2100 will force a new ADB to start inside an existing ADB. For example, 2101 assume that ADBi has a adb_block_size of 4k and that an INITIALIZE 2102 starts 1k inside ADBi. The server should [[Comment.3: Need to flesh 2103 this out. --TH]] 2105 5.3. Operation 65: READ_PLUS 2107 If the client sends a READ operation, it is explicitly stating that 2108 it is not supporting sparse files. So if a READ occurs on a sparse 2109 ADB, then the server must expand such ADBs to be raw bytes. If a 2110 READ occurs in the middle of an ADB, the server can only send back 2111 bytes starting from that offset. 2113 Such an operation is inefficient for transfer of sparse sections of 2114 the file. As such, READ is marked as OBSOLETE in NFSv4.2. Instead, 2115 a client should issue READ_PLUS. Note that as the client has no a 2116 priori knowledge of whether an ADB is present or not, it should 2117 always use READ_PLUS. 2119 5.3.1. ARGUMENT 2121 struct READ_PLUS4args { 2122 /* CURRENT_FH: file */ 2123 stateid4 rpa_stateid; 2124 offset4 rpa_offset; 2125 count4 rpa_count; 2126 }; 2128 5.3.2. RESULT 2130 union read_plus_content switch (data_content4 content) { 2131 case NFS4_CONTENT_DATA: 2132 opaque rpc_data<>; 2133 case NFS4_CONTENT_APP_BLOCK: 2134 app_data_block4 rpc_block; 2135 case NFS4_CONTENT_HOLE: 2136 length4 rpc_hole_length; 2137 default: 2138 void; 2139 }; 2141 /* 2142 * Allow a return of an array of contents. 2143 */ 2144 struct read_plus_res4 { 2145 bool rpr_eof; 2146 read_plus_content rpr_contents<>; 2147 }; 2149 union READ_PLUS4res switch (nfsstat4 status) { 2150 case NFS4_OK: 2151 read_plus_res4 resok4; 2152 default: 2153 void; 2154 }; 2156 5.3.3. DESCRIPTION 2158 Over the given range, READ_PLUS will return all data and ADBs found 2159 as an array of read_plus_content. It is possible to have consecutive 2160 ADBs in the array as either different definitions of ADBs are present 2161 or as the guard pattern changes. 2163 Edge cases exist for ABDs which either begin before the rpa_offset 2164 requested by the READ_PLUS or end after the rpa_count requested - 2165 both of which may occur as not all applications which access the file 2166 are aware of the main application imposing a format on the file 2167 contents, i.e., tar, dd, cp, etc. READ_PLUS MUST retrieve whole 2168 ADBs, but it need not retrieve an entire sequences of ADBs. 2170 The server MUST return a whole ADB because if it does not, it must 2171 expand that partial ADB before it sends it to the client. E.g., if 2172 an ADB had a block size of 64k and the READ_PLUS was for 128k 2173 starting at an offset of 32k inside the ADB, then the first 32k would 2174 be converted to data. 2176 5.4. pNFS Considerations 2178 While this document does not mandate how sparse ADBs are recorded on 2179 the server, it does make the assumption that such information is not 2180 in the file. I.e., the information is metadata. As such, the 2181 INITIALIZE operation is defined to be not supported by the DS - it 2182 must be issued to the MDS. But since the client must not assume a 2183 priori whether a read is sparse or not, the READ_PLUS operation MUST 2184 be supported by both the DS and the MDS. I.e., the client might 2185 impose on the MDS to asynchronously read the data from the DS. 2187 Furthermore, each DS MUST not report to a client either a sparse ADB 2188 or data which belongs to another DS. One implication of this 2189 requirement is that the app_data_block4's adb_block_size MUST be 2190 either be the stripe width or the stripe width must be an even 2191 multiple of it. 2193 The second implication here is that the DS must be able to use the 2194 Control Protocol to determine from the MDS where the sparse ADBs 2195 occur. [[Comment.4: Need to discuss what happens if after the file 2196 is being written to and an INITIALIZE occurs? --TH]] Perhaps instead 2197 of the DS pulling from the MDS, the MDS pushes to the DS? Thus an 2198 INITIALIZE causes a new push? [[Comment.5: Still need to consider 2199 race cases of the DS getting a WRITE and the MDS getting an 2200 INITIALIZE. --TH]] 2202 5.5. An Example of Detecting Corruption 2204 In this section, we define an ADB format in which corruption can be 2205 detected. Note that this is just one possible format and means to 2206 detect corruption. 2208 Consider a very basic implementation of an operating system's disk 2209 blocks. A block is either data or it is an indirect block which 2210 allows for files to be larger than one block. It is desired to be 2211 able to initialize a block. Lastly, to quickly unlink a file, a 2212 block can be marked invalid. The contents remain intact - which 2213 would enable this OS application to undelete a file. 2215 The application defines 4k sized data blocks, with an 8 byte block 2216 counter occurring at offset 0 in the block, and with the guard 2217 pattern occurring at offset 8 inside the block. Furthermore, the 2218 guard pattern can take one of four states: 2220 0xfeedface - This is the FREE state and indicates that the ADB 2221 format has been applied. 2223 0xcafedead - This is the DATA state and indicates that real data 2224 has been written to this block. 2226 0xe4e5c001 - This is the INDIRECT state and indicates that the 2227 block contains block counter numbers that are chained off of this 2228 block. 2230 0xba1ed4a3 - This is the INVALID state and indicates that the block 2231 contains data whose contents are garbage. 2233 Finally, it also defines an 8 byte checksum [20] starting at byte 16 2234 which applies to the remaining contents of the block. If the state 2235 is FREE, then that checksum is trivially zero. As such, the 2236 application has no need to transfer the checksum implicitly inside 2237 the ADB - it need not make the transfer layer aware of the fact that 2238 there is a checksum (see [18] for an example of checksums used to 2239 detect corruption in application data blocks). 2241 Corruption in each ADB can be detected thusly: 2243 o If the guard pattern is anything other than one of the allowed 2244 values, including all zeros. 2246 o If the guard pattern is FREE and any other byte in the remainder 2247 of the ADB is anything other than zero. 2249 o If the guard pattern is anything other than FREE, then if the 2250 stored checksum does not match the computed checksum. 2252 o If the guard pattern is INDIRECT and one of the stored indirect 2253 block numbers has a value greater than the number of ADBs in the 2254 file. 2256 o If the guard pattern is INDIRECT and one of the stored indirect 2257 block numbers is a duplicate of another stored indirect block 2258 number. 2260 As can be seen, the application can detect errors based on the 2261 combination of the guard pattern state and the checksum. But also, 2262 the application can detect corruption based on the state and the 2263 contents of the ADB. This last point is important in validating the 2264 minimum amount of data we incorporated into our generic framework. 2265 I.e., the guard pattern is sufficient in allowing applications to 2266 design their own corruption detection. 2268 Finally, it is important to note that none of these corruption checks 2269 occur in the transport layer. The server and client components are 2270 totally unaware of the file format and might report everything as 2271 being transferred correctly even in the case the application detects 2272 corruption. 2274 5.6. Example of READ_PLUS 2276 The hypothetical application presented in Section 5.5 can be used to 2277 illustrate how READ_PLUS would return an array of results. A file is 2278 created and initialized with 100 4k ADBs in the FREE state: 2280 INITIALIZE {0, 4k, 100, 0, 0, 8, 0xfeedface} 2282 Further, assume the application writes a single ADB at 16k, changing 2283 the guard pattern to 0xcafedead, we would then have in memory: 2285 0 -> (16k - 1) : 4k, 4, 0, 0, 8, 0xfeedface 2286 16k -> (20k - 1) : 00 00 00 05 ca fe de ad XX XX ... XX XX 2287 20k -> 400k : 4k, 95, 0, 6, 0xfeedface 2289 And when the client did a READ_PLUS of 64k at the start of the file, 2290 it would get back a result of an ADB, some data, and a final ADB: 2292 ADB {0, 4, 0, 0, 8, 0xfeedface} 2293 data 4k 2294 ADB {20k, 4k, 59, 0, 6, 0xfeedface} 2296 5.7. Zero Filled Holes 2298 As applications are free to define the structure of an ADB, it is 2299 trivial to define an ADB which supports zero filled holes. Such a 2300 case would encompass the traditional definitions of a sparse file and 2301 hole punching. For example, to punch a 64k hole, starting at 100M, 2302 into an existing file which has no ADB structure: 2304 INITIALIZE {100M, 64k, 1, NFS4_UINT64_MAX, 2305 0, NFS4_UINT64_MAX, 0x0} 2307 6. Space Reservation 2309 6.1. Introduction 2311 This section describes a set of operations that allow applications 2312 such as hypervisors to reserve space for a file, report the amount of 2313 actual disk space a file occupies and freeup the backing space of a 2314 file when it is not required. 2316 In virtualized environments, virtual disk files are often stored on 2317 NFS mounted volumes. Since virtual disk files represent the hard 2318 disks of virtual machines, hypervisors often have to guarantee 2319 certain properties for the file. 2321 One such example is space reservation. When a hypervisor creates a 2322 virtual disk file, it often tries to preallocate the space for the 2323 file so that there are no future allocation related errors during the 2324 operation of the virtual machine. Such errors prevent a virtual 2325 machine from continuing execution and result in downtime. 2327 Another useful feature would be the ability to report the number of 2328 blocks that would be freed when a file is deleted. Currently, NFS 2329 reports two size attributes: 2331 size The logical file size of the file. 2333 space_used The size in bytes that the file occupies on disk 2335 While these attributes are sufficient for space accounting in 2336 traditional filesystems, they prove to be inadequate in modern 2337 filesystems that support block sharing. Having a way to tell the 2338 number of blocks that would be freed if the file was deleted would be 2339 useful to applications that wish to migrate files when a volume is 2340 low on space. 2342 Since virtual disks represent a hard drive in a virtual machine, a 2343 virtual disk can be viewed as a filesystem within a file. Since not 2344 all blocks within a filesystem are in use, there is an opportunity to 2345 reclaim blocks that are no longer in use. A call to deallocate 2346 blocks could result in better space efficiency. Lesser space MAY be 2347 consumed for backups after block deallocation. 2349 We propose the following operations and attributes for the 2350 aforementioned use cases: 2352 space_reserved This attribute specifies whether the blocks backing 2353 the file have been preallocated. 2355 space_freed This attribute specifies the space freed when a file is 2356 deleted, taking block sharing into consideration. 2358 max_hole_punch This attribute specifies the maximum sized hole that 2359 can be punched on the filesystem. 2361 HOLE_PUNCH This operation zeroes and/or deallocates the blocks 2362 backing a region of the file. 2364 6.2. Use Cases 2366 6.2.1. Space Reservation 2368 Some applications require that once a file of a certain size is 2369 created, writes to that file never fail with an out of space 2370 condition. One such example is that of a hypervisor writing to a 2371 virtual disk. An out of space condition while writing to virtual 2372 disks would mean that the virtual machine would need to be frozen. 2374 Currently, in order to achieve such a guarantee, applications zero 2375 the entire file. The initial zeroing allocates the backing blocks 2376 and all subsequent writes are overwrites of already allocated blocks. 2377 This approach is not only inefficient in terms of the amount of I/O 2378 done, it is also not guaranteed to work on filesystems that are log 2379 structured or deduplicated. An efficient way of guaranteeing space 2380 reservation would be beneficial to such applications. 2382 If the space_reserved attribute is set on a file, it is guaranteed 2383 that writes that do not grow the file will not fail with 2384 NFSERR_NOSPC. 2386 6.2.2. Space freed on deletes 2388 Currently, files in NFS have two size attributes: 2390 size The logical file size of the file. 2392 space_used The size in bytes that the file occupies on disk. 2394 While these attributes are sufficient for space accounting in 2395 traditional filesystems, they prove to be inadequate in modern 2396 filesystems that support block sharing. In such filesystems, 2397 multiple inodes can point to a single block with a block reference 2398 count to guard against premature freeing. 2400 If space_used of a file is interpreted to mean the size in bytes of 2401 all disk blocks pointed to by the inode of the file, then shared 2402 blocks get double counted, over-reporting the space utilization. 2403 This also has the adverse effect that the deletion of a file with 2404 shared blocks frees up less than space_used bytes. 2406 On the other hand, if space_used is interpreted to mean the size in 2407 bytes of those disk blocks unique to the inode of the file, then 2408 shared blocks are not counted in any file, resulting in under- 2409 reporting of the space utilization. 2411 For example, two files A and B have 10 blocks each. Let 6 of these 2412 blocks be shared between them. Thus, the combined space utilized by 2413 the two files is 14 * BLOCK_SIZE bytes. In the former case, the 2414 combined space utilization of the two files would be reported as 20 * 2415 BLOCK_SIZE. However, deleting either would only result in 4 * 2416 BLOCK_SIZE being freed. Conversely, the latter interpretation would 2417 report that the space utilization is only 8 * BLOCK_SIZE. 2419 Adding another size attribute, space_freed, is helpful in solving 2420 this problem. space_freed is the number of blocks that are allocated 2421 to the given file that would be freed on its deletion. In the 2422 example, both A and B would report space_freed as 4 * BLOCK_SIZE and 2423 space_used as 10 * BLOCK_SIZE. If A is deleted, B will report 2424 space_freed as 10 * BLOCK_SIZE as the deletion of B would result in 2425 the deallocation of all 10 blocks. 2427 The addition of this problem doesn't solve the problem of space being 2428 over-reported. However, over-reporting is better than under- 2429 reporting. 2431 6.2.3. Operations and attributes 2433 In the sections that follow, one operation and three attributes are 2434 defined that together provide the space management facilities 2435 outlined earlier in the document. The operation is intended to be 2436 OPTIONAL and the attributes RECOMMENDED as defined in section 17 of 2437 [2]. 2439 6.2.4. Attribute 77: space_reserved 2441 The space_reserve attribute is a read/write attribute of type 2442 boolean. It is a per file attribute. When the space_reserved 2443 attribute is set via SETATTR, the server must ensure that there is 2444 disk space to accommodate every byte in the file before it can return 2445 success. If the server cannot guarantee this, it must return 2446 NFS4ERR_NOSPC. 2448 If the client tries to grow a file which has the space_reserved 2449 attribute set, the server must guarantee that there is disk space to 2450 accommodate every byte in the file with the new size before it can 2451 return success. If the server cannot guarantee this, it must return 2452 NFS4ERR_NOSPC. 2454 It is not required that the server allocate the space to the file 2455 before returning success. The allocation can be deferred, however, 2456 it must be guaranteed that it will not fail for lack of space. 2458 The value of space_reserved can be obtained at any time through 2459 GETATTR. 2461 In order to avoid ambiguity, the space_reserve bit cannot be set 2462 along with the size bit in SETATTR. Increasing the size of a file 2463 with space_reserve set will fail if space reservation cannot be 2464 guaranteed for the new size. If the file size is decreased, space 2465 reservation is only guaranteed for the new size and the extra blocks 2466 backing the file can be released. 2468 6.2.5. Attribute 78: space_freed 2470 space_freed gives the number of bytes freed if the file is deleted. 2471 This attribute is read only and is of type length4. It is a per file 2472 attribute. 2474 6.2.6. Attribute 79: max_hole_punch 2476 max_hole_punch specifies the maximum size of a hole that the 2477 HOLE_PUNCH operation can handle. This attribute is read only and of 2478 type length4. It is a per filesystem attribute. This attribute MUST 2479 be implemented if HOLE_PUNCH is implemented. 2481 6.2.7. Operation 64: HOLE_PUNCH - Zero and deallocate blocks backing 2482 the file in the specified range. 2484 WARNING: Most of this section is now obsolete. Parts of it need to 2485 be scavanged for the ADB discussion, but for the most part, it cannot 2486 be trusted. 2488 6.2.7.1. DESCRIPTION 2490 Whenever a client wishes to deallocate the blocks backing a 2491 particular region in the file, it calls the HOLE_PUNCH operation with 2492 the current filehandle set to the filehandle of the file in question, 2493 start offset and length in bytes of the region set in hpa_offset and 2494 hpa_count respectively. All further reads to this region MUST return 2495 zeros until overwritten. The filehandle specified must be that of a 2496 regular file. 2498 Situations may arise where hpa_offset and/or hpa_offset + hpa_count 2499 will not be aligned to a boundary that the server does allocations/ 2500 deallocations in. For most filesystems, this is the block size of 2501 the file system. In such a case, the server can deallocate as many 2502 bytes as it can in the region. The blocks that cannot be deallocated 2503 MUST be zeroed. Except for the block deallocation and maximum hole 2504 punching capability, a HOLE_PUNCH operation is to be treated similar 2505 to a write of zeroes. 2507 The server is not required to complete deallocating the blocks 2508 specified in the operation before returning. It is acceptable to 2509 have the deallocation be deferred. In fact, HOLE_PUNCH is merely a 2510 hint; it is valid for a server to return success without ever doing 2511 anything towards deallocating the blocks backing the region 2512 specified. However, any future reads to the region MUST return 2513 zeroes. 2515 HOLE_PUNCH will result in the space_used attribute being decreased by 2516 the number of bytes that were deallocated. The space_freed attribute 2517 may or may not decrease, depending on the support and whether the 2518 blocks backing the specified range were shared or not. The size 2519 attribute will remain unchanged. 2521 The HOLE_PUNCH operation MUST NOT change the space reservation 2522 guarantee of the file. While the server can deallocate the blocks 2523 specified by hpa_offset and hpa_count, future writes to this region 2524 MUST NOT fail with NFSERR_NOSPC. 2526 The HOLE_PUNCH operation may fail for the following reasons (this is 2527 a partial list): 2529 NFS4ERR_NOTSUPP The Hole punch operations are not supported by the 2530 NFS server receiving this request. 2532 NFS4ERR_DIR The current filehandle is of type NF4DIR. 2534 NFS4ERR_SYMLINK The current filehandle is of type NF4LNK. 2536 NFS4ERR_WRONG_TYPE The current filehandle does not designate an 2537 ordinary file. 2539 7. Sparse Files 2541 WARNING: Most of this section needs to be reworked because of the 2542 work going on in the ADB section. 2544 7.1. Introduction 2546 A sparse file is a common way of representing a large file without 2547 having to utilize all of the disk space for it. Consequently, a 2548 sparse file uses less physical space than its size indicates. This 2549 means the file contains 'holes', byte ranges within the file that 2550 contain no data. Most modern file systems support sparse files, 2551 including most UNIX file systems and NTFS, but notably not Apple's 2552 HFS+. Common examples of sparse files include Virtual Machine (VM) 2553 OS/disk images, database files, log files, and even checkpoint 2554 recovery files most commonly used by the HPC community. 2556 If an application reads a hole in a sparse file, the file system must 2557 returns all zeros to the application. For local data access there is 2558 little penalty, but with NFS these zeroes must be transferred back to 2559 the client. If an application uses the NFS client to read data into 2560 memory, this wastes time and bandwidth as the application waits for 2561 the zeroes to be transferred. 2563 A sparse file is typically created by initializing the file to be all 2564 zeros - nothing is written to the data in the file, instead the hole 2565 is recorded in the metadata for the file. So a 8G disk image might 2566 be represented initially by a couple hundred bits in the inode and 2567 nothing on the disk. If the VM then writes 100M to a file in the 2568 middle of the image, there would now be two holes represented in the 2569 metadata and 100M in the data. 2571 Other applications want to initialize a file to patterns other than 2572 zero. The problem with initializing to zero is that it is often 2573 difficult to distinguish a byte-range of initialized to all zeroes 2574 from data corruption, since a pattern of zeroes is a probable pattern 2575 for corruption. Instead, some applications, such as database 2576 management systems, use pattern consisting of bytes or words of non- 2577 zero values. 2579 Besides reading sparse files and initializing them, applications 2580 might want to hole punch, which is the deallocation of the data 2581 blocks which back a region of the file. At such time, the affected 2582 blocks are reinitialized to a pattern. 2584 This section introduces a new operation to read patterns from a file, 2585 READ_PLUS, and a new operation to both initialize patterns and to 2586 punch pattern holes into a file, WRITE_PLUS. READ_PLUS supports all 2587 the features of READ but includes an extension to support sparse 2588 pattern files. READ_PLUS is guaranteed to perform no worse than 2589 READ, and can dramatically improve performance with sparse files. 2590 READ_PLUS does not depend on pNFS protocol features, but can be used 2591 by pNFS to support sparse files. 2593 7.2. Terminology 2595 Regular file: An object of file type NF4REG or NF4NAMEDATTR. 2597 Sparse file: A Regular file that contains one or more Holes. 2599 Hole: A byte range within a Sparse file that contains regions of all 2600 zeroes. For block-based file systems, this could also be an 2601 unallocated region of the file. 2603 Hole Threshold The minimum length of a Hole as determined by the 2604 server. If a server chooses to define a Hole Threshold, then it 2605 would not return hole information (nfs_readplusreshole) with a 2606 hole_offset and hole_length that specify a range shorter than the 2607 Hole Threshold. 2609 7.3. Applications and Sparse Files 2611 Applications may cause an NFS client to read holes in a file for 2612 several reasons. This section describes three different application 2613 workloads that cause the NFS client to transfer data unnecessarily. 2614 These workloads are simply examples, and there are probably many more 2615 workloads that are negatively impacted by sparse files. 2617 The first workload that can cause holes to be read is sequential 2618 reads within a sparse file. When this happens, the NFS client may 2619 perform read requests ("readahead") into sections of the file not 2620 explicitly requested by the application. Since the NFS client cannot 2621 differentiate between holes and non-holes, the NFS client may 2622 prefetch empty sections of the file. 2624 This workload is exemplified by Virtual Machines and their associated 2625 file system images, e.g., VMware .vmdk files, which are large sparse 2626 files encapsulating an entire operating system. If a VM reads files 2627 within the file system image, this will translate to sequential NFS 2628 read requests into the much larger file system image file. Since NFS 2629 does not understand the internals of the file system image, it ends 2630 up performing readahead file holes. 2632 The second workload is generated by copying a file from a directory 2633 in NFS to either the same NFS server, to another file system, e.g., 2634 another NFS or Samba server, to a local ext3 file system, or even a 2635 network socket. In this case, bandwidth and server resources are 2636 wasted as the entire file is transferred from the NFS server to the 2637 NFS client. Once a byte range of the file has been transferred to 2638 the client, it is up to the client application, e.g., rsync, cp, scp, 2639 on how it writes the data to the target location. For example, cp 2640 supports sparse files and will not write all zero regions, whereas 2641 scp does not support sparse files and will transfer every byte of the 2642 file. 2644 The third workload is generated by applications that do not utilize 2645 the NFS client cache, but instead use direct I/O and manage cached 2646 data independently, e.g., databases. These applications may perform 2647 whole file caching with sparse files, which would mean that even the 2648 holes will be transferred to the clients and cached. 2650 7.4. Overview of Sparse Files and NFSv4 2652 This proposal seeks to provide sparse file support to the largest 2653 number of NFS client and server implementations, and as such proposes 2654 to add a new return code to the mandatory NFSv4.1 READ_PLUS operation 2655 instead of proposing additions or extensions of new or existing 2656 optional features (such as pNFS). 2658 As well, this document seeks to ensure that the proposed extensions 2659 are simple and do not transfer data between the client and server 2660 unnecessarily. For example, one possible way to implement sparse 2661 file read support would be to have the client, on the first hole 2662 encountered or at OPEN time, request a Data Region Map from the 2663 server. A Data Region Map would specify all zero and non-zero 2664 regions in a file. While this option seems simple, it is less useful 2665 and can become inefficient and cumbersome for several reasons: 2667 o Data Region Maps can be large, and transferring them can reduce 2668 overall read performance. For example, VMware's .vmdk files can 2669 have a file size of over 100 GBs and have a map well over several 2670 MBs. 2672 o Data Region Maps can change frequently, and become invalidated on 2673 every write to the file. NFSv4 has a single change attribute, 2674 which means any change to any region of a file will invalidate all 2675 Data Region Maps. This can result in the map being transferred 2676 multiple times with each update to the file. For example, a VM 2677 that updates a config file in its file system image would 2678 invalidate the Data Region Map not only for itself, but for all 2679 other clients accessing the same file system image. 2681 o Data Region Maps do not handle all zero-filled sections of the 2682 file, reducing the effectiveness of the solution. While it may be 2683 possible to modify the maps to handle zero-filled sections (at 2684 possibly great effort to the server), it is almost impossible with 2685 pNFS. With pNFS, the owner of the Data Region Map is the metadata 2686 server, which is not in the data path and has no knowledge of the 2687 contents of a data region. 2689 Another way to handle holes is compression, but this not ideal since 2690 it requires all implementations to agree on a single compression 2691 algorithm and requires a fair amount of computational overhead. 2693 Note that supporting writing to a sparse file does not require 2694 changes to the protocol. Applications and/or NFS implementations can 2695 choose to ignore WRITE requests of all zeroes to the NFS server 2696 without consequence. 2698 7.5. Operation 65: READ_PLUS 2700 The section introduces a new read operation, named READ_PLUS, which 2701 allows NFS clients to avoid reading holes in a sparse file. 2702 READ_PLUS is guaranteed to perform no worse than READ, and can 2703 dramatically improve performance with sparse files. 2705 READ_PLUS supports all the features of the existing NFSv4.1 READ 2706 operation [2] and adds a simple yet significant extension to the 2707 format of its response. The change allows the client to avoid 2708 returning all zeroes from a file hole, wasting computational and 2709 network resources and reducing performance. READ_PLUS uses a new 2710 result structure that tells the client that the result is all zeroes 2711 AND the byte-range of the hole in which the request was made. 2712 Returning the hole's byte-range, and only upon request, avoids 2713 transferring large Data Region Maps that may be soon invalidated and 2714 contain information about a file that may not even be read in its 2715 entirely. 2717 A new read operation is required due to NFSv4.1 minor versioning 2718 rules that do not allow modification of existing operation's 2719 arguments or results. READ_PLUS is designed in such a way to allow 2720 future extensions to the result structure. The same approach could 2721 be taken to extend the argument structure, but a good use case is 2722 first required to make such a change. 2724 7.5.1. ARGUMENT 2726 struct READ_PLUS4args { 2727 /* CURRENT_FH: file */ 2728 stateid4 rpa_stateid; 2729 offset4 rpa_offset; 2730 count4 rpa_count; 2731 }; 2733 7.5.2. RESULT 2735 union read_plus_content switch (data_content4 content) { 2736 case NFS4_CONTENT_DATA: 2737 opaque rpc_data<>; 2738 case NFS4_CONTENT_APP_BLOCK: 2739 app_data_block4 rpc_block; 2740 case NFS4_CONTENT_HOLE: 2741 length4 rpc_hole_length; 2742 default: 2743 void; 2744 }; 2746 /* 2747 * Allow a return of an array of contents. 2748 */ 2749 struct read_plus_res4 { 2750 bool rpr_eof; 2751 read_plus_content rpr_contents<>; 2752 }; 2754 union READ_PLUS4res switch (nfsstat4 status) { 2755 case NFS4_OK: 2756 read_plus_res4 resok4; 2757 default: 2758 void; 2759 }; 2761 7.5.3. DESCRIPTION 2763 The READ_PLUS operation is based upon the NFSv4.1 READ operation [2], 2764 and similarly reads data from the regular file identified by the 2765 current filehandle. 2767 The client provides an offset of where the READ_PLUS is to start and 2768 a count of how many bytes are to be read. An offset of zero means to 2769 read data starting at the beginning of the file. If offset is 2770 greater than or equal to the size of the file, the status NFS4_OK is 2771 returned with nfs_readplusrestype4 set to READ_OK, data length set to 2772 zero, and eof set to TRUE. The READ_PLUS is subject to access 2773 permissions checking. 2775 If the client specifies a count value of zero, the READ_PLUS succeeds 2776 and returns zero bytes of data, again subject to access permissions 2777 checking. In all situations, the server may choose to return fewer 2778 bytes than specified by the client. The client needs to check for 2779 this condition and handle the condition appropriately. 2781 If the client specifies an offset and count value that is entirely 2782 contained within a hole of the file, the status NFS4_OK is returned 2783 with nfs_readplusresok4 set to READ_HOLE, and if information is 2784 available regarding the hole, a nfs_readplusreshole structure 2785 containing the offset and range of the entire hole. The 2786 nfs_readplusreshole structure is considered valid until the file is 2787 changed (detected via the change attribute). The server MUST provide 2788 the same semantics for nfs_readplusreshole as if the client read the 2789 region and received zeroes; the implied holes contents lifetime MUST 2790 be exactly the same as any other read data. 2792 If the client specifies an offset and count value that begins in a 2793 non-hole of the file but extends into hole the server should return a 2794 short read with status NFS4_OK, nfs_readplusresok4 set to READ_OK, 2795 and data length set to the number of bytes returned. The client will 2796 then issue another READ_PLUS for the remaining bytes, which the 2797 server will respond with information about the hole in the file. 2799 If the server knows that the requested byte range is into a hole of 2800 the file, but has no further information regarding the hole, it 2801 returns a nfs_readplusreshole structure with holeres4 set to 2802 HOLE_NOINFO. 2804 If hole information is available and can be returned to the client, 2805 the server returns a nfs_readplusreshole structure with the value of 2806 holeres4 to HOLE_INFO. The values of hole_offset and hole_length 2807 define the byte-range for the current hole in the file. These values 2808 represent the information known to the server and may describe a 2809 byte-range smaller than the true size of the hole. 2811 Except when special stateids are used, the stateid value for a 2812 READ_PLUS request represents a value returned from a previous byte- 2813 range lock or share reservation request or the stateid associated 2814 with a delegation. The stateid identifies the associated owners if 2815 any and is used by the server to verify that the associated locks are 2816 still valid (e.g., have not been revoked). 2818 If the read ended at the end-of-file (formally, in a correctly formed 2819 READ_PLUS operation, if offset + count is equal to the size of the 2820 file), or the READ_PLUS operation extends beyond the size of the file 2821 (if offset + count is greater than the size of the file), eof is 2822 returned as TRUE; otherwise, it is FALSE. A successful READ_PLUS of 2823 an empty file will always return eof as TRUE. 2825 If the current filehandle is not an ordinary file, an error will be 2826 returned to the client. In the case that the current filehandle 2827 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If 2828 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 2829 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 2831 For a READ_PLUS with a stateid value of all bits equal to zero, the 2832 server MAY allow the READ_PLUS to be serviced subject to mandatory 2833 byte-range locks or the current share deny modes for the file. For a 2834 READ_PLUS with a stateid value of all bits equal to one, the server 2835 MAY allow READ_PLUS operations to bypass locking checks at the 2836 server. 2838 On success, the current filehandle retains its value. 2840 7.5.4. IMPLEMENTATION 2842 If the server returns a "short read" (i.e., fewer data than requested 2843 and eof is set to FALSE), the client should send another READ_PLUS to 2844 get the remaining data. A server may return less data than requested 2845 under several circumstances. The file may have been truncated by 2846 another client or perhaps on the server itself, changing the file 2847 size from what the requesting client believes to be the case. This 2848 would reduce the actual amount of data available to the client. It 2849 is possible that the server reduce the transfer size and so return a 2850 short read result. Server resource exhaustion may also occur in a 2851 short read. 2853 If mandatory byte-range locking is in effect for the file, and if the 2854 byte-range corresponding to the data to be read from the file is 2855 WRITE_LT locked by an owner not associated with the stateid, the 2856 server will return the NFS4ERR_LOCKED error. The client should try 2857 to get the appropriate READ_LT via the LOCK operation before re- 2858 attempting the READ_PLUS. When the READ_PLUS completes, the client 2859 should release the byte-range lock via LOCKU. In addition, the 2860 server MUST return a nfs_readplusreshole structure with values of 2861 hole_offset and hole_length that are within the owner's locked byte 2862 range. 2864 If another client has an OPEN_DELEGATE_WRITE delegation for the file 2865 being read, the delegation must be recalled, and the operation cannot 2866 proceed until that delegation is returned or revoked. Except where 2867 this happens very quickly, one or more NFS4ERR_DELAY errors will be 2868 returned to requests made while the delegation remains outstanding. 2869 Normally, delegations will not be recalled as a result of a READ_PLUS 2870 operation since the recall will occur as a result of an earlier OPEN. 2871 However, since it is possible for a READ_PLUS to be done with a 2872 special stateid, the server needs to check for this case even though 2873 the client should have done an OPEN previously. 2875 7.5.4.1. Additional pNFS Implementation Information 2877 With pNFS, the semantics of using READ_PLUS remains the same. Any 2878 data server MAY return a READ_HOLE result for a READ_PLUS request 2879 that it receives. 2881 When a data server chooses to return a READ_HOLE result, it has the 2882 option of returning hole information for the data stored on that data 2883 server (as defined by the data layout), but it MUST not return a 2884 nfs_readplusreshole structure with a byte range that includes data 2885 managed by another data server. 2887 1. Data servers that cannot determine hole information SHOULD return 2888 HOLE_NOINFO. 2890 2. Data servers that can obtain hole information for the parts of 2891 the file stored on that data server, the data server SHOULD 2892 return HOLE_INFO and the byte range of the hole stored on that 2893 data server. 2895 A data server should do its best to return as much information about 2896 a hole as is feasible without having to contact the metadata server. 2897 If communication with the metadata server is required, then every 2898 attempt should be taken to minimize the number of requests. 2900 If mandatory locking is enforced, then the data server must also 2901 ensure that to return only information for a Hole that is within the 2902 owner's locked byte range. 2904 7.5.5. READ_PLUS with Sparse Files Example 2906 To see how the return value READ_HOLE will work, the following table 2907 describes a sparse file. For each byte range, the file contains 2908 either non-zero data or a hole. In addition, the server in this 2909 example uses a hole threshold of 32K. 2911 +-------------+----------+ 2912 | Byte-Range | Contents | 2913 +-------------+----------+ 2914 | 0-15999 | Hole | 2915 | 16K-31999 | Non-Zero | 2916 | 32K-255999 | Hole | 2917 | 256K-287999 | Non-Zero | 2918 | 288K-353999 | Hole | 2919 | 354K-417999 | Non-Zero | 2920 +-------------+----------+ 2922 Table 3 2924 Under the given circumstances, if a client was to read the file from 2925 beginning to end with a max read size of 64K, the following will be 2926 the result. This assumes the client has already opened the file and 2927 acquired a valid stateid and just needs to issue READ_PLUS requests. 2929 1. READ_PLUS(s, 0, 64K) --> NFS_OK, readplusrestype4 = READ_OK, eof 2930 = false, data<>[32K]. Return a short read, as the last half of 2931 the request was all zeroes. Note that the first hole is read 2932 back as all zeros as it is below the hole threshhold. 2934 2. READ_PLUS(s, 32K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE, 2935 nfs_readplusreshole(HOLE_INFO)(32K, 224K). The requested range 2936 was all zeros, and the current hole begins at offset 32K and is 2937 224K in length. 2939 3. READ_PLUS(s, 256K, 64K) --> NFS_OK, readplusrestype4 = READ_OK, 2940 eof = false, data<>[32K]. Return a short read, as the last half 2941 of the request was all zeroes. 2943 4. READ_PLUS(s, 288K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE, 2944 nfs_readplusreshole(HOLE_INFO)(288K, 66K). 2946 5. READ_PLUS(s, 354K, 64K) --> NFS_OK, readplusrestype4 = READ_OK, 2947 eof = true, data<>[64K]. 2949 7.6. Related Work 2951 Solaris and ZFS support an extension to lseek(2) that allows 2952 applications to discover holes in a file. The values, SEEK_HOLE and 2953 SEEK_DATA, allow clients to seek to the next hole or beginning of 2954 data, respectively. 2956 XFS supports the XFS_IOC_GETBMAP extended attribute, which returns 2957 the Data Region Map for a file. Clients can then use this 2958 information to avoid reading holes in a file. 2960 NTFS and CIFS support the FSCTL_SET_SPARSE attribute, which allows 2961 applications to control whether empty regions of the file are 2962 preallocated and filled in with zeros or simply left unallocated. 2964 7.7. Other Proposed Designs 2966 7.7.1. Multi-Data Server Hole Information 2968 The current design prohibits pnfs data servers from returning hole 2969 information for regions of a file that are not stored on that data 2970 server. Having data servers return information regarding other data 2971 servers changes the fundamental principal that all metadata 2972 information comes from the metadata server. 2974 Here is a brief description if we did choose to support multi-data 2975 server hole information: 2977 For a data server that can obtain hole information for the entire 2978 file without severe performance impact, it MAY return HOLE_INFO and 2979 the byte range of the entire file hole. When a pNFS client receives 2980 a READ_HOLE result and a non-empty nfs_readplusreshole structure, it 2981 MAY use this information in conjunction with a valid layout for the 2982 file to determine the next data server for the next region of data 2983 that is not in a hole. 2985 7.7.2. Data Result Array 2987 If a single read request contains one or more Holes with a length 2988 greater than the Sparse Threshold, the current design would return 2989 results indicating a short read to the client. A client would then 2990 send a series of read requests to the server to retrieve information 2991 for the Holes and the remaining data. To avoid turning a single read 2992 request into several exchanges between the client and server, the 2993 server may need to choose a relatively large Sparse Threshold in 2994 order to decrease the number of short reads it creates. A large 2995 Sparse Threshold may miss many smaller holes, which in turn may 2996 negate the benefits of sparse read support. 2998 To avoid this situation, one option is to have the READ_PLUS 2999 operation return information for multiple holes in a single return 3000 value. This would allow several small holes to be described in a 3001 single read response without requiring multliple exchanges between 3002 the client and server. 3004 One important item to consider with returning an array of data chunks 3005 is its impact on RDMA, which may use different block sizes on the 3006 client and server (among other things). 3008 7.7.3. User-Defined Sparse Mask 3010 Add mask (instead of just zeroes). Specified by server or client? 3012 7.7.4. Allocated flag 3014 A Hole on the server may be an allocated byte-range consisting of all 3015 zeroes or may not be allocated at all. To ensure this information is 3016 properly communicated to the client, it may be beneficial to add a 3017 'alloc' flag to the HOLE_INFO section of nfs_readplusreshole. This 3018 would allow an NFS client to copy a file from one file system to 3019 another and have it more closely resemble the original. 3021 7.7.5. Dense and Sparse pNFS File Layouts 3023 The hole information returned form a data server must be understood 3024 by pNFS clients using both Dense or Sparse file layout types. Does 3025 the current READ_PLUS return value work for both layout types? Does 3026 the data server know if it is using dense or sparse so that it can 3027 return the correct hole_offset and hole_length values? 3029 8. Security Considerations 3031 9. IANA Considerations 3033 This section uses terms that are defined in [21]. 3035 10. References 3037 10.1. Normative References 3039 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 3040 Levels", March 1997. 3042 [2] Shepler, S., Eisler, M., and D. Noveck, "Network File System 3043 (NFS) Version 4 Minor Version 1 Protocol", RFC 5661, 3044 January 2010. 3046 [3] Haynes, T., "Network File System (NFS) Version 4 Minor Version 3047 2 External Data Representation Standard (XDR) Description", 3048 March 2011. 3050 [4] Halevy, B., Welch, B., and J. Zelenka, "Object-Based Parallel 3051 NFS (pNFS) Operations", RFC 5664, January 2010. 3053 [5] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 3054 Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, 3055 January 2005. 3057 [6] Williams, N., "Remote Procedure Call (RPC) Security Version 3", 3058 draft-williams-rpcsecgssv3 (work in progress), 2008. 3060 [7] Shepler, S., Eisler, M., and D. Noveck, "Network File System 3061 (NFS) Version 4 Minor Version 1 External Data Representation 3062 Standard (XDR) Description", RFC 5662, January 2010. 3064 [8] Black, D., Glasgow, J., and S. Fridella, "Parallel NFS (pNFS) 3065 Block/Volume Layout", RFC 5663, January 2010. 3067 [9] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol 3068 Specification", RFC 2203, September 1997. 3070 10.2. Informative References 3072 [10] Haynes, T. and D. Noveck, "Network File System (NFS) version 4 3073 Protocol", draft-ietf-nfsv4-rfc3530bis-09 (Work In Progress), 3074 March 2011. 3076 [11] Eisler, M., "XDR: External Data Representation Standard", 3077 RFC 4506, May 2006. 3079 [12] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik, 3080 "NSDB Protocol for Federated Filesystems", 3081 draft-ietf-nfsv4-federated-fs-protocol (Work In Progress), 3082 2010. 3084 [13] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik, 3085 "Administration Protocol for Federated Filesystems", 3086 draft-ietf-nfsv4-federated-fs-admin (Work In Progress), 2010. 3088 [14] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., 3089 Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- 3090 HTTP/1.1", RFC 2616, June 1999. 3092 [15] Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9, 3093 RFC 959, October 1985. 3095 [16] Simpson, W., "PPP Challenge Handshake Authentication Protocol 3096 (CHAP)", RFC 1994, August 1996. 3098 [17] Strohm, R., "Chapter 2, Data Blocks, Extents, and Segments, of 3099 Oracle Database Concepts 11g Release 1 (11.1)", January 2011. 3101 [18] Ashdown, L., "Chapter 15, Validating Database Files and 3102 Backups, of Oracle Database Backup and Recovery User's Guide 3103 11g Release 1 (11.1)", August 2008. 3105 [19] McDougall, R. and J. Mauro, "Section 11.4.3, Detecting Memory 3106 Corruption of Solaris Internals", 2007. 3108 [20] Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci- 3109 Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data 3110 Corruption in the Storage Stack", Proceedings of the 6th USENIX 3111 Symposium on File and Storage Technologies (FAST '08) , 2008. 3113 [21] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA 3114 Considerations Section in RFCs", BCP 26, RFC 5226, May 2008. 3116 [22] Nowicki, B., "NFS: Network File System Protocol specification", 3117 RFC 1094, March 1989. 3119 [23] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3 3120 Protocol Specification", RFC 1813, June 1995. 3122 [24] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", 3123 RFC 1833, August 1995. 3125 [25] Eisler, M., "NFS Version 2 and Version 3 Security Issues and 3126 the NFS Protocol's Use of RPCSEC_GSS and Kerberos V5", 3127 RFC 2623, June 1999. 3129 [26] Callaghan, B., "NFS URL Scheme", RFC 2224, October 1997. 3131 [27] Shepler, S., "NFS Version 4 Design Considerations", RFC 2624, 3132 June 1999. 3134 [28] Reynolds, J., "Assigned Numbers: RFC 1700 is Replaced by an On- 3135 line Database", RFC 3232, January 2002. 3137 [29] Linn, J., "The Kerberos Version 5 GSS-API Mechanism", RFC 1964, 3138 June 1996. 3140 [30] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, 3141 C., Eisler, M., and D. Noveck, "Network File System (NFS) 3142 version 4 Protocol", RFC 3530, April 2003. 3144 Appendix A. Acknowledgments 3146 For the pNFS Access Permissions Check, the original draft was by 3147 Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow. The work 3148 was influenced by discussions with Benny Halevy and Bruce Fields. A 3149 review was done by Tom Haynes. 3151 For the Sharing change attribute implementation details with NFSv4 3152 clients, the original draft was by Trond Myklebust. 3154 For the NFS Server-side Copy, the original draft was by James 3155 Lentini, Mike Eisler, Deepak Kenchammana, Anshul Madan, and Rahul 3156 Iyer. Talpey co-authored an unpublished version of that document. 3157 It was also was reviewed by a number of individuals: Pranoop Erasani, 3158 Tom Haynes, Arthur Lent, Trond Myklebust, Dave Noveck, Theresa 3159 Lingutla-Raj, Manjunath Shankararao, Satyam Vaghani, and Nico 3160 Williams. 3162 For the NFS space reservation operations, the original draft was by 3163 Mike Eisler, James Lentini, Manjunath Shankararao, and Rahul Iyer. 3165 For the sparse file support, the original draft was by Dean 3166 Hildebrand and Marc Eshel. Valuable input and advice was received 3167 from Sorin Faibish, Bruce Fields, Benny Halevy, Trond Myklebust, and 3168 Richard Scheffenegger. 3170 Appendix B. RFC Editor Notes 3172 [RFC Editor: please remove this section prior to publishing this 3173 document as an RFC] 3175 [RFC Editor: prior to publishing this document as an RFC, please 3176 replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the 3177 RFC number of this document] 3179 Author's Address 3181 Thomas Haynes 3182 NetApp 3183 9110 E 66th St 3184 Tulsa, OK 74133 3185 USA 3187 Phone: +1 918 307 1415 3188 Email: thomas@netapp.com 3189 URI: http://www.tulsalabs.com