idnits 2.17.1 draft-ietf-nfsv4-minorversion2-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 5 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. == There are 5 instances of lines with private range IPv4 addresses in the document. If these are generic example addresses, they should be changed to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x, 198.51.100.x or 203.0.113.x. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: Furthermore, each DS MUST not report to a client either a sparse ADB or data which belongs to another DS. One implication of this requirement is that the app_data_block4's adb_block_size MUST be either be the stripe width or the stripe width must be an even multiple of it. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: When a data server chooses to return a READ_HOLE result, it has the option of returning hole information for the data stored on that data server (as defined by the data layout), but it MUST not return a nfs_readplusreshole structure with a byte range that includes data managed by another data server. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: The second change is to provide a method for the server to notify the client that the attribute changed on an open file on the server. If the file is closed, then during the open attempt, the client will gather the new attribute value. The server MUST not communicate the new value of the attribute, the client MUST query it. This requirement stems from the need for the client to provide sufficient access rights to the attribute. == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (August 14, 2011) is 4639 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '0' is mentioned on line 1101, but not defined == Unused Reference: '8' is defined on line 4049, but no explicit reference was found in the text == Unused Reference: '9' is defined on line 4053, but no explicit reference was found in the text == Unused Reference: '24' is defined on line 4110, but no explicit reference was found in the text == Unused Reference: '25' is defined on line 4113, but no explicit reference was found in the text == Unused Reference: '26' is defined on line 4116, but no explicit reference was found in the text == Unused Reference: '27' is defined on line 4119, but no explicit reference was found in the text == Unused Reference: '28' is defined on line 4123, but no explicit reference was found in the text == Unused Reference: '29' is defined on line 4125, but no explicit reference was found in the text == Unused Reference: '30' is defined on line 4128, but no explicit reference was found in the text == Unused Reference: '31' is defined on line 4131, but no explicit reference was found in the text == Unused Reference: '32' is defined on line 4134, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. '1' ** Obsolete normative reference: RFC 5661 (ref. '2') (Obsoleted by RFC 8881) -- Possible downref: Non-RFC (?) normative reference: ref. '3' == Outdated reference: A later version (-35) exists of draft-ietf-nfsv4-rfc3530bis-09 -- Obsolete informational reference (is this intentional?): RFC 2616 (ref. '14') (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) -- Obsolete informational reference (is this intentional?): RFC 5226 (ref. '23') (Obsoleted by RFC 8126) -- Obsolete informational reference (is this intentional?): RFC 3530 (ref. '32') (Obsoleted by RFC 7530) Summary: 1 error (**), 0 flaws (~~), 20 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 T. Haynes 3 Internet-Draft Editor 4 Intended status: Standards Track August 14, 2011 5 Expires: February 15, 2012 7 NFS Version 4 Minor Version 2 8 draft-ietf-nfsv4-minorversion2-03.txt 10 Abstract 12 This Internet-Draft describes NFS version 4 minor version two, 13 focusing mainly on the protocol extensions made from NFS version 4 14 minor version 0 and NFS version 4 minor version 1. Major extensions 15 introduced in NFS version 4 minor version two include: Server-side 16 Copy, Space Reservations, and Support for Sparse Files. 18 Requirements Language 20 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 21 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 22 document are to be interpreted as described in RFC 2119 [1]. 24 Status of this Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at http://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on February 15, 2012. 41 Copyright Notice 43 Copyright (c) 2011 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 This document may contain material from IETF Documents or IETF 57 Contributions published or made publicly available before November 58 10, 2008. The person(s) controlling the copyright in some of this 59 material may not have granted the IETF Trust the right to allow 60 modifications of such material outside the IETF Standards Process. 61 Without obtaining an adequate license from the person(s) controlling 62 the copyright in such materials, this document may not be modified 63 outside the IETF Standards Process, and derivative works of it may 64 not be created outside the IETF Standards Process, except to format 65 it for publication as an RFC or to translate it into languages other 66 than English. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6 71 1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . . 6 72 1.2. Scope of This Document . . . . . . . . . . . . . . . . . . 6 73 1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 6 74 1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . . 6 75 1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . . 6 76 2. pNFS LAYOUTRETURN Error Handling . . . . . . . . . . . . . . . 6 77 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 6 78 2.2. Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 7 79 2.2.1. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 7 80 2.2.2. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 7 81 2.2.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 7 82 2.2.4. IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 8 83 3. Sharing change attribute implementation details with NFSv4 84 clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 85 3.1. Abstract . . . . . . . . . . . . . . . . . . . . . . . . . 9 86 3.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . 10 87 3.3. Definition of the 'change_attr_type' per-file system 88 attribute . . . . . . . . . . . . . . . . . . . . . . . . 10 89 4. NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . . 11 90 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 12 91 4.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 12 92 4.2.1. Intra-Server Copy . . . . . . . . . . . . . . . . . . 14 93 4.2.2. Inter-Server Copy . . . . . . . . . . . . . . . . . . 15 94 4.2.3. Server-to-Server Copy Protocol . . . . . . . . . . . . 18 95 4.3. Operations . . . . . . . . . . . . . . . . . . . . . . . . 20 96 4.3.1. netloc4 - Network Locations . . . . . . . . . . . . . 20 97 4.3.2. Copy Offload Stateids . . . . . . . . . . . . . . . . 21 98 4.4. Security Considerations . . . . . . . . . . . . . . . . . 21 99 4.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 21 100 5. Application Data Block Support . . . . . . . . . . . . . . . . 29 101 5.1. Generic Framework . . . . . . . . . . . . . . . . . . . . 30 102 5.1.1. Data Block Representation . . . . . . . . . . . . . . 31 103 5.1.2. Data Content . . . . . . . . . . . . . . . . . . . . . 31 104 5.2. pNFS Considerations . . . . . . . . . . . . . . . . . . . 31 105 5.3. An Example of Detecting Corruption . . . . . . . . . . . . 32 106 5.4. Example of READ_PLUS . . . . . . . . . . . . . . . . . . . 34 107 5.5. Zero Filled Holes . . . . . . . . . . . . . . . . . . . . 34 108 6. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 34 109 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 34 110 6.2. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 35 111 6.2.1. Space Reservation . . . . . . . . . . . . . . . . . . 36 112 6.2.2. Space freed on deletes . . . . . . . . . . . . . . . . 36 113 6.2.3. Operations and attributes . . . . . . . . . . . . . . 37 114 6.2.4. Attribute 77: space_reserved . . . . . . . . . . . . . 37 115 6.2.5. Attribute 78: space_freed . . . . . . . . . . . . . . 38 116 6.2.6. Attribute 79: max_hole_punch . . . . . . . . . . . . . 38 117 6.2.7. Operation 64: HOLE_PUNCH - Zero and deallocate 118 blocks backing the file in the specified range. . . . 38 119 7. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 39 120 7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 39 121 7.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 40 122 7.3. Applications and Sparse Files . . . . . . . . . . . . . . 41 123 7.4. Overview of Sparse Files and NFSv4 . . . . . . . . . . . . 42 124 7.5. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 43 125 7.5.1. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 43 126 7.5.2. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 44 127 7.5.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 44 128 7.5.4. IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 46 129 7.5.5. READ_PLUS with Sparse Files Example . . . . . . . . . 47 130 7.6. Related Work . . . . . . . . . . . . . . . . . . . . . . . 48 131 7.7. Other Proposed Designs . . . . . . . . . . . . . . . . . . 48 132 7.7.1. Multi-Data Server Hole Information . . . . . . . . . . 48 133 7.7.2. Data Result Array . . . . . . . . . . . . . . . . . . 49 134 7.7.3. User-Defined Sparse Mask . . . . . . . . . . . . . . . 49 135 7.7.4. Allocated flag . . . . . . . . . . . . . . . . . . . . 49 136 7.7.5. Dense and Sparse pNFS File Layouts . . . . . . . . . . 50 137 8. Labeled NFS . . . . . . . . . . . . . . . . . . . . . . . . . 50 138 8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 50 139 8.2. Definitions . . . . . . . . . . . . . . . . . . . . . . . 52 140 8.3. MAC Security Attribute . . . . . . . . . . . . . . . . . . 52 141 8.3.1. Interpreting FATTR4_SEC_LABEL . . . . . . . . . . . . 53 142 8.3.2. Delegations . . . . . . . . . . . . . . . . . . . . . 54 143 8.3.3. Permission Checking . . . . . . . . . . . . . . . . . 54 144 8.3.4. Object Creation . . . . . . . . . . . . . . . . . . . 55 145 8.3.5. Existing Objects . . . . . . . . . . . . . . . . . . . 55 146 8.3.6. Label Changes . . . . . . . . . . . . . . . . . . . . 55 147 8.4. Procedure 16: CB_ATTR_CHANGED - Notify Client that the 148 File's Attributes Changed . . . . . . . . . . . . . . . . 56 149 8.5. pNFS Considerations . . . . . . . . . . . . . . . . . . . 57 150 8.6. Discovery of Server LNFS Support . . . . . . . . . . . . . 57 151 8.7. MAC Security NFS Modes of Operation . . . . . . . . . . . 58 152 8.7.1. Full Mode . . . . . . . . . . . . . . . . . . . . . . 58 153 8.7.2. Smart Client Mode . . . . . . . . . . . . . . . . . . 59 154 8.7.3. Smart Server Mode . . . . . . . . . . . . . . . . . . 60 155 8.8. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 61 156 8.8.1. Full MAC labeling support for remotely mounted 157 filesystems . . . . . . . . . . . . . . . . . . . . . 61 158 8.8.2. MAC labeling of virtual machine images stored on 159 the network . . . . . . . . . . . . . . . . . . . . . 61 160 8.8.3. International Traffic in Arms Regulations (ITAR) . . . 62 161 8.8.4. Legal Hold/eDiscovery . . . . . . . . . . . . . . . . 62 162 8.8.5. Simple security label storage . . . . . . . . . . . . 63 163 8.8.6. Diskless Linux . . . . . . . . . . . . . . . . . . . . 63 164 8.8.7. Multi-Level Security . . . . . . . . . . . . . . . . . 64 165 8.9. Security Considerations . . . . . . . . . . . . . . . . . 65 166 9. Security Considerations . . . . . . . . . . . . . . . . . . . 66 167 10. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 66 168 11. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 69 169 11.1. Operation 59: COPY - Initiate a server-side copy . . . . . 69 170 11.2. Operation 60: COPY_ABORT - Cancel a server-side copy . . . 77 171 11.3. Operation 61: COPY_NOTIFY - Notify a source server of 172 a future copy . . . . . . . . . . . . . . . . . . . . . . 78 173 11.4. Operation 62: COPY_REVOKE - Revoke a destination 174 server's copy privileges . . . . . . . . . . . . . . . . . 80 175 11.5. Operation 63: COPY_STATUS - Poll for status of a 176 server-side copy . . . . . . . . . . . . . . . . . . . . . 81 177 11.6. Operation 64: INITIALIZE . . . . . . . . . . . . . . . . . 83 178 11.7. Modification to Operation 42: EXCHANGE_ID - 179 Instantiate Client ID . . . . . . . . . . . . . . . . . . 85 180 11.8. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 86 181 12. NFSv4.2 Callback Operations . . . . . . . . . . . . . . . . . 88 182 12.1. Operation 15: CB_COPY - Report results of a 183 server-side copy . . . . . . . . . . . . . . . . . . . . . 88 184 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 89 185 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 89 186 14.1. Normative References . . . . . . . . . . . . . . . . . . . 89 187 14.2. Informative References . . . . . . . . . . . . . . . . . . 90 188 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 91 189 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 92 190 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 92 192 1. Introduction 194 1.1. The NFS Version 4 Minor Version 2 Protocol 196 The NFS version 4 minor version 2 (NFSv4.2) protocol is the third 197 minor version of the NFS version 4 (NFSv4) protocol. The first minor 198 version, NFSv4.0, is described in [10] and the second minor version, 199 NFSv4.1, is described in [2]. It follows the guidelines for minor 200 versioning that are listed in Section 11 of RFC 3530bis. 202 As a minor version, NFSv4.2 is consistent with the overall goals for 203 NFSv4, but extends the protocol so as to better meet those goals, 204 based on experiences with NFSv4.1. In addition, NFSv4.2 has adopted 205 some additional goals, which motivate some of the major extensions in 206 NFSv4.2. 208 1.2. Scope of This Document 210 This document describes the NFSv4.2 protocol. With respect to 211 NFSv4.0 and NFSv4.1, this document does not: 213 o describe the NFSv4.0 or NFSv4.1 protocols, except where needed to 214 contrast with NFSv4.2. 216 o modify the specification of the NFSv4.0 or NFSv4.1 protocols. 218 o clarify the NFSv4.0 or NFSv4.1 protocols. 220 The full XDR for NFSv4.2 is presented in [3]. 222 1.3. NFSv4.2 Goals 224 1.4. Overview of NFSv4.2 Features 226 1.5. Differences from NFSv4.1 228 2. pNFS LAYOUTRETURN Error Handling 230 2.1. Introduction 232 In the pNFS description provided in [2], the client is not enabled to 233 relay an error code from the DS to the MDS. In the specification of 234 the Objects-Based Layout protocol [4], use is made of the opaque 235 lrf_body field of the LAYOUTRETURN argument to do such a relaying of 236 error codes. In this section, we define a new data structure to 237 enable the passing of error codes back to the MDS and provide some 238 guidelines on what both the client and MDS should expect in such 239 circumstances. 241 There are two broad classes of errors, transient and persistent. The 242 client SHOULD strive to only use this new mechanism to report 243 persistent errors. It MUST be able to deal with transient issues by 244 itself. Also, while the client might consider an issue to be 245 persistent, it MUST be prepared for the MDS to consider such issues 246 to be persistent. A prime example of this is if the MDS fences off a 247 client from either a stateid or a filehandle. The client will get an 248 error from the DS and might relay either NFS4ERR_ACCESS or 249 NFS4ERR_STALE_STATEID back to the MDS, with the belief that this is a 250 hard error. The MDS on the other hand, is waiting for the client to 251 report such an error. For it, the mission is accomplished in that 252 the client has returned a layout that the MDS had most likley 253 recalled. 255 2.2. Changes to Operation 51: LAYOUTRETURN 257 The existing LAYOUTRETURN operation is extended by introducing a new 258 data structure to report errors, layoutreturn_device_error4. Also, 259 layoutreturn_device_error4 is introduced to enable an array of errors 260 to be reported. 262 2.2.1. ARGUMENT 264 The ARGUMENT specification of the LAYOUTRETURN operation in section 265 18.44.1 of [2] is augmented by the following XDR code [11]: 267 struct layoutreturn_device_error4 { 268 deviceid4 lrde_deviceid; 269 nfsstat4 lrde_status; 270 nfs_opnum4 lrde_opnum; 271 }; 273 struct layoutreturn_error_report4 { 274 layoutreturn_device_error4 lrer_errors<>; 275 }; 277 2.2.2. RESULT 279 The RESULT of the LAYOUTRETURN operation is unchanged; see section 280 18.44.2 of [2]. 282 2.2.3. DESCRIPTION 284 The following text is added to the end of the LAYOUTRETURN operation 285 DESCRIPTION in section 18.44.3 of [2]. 287 When a client used LAYOUTRETURN with a type of LAYOUTRETURN4_FILE, 288 then if the lrf_body field is NULL, it indicates to the MDS that the 289 client experienced no errors. If lrf_body is non-NULL, then the 290 field references error information which is layout type specific. 291 I.e., the Objects-Based Layout protocol can continue to utilize 292 lrf_body as specified in [4]. For both Files-Based Layouts, the 293 field references a layoutreturn_device_error4, which contains an 294 array of layoutreturn_device_error4. 296 Each individual layoutreturn_device_error4 descibes a single error 297 associated with a DS, which is identfied via lrde_deviceid. The 298 operation which returned the error is identified via lrde_opnum. 299 Finally the NFS error value (nfsstat4) encountered is provided via 300 lrde_status and may consist of the following error codes: 302 NFS4_OKAY: No issues were found for this device. 304 NFS4ERR_NXIO: The client was unable to establish any communication 305 with the DS. 307 NFS4ERR_*: The client was able to establish communication with the 308 DS and is returning one of the allowed error codes for the 309 operation denoted by lrde_opnum. 311 2.2.4. IMPLEMENTATION 313 The following text is added to the end of the LAYOUTRETURN operation 314 IMPLEMENTATION in section 18.4.4 of [2]. 316 A client that expects to use pNFS for a mounted filesystem SHOULD 317 check for pNFS support at mount time. This check SHOULD be performed 318 by sending a GETDEVICELIST operation, followed by layout-type- 319 specific checks for accessibility of each storage device returned by 320 GETDEVICELIST. If the NFS server does not support pNFS, the 321 GETDEVICELIST operation will be rejected with an NFS4ERR_NOTSUPP 322 error; in this situation it is up to the client to determine whether 323 it is acceptable to proceed with NFS-only access. 325 Clients are expected to tolerate transient storage device errors, and 326 hence clients SHOULD NOT use the LAYOUTRETURN error handling for 327 device access problems that may be transient. The methods by which a 328 client decides whether an access problem is transient vs. persistent 329 are implementation-specific, but may include retrying I/Os to a data 330 server under appropriate conditions. 332 When an I/O fails to a storage device, the client SHOULD retry the 333 failed I/O via the MDS. In this situation, before retrying the I/O, 334 the client SHOULD return the layout, or the affected portion thereof, 335 and SHOULD indicate which storage device or devices was problematic. 336 If the client does not do this, the MDS may issue a layout recall 337 callback in order to perform the retried I/O. 339 The client needs to be cognizant that since this error handling is 340 optional in the MDS, the MDS may silently ignore this functionality. 341 Also, as the MDS may consider some issues the client reports to be 342 expected (see Section 2.1), the client might find it difficult to 343 detect a MDS which has not implemented error handling via 344 LAYOUTRETURN. 346 If an MDS is aware that a storage device is proving problematic to a 347 client, the MDS SHOULD NOT include that storage device in any pNFS 348 layouts sent to that client. If the MDS is aware that a storage 349 device is affecting many clients, then the MDS SHOULD NOT include 350 that storage device in any pNFS layouts sent out. Clients must still 351 be aware that the MDS might not have any choice in using the storage 352 device, i.e., there might only be one possible layout for the system. 354 Another interesting complication is that for existing files, the MDS 355 might have no choice in which storage devices to hand out to clients. 356 The MDS might try to restripe a file across a different storage 357 device, but clients need to be aware that not all implementations 358 have restriping support. 360 An MDS SHOULD react to a client return of layouts with errors by not 361 using the problematic storage devices in layouts for that client, but 362 the MDS is not required to indefinitely retain per-client storage 363 device error information. An MDS is also not required to 364 automatically reinstate use of a previously problematic storage 365 device; administrative intervention may be required instead. 367 A client MAY perform I/O via the MDS even when the client holds a 368 layout that covers the I/O; servers MUST support this client 369 behavior, and MAY recall layouts as needed to complete I/Os. 371 3. Sharing change attribute implementation details with NFSv4 clients 373 3.1. Abstract 375 This document describes an extension to the NFSv4 protocol that 376 allows the server to share information about the implementation of 377 its change attribute with the client. The aim is to improve the 378 client's ability to determine the order in which parallel updates to 379 the same file were processed. 381 3.2. Introduction 383 Although both the NFSv4 [10] and NFSv4.1 protocol [2], define the 384 change attribute as being mandatory to implement, there is little in 385 the way of guidance. The only feature that is mandated by the spec 386 is that the value must change whenever the file data or metadata 387 change. 389 While this allows for a wide range of implementations, it also leaves 390 the client with a conundrum: how does it determine which is the most 391 recent value for the change attribute in a case where several RPC 392 calls have been issued in parallel? In other words if two COMPOUNDs, 393 both containing WRITE and GETATTR requests for the same file, have 394 been issued in parallel, how does the client determine which of the 395 two change attribute values returned in the replies to the GETATTR 396 requests corresponds to the most recent state of the file? In some 397 cases, the only recourse may be to send another COMPOUND containing a 398 third GETATTR that is fully serialised with the first two. 400 In order to avoid this kind of inefficiency, we propose a method to 401 allow the server to share details about how the change attribute is 402 expected to evolve, so that the client may immediately determine 403 which, out of the several change attribute values returned by the 404 server, is the most recent. 406 3.3. Definition of the 'change_attr_type' per-file system attribute 408 enum change_attr_typeinfo { 409 NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR = 0, 410 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER = 1, 411 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS = 2, 412 NFS4_CHANGE_TYPE_IS_TIME_METADATA = 3, 413 NFS4_CHANGE_TYPE_IS_UNDEFINED = 4 414 }; 416 +------------------+----+---------------------------+-----+ 417 | Name | Id | Data Type | Acc | 418 +------------------+----+---------------------------+-----+ 419 | change_attr_type | XX | enum change_attr_typeinfo | R | 420 +------------------+----+---------------------------+-----+ 422 The proposed solution is to enable the NFS server to provide 423 additional information about how it expects the change attribute 424 value to evolve after the file data or metadata has changed. To do 425 so, we define a new recommended attribute, 'change_attr_type', which 426 may take values from enum change_attr_typeinfo as follows: 428 NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR: The change attribute value MUST 429 monotonically increase for every atomic change to the file 430 attributes, data or directory contents. 432 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER: The change attribute value MUST 433 be incremented by one unit for every atomic change to the file 434 attributes, data or directory contents. This property is 435 preserved when writing to pNFS data servers. 437 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS: The change attribute 438 value MUST be incremented by one unit for every atomic change to 439 the file attributes, data or directory contents. In the case 440 where the client is writing to pNFS data servers, the number of 441 increments is not guaranteed to exactly match the number of 442 writes. 444 NFS4_CHANGE_TYPE_IS_TIME_METADATA: The change attribute is 445 implemented as suggested in the NFSv4 spec [10] in terms of the 446 time_metadata attribute. 448 NFS4_CHANGE_TYPE_IS_UNDEFINED: The change attribute does not take 449 values that fit into any of these categories. 451 If either NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR, 452 NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, or 453 NFS4_CHANGE_TYPE_IS_TIME_METADATA are set, then the client knows at 454 the very least that the change attribute is monotonically increasing, 455 which is sufficient to resolve the question of which value is the 456 most recent. 458 If the client sees the value NFS4_CHANGE_TYPE_IS_TIME_METADATA, then 459 by inspecting the value of the 'time_delta' attribute it additionally 460 has the option of detecting rogue server implementations that use 461 time_metadata in violation of the spec. 463 Finally, if the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it 464 has the ability to predict what the resulting change attribute value 465 should be after a COMPOUND containing a SETATTR, WRITE, or CREATE. 466 This again allows it to detect changes made in parallel by another 467 client. The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits 468 the same, but only if the client is not doing pNFS WRITEs. 470 4. NFS Server-side Copy 471 4.1. Introduction 473 This document describes a server-side copy feature for the NFS 474 protocol. 476 The server-side copy feature provides a mechanism for the NFS client 477 to perform a file copy on the server without the data being 478 transmitted back and forth over the network. 480 Without this feature, an NFS client copies data from one location to 481 another by reading the data from the server over the network, and 482 then writing the data back over the network to the server. Using 483 this server-side copy operation, the client is able to instruct the 484 server to copy the data locally without the data being sent back and 485 forth over the network unnecessarily. 487 In general, this feature is useful whenever data is copied from one 488 location to another on the server. It is particularly useful when 489 copying the contents of a file from a backup. Backup-versions of a 490 file are copied for a number of reasons, including restoring and 491 cloning data. 493 If the source object and destination object are on different file 494 servers, the file servers will communicate with one another to 495 perform the copy operation. The server-to-server protocol by which 496 this is accomplished is not defined in this document. 498 4.2. Protocol Overview 500 The server-side copy offload operations support both intra-server and 501 inter-server file copies. An intra-server copy is a copy in which 502 the source file and destination file reside on the same server. In 503 an inter-server copy, the source file and destination file are on 504 different servers. In both cases, the copy may be performed 505 synchronously or asynchronously. 507 Throughout the rest of this document, we refer to the NFS server 508 containing the source file as the "source server" and the NFS server 509 to which the file is transferred as the "destination server". In the 510 case of an intra-server copy, the source server and destination 511 server are the same server. Therefore in the context of an intra- 512 server copy, the terms source server and destination server refer to 513 the single server performing the copy. 515 The operations described below are designed to copy files. Other 516 file system objects can be copied by building on these operations or 517 using other techniques. For example if the user wishes to copy a 518 directory, the client can synthesize a directory copy by first 519 creating the destination directory and then copying the source 520 directory's files to the new destination directory. If the user 521 wishes to copy a namespace junction [12] [13], the client can use the 522 ONC RPC Federated Filesystem protocol [13] to perform the copy. 523 Specifically the client can determine the source junction's 524 attributes using the FEDFS_LOOKUP_FSN procedure and create a 525 duplicate junction using the FEDFS_CREATE_JUNCTION procedure. 527 For the inter-server copy protocol, the operations are defined to be 528 compatible with a server-to-server copy protocol in which the 529 destination server reads the file data from the source server. This 530 model in which the file data is pulled from the source by the 531 destination has a number of advantages over a model in which the 532 source pushes the file data to the destination. The advantages of 533 the pull model include: 535 o The pull model only requires a remote server (i.e., the 536 destination server) to be granted read access. A push model 537 requires a remote server (i.e., the source server) to be granted 538 write access, which is more privileged. 540 o The pull model allows the destination server to stop reading if it 541 has run out of space. In a push model, the destination server 542 must flow control the source server in this situation. 544 o The pull model allows the destination server to easily flow 545 control the data stream by adjusting the size of its read 546 operations. In a push model, the destination server does not have 547 this ability. The source server in a push model is capable of 548 writing chunks larger than the destination server has requested in 549 attributes and session parameters. In theory, the destination 550 server could perform a "short" write in this situation, but this 551 approach is known to behave poorly in practice. 553 The following operations are provided to support server-side copy: 555 COPY_NOTIFY: For inter-server copies, the client sends this 556 operation to the source server to notify it of a future file copy 557 from a given destination server for the given user. 559 COPY_REVOKE: Also for inter-server copies, the client sends this 560 operation to the source server to revoke permission to copy a file 561 for the given user. 563 COPY: Used by the client to request a file copy. 565 COPY_ABORT: Used by the client to abort an asynchronous file copy. 567 COPY_STATUS: Used by the client to poll the status of an 568 asynchronous file copy. 570 CB_COPY: Used by the destination server to report the results of an 571 asynchronous file copy to the client. 573 These operations are described in detail in Section 4.3. This 574 section provides an overview of how these operations are used to 575 perform server-side copies. 577 4.2.1. Intra-Server Copy 579 To copy a file on a single server, the client uses a COPY operation. 580 The server may respond to the copy operation with the final results 581 of the copy or it may perform the copy asynchronously and deliver the 582 results using a CB_COPY operation callback. If the copy is performed 583 asynchronously, the client may poll the status of the copy using 584 COPY_STATUS or cancel the copy using COPY_ABORT. 586 A synchronous intra-server copy is shown in Figure 1. In this 587 example, the NFS server chooses to perform the copy synchronously. 588 The copy operation is completed, either successfully or 589 unsuccessfully, before the server replies to the client's request. 590 The server's reply contains the final result of the operation. 592 Client Server 593 + + 594 | | 595 |--- COPY ---------------------------->| Client requests 596 |<------------------------------------/| a file copy 597 | | 598 | | 600 Figure 1: A synchronous intra-server copy. 602 An asynchronous intra-server copy is shown in Figure 2. In this 603 example, the NFS server performs the copy asynchronously. The 604 server's reply to the copy request indicates that the copy operation 605 was initiated and the final result will be delivered at a later time. 606 The server's reply also contains a copy stateid. The client may use 607 this copy stateid to poll for status information (as shown) or to 608 cancel the copy using a COPY_ABORT. When the server completes the 609 copy, the server performs a callback to the client and reports the 610 results. 612 Client Server 613 + + 614 | | 615 |--- COPY ---------------------------->| Client requests 616 |<------------------------------------/| a file copy 617 | | 618 | | 619 |--- COPY_STATUS --------------------->| Client may poll 620 |<------------------------------------/| for status 621 | | 622 | . | Multiple COPY_STATUS 623 | . | operations may be sent. 624 | . | 625 | | 626 |<-- CB_COPY --------------------------| Server reports results 627 |\------------------------------------>| 628 | | 630 Figure 2: An asynchronous intra-server copy. 632 4.2.2. Inter-Server Copy 634 A copy may also be performed between two servers. The copy protocol 635 is designed to accommodate a variety of network topologies. As shown 636 in Figure 3, the client and servers may be connected by multiple 637 networks. In particular, the servers may be connected by a 638 specialized, high speed network (network 192.168.33.0/24 in the 639 diagram) that does not include the client. The protocol allows the 640 client to setup the copy between the servers (over network 641 10.11.78.0/24 in the diagram) and for the servers to communicate on 642 the high speed network if they choose to do so. 644 192.168.33.0/24 645 +-------------------------------------+ 646 | | 647 | | 648 | 192.168.33.18 | 192.168.33.56 649 +-------+------+ +------+------+ 650 | Source | | Destination | 651 +-------+------+ +------+------+ 652 | 10.11.78.18 | 10.11.78.56 653 | | 654 | | 655 | 10.11.78.0/24 | 656 +------------------+------------------+ 657 | 658 | 659 | 10.11.78.243 660 +-----+-----+ 661 | Client | 662 +-----------+ 664 Figure 3: An example inter-server network topology. 666 For an inter-server copy, the client notifies the source server that 667 a file will be copied by the destination server using a COPY_NOTIFY 668 operation. The client then initiates the copy by sending the COPY 669 operation to the destination server. The destination server may 670 perform the copy synchronously or asynchronously. 672 A synchronous inter-server copy is shown in Figure 4. In this case, 673 the destination server chooses to perform the copy before responding 674 to the client's COPY request. 676 An asynchronous copy is shown in Figure 5. In this case, the 677 destination server chooses to respond to the client's COPY request 678 immediately and then perform the copy asynchronously. 680 Client Source Destination 681 + + + 682 | | | 683 |--- COPY_NOTIFY --->| | 684 |<------------------/| | 685 | | | 686 | | | 687 |--- COPY ---------------------------->| 688 | | | 689 | | | 690 | |<----- read -----| 691 | |\--------------->| 692 | | | 693 | | . | Multiple reads may 694 | | . | be necessary 695 | | . | 696 | | | 697 | | | 698 |<------------------------------------/| Destination replies 699 | | | to COPY 701 Figure 4: A synchronous inter-server copy. 703 Client Source Destination 704 + + + 705 | | | 706 |--- COPY_NOTIFY --->| | 707 |<------------------/| | 708 | | | 709 | | | 710 |--- COPY ---------------------------->| 711 |<------------------------------------/| 712 | | | 713 | | | 714 | |<----- read -----| 715 | |\--------------->| 716 | | | 717 | | . | Multiple reads may 718 | | . | be necessary 719 | | . | 720 | | | 721 | | | 722 |--- COPY_STATUS --------------------->| Client may poll 723 |<------------------------------------/| for status 724 | | | 725 | | . | Multiple COPY_STATUS 726 | | . | operations may be sent 727 | | . | 728 | | | 729 | | | 730 | | | 731 |<-- CB_COPY --------------------------| Destination reports 732 |\------------------------------------>| results 733 | | | 735 Figure 5: An asynchronous inter-server copy. 737 4.2.3. Server-to-Server Copy Protocol 739 During an inter-server copy, the destination server reads the file 740 data from the source server. The source server and destination 741 server are not required to use a specific protocol to transfer the 742 file data. The choice of what protocol to use is ultimately the 743 destination server's decision. 745 4.2.3.1. Using NFSv4.x as a Server-to-Server Copy Protocol 747 The destination server MAY use standard NFSv4.x (where x >= 1) to 748 read the data from the source server. If NFSv4.x is used for the 749 server-to-server copy protocol, the destination server can use the 750 filehandle contained in the COPY request with standard NFSv4.x 751 operations to read data from the source server. Specifically, the 752 destination server may use the NFSv4.x OPEN operation's CLAIM_FH 753 facility to open the file being copied and obtain an open stateid. 754 Using the stateid, the destination server may then use NFSv4.x READ 755 operations to read the file. 757 4.2.3.2. Using an alternative Server-to-Server Copy Protocol 759 In a homogeneous environment, the source and destination servers 760 might be able to perform the file copy extremely efficiently using 761 specialized protocols. For example the source and destination 762 servers might be two nodes sharing a common file system format for 763 the source and destination file systems. Thus the source and 764 destination are in an ideal position to efficiently render the image 765 of the source file to the destination file by replicating the file 766 system formats at the block level. Another possibility is that the 767 source and destination might be two nodes sharing a common storage 768 area network, and thus there is no need to copy any data at all, and 769 instead ownership of the file and its contents might simply be re- 770 assigned to the destination. To allow for these possibilities, the 771 destination server is allowed to use a server-to-server copy protocol 772 of its choice. 774 In a heterogeneous environment, using a protocol other than NFSv4.x 775 (e.g,. HTTP [14] or FTP [15]) presents some challenges. In 776 particular, the destination server is presented with the challenge of 777 accessing the source file given only an NFSv4.x filehandle. 779 One option for protocols that identify source files with path names 780 is to use an ASCII hexadecimal representation of the source 781 filehandle as the file name. 783 Another option for the source server is to use URLs to direct the 784 destination server to a specialized service. For example, the 785 response to COPY_NOTIFY could include the URL 786 ftp://s1.example.com:9999/_FH/0x12345, where 0x12345 is the ASCII 787 hexadecimal representation of the source filehandle. When the 788 destination server receives the source server's URL, it would use 789 "_FH/0x12345" as the file name to pass to the FTP server listening on 790 port 9999 of s1.example.com. On port 9999 there would be a special 791 instance of the FTP service that understands how to convert NFS 792 filehandles to an open file descriptor (in many operating systems, 793 this would require a new system call, one which is the inverse of the 794 makefh() function that the pre-NFSv4 MOUNT service needs). 796 Authenticating and identifying the destination server to the source 797 server is also a challenge. Recommendations for how to accomplish 798 this are given in Section 4.4.1.2.4 and Section 4.4.1.4. 800 4.3. Operations 802 In the sections that follow, several operations are defined that 803 together provide the server-side copy feature. These operations are 804 intended to be OPTIONAL operations as defined in section 17 of [2]. 805 The COPY_NOTIFY, COPY_REVOKE, COPY, COPY_ABORT, and COPY_STATUS 806 operations are designed to be sent within an NFSv4 COMPOUND 807 procedure. The CB_COPY operation is designed to be sent within an 808 NFSv4 CB_COMPOUND procedure. 810 Each operation is performed in the context of the user identified by 811 the ONC RPC credential of its containing COMPOUND or CB_COMPOUND 812 request. For example, a COPY_ABORT operation issued by a given user 813 indicates that a specified COPY operation initiated by the same user 814 be canceled. Therefore a COPY_ABORT MUST NOT interfere with a copy 815 of the same file initiated by another user. 817 An NFS server MAY allow an administrative user to monitor or cancel 818 copy operations using an implementation specific interface. 820 4.3.1. netloc4 - Network Locations 822 The server-side copy operations specify network locations using the 823 netloc4 data type shown below: 825 enum netloc_type4 { 826 NL4_NAME = 0, 827 NL4_URL = 1, 828 NL4_NETADDR = 2 829 }; 830 union netloc4 switch (netloc_type4 nl_type) { 831 case NL4_NAME: utf8str_cis nl_name; 832 case NL4_URL: utf8str_cis nl_url; 833 case NL4_NETADDR: netaddr4 nl_addr; 834 }; 836 If the netloc4 is of type NL4_NAME, the nl_name field MUST be 837 specified as a UTF-8 string. The nl_name is expected to be resolved 838 to a network address via DNS, LDAP, NIS, /etc/hosts, or some other 839 means. If the netloc4 is of type NL4_URL, a server URL [5] 840 appropriate for the server-to-server copy operation is specified as a 841 UTF-8 string. If the netloc4 is of type NL4_NETADDR, the nl_addr 842 field MUST contain a valid netaddr4 as defined in Section 3.3.9 of 843 [2]. 845 When netloc4 values are used for an inter-server copy as shown in 846 Figure 3, their values may be evaluated on the source server, 847 destination server, and client. The network environment in which 848 these systems operate should be configured so that the netloc4 values 849 are interpreted as intended on each system. 851 4.3.2. Copy Offload Stateids 853 A server may perform a copy offload operation asynchronously. An 854 asynchronous copy is tracked using a copy offload stateid. Copy 855 offload stateids are included in the COPY, COPY_ABORT, COPY_STATUS, 856 and CB_COPY operations. 858 Section 8.2.4 of [2] specifies that stateids are valid until either 859 (A) the client or server restart or (B) the client returns the 860 resource. 862 A copy offload stateid will be valid until either (A) the client or 863 server restart or (B) the client returns the resource by issuing a 864 COPY_ABORT operation or the client replies to a CB_COPY operation. 866 A copy offload stateid's seqid MUST NOT be 0 (zero). In the context 867 of a copy offload operation, it is ambiguous to indicate the most 868 recent copy offload operation using a stateid with seqid of 0 (zero). 869 Therefore a copy offload stateid with seqid of 0 (zero) MUST be 870 considered invalid. 872 4.4. Security Considerations 874 The security considerations pertaining to NFSv4 [10] apply to this 875 document. 877 The standard security mechanisms provide by NFSv4 [10] may be used to 878 secure the protocol described in this document. 880 NFSv4 clients and servers supporting the the inter-server copy 881 operations described in this document are REQUIRED to implement [6], 882 including the RPCSEC_GSSv3 privileges copy_from_auth and 883 copy_to_auth. If the server-to-server copy protocol is ONC RPC 884 based, the servers are also REQUIRED to implement the RPCSEC_GSSv3 885 privilege copy_confirm_auth. These requirements to implement are not 886 requirements to use. NFSv4 clients and servers are RECOMMENDED to 887 use [6] to secure server-side copy operations. 889 4.4.1. Inter-Server Copy Security 891 4.4.1.1. Requirements for Secure Inter-Server Copy 893 Inter-server copy is driven by several requirements: 895 o The specification MUST NOT mandate an inter-server copy protocol. 896 There are many ways to copy data. Some will be more optimal than 897 others depending on the identities of the source server and 898 destination server. For example the source and destination 899 servers might be two nodes sharing a common file system format for 900 the source and destination file systems. Thus the source and 901 destination are in an ideal position to efficiently render the 902 image of the source file to the destination file by replicating 903 the file system formats at the block level. In other cases, the 904 source and destination might be two nodes sharing a common storage 905 area network, and thus there is no need to copy any data at all, 906 and instead ownership of the file and its contents simply gets re- 907 assigned to the destination. 909 o The specification MUST provide guidance for using NFSv4.x as a 910 copy protocol. For those source and destination servers willing 911 to use NFSv4.x there are specific security considerations that 912 this specification can and does address. 914 o The specification MUST NOT mandate pre-configuration between the 915 source and destination server. Requiring that the source and 916 destination first have a "copying relationship" increases the 917 administrative burden. However the specification MUST NOT 918 preclude implementations that require pre-configuration. 920 o The specification MUST NOT mandate a trust relationship between 921 the source and destination server. The NFSv4 security model 922 requires mutual authentication between a principal on an NFS 923 client and a principal on an NFS server. This model MUST continue 924 with the introduction of COPY. 926 4.4.1.2. Inter-Server Copy with RPCSEC_GSSv3 928 When the client sends a COPY_NOTIFY to the source server to expect 929 the destination to attempt to copy data from the source server, it is 930 expected that this copy is being done on behalf of the principal 931 (called the "user principal") that sent the RPC request that encloses 932 the COMPOUND procedure that contains the COPY_NOTIFY operation. The 933 user principal is identified by the RPC credentials. A mechanism 934 that allows the user principal to authorize the destination server to 935 perform the copy in a manner that lets the source server properly 936 authenticate the destination's copy, and without allowing the 937 destination to exceed its authorization is necessary. 939 An approach that sends delegated credentials of the client's user 940 principal to the destination server is not used for the following 941 reasons. If the client's user delegated its credentials, the 942 destination would authenticate as the user principal. If the 943 destination were using the NFSv4 protocol to perform the copy, then 944 the source server would authenticate the destination server as the 945 user principal, and the file copy would securely proceed. However, 946 this approach would allow the destination server to copy other files. 947 The user principal would have to trust the destination server to not 948 do so. This is counter to the requirements, and therefore is not 949 considered. Instead an approach using RPCSEC_GSSv3 [6] privileges is 950 proposed. 952 One of the stated applications of the proposed RPCSEC_GSSv3 protocol 953 is compound client host and user authentication [+ privilege 954 assertion]. For inter-server file copy, we require compound NFS 955 server host and user authentication [+ privilege assertion]. The 956 distinction between the two is one without meaning. 958 RPCSEC_GSSv3 introduces the notion of privileges. We define three 959 privileges: 961 copy_from_auth: A user principal is authorizing a source principal 962 ("nfs@") to allow a destination principal ("nfs@ 963 ") to copy a file from the source to the destination. 964 This privilege is established on the source server before the user 965 principal sends a COPY_NOTIFY operation to the source server. 967 struct copy_from_auth_priv { 968 secret4 cfap_shared_secret; 969 netloc4 cfap_destination; 970 /* the NFSv4 user name that the user principal maps to */ 971 utf8str_mixed cfap_username; 972 /* equal to seq_num of rpc_gss_cred_vers_3_t */ 973 unsigned int cfap_seq_num; 974 }; 976 cap_shared_secret is a secret value the user principal generates. 978 copy_to_auth: A user principal is authorizing a destination 979 principal ("nfs@") to allow it to copy a file from 980 the source to the destination. This privilege is established on 981 the destination server before the user principal sends a COPY 982 operation to the destination server. 984 struct copy_to_auth_priv { 985 /* equal to cfap_shared_secret */ 986 secret4 ctap_shared_secret; 987 netloc4 ctap_source; 988 /* the NFSv4 user name that the user principal maps to */ 989 utf8str_mixed ctap_username; 990 /* equal to seq_num of rpc_gss_cred_vers_3_t */ 991 unsigned int ctap_seq_num; 992 }; 994 ctap_shared_secret is a secret value the user principal generated 995 and was used to establish the copy_from_auth privilege with the 996 source principal. 998 copy_confirm_auth: A destination principal is confirming with the 999 source principal that it is authorized to copy data from the 1000 source on behalf of the user principal. When the inter-server 1001 copy protocol is NFSv4, or for that matter, any protocol capable 1002 of being secured via RPCSEC_GSSv3 (i.e., any ONC RPC protocol), 1003 this privilege is established before the file is copied from the 1004 source to the destination. 1006 struct copy_confirm_auth_priv { 1007 /* equal to GSS_GetMIC() of cfap_shared_secret */ 1008 opaque ccap_shared_secret_mic<>; 1009 /* the NFSv4 user name that the user principal maps to */ 1010 utf8str_mixed ccap_username; 1011 /* equal to seq_num of rpc_gss_cred_vers_3_t */ 1012 unsigned int ccap_seq_num; 1013 }; 1015 4.4.1.2.1. Establishing a Security Context 1017 When the user principal wants to COPY a file between two servers, if 1018 it has not established copy_from_auth and copy_to_auth privileges on 1019 the servers, it establishes them: 1021 o The user principal generates a secret it will share with the two 1022 servers. This shared secret will be placed in the 1023 cfap_shared_secret and ctap_shared_secret fields of the 1024 appropriate privilege data types, copy_from_auth_priv and 1025 copy_to_auth_priv. 1027 o An instance of copy_from_auth_priv is filled in with the shared 1028 secret, the destination server, and the NFSv4 user id of the user 1029 principal. It will be sent with an RPCSEC_GSS3_CREATE procedure, 1030 and so cfap_seq_num is set to the seq_num of the credential of the 1031 RPCSEC_GSS3_CREATE procedure. Because cfap_shared_secret is a 1032 secret, after XDR encoding copy_from_auth_priv, GSS_Wrap() (with 1033 privacy) is invoked on copy_from_auth_priv. The 1034 RPCSEC_GSS3_CREATE procedure's arguments are: 1036 struct { 1037 rpc_gss3_gss_binding *compound_binding; 1038 rpc_gss3_chan_binding *chan_binding_mic; 1039 rpc_gss3_assertion assertions<>; 1040 rpc_gss3_extension extensions<>; 1041 } rpc_gss3_create_args; 1043 The string "copy_from_auth" is placed in assertions[0].privs. The 1044 output of GSS_Wrap() is placed in extensions[0].data. The field 1045 extensions[0].critical is set to TRUE. The source server calls 1046 GSS_Unwrap() on the privilege, and verifies that the seq_num 1047 matches the credential. It then verifies that the NFSv4 user id 1048 being asserted matches the source server's mapping of the user 1049 principal. If it does, the privilege is established on the source 1050 server as: <"copy_from_auth", user id, destination>. The 1051 successful reply to RPCSEC_GSS3_CREATE has: 1053 struct { 1054 opaque handle<>; 1055 rpc_gss3_chan_binding *chan_binding_mic; 1056 rpc_gss3_assertion granted_assertions<>; 1057 rpc_gss3_assertion server_assertions<>; 1058 rpc_gss3_extension extensions<>; 1059 } rpc_gss3_create_res; 1061 The field "handle" is the RPCSEC_GSSv3 handle that the client will 1062 use on COPY_NOTIFY requests involving the source and destination 1063 server. granted_assertions[0].privs will be equal to 1064 "copy_from_auth". The server will return a GSS_Wrap() of 1065 copy_to_auth_priv. 1067 o An instance of copy_to_auth_priv is filled in with the shared 1068 secret, the source server, and the NFSv4 user id. It will be sent 1069 with an RPCSEC_GSS3_CREATE procedure, and so ctap_seq_num is set 1070 to the seq_num of the credential of the RPCSEC_GSS3_CREATE 1071 procedure. Because ctap_shared_secret is a secret, after XDR 1072 encoding copy_to_auth_priv, GSS_Wrap() is invoked on 1073 copy_to_auth_priv. The RPCSEC_GSS3_CREATE procedure's arguments 1074 are: 1076 struct { 1077 rpc_gss3_gss_binding *compound_binding; 1078 rpc_gss3_chan_binding *chan_binding_mic; 1079 rpc_gss3_assertion assertions<>; 1080 rpc_gss3_extension extensions<>; 1081 } rpc_gss3_create_args; 1083 The string "copy_to_auth" is placed in assertions[0].privs. The 1084 output of GSS_Wrap() is placed in extensions[0].data. The field 1085 extensions[0].critical is set to TRUE. After unwrapping, 1086 verifying the seq_num, and the user principal to NFSv4 user ID 1087 mapping, the destination establishes a privilege of 1088 <"copy_to_auth", user id, source>. The successful reply to 1089 RPCSEC_GSS3_CREATE has: 1091 struct { 1092 opaque handle<>; 1093 rpc_gss3_chan_binding *chan_binding_mic; 1094 rpc_gss3_assertion granted_assertions<>; 1095 rpc_gss3_assertion server_assertions<>; 1096 rpc_gss3_extension extensions<>; 1097 } rpc_gss3_create_res; 1099 The field "handle" is the RPCSEC_GSSv3 handle that the client will 1100 use on COPY requests involving the source and destination server. 1101 The field granted_assertions[0].privs will be equal to 1102 "copy_to_auth". The server will return a GSS_Wrap() of 1103 copy_to_auth_priv. 1105 4.4.1.2.2. Starting a Secure Inter-Server Copy 1107 When the client sends a COPY_NOTIFY request to the source server, it 1108 uses the privileged "copy_from_auth" RPCSEC_GSSv3 handle. 1109 cna_destination_server in COPY_NOTIFY MUST be the same as the name of 1110 the destination server specified in copy_from_auth_priv. Otherwise, 1111 COPY_NOTIFY will fail with NFS4ERR_ACCESS. The source server 1112 verifies that the privilege <"copy_from_auth", user id, destination> 1113 exists, and annotates it with the source filehandle, if the user 1114 principal has read access to the source file, and if administrative 1115 policies give the user principal and the NFS client read access to 1116 the source file (i.e., if the ACCESS operation would grant read 1117 access). Otherwise, COPY_NOTIFY will fail with NFS4ERR_ACCESS. 1119 When the client sends a COPY request to the destination server, it 1120 uses the privileged "copy_to_auth" RPCSEC_GSSv3 handle. 1121 ca_source_server in COPY MUST be the same as the name of the source 1122 server specified in copy_to_auth_priv. Otherwise, COPY will fail 1123 with NFS4ERR_ACCESS. The destination server verifies that the 1124 privilege <"copy_to_auth", user id, source> exists, and annotates it 1125 with the source and destination filehandles. If the client has 1126 failed to establish the "copy_to_auth" policy it will reject the 1127 request with NFS4ERR_PARTNER_NO_AUTH. 1129 If the client sends a COPY_REVOKE to the source server to rescind the 1130 destination server's copy privilege, it uses the privileged 1131 "copy_from_auth" RPCSEC_GSSv3 handle and the cra_destination_server 1132 in COPY_REVOKE MUST be the same as the name of the destination server 1133 specified in copy_from_auth_priv. The source server will then delete 1134 the <"copy_from_auth", user id, destination> privilege and fail any 1135 subsequent copy requests sent under the auspices of this privilege 1136 from the destination server. 1138 4.4.1.2.3. Securing ONC RPC Server-to-Server Copy Protocols 1140 After a destination server has a "copy_to_auth" privilege established 1141 on it, and it receives a COPY request, if it knows it will use an ONC 1142 RPC protocol to copy data, it will establish a "copy_confirm_auth" 1143 privilege on the source server, using nfs@ as the 1144 initiator principal, and nfs@ as the target principal. 1146 The value of the field ccap_shared_secret_mic is a GSS_VerifyMIC() of 1147 the shared secret passed in the copy_to_auth privilege. The field 1148 ccap_username is the mapping of the user principal to an NFSv4 user 1149 name ("user"@"domain" form), and MUST be the same as ctap_username 1150 and cfap_username. The field ccap_seq_num is the seq_num of the 1151 RPCSEC_GSSv3 credential used for the RPCSEC_GSS3_CREATE procedure the 1152 destination will send to the source server to establish the 1153 privilege. 1155 The source server verifies the privilege, and establishes a 1156 <"copy_confirm_auth", user id, destination> privilege. If the source 1157 server fails to verify the privilege, the COPY operation will be 1158 rejected with NFS4ERR_PARTNER_NO_AUTH. All subsequent ONC RPC 1159 requests sent from the destination to copy data from the source to 1160 the destination will use the RPCSEC_GSSv3 handle returned by the 1161 source's RPCSEC_GSS3_CREATE response. 1163 Note that the use of the "copy_confirm_auth" privilege accomplishes 1164 the following: 1166 o if a protocol like NFS is being used, with export policies, export 1167 policies can be overridden in case the destination server as-an- 1168 NFS-client is not authorized 1170 o manual configuration to allow a copy relationship between the 1171 source and destination is not needed. 1173 If the attempt to establish a "copy_confirm_auth" privilege fails, 1174 then when the user principal sends a COPY request to destination, the 1175 destination server will reject it with NFS4ERR_PARTNER_NO_AUTH. 1177 4.4.1.2.4. Securing Non ONC RPC Server-to-Server Copy Protocols 1179 If the destination won't be using ONC RPC to copy the data, then the 1180 source and destination are using an unspecified copy protocol. The 1181 destination could use the shared secret and the NFSv4 user id to 1182 prove to the source server that the user principal has authorized the 1183 copy. 1185 For protocols that authenticate user names with passwords (e.g., HTTP 1186 [14] and FTP [15]), the nfsv4 user id could be used as the user name, 1187 and an ASCII hexadecimal representation of the RPCSEC_GSSv3 shared 1188 secret could be used as the user password or as input into non- 1189 password authentication methods like CHAP [16]. 1191 4.4.1.3. Inter-Server Copy via ONC RPC but without RPCSEC_GSSv3 1193 ONC RPC security flavors other than RPCSEC_GSSv3 MAY be used with the 1194 server-side copy offload operations described in this document. In 1195 particular, host-based ONC RPC security flavors such as AUTH_NONE and 1196 AUTH_SYS MAY be used. If a host-based security flavor is used, a 1197 minimal level of protection for the server-to-server copy protocol is 1198 possible. 1200 In the absence of strong security mechanisms such as RPCSEC_GSSv3, 1201 the challenge is how the source server and destination server 1202 identify themselves to each other, especially in the presence of 1203 multi-homed source and destination servers. In a multi-homed 1204 environment, the destination server might not contact the source 1205 server from the same network address specified by the client in the 1206 COPY_NOTIFY. This can be overcome using the procedure described 1207 below. 1209 When the client sends the source server the COPY_NOTIFY operation, 1210 the source server may reply to the client with a list of target 1211 addresses, names, and/or URLs and assign them to the unique triple: 1212 . If the destination uses 1213 one of these target netlocs to contact the source server, the source 1214 server will be able to uniquely identify the destination server, even 1215 if the destination server does not connect from the address specified 1216 by the client in COPY_NOTIFY. 1218 For example, suppose the network topology is as shown in Figure 3. 1219 If the source filehandle is 0x12345, the source server may respond to 1220 a COPY_NOTIFY for destination 10.11.78.56 with the URLs: 1222 nfs://10.11.78.18//_COPY/10.11.78.56/_FH/0x12345 1224 nfs://192.168.33.18//_COPY/10.11.78.56/_FH/0x12345 1226 The client will then send these URLs to the destination server in the 1227 COPY operation. Suppose that the 192.168.33.0/24 network is a high 1228 speed network and the destination server decides to transfer the file 1229 over this network. If the destination contacts the source server 1230 from 192.168.33.56 over this network using NFSv4.1, it does the 1231 following: 1233 COMPOUND { PUTROOTFH, LOOKUP "_COPY" ; LOOKUP "10.11.78.56"; LOOKUP 1234 "_FH" ; OPEN "0x12345" ; GETFH } 1236 The source server will therefore know that these NFSv4.1 operations 1237 are being issued by the destination server identified in the 1238 COPY_NOTIFY. 1240 4.4.1.4. Inter-Server Copy without ONC RPC and RPCSEC_GSSv3 1242 The same techniques as Section 4.4.1.3, using unique URLs for each 1243 destination server, can be used for other protocols (e.g., HTTP [14] 1244 and FTP [15]) as well. 1246 5. Application Data Block Support 1248 At the OS level, files are contained on disk blocks. Applications 1249 are also free to impose structure on the data contained in a file and 1250 we can define an Application Data Block (ADB) to be such a structure. 1251 From the application's viewpoint, it only wants to handle ADBs and 1252 not raw bytes (see [17]). An ADB is typically comprised of two 1253 sections: a header and data. The header describes the 1254 characteristics of the block and can provide a means to detect 1255 corruption in the data payload. The data section is typically 1256 initialized to all zeros. 1258 The format of the header is application specific, but there are two 1259 main components typically encountered: 1261 1. An ADB Number (ADBN), which allows the application to determine 1262 which data block is being referenced. The ADBN is a logical 1263 block number and is useful when the client is not storing the 1264 blocks in contiguous memory. 1266 2. Fields to describe the state of the ADB and a means to detect 1267 block corruption. For both pieces of data, a useful property is 1268 that allowed values be unique in that if passed across the 1269 network, corruption due to translation between big and little 1270 endian architectures are detectable. For example, 0xF0DEDEF0 has 1271 the same bit pattern in both architectures. 1273 Applications already impose structures on files [17] and detect 1274 corruption in data blocks [18]. What they are not able to do is 1275 efficiently transfer and store ADBs. To initialize a file with ADBs, 1276 the client must send the full ADB to the server and that must be 1277 stored on the server. When the application is initializing a file to 1278 have the ADB structure, it could compress the ADBs to just the 1279 information to necessary to later reconstruct the header portion of 1280 the ADB when the contents are read back. Using sparse file 1281 techniques, the disk blocks described by would not be allocated. 1282 Unlike sparse file techniques, there would be a small cost to store 1283 the compressed header data. 1285 In this section, we are going to define a generic framework for an 1286 ADB, present one approach to detecting corruption in a given ADB 1287 implementation, and describe the model for how the client and server 1288 can support efficient initialization of ADBs, reading of ADB holes, 1289 punching holes in ADBs, and space reservation. Further, we need to 1290 be able to extend this model to applications which do not support 1291 ADBs, but wish to be able to handle sparse files, hole punching, and 1292 space reservation. 1294 5.1. Generic Framework 1296 We want the representation of the ADB to be flexible enough to 1297 support many different applications. The most basic approach is no 1298 imposition of a block at all, which means we are working with the raw 1299 bytes. Such an approach would be useful for storing holes, punching 1300 holes, etc. In more complex deployments, a server might be 1301 supporting multiple applications, each with their own definition of 1302 the ADB. One might store the ADBN at the start of the block and then 1303 have a guard pattern to detect corruption [19]. The next might store 1304 the ADBN at an offset of 100 bytes within the block and have no guard 1305 pattern at all. The point is that existing applications might 1306 already have well defined formats for their data blocks. 1308 The guard pattern can be used to represent the state of the block, to 1309 protect against corruption, or both. Again, it needs to be able to 1310 be placed anywhere within the ADB. 1312 We need to be able to represent the starting offset of the block and 1313 the size of the block. Note that nothing prevents the application 1314 from defining different sized blocks in a file. 1316 5.1.1. Data Block Representation 1318 struct app_data_block4 { 1319 offset4 adb_offset; 1320 length4 adb_block_size; 1321 length4 adb_block_count; 1322 length4 adb_reloff_blocknum; 1323 count4 adb_block_num; 1324 length4 adb_reloff_pattern; 1325 opaque adb_pattern<>; 1326 }; 1328 The app_data_block4 structure captures the abstraction presented for 1329 the ADB. The additional fields present are to allow the transmission 1330 of adb_block_count ADBs at one time. We also use adb_block_num to 1331 convey the ADBN of the first block in the sequence. Each ADB will 1332 contain the same adb_pattern string. 1334 As both adb_block_num and adb_pattern are optional, if either 1335 adb_reloff_pattern or adb_reloff_blocknum is set to NFS4_UINT64_MAX, 1336 then the corresponding field is not set in any of the ADB. 1338 5.1.2. Data Content 1340 /* 1341 * Use an enum such that we can extend new types. 1342 */ 1343 enum data_content4 { 1344 NFS4_CONTENT_DATA = 0, 1345 NFS4_CONTENT_APP_BLOCK = 1, 1346 NFS4_CONTENT_HOLE = 2 1347 }; 1349 New operations might need to differentiate between wanting to access 1350 data versus an ADB. Also, future minor versions might want to 1351 introduce new data formats. This enumeration allows that to occur. 1353 5.2. pNFS Considerations 1355 While this document does not mandate how sparse ADBs are recorded on 1356 the server, it does make the assumption that such information is not 1357 in the file. I.e., the information is metadata. As such, the 1358 INITIALIZE operation is defined to be not supported by the DS - it 1359 must be issued to the MDS. But since the client must not assume a 1360 priori whether a read is sparse or not, the READ_PLUS operation MUST 1361 be supported by both the DS and the MDS. I.e., the client might 1362 impose on the MDS to asynchronously read the data from the DS. 1364 Furthermore, each DS MUST not report to a client either a sparse ADB 1365 or data which belongs to another DS. One implication of this 1366 requirement is that the app_data_block4's adb_block_size MUST be 1367 either be the stripe width or the stripe width must be an even 1368 multiple of it. 1370 The second implication here is that the DS must be able to use the 1371 Control Protocol to determine from the MDS where the sparse ADBs 1372 occur. [[Comment.1: Need to discuss what happens if after the file 1373 is being written to and an INITIALIZE occurs? --TH]] Perhaps instead 1374 of the DS pulling from the MDS, the MDS pushes to the DS? Thus an 1375 INITIALIZE causes a new push? [[Comment.2: Still need to consider 1376 race cases of the DS getting a WRITE and the MDS getting an 1377 INITIALIZE. --TH]] 1379 5.3. An Example of Detecting Corruption 1381 In this section, we define an ADB format in which corruption can be 1382 detected. Note that this is just one possible format and means to 1383 detect corruption. 1385 Consider a very basic implementation of an operating system's disk 1386 blocks. A block is either data or it is an indirect block which 1387 allows for files to be larger than one block. It is desired to be 1388 able to initialize a block. Lastly, to quickly unlink a file, a 1389 block can be marked invalid. The contents remain intact - which 1390 would enable this OS application to undelete a file. 1392 The application defines 4k sized data blocks, with an 8 byte block 1393 counter occurring at offset 0 in the block, and with the guard 1394 pattern occurring at offset 8 inside the block. Furthermore, the 1395 guard pattern can take one of four states: 1397 0xfeedface - This is the FREE state and indicates that the ADB 1398 format has been applied. 1400 0xcafedead - This is the DATA state and indicates that real data 1401 has been written to this block. 1403 0xe4e5c001 - This is the INDIRECT state and indicates that the 1404 block contains block counter numbers that are chained off of this 1405 block. 1407 0xba1ed4a3 - This is the INVALID state and indicates that the block 1408 contains data whose contents are garbage. 1410 Finally, it also defines an 8 byte checksum [20] starting at byte 16 1411 which applies to the remaining contents of the block. If the state 1412 is FREE, then that checksum is trivially zero. As such, the 1413 application has no need to transfer the checksum implicitly inside 1414 the ADB - it need not make the transfer layer aware of the fact that 1415 there is a checksum (see [18] for an example of checksums used to 1416 detect corruption in application data blocks). 1418 Corruption in each ADB can be detected thusly: 1420 o If the guard pattern is anything other than one of the allowed 1421 values, including all zeros. 1423 o If the guard pattern is FREE and any other byte in the remainder 1424 of the ADB is anything other than zero. 1426 o If the guard pattern is anything other than FREE, then if the 1427 stored checksum does not match the computed checksum. 1429 o If the guard pattern is INDIRECT and one of the stored indirect 1430 block numbers has a value greater than the number of ADBs in the 1431 file. 1433 o If the guard pattern is INDIRECT and one of the stored indirect 1434 block numbers is a duplicate of another stored indirect block 1435 number. 1437 As can be seen, the application can detect errors based on the 1438 combination of the guard pattern state and the checksum. But also, 1439 the application can detect corruption based on the state and the 1440 contents of the ADB. This last point is important in validating the 1441 minimum amount of data we incorporated into our generic framework. 1442 I.e., the guard pattern is sufficient in allowing applications to 1443 design their own corruption detection. 1445 Finally, it is important to note that none of these corruption checks 1446 occur in the transport layer. The server and client components are 1447 totally unaware of the file format and might report everything as 1448 being transferred correctly even in the case the application detects 1449 corruption. 1451 5.4. Example of READ_PLUS 1453 The hypothetical application presented in Section 5.3 can be used to 1454 illustrate how READ_PLUS would return an array of results. A file is 1455 created and initialized with 100 4k ADBs in the FREE state: 1457 INITIALIZE {0, 4k, 100, 0, 0, 8, 0xfeedface} 1459 Further, assume the application writes a single ADB at 16k, changing 1460 the guard pattern to 0xcafedead, we would then have in memory: 1462 0 -> (16k - 1) : 4k, 4, 0, 0, 8, 0xfeedface 1463 16k -> (20k - 1) : 00 00 00 05 ca fe de ad XX XX ... XX XX 1464 20k -> 400k : 4k, 95, 0, 6, 0xfeedface 1466 And when the client did a READ_PLUS of 64k at the start of the file, 1467 it would get back a result of an ADB, some data, and a final ADB: 1469 ADB {0, 4, 0, 0, 8, 0xfeedface} 1470 data 4k 1471 ADB {20k, 4k, 59, 0, 6, 0xfeedface} 1473 5.5. Zero Filled Holes 1475 As applications are free to define the structure of an ADB, it is 1476 trivial to define an ADB which supports zero filled holes. Such a 1477 case would encompass the traditional definitions of a sparse file and 1478 hole punching. For example, to punch a 64k hole, starting at 100M, 1479 into an existing file which has no ADB structure: 1481 INITIALIZE {100M, 64k, 1, NFS4_UINT64_MAX, 1482 0, NFS4_UINT64_MAX, 0x0} 1484 6. Space Reservation 1486 6.1. Introduction 1488 This section describes a set of operations that allow applications 1489 such as hypervisors to reserve space for a file, report the amount of 1490 actual disk space a file occupies and freeup the backing space of a 1491 file when it is not required. 1493 In virtualized environments, virtual disk files are often stored on 1494 NFS mounted volumes. Since virtual disk files represent the hard 1495 disks of virtual machines, hypervisors often have to guarantee 1496 certain properties for the file. 1498 One such example is space reservation. When a hypervisor creates a 1499 virtual disk file, it often tries to preallocate the space for the 1500 file so that there are no future allocation related errors during the 1501 operation of the virtual machine. Such errors prevent a virtual 1502 machine from continuing execution and result in downtime. 1504 Another useful feature would be the ability to report the number of 1505 blocks that would be freed when a file is deleted. Currently, NFS 1506 reports two size attributes: 1508 size The logical file size of the file. 1510 space_used The size in bytes that the file occupies on disk 1512 While these attributes are sufficient for space accounting in 1513 traditional filesystems, they prove to be inadequate in modern 1514 filesystems that support block sharing. Having a way to tell the 1515 number of blocks that would be freed if the file was deleted would be 1516 useful to applications that wish to migrate files when a volume is 1517 low on space. 1519 Since virtual disks represent a hard drive in a virtual machine, a 1520 virtual disk can be viewed as a filesystem within a file. Since not 1521 all blocks within a filesystem are in use, there is an opportunity to 1522 reclaim blocks that are no longer in use. A call to deallocate 1523 blocks could result in better space efficiency. Lesser space MAY be 1524 consumed for backups after block deallocation. 1526 We propose the following operations and attributes for the 1527 aforementioned use cases: 1529 space_reserved This attribute specifies whether the blocks backing 1530 the file have been preallocated. 1532 space_freed This attribute specifies the space freed when a file is 1533 deleted, taking block sharing into consideration. 1535 max_hole_punch This attribute specifies the maximum sized hole that 1536 can be punched on the filesystem. 1538 HOLE_PUNCH This operation zeroes and/or deallocates the blocks 1539 backing a region of the file. 1541 6.2. Use Cases 1542 6.2.1. Space Reservation 1544 Some applications require that once a file of a certain size is 1545 created, writes to that file never fail with an out of space 1546 condition. One such example is that of a hypervisor writing to a 1547 virtual disk. An out of space condition while writing to virtual 1548 disks would mean that the virtual machine would need to be frozen. 1550 Currently, in order to achieve such a guarantee, applications zero 1551 the entire file. The initial zeroing allocates the backing blocks 1552 and all subsequent writes are overwrites of already allocated blocks. 1553 This approach is not only inefficient in terms of the amount of I/O 1554 done, it is also not guaranteed to work on filesystems that are log 1555 structured or deduplicated. An efficient way of guaranteeing space 1556 reservation would be beneficial to such applications. 1558 If the space_reserved attribute is set on a file, it is guaranteed 1559 that writes that do not grow the file will not fail with 1560 NFSERR_NOSPC. 1562 6.2.2. Space freed on deletes 1564 Currently, files in NFS have two size attributes: 1566 size The logical file size of the file. 1568 space_used The size in bytes that the file occupies on disk. 1570 While these attributes are sufficient for space accounting in 1571 traditional filesystems, they prove to be inadequate in modern 1572 filesystems that support block sharing. In such filesystems, 1573 multiple inodes can point to a single block with a block reference 1574 count to guard against premature freeing. 1576 If space_used of a file is interpreted to mean the size in bytes of 1577 all disk blocks pointed to by the inode of the file, then shared 1578 blocks get double counted, over-reporting the space utilization. 1579 This also has the adverse effect that the deletion of a file with 1580 shared blocks frees up less than space_used bytes. 1582 On the other hand, if space_used is interpreted to mean the size in 1583 bytes of those disk blocks unique to the inode of the file, then 1584 shared blocks are not counted in any file, resulting in under- 1585 reporting of the space utilization. 1587 For example, two files A and B have 10 blocks each. Let 6 of these 1588 blocks be shared between them. Thus, the combined space utilized by 1589 the two files is 14 * BLOCK_SIZE bytes. In the former case, the 1590 combined space utilization of the two files would be reported as 20 * 1591 BLOCK_SIZE. However, deleting either would only result in 4 * 1592 BLOCK_SIZE being freed. Conversely, the latter interpretation would 1593 report that the space utilization is only 8 * BLOCK_SIZE. 1595 Adding another size attribute, space_freed, is helpful in solving 1596 this problem. space_freed is the number of blocks that are allocated 1597 to the given file that would be freed on its deletion. In the 1598 example, both A and B would report space_freed as 4 * BLOCK_SIZE and 1599 space_used as 10 * BLOCK_SIZE. If A is deleted, B will report 1600 space_freed as 10 * BLOCK_SIZE as the deletion of B would result in 1601 the deallocation of all 10 blocks. 1603 The addition of this problem doesn't solve the problem of space being 1604 over-reported. However, over-reporting is better than under- 1605 reporting. 1607 6.2.3. Operations and attributes 1609 In the sections that follow, one operation and three attributes are 1610 defined that together provide the space management facilities 1611 outlined earlier in the document. The operation is intended to be 1612 OPTIONAL and the attributes RECOMMENDED as defined in section 17 of 1613 [2]. 1615 6.2.4. Attribute 77: space_reserved 1617 The space_reserve attribute is a read/write attribute of type 1618 boolean. It is a per file attribute. When the space_reserved 1619 attribute is set via SETATTR, the server must ensure that there is 1620 disk space to accommodate every byte in the file before it can return 1621 success. If the server cannot guarantee this, it must return 1622 NFS4ERR_NOSPC. 1624 If the client tries to grow a file which has the space_reserved 1625 attribute set, the server must guarantee that there is disk space to 1626 accommodate every byte in the file with the new size before it can 1627 return success. If the server cannot guarantee this, it must return 1628 NFS4ERR_NOSPC. 1630 It is not required that the server allocate the space to the file 1631 before returning success. The allocation can be deferred, however, 1632 it must be guaranteed that it will not fail for lack of space. 1634 The value of space_reserved can be obtained at any time through 1635 GETATTR. 1637 In order to avoid ambiguity, the space_reserve bit cannot be set 1638 along with the size bit in SETATTR. Increasing the size of a file 1639 with space_reserve set will fail if space reservation cannot be 1640 guaranteed for the new size. If the file size is decreased, space 1641 reservation is only guaranteed for the new size and the extra blocks 1642 backing the file can be released. 1644 6.2.5. Attribute 78: space_freed 1646 space_freed gives the number of bytes freed if the file is deleted. 1647 This attribute is read only and is of type length4. It is a per file 1648 attribute. 1650 6.2.6. Attribute 79: max_hole_punch 1652 max_hole_punch specifies the maximum size of a hole that the 1653 HOLE_PUNCH operation can handle. This attribute is read only and of 1654 type length4. It is a per filesystem attribute. This attribute MUST 1655 be implemented if HOLE_PUNCH is implemented. 1657 6.2.7. Operation 64: HOLE_PUNCH - Zero and deallocate blocks backing 1658 the file in the specified range. 1660 WARNING: Most of this section is now obsolete. Parts of it need to 1661 be scavanged for the ADB discussion, but for the most part, it cannot 1662 be trusted. 1664 6.2.7.1. DESCRIPTION 1666 Whenever a client wishes to deallocate the blocks backing a 1667 particular region in the file, it calls the HOLE_PUNCH operation with 1668 the current filehandle set to the filehandle of the file in question, 1669 start offset and length in bytes of the region set in hpa_offset and 1670 hpa_count respectively. All further reads to this region MUST return 1671 zeros until overwritten. The filehandle specified must be that of a 1672 regular file. 1674 Situations may arise where hpa_offset and/or hpa_offset + hpa_count 1675 will not be aligned to a boundary that the server does allocations/ 1676 deallocations in. For most filesystems, this is the block size of 1677 the file system. In such a case, the server can deallocate as many 1678 bytes as it can in the region. The blocks that cannot be deallocated 1679 MUST be zeroed. Except for the block deallocation and maximum hole 1680 punching capability, a HOLE_PUNCH operation is to be treated similar 1681 to a write of zeroes. 1683 The server is not required to complete deallocating the blocks 1684 specified in the operation before returning. It is acceptable to 1685 have the deallocation be deferred. In fact, HOLE_PUNCH is merely a 1686 hint; it is valid for a server to return success without ever doing 1687 anything towards deallocating the blocks backing the region 1688 specified. However, any future reads to the region MUST return 1689 zeroes. 1691 HOLE_PUNCH will result in the space_used attribute being decreased by 1692 the number of bytes that were deallocated. The space_freed attribute 1693 may or may not decrease, depending on the support and whether the 1694 blocks backing the specified range were shared or not. The size 1695 attribute will remain unchanged. 1697 The HOLE_PUNCH operation MUST NOT change the space reservation 1698 guarantee of the file. While the server can deallocate the blocks 1699 specified by hpa_offset and hpa_count, future writes to this region 1700 MUST NOT fail with NFSERR_NOSPC. 1702 The HOLE_PUNCH operation may fail for the following reasons (this is 1703 a partial list): 1705 NFS4ERR_NOTSUPP The Hole punch operations are not supported by the 1706 NFS server receiving this request. 1708 NFS4ERR_DIR The current filehandle is of type NF4DIR. 1710 NFS4ERR_SYMLINK The current filehandle is of type NF4LNK. 1712 NFS4ERR_WRONG_TYPE The current filehandle does not designate an 1713 ordinary file. 1715 7. Sparse Files 1717 WARNING: Most of this section needs to be reworked because of the 1718 work going on in the ADB section. 1720 7.1. Introduction 1722 A sparse file is a common way of representing a large file without 1723 having to utilize all of the disk space for it. Consequently, a 1724 sparse file uses less physical space than its size indicates. This 1725 means the file contains 'holes', byte ranges within the file that 1726 contain no data. Most modern file systems support sparse files, 1727 including most UNIX file systems and NTFS, but notably not Apple's 1728 HFS+. Common examples of sparse files include Virtual Machine (VM) 1729 OS/disk images, database files, log files, and even checkpoint 1730 recovery files most commonly used by the HPC community. 1732 If an application reads a hole in a sparse file, the file system must 1733 returns all zeros to the application. For local data access there is 1734 little penalty, but with NFS these zeroes must be transferred back to 1735 the client. If an application uses the NFS client to read data into 1736 memory, this wastes time and bandwidth as the application waits for 1737 the zeroes to be transferred. 1739 A sparse file is typically created by initializing the file to be all 1740 zeros - nothing is written to the data in the file, instead the hole 1741 is recorded in the metadata for the file. So a 8G disk image might 1742 be represented initially by a couple hundred bits in the inode and 1743 nothing on the disk. If the VM then writes 100M to a file in the 1744 middle of the image, there would now be two holes represented in the 1745 metadata and 100M in the data. 1747 Other applications want to initialize a file to patterns other than 1748 zero. The problem with initializing to zero is that it is often 1749 difficult to distinguish a byte-range of initialized to all zeroes 1750 from data corruption, since a pattern of zeroes is a probable pattern 1751 for corruption. Instead, some applications, such as database 1752 management systems, use pattern consisting of bytes or words of non- 1753 zero values. 1755 Besides reading sparse files and initializing them, applications 1756 might want to hole punch, which is the deallocation of the data 1757 blocks which back a region of the file. At such time, the affected 1758 blocks are reinitialized to a pattern. 1760 This section introduces a new operation to read patterns from a file, 1761 READ_PLUS, and a new operation to both initialize patterns and to 1762 punch pattern holes into a file, WRITE_PLUS. READ_PLUS supports all 1763 the features of READ but includes an extension to support sparse 1764 pattern files. READ_PLUS is guaranteed to perform no worse than 1765 READ, and can dramatically improve performance with sparse files. 1766 READ_PLUS does not depend on pNFS protocol features, but can be used 1767 by pNFS to support sparse files. 1769 7.2. Terminology 1771 Regular file: An object of file type NF4REG or NF4NAMEDATTR. 1773 Sparse file: A Regular file that contains one or more Holes. 1775 Hole: A byte range within a Sparse file that contains regions of all 1776 zeroes. For block-based file systems, this could also be an 1777 unallocated region of the file. 1779 Hole Threshold The minimum length of a Hole as determined by the 1780 server. If a server chooses to define a Hole Threshold, then it 1781 would not return hole information (nfs_readplusreshole) with a 1782 hole_offset and hole_length that specify a range shorter than the 1783 Hole Threshold. 1785 7.3. Applications and Sparse Files 1787 Applications may cause an NFS client to read holes in a file for 1788 several reasons. This section describes three different application 1789 workloads that cause the NFS client to transfer data unnecessarily. 1790 These workloads are simply examples, and there are probably many more 1791 workloads that are negatively impacted by sparse files. 1793 The first workload that can cause holes to be read is sequential 1794 reads within a sparse file. When this happens, the NFS client may 1795 perform read requests ("readahead") into sections of the file not 1796 explicitly requested by the application. Since the NFS client cannot 1797 differentiate between holes and non-holes, the NFS client may 1798 prefetch empty sections of the file. 1800 This workload is exemplified by Virtual Machines and their associated 1801 file system images, e.g., VMware .vmdk files, which are large sparse 1802 files encapsulating an entire operating system. If a VM reads files 1803 within the file system image, this will translate to sequential NFS 1804 read requests into the much larger file system image file. Since NFS 1805 does not understand the internals of the file system image, it ends 1806 up performing readahead file holes. 1808 The second workload is generated by copying a file from a directory 1809 in NFS to either the same NFS server, to another file system, e.g., 1810 another NFS or Samba server, to a local ext3 file system, or even a 1811 network socket. In this case, bandwidth and server resources are 1812 wasted as the entire file is transferred from the NFS server to the 1813 NFS client. Once a byte range of the file has been transferred to 1814 the client, it is up to the client application, e.g., rsync, cp, scp, 1815 on how it writes the data to the target location. For example, cp 1816 supports sparse files and will not write all zero regions, whereas 1817 scp does not support sparse files and will transfer every byte of the 1818 file. 1820 The third workload is generated by applications that do not utilize 1821 the NFS client cache, but instead use direct I/O and manage cached 1822 data independently, e.g., databases. These applications may perform 1823 whole file caching with sparse files, which would mean that even the 1824 holes will be transferred to the clients and cached. 1826 7.4. Overview of Sparse Files and NFSv4 1828 This proposal seeks to provide sparse file support to the largest 1829 number of NFS client and server implementations, and as such proposes 1830 to add a new return code to the mandatory NFSv4.1 READ_PLUS operation 1831 instead of proposing additions or extensions of new or existing 1832 optional features (such as pNFS). 1834 As well, this document seeks to ensure that the proposed extensions 1835 are simple and do not transfer data between the client and server 1836 unnecessarily. For example, one possible way to implement sparse 1837 file read support would be to have the client, on the first hole 1838 encountered or at OPEN time, request a Data Region Map from the 1839 server. A Data Region Map would specify all zero and non-zero 1840 regions in a file. While this option seems simple, it is less useful 1841 and can become inefficient and cumbersome for several reasons: 1843 o Data Region Maps can be large, and transferring them can reduce 1844 overall read performance. For example, VMware's .vmdk files can 1845 have a file size of over 100 GBs and have a map well over several 1846 MBs. 1848 o Data Region Maps can change frequently, and become invalidated on 1849 every write to the file. NFSv4 has a single change attribute, 1850 which means any change to any region of a file will invalidate all 1851 Data Region Maps. This can result in the map being transferred 1852 multiple times with each update to the file. For example, a VM 1853 that updates a config file in its file system image would 1854 invalidate the Data Region Map not only for itself, but for all 1855 other clients accessing the same file system image. 1857 o Data Region Maps do not handle all zero-filled sections of the 1858 file, reducing the effectiveness of the solution. While it may be 1859 possible to modify the maps to handle zero-filled sections (at 1860 possibly great effort to the server), it is almost impossible with 1861 pNFS. With pNFS, the owner of the Data Region Map is the metadata 1862 server, which is not in the data path and has no knowledge of the 1863 contents of a data region. 1865 Another way to handle holes is compression, but this not ideal since 1866 it requires all implementations to agree on a single compression 1867 algorithm and requires a fair amount of computational overhead. 1869 Note that supporting writing to a sparse file does not require 1870 changes to the protocol. Applications and/or NFS implementations can 1871 choose to ignore WRITE requests of all zeroes to the NFS server 1872 without consequence. 1874 7.5. Operation 65: READ_PLUS 1876 The section introduces a new read operation, named READ_PLUS, which 1877 allows NFS clients to avoid reading holes in a sparse file. 1878 READ_PLUS is guaranteed to perform no worse than READ, and can 1879 dramatically improve performance with sparse files. 1881 READ_PLUS supports all the features of the existing NFSv4.1 READ 1882 operation [2] and adds a simple yet significant extension to the 1883 format of its response. The change allows the client to avoid 1884 returning all zeroes from a file hole, wasting computational and 1885 network resources and reducing performance. READ_PLUS uses a new 1886 result structure that tells the client that the result is all zeroes 1887 AND the byte-range of the hole in which the request was made. 1888 Returning the hole's byte-range, and only upon request, avoids 1889 transferring large Data Region Maps that may be soon invalidated and 1890 contain information about a file that may not even be read in its 1891 entirely. 1893 A new read operation is required due to NFSv4.1 minor versioning 1894 rules that do not allow modification of existing operation's 1895 arguments or results. READ_PLUS is designed in such a way to allow 1896 future extensions to the result structure. The same approach could 1897 be taken to extend the argument structure, but a good use case is 1898 first required to make such a change. 1900 7.5.1. ARGUMENT 1902 struct READ_PLUS4args { 1903 /* CURRENT_FH: file */ 1904 stateid4 rpa_stateid; 1905 offset4 rpa_offset; 1906 count4 rpa_count; 1907 }; 1909 7.5.2. RESULT 1911 union read_plus_content switch (data_content4 content) { 1912 case NFS4_CONTENT_DATA: 1913 opaque rpc_data<>; 1914 case NFS4_CONTENT_APP_BLOCK: 1915 app_data_block4 rpc_block; 1916 case NFS4_CONTENT_HOLE: 1917 hole_info4 rpc_hole; 1918 default: 1919 void; 1920 }; 1922 /* 1923 * Allow a return of an array of contents. 1924 */ 1925 struct read_plus_res4 { 1926 bool rpr_eof; 1927 read_plus_content rpr_contents<>; 1928 }; 1930 union READ_PLUS4res switch (nfsstat4 status) { 1931 case NFS4_OK: 1932 read_plus_res4 resok4; 1933 default: 1934 void; 1935 }; 1937 7.5.3. DESCRIPTION 1939 The READ_PLUS operation is based upon the NFSv4.1 READ operation [2], 1940 and similarly reads data from the regular file identified by the 1941 current filehandle. 1943 The client provides an offset of where the READ_PLUS is to start and 1944 a count of how many bytes are to be read. An offset of zero means to 1945 read data starting at the beginning of the file. If offset is 1946 greater than or equal to the size of the file, the status NFS4_OK is 1947 returned with nfs_readplusrestype4 set to READ_OK, data length set to 1948 zero, and eof set to TRUE. The READ_PLUS is subject to access 1949 permissions checking. 1951 If the client specifies a count value of zero, the READ_PLUS succeeds 1952 and returns zero bytes of data, again subject to access permissions 1953 checking. In all situations, the server may choose to return fewer 1954 bytes than specified by the client. The client needs to check for 1955 this condition and handle the condition appropriately. 1957 If the client specifies an offset and count value that is entirely 1958 contained within a hole of the file, the status NFS4_OK is returned 1959 with nfs_readplusresok4 set to READ_HOLE, and if information is 1960 available regarding the hole, a nfs_readplusreshole structure 1961 containing the offset and range of the entire hole. The 1962 nfs_readplusreshole structure is considered valid until the file is 1963 changed (detected via the change attribute). The server MUST provide 1964 the same semantics for nfs_readplusreshole as if the client read the 1965 region and received zeroes; the implied holes contents lifetime MUST 1966 be exactly the same as any other read data. 1968 If the client specifies an offset and count value that begins in a 1969 non-hole of the file but extends into hole the server should return a 1970 short read with status NFS4_OK, nfs_readplusresok4 set to READ_OK, 1971 and data length set to the number of bytes returned. The client will 1972 then issue another READ_PLUS for the remaining bytes, which the 1973 server will respond with information about the hole in the file. 1975 If the server knows that the requested byte range is into a hole of 1976 the file, but has no further information regarding the hole, it 1977 returns a nfs_readplusreshole structure with holeres4 set to 1978 HOLE_NOINFO. 1980 If hole information is available and can be returned to the client, 1981 the server returns a nfs_readplusreshole structure with the value of 1982 holeres4 to HOLE_INFO. The values of hole_offset and hole_length 1983 define the byte-range for the current hole in the file. These values 1984 represent the information known to the server and may describe a 1985 byte-range smaller than the true size of the hole. 1987 Except when special stateids are used, the stateid value for a 1988 READ_PLUS request represents a value returned from a previous byte- 1989 range lock or share reservation request or the stateid associated 1990 with a delegation. The stateid identifies the associated owners if 1991 any and is used by the server to verify that the associated locks are 1992 still valid (e.g., have not been revoked). 1994 If the read ended at the end-of-file (formally, in a correctly formed 1995 READ_PLUS operation, if offset + count is equal to the size of the 1996 file), or the READ_PLUS operation extends beyond the size of the file 1997 (if offset + count is greater than the size of the file), eof is 1998 returned as TRUE; otherwise, it is FALSE. A successful READ_PLUS of 1999 an empty file will always return eof as TRUE. 2001 If the current filehandle is not an ordinary file, an error will be 2002 returned to the client. In the case that the current filehandle 2003 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If 2004 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 2005 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 2007 For a READ_PLUS with a stateid value of all bits equal to zero, the 2008 server MAY allow the READ_PLUS to be serviced subject to mandatory 2009 byte-range locks or the current share deny modes for the file. For a 2010 READ_PLUS with a stateid value of all bits equal to one, the server 2011 MAY allow READ_PLUS operations to bypass locking checks at the 2012 server. 2014 On success, the current filehandle retains its value. 2016 7.5.4. IMPLEMENTATION 2018 If the server returns a "short read" (i.e., fewer data than requested 2019 and eof is set to FALSE), the client should send another READ_PLUS to 2020 get the remaining data. A server may return less data than requested 2021 under several circumstances. The file may have been truncated by 2022 another client or perhaps on the server itself, changing the file 2023 size from what the requesting client believes to be the case. This 2024 would reduce the actual amount of data available to the client. It 2025 is possible that the server reduce the transfer size and so return a 2026 short read result. Server resource exhaustion may also occur in a 2027 short read. 2029 If mandatory byte-range locking is in effect for the file, and if the 2030 byte-range corresponding to the data to be read from the file is 2031 WRITE_LT locked by an owner not associated with the stateid, the 2032 server will return the NFS4ERR_LOCKED error. The client should try 2033 to get the appropriate READ_LT via the LOCK operation before re- 2034 attempting the READ_PLUS. When the READ_PLUS completes, the client 2035 should release the byte-range lock via LOCKU. In addition, the 2036 server MUST return a nfs_readplusreshole structure with values of 2037 hole_offset and hole_length that are within the owner's locked byte 2038 range. 2040 If another client has an OPEN_DELEGATE_WRITE delegation for the file 2041 being read, the delegation must be recalled, and the operation cannot 2042 proceed until that delegation is returned or revoked. Except where 2043 this happens very quickly, one or more NFS4ERR_DELAY errors will be 2044 returned to requests made while the delegation remains outstanding. 2045 Normally, delegations will not be recalled as a result of a READ_PLUS 2046 operation since the recall will occur as a result of an earlier OPEN. 2047 However, since it is possible for a READ_PLUS to be done with a 2048 special stateid, the server needs to check for this case even though 2049 the client should have done an OPEN previously. 2051 7.5.4.1. Additional pNFS Implementation Information 2053 With pNFS, the semantics of using READ_PLUS remains the same. Any 2054 data server MAY return a READ_HOLE result for a READ_PLUS request 2055 that it receives. 2057 When a data server chooses to return a READ_HOLE result, it has the 2058 option of returning hole information for the data stored on that data 2059 server (as defined by the data layout), but it MUST not return a 2060 nfs_readplusreshole structure with a byte range that includes data 2061 managed by another data server. 2063 1. Data servers that cannot determine hole information SHOULD return 2064 HOLE_NOINFO. 2066 2. Data servers that can obtain hole information for the parts of 2067 the file stored on that data server, the data server SHOULD 2068 return HOLE_INFO and the byte range of the hole stored on that 2069 data server. 2071 A data server should do its best to return as much information about 2072 a hole as is feasible without having to contact the metadata server. 2073 If communication with the metadata server is required, then every 2074 attempt should be taken to minimize the number of requests. 2076 If mandatory locking is enforced, then the data server must also 2077 ensure that to return only information for a Hole that is within the 2078 owner's locked byte range. 2080 7.5.5. READ_PLUS with Sparse Files Example 2082 To see how the return value READ_HOLE will work, the following table 2083 describes a sparse file. For each byte range, the file contains 2084 either non-zero data or a hole. In addition, the server in this 2085 example uses a hole threshold of 32K. 2087 +-------------+----------+ 2088 | Byte-Range | Contents | 2089 +-------------+----------+ 2090 | 0-15999 | Hole | 2091 | 16K-31999 | Non-Zero | 2092 | 32K-255999 | Hole | 2093 | 256K-287999 | Non-Zero | 2094 | 288K-353999 | Hole | 2095 | 354K-417999 | Non-Zero | 2096 +-------------+----------+ 2098 Table 1 2100 Under the given circumstances, if a client was to read the file from 2101 beginning to end with a max read size of 64K, the following will be 2102 the result. This assumes the client has already opened the file and 2103 acquired a valid stateid and just needs to issue READ_PLUS requests. 2105 1. READ_PLUS(s, 0, 64K) --> NFS_OK, readplusrestype4 = READ_OK, eof 2106 = false, data<>[32K]. Return a short read, as the last half of 2107 the request was all zeroes. Note that the first hole is read 2108 back as all zeros as it is below the hole threshhold. 2110 2. READ_PLUS(s, 32K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE, 2111 nfs_readplusreshole(HOLE_INFO)(32K, 224K). The requested range 2112 was all zeros, and the current hole begins at offset 32K and is 2113 224K in length. 2115 3. READ_PLUS(s, 256K, 64K) --> NFS_OK, readplusrestype4 = READ_OK, 2116 eof = false, data<>[32K]. Return a short read, as the last half 2117 of the request was all zeroes. 2119 4. READ_PLUS(s, 288K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE, 2120 nfs_readplusreshole(HOLE_INFO)(288K, 66K). 2122 5. READ_PLUS(s, 354K, 64K) --> NFS_OK, readplusrestype4 = READ_OK, 2123 eof = true, data<>[64K]. 2125 7.6. Related Work 2127 Solaris and ZFS support an extension to lseek(2) that allows 2128 applications to discover holes in a file. The values, SEEK_HOLE and 2129 SEEK_DATA, allow clients to seek to the next hole or beginning of 2130 data, respectively. 2132 XFS supports the XFS_IOC_GETBMAP extended attribute, which returns 2133 the Data Region Map for a file. Clients can then use this 2134 information to avoid reading holes in a file. 2136 NTFS and CIFS support the FSCTL_SET_SPARSE attribute, which allows 2137 applications to control whether empty regions of the file are 2138 preallocated and filled in with zeros or simply left unallocated. 2140 7.7. Other Proposed Designs 2142 7.7.1. Multi-Data Server Hole Information 2144 The current design prohibits pnfs data servers from returning hole 2145 information for regions of a file that are not stored on that data 2146 server. Having data servers return information regarding other data 2147 servers changes the fundamental principal that all metadata 2148 information comes from the metadata server. 2150 Here is a brief description if we did choose to support multi-data 2151 server hole information: 2153 For a data server that can obtain hole information for the entire 2154 file without severe performance impact, it MAY return HOLE_INFO and 2155 the byte range of the entire file hole. When a pNFS client receives 2156 a READ_HOLE result and a non-empty nfs_readplusreshole structure, it 2157 MAY use this information in conjunction with a valid layout for the 2158 file to determine the next data server for the next region of data 2159 that is not in a hole. 2161 7.7.2. Data Result Array 2163 If a single read request contains one or more Holes with a length 2164 greater than the Sparse Threshold, the current design would return 2165 results indicating a short read to the client. A client would then 2166 send a series of read requests to the server to retrieve information 2167 for the Holes and the remaining data. To avoid turning a single read 2168 request into several exchanges between the client and server, the 2169 server may need to choose a relatively large Sparse Threshold in 2170 order to decrease the number of short reads it creates. A large 2171 Sparse Threshold may miss many smaller holes, which in turn may 2172 negate the benefits of sparse read support. 2174 To avoid this situation, one option is to have the READ_PLUS 2175 operation return information for multiple holes in a single return 2176 value. This would allow several small holes to be described in a 2177 single read response without requiring multliple exchanges between 2178 the client and server. 2180 One important item to consider with returning an array of data chunks 2181 is its impact on RDMA, which may use different block sizes on the 2182 client and server (among other things). 2184 7.7.3. User-Defined Sparse Mask 2186 Add mask (instead of just zeroes). Specified by server or client? 2188 7.7.4. Allocated flag 2190 A Hole on the server may be an allocated byte-range consisting of all 2191 zeroes or may not be allocated at all. To ensure this information is 2192 properly communicated to the client, it may be beneficial to add a 2193 'alloc' flag to the HOLE_INFO section of nfs_readplusreshole. This 2194 would allow an NFS client to copy a file from one file system to 2195 another and have it more closely resemble the original. 2197 7.7.5. Dense and Sparse pNFS File Layouts 2199 The hole information returned form a data server must be understood 2200 by pNFS clients using both Dense or Sparse file layout types. Does 2201 the current READ_PLUS return value work for both layout types? Does 2202 the data server know if it is using dense or sparse so that it can 2203 return the correct hole_offset and hole_length values? 2205 8. Labeled NFS 2207 WARNING: Need to pull out the requirements. 2209 8.1. Introduction 2211 Mandatory Access Control (MAC) systems have been mainstreamed in 2212 modern operating systems such as Linux (R), FreeBSD (R), Solaris 2213 (TM), and Windows Vista (R). MAC systems bind security attributes to 2214 subjects (processes) and objects within a system. These attributes 2215 are used with other information in the system to make access control 2216 decisions. 2218 Access control models such as Unix permissions or Access Control 2219 Lists are commonly referred to as Discretionary Access Control (DAC) 2220 models. These systems base their access decisions on user identity 2221 and resource ownership. In contrast MAC models base their access 2222 control decisions on the label on the subject (usually a process) and 2223 the object it wishes to access. These labels may contain user 2224 identity information but usually contain additional information. In 2225 DAC systems users are free to specify the access rules for resources 2226 that they own. MAC models base their security decisions on a system 2227 wide policy established by an administrator or organization which the 2228 users do not have the ability to override. DAC systems offer no real 2229 protection against malicious or flawed software due to each program 2230 running with the full permissions of the user executing it. 2231 Inversely MAC models can confine malicious or flawed software and 2232 usually act at a finer granularity than their DAC counterparts. 2234 People desire to use NFSv4 with these systems. A mechanism is 2235 required to provide security attribute information to NFSv4 clients 2236 and servers. This mechanism has the following requirements: 2238 (1) Clients must be able to convey to the server the security 2239 attribute of the subject making the access request. The server 2240 may provide a mechanism to enforce MAC policy based on the 2241 requesting subject's security attribute. 2243 (2) Server must be able to store and retrieve the security attribute 2244 of exported files as requested by the client. 2246 (3) Server must provide a mechanism for notifying clients of 2247 attribute changes of files on the server. 2249 (4) Clients and Servers must be able to negotiate Label Formats and 2250 Domains of Interpretation (DOI) and provide a mechanism to 2251 translate between them as needed. 2253 These four requirements are key to the system with only requirements 2254 (2) and (3) requiring changes to NFSv4. The ability to convey the 2255 security attribute of the subject as described in requirement (1) 2256 falls upon the RPC layer to implement (see [6]). Requirement (4) 2257 allows communication between different MAC implementations. The 2258 management of label formats, DOIs, and the translation between them 2259 does not require any support from NFSv4 on a protocol level and is 2260 out of the scope of this document. 2262 The first change necessary is to devise a method for transporting and 2263 storing security label data on NFSv4 file objects. Security labels 2264 have several semantics that are met by NFSv4 recommended attributes 2265 such as the ability to set the label value upon object creation. 2266 Access control on these attributes are done through a combination of 2267 two mechanisms. As with other recommended attributes on file objects 2268 the usual DAC checks (ACLs and permission bits) will be performed to 2269 ensure that proper file ownership is enforced. In addition a MAC 2270 system MAY be employed on the client, server, or both to enforce 2271 additional policy on what subjects may modify security label 2272 information. 2274 The second change is to provide a method for the server to notify the 2275 client that the attribute changed on an open file on the server. If 2276 the file is closed, then during the open attempt, the client will 2277 gather the new attribute value. The server MUST not communicate the 2278 new value of the attribute, the client MUST query it. This 2279 requirement stems from the need for the client to provide sufficient 2280 access rights to the attribute. 2282 The final change necessary is a modification to the RPC layer used in 2283 NFSv4 in the form of a new version of the RPCSEC_GSS [7] framework. 2284 In order for an NFSv4 server to apply MAC checks it must obtain 2285 additional information from the client. Several methods were 2286 explored for performing this and it was decided that the best 2287 approach was to incorporate the ability to make security attribute 2288 assertions through the RPC mechanism. RPCSECGSSv3 [6] outlines a 2289 method to assert additional security information such as security 2290 labels on gss context creation and have that data bound to all RPC 2291 requests that make use of that context. 2293 8.2. Definitions 2295 Label Format Specifier (LFS): is an identifier used by the client to 2296 establish the syntactic format of the security label and the 2297 semantic meaning of its components. These specifiers exist in a 2298 registry associated with documents describing the format and 2299 semantics of the label. 2301 Label Format Registry: is the IANA registry containing all 2302 registered LFS along with references to the documents that 2303 describe the syntactic format and semantics of the security label. 2305 Policy Identifier (PI): is an optional part of the definition of a 2306 Label Format Specifier which allows for clients and server to 2307 identify specific security policies. 2309 Domain of Interpretation (DOI): represents an administrative 2310 security boundary, where all systems within the DOI have 2311 semantically coherent labeling. That is, a security attribute 2312 must always mean exactly the same thing anywhere within the DOI. 2314 Object: is a passive resource within the system that we wish to be 2315 protected. Objects can be entities such as files, directories, 2316 pipes, sockets, and many other system resources relevant to the 2317 protection of the system state. 2319 Subject: A subject is an active entity usually a process which is 2320 requesting access to an object. 2322 Multi-Level Security (MLS): is a traditional model where objects are 2323 given a sensitivity level (Unclassified, Secret, Top Secret, etc) 2324 and a category set [21]. 2326 8.3. MAC Security Attribute 2328 MAC models base access decisions on security attributes bound to 2329 subjects and objects. This information can range from a user 2330 identity for an identity based MAC model, sensitivity levels for 2331 Multi-level security, or a type for Type Enforcement. These models 2332 base their decisions on different criteria but the semantics of the 2333 security attribute remain the same. The semantics required by the 2334 security attributes are listed below: 2336 o Must provide flexibility with respect to MAC model. 2338 o Must provide the ability to atomically set security information 2339 upon object creation 2341 o Must provide the ability to enforce access control decisions both 2342 on the client and the server 2344 o Must not expose an object to either the client or server name 2345 space before its security information has been bound to it. 2347 NFSv4 provides several options for implementing the security 2348 attribute. The first option is to implement the security attribute 2349 as a named attribute. Named attributes provide flexibility since 2350 they are treated as an opaque field but lack a way to atomically set 2351 the attribute on creation. In addition, named attributes themselves 2352 are file system objects which need to be assigned a security 2353 attribute. This raises the question of how to assign security 2354 attributes to the file and directories used to hold the security 2355 attribute for the file in question. The inability to atomically 2356 assign the security attribute on file creation and the necessity to 2357 assign security attributes to its sub-components makes named 2358 attributes unacceptable as a method for storing security attributes. 2360 The second option is to implement the security attribute as a 2361 recommended attribute. These attributes have a fixed format and 2362 semantics, which conflicts with the flexible nature of the security 2363 attribute. To resolve this the security attribute consists of two 2364 components. The first component is a LFS as defined in [22] to allow 2365 for interoperability between MAC mechanisms. The second component is 2366 an opaque field which is the actual security attribute data. To 2367 allow for various MAC models NFSv4 should be used solely as a 2368 transport mechanism for the security attribute. It is the 2369 responsibility of the endpoints to consume the security attribute and 2370 make access decisions based on their respective models. In addition, 2371 creation of objects through OPEN and CREATE allows for the security 2372 attribute to be specified upon creation. By providing an atomic 2373 create and set operation for the security attribute it is possible to 2374 enforce the second and fourth requirements. The recommended 2375 attribute FATTR4_SEC_LABEL will be used to satisfy this requirement. 2377 8.3.1. Interpreting FATTR4_SEC_LABEL 2379 The XDR [11] necessary to implement Labeled NFSv4 is presented in 2380 Figure 6: 2382 const FATTR4_SEC_LABEL = 81; 2384 typedef uint32_t policy4; 2385 struct labelformat_spec4 { 2386 policy4 lfs_lfs; 2387 policy4 lfs_pi; 2388 }; 2390 struct sec_label_attr_info { 2391 labelformat_spec4 slai_lfs; 2392 opaque slai_data<>; 2393 }; 2395 Figure 6 2397 The FATTR4_SEC_LABEL contains an array of two components with the 2398 first component being an LFS. It serves to provide the receiving end 2399 with the information necessary to translate the security attribute 2400 into a form that is usable by the endpoint. Label Formats assigned 2401 an LFS may optionally choose to include a Policy Identifier field to 2402 allow for complex policy deployments. The LFS and Label Format 2403 Registry are described in detail in [22]. The translation used to 2404 interpret the security attribute is not specified as part of the 2405 protocol as it may depend on various factors. The second component 2406 is an opaque section which contains the data of the attribute. This 2407 component is dependent on the MAC model to interpret and enforce. 2409 In particular, it is the responsibility of the LFS specification to 2410 define a maximum size for the opaque section, slai_data<>. When 2411 creating or modifying a label for an object, the client needs to be 2412 guaranteed that the server will accept a label that is sized 2413 correctly. By both client and server being part of a specific MAC 2414 model, the client will be aware of the size. 2416 8.3.2. Delegations 2418 In the event that a security attribute is changed on the server while 2419 a client holds a delegation on the file, the client should follow the 2420 existing protocol with respect to attribute changes. It should flush 2421 all changes back to the server and relinquish the delegation. 2423 8.3.3. Permission Checking 2425 It is not feasible to enumerate all possible MAC models and even 2426 levels of protection within a subset of these models. This means 2427 that the NFSv4 client and servers cannot be expected to directly make 2428 access control decisions based on the security attribute. Instead 2429 NFSv4 should defer permission checking on this attribute to the host 2430 system. These checks are performed in addition to existing DAC and 2431 ACL checks outlined in the NFSv4 protocol. Section 8.7 gives a 2432 specific example of how the security attribute is handled under a 2433 particular MAC model. 2435 8.3.4. Object Creation 2437 When creating files in NFSv4 the OPEN and CREATE operations are used. 2438 One of the parameters to these operations is an fattr4 structure 2439 containing the attributes the file is to be created with. This 2440 allows NFSv4 to atomically set the security attribute of files upon 2441 creation. When a client is MAC aware it must always provide the 2442 initial security attribute upon file creation. In the event that the 2443 server is the only MAC aware entity in the system it should ignore 2444 the security attribute specified by the client and instead make the 2445 determination itself. A more in depth explanation can be found in 2446 Section 8.7. 2448 8.3.5. Existing Objects 2450 Note that under the MAC model, all objects must have labels. 2451 Therefore, if an existing server is upgraded to include LNFS support, 2452 then it is the responsibility of the security system to define the 2453 behavior for existing objects. For example, if the security system 2454 is LFS 0, which means the server just stores and returns labels, then 2455 existing files should return labels which are set to an empty value. 2457 8.3.6. Label Changes 2459 As per the requirements, when a file's security label is modified, 2460 the server must notify all clients which have the file opened of the 2461 change in label. It does so with CB_ATTR_CHANGED. There are 2462 preconditions to making an attribute change imposed by NFSv4 and the 2463 security system might want to impose others. In the process of 2464 meeting these preconditions, the server may chose to either serve the 2465 request in whole or return NFS4ERR_DELAY to the SETATTR operation. 2467 If there are open delegations on the file belonging to client other 2468 than the one making the label change, then the process described in 2469 Section 8.3.2 must be followed. 2471 As the server is always presented with the subject label from the 2472 client, it does not necessarily need to communicate the fact that the 2473 label has changed to the client. In the cases where the change 2474 outright denies the client access, the client will be able to quickly 2475 determine that there is a new label in effect. It is in cases where 2476 the client may share the same object between multiple subjects or a 2477 security system which is not strictly hierarchical that the 2478 CB_ATTR_CHANGED callback is very useful. It allows the server to 2479 inform the clients that the cached security attribute is now stale. 2481 In the scenario presented in Section 8.8.5, the clients are smart and 2482 the server has a very simple security system which just stores the 2483 labels. In this system, the MAC label check always allows access, 2484 regardless of the subject label. 2486 The way in which MAC labels are enforced is by the smart client. So 2487 if client A changes a security label on a file, then the server MUST 2488 inform all clients that have the file opened that the label has 2489 changed via CB_ATTR_CHANGED. Then the clients MUST retrieve the new 2490 label and MUST enforce access via the new attribute values. 2492 [[Comment.3: Describe a LFS of 0, which will be the means to indicate 2493 such a deployment. In the current LFR, 0 is marked as reserved. If 2494 we use it, then we define the default LFS to be used by a LNFS aware 2495 server. I.e., it lets smart clients work together in the face of a 2496 dumb server. Note that will supporting this system is optional, it 2497 will make for a very good debugging mode during development. I.e., 2498 even if a server does not deploy with another security system, this 2499 mode gets your foot in the door. --TH]] 2501 8.4. Procedure 16: CB_ATTR_CHANGED - Notify Client that the File's 2502 Attributes Changed 2504 8.4.1. ARGUMENTS 2506 struct CB_ATTR_CHANGED4args { 2507 nfs_fh4 acca_fh; 2508 bitmap4 acca_critical; 2509 bitmap4 acca_info; 2510 }; 2512 8.4.2. RESULTS 2514 struct CB_ATTR_CHANGED4res { 2515 nfsstat4 accr_status; 2516 }; 2518 8.4.3. DESCRIPTION 2520 The CB_ATTR_CHANGED callback operation is used by the server to 2521 indicate to the client that the file's attributes have been modified 2522 on the server. The server does not convey how the attributes have 2523 changed, just that they have been modified. The server can inform 2524 the client about both critical and informational attribute changes in 2525 the bitmask arguments. The client SHOULD query the server about all 2526 attributes set in acca_critical. For all changes reflected in 2527 acca_info, the client can decide whether or not it wants to poll the 2528 server. 2530 The CB_ATTR_CHANGED callback operation with the FATTR4_SEC_LABEL set 2531 in acca_critical is the method used by the server to indicate that 2532 the MAC label for the file referenced by acca_fh has changed. In 2533 many ways, the server does not care about the result returned by the 2534 client. 2536 8.5. pNFS Considerations 2538 This section examines the issues in deploying LNFS in a pNFS 2539 community of servers. 2541 8.5.1. MAC Label Checks 2543 The new FATTR4_SEC_LABEL attribute is metadata information and as 2544 such the DS is not aware of the value contained on the MDS. 2545 Fortunately, the NFSv4.1 protocol [2] already has provisions for 2546 doing access level checks from the DS to the MDS. In order for the 2547 DS to validate the subject label presented by the client, it SHOULD 2548 utilize this mechanism. 2550 If a file's FATTR4_SEC_LABEL is changed, then the MDS should utilize 2551 CB_ATTR_CHANGED to inform the client of that fact. If the MDS is 2552 maintaining 2554 8.6. Discovery of Server LNFS Support 2556 The server can easily determine that a client supports LNFS when it 2557 queries for the FATTR4_SEC_LABEL label for an object. Note that it 2558 cannot assume that the presence of RPCSEC_GSSv3 indicates LNFS 2559 support. The client might need to discover which LFS the server 2560 supports. 2562 A server which supports LNFS MUST allow a client with any subject 2563 label to retrieve the FATTR4_SEC_LABEL attribute for the root 2564 filehandle, ROOTFH. The following compound must always succeed as 2565 far as a MAC label check is concerned: 2567 PUTROOTFH, GETATTR {FATTR4_SEC_LABEL} 2569 Note that the server might have imposed a security flavor on the root 2570 that precludes such access. I.e., if the server requires kerberized 2571 access and the client presents a compound with AUTH_SYS, then the 2572 server is allowed to return NFS4ERR_WRONGSEC in this case. But if 2573 the client presents a correct security flavor, then the server MUST 2574 return the FATTR4_SEC_LABEL attribute with the supported LFS filled 2575 in. 2577 8.7. MAC Security NFS Modes of Operation 2579 A system using Labeled NFS may operate in three modes. The first 2580 mode provides the most protection and is called "full mode". In this 2581 mode both the client and server implement a MAC model allowing each 2582 end to make an access control decision. The remaining two modes are 2583 variations on each other and are called "smart client" and "smart 2584 server" modes. In these modes one end of the connection is not 2585 implementing a MAC model and because of this these operating modes 2586 offer less protection than full mode. 2588 8.7.1. Full Mode 2590 Full mode environments consist of MAC aware NFSv4 servers and clients 2591 and may be composed of mixed MAC models and policies. The system 2592 requires that both the client and server have an opportunity to 2593 perform an access control check based on all relevant information 2594 within the network. The file object security attribute is provided 2595 using the mechanism described in Section 8.3. The security attribute 2596 of the subject making the request is transported at the RPC layer 2597 using the mechanism described in RPCSECGSSv3 [6]. 2599 8.7.1.1. Initial Labeling and Translation 2601 The ability to create a file is an action that a MAC model may wish 2602 to mediate. The client is given the responsibility to determine the 2603 initial security attribute to be placed on a file. This allows the 2604 client to make a decision as to the acceptable security attributes to 2605 create a file with before sending the request to the server. Once 2606 the server receives the creation request from the client it may 2607 choose to evaluate if the security attribute is acceptable. 2609 Security attributes on the client and server may vary based on MAC 2610 model and policy. To handle this the security attribute field has an 2611 LFS component. This component is a mechanism for the host to 2612 identify the format and meaning of the opaque portion of the security 2613 attribute. A full mode environment may contain hosts operating in 2614 several different LFSs and DOIs. In this case a mechanism for 2615 translating the opaque portion of the security attribute is needed. 2616 The actual translation function will vary based on MAC model and 2617 policy and is out of the scope of this document. If a translation is 2618 unavailable for a given LFS and DOI then the request SHOULD be 2619 denied. Another recourse is to allow the host to provide a fallback 2620 mapping for unknown security attributes. 2622 8.7.1.2. Policy Enforcement 2624 In full mode access control decisions are made by both the clients 2625 and servers. When a client makes a request it takes the security 2626 attribute from the requesting process and makes an access control 2627 decision based on that attribute and the security attribute of the 2628 object it is trying to access. If the client denies that access an 2629 RPC call to the server is never made. If however the access is 2630 allowed the client will make a call to the NFS server. 2632 When the server receives the request from the client it extracts the 2633 security attribute conveyed in the RPC request. The server then uses 2634 this security attribute and the attribute of the object the client is 2635 trying to access to make an access control decision. If the server's 2636 policy allows this access it will fulfill the client's request, 2637 otherwise it will return NFS4ERR_ACCESS. 2639 Implementations MAY validate security attributes supplied over the 2640 network to ensure that they are within a set of attributes permitted 2641 from a specific peer, and if not, reject them. Note that a system 2642 may permit a different set of attributes to be accepted from each 2643 peer. An example of this can be seen in Section 8.8.7.1. 2645 8.7.2. Smart Client Mode 2647 Smart client environments consist of NFSv4 servers that are not MAC 2648 aware but NFSv4 clients that are. Clients in this environment are 2649 may consist of groups implementing different MAC models policies. 2650 The system requires that all clients in the environment be 2651 responsible for access control checks. Due to the amount of trust 2652 placed in the clients this mode is only to be used in a trusted 2653 environment. 2655 8.7.2.1. Initial Labeling and Translation 2657 Just like in full mode the client is responsible for determining the 2658 initial label upon object creation. The server in smart client mode 2659 does not implement a MAC model, however, it may provide the ability 2660 to restrict the creation and labeling of object with certain labels 2661 based on different criteria as described in Section 8.7.1.2. 2663 In a smart client environment a group of clients operate in a single 2664 DOI. This removes the need for the clients to maintain a set of DOI 2665 translations. Servers should provide a method to allow different 2666 groups of clients to access the server at the same time. However it 2667 should not let two groups of clients operating in different DOIs to 2668 access the same files. 2670 8.7.2.2. Policy Enforcement 2672 In smart client mode access control decisions are made by the 2673 clients. When a client accesses an object it obtains the security 2674 attribute of the object from the server and combines it with the 2675 security attribute of the process making the request to make an 2676 access control decision. This check is in addition to the DAC checks 2677 provided by NFSv4 so this may fail based on the DAC criteria even if 2678 the MAC policy grants access. As the policy check is located on the 2679 client an access control denial should take the form that is native 2680 to the platform. 2682 8.7.3. Smart Server Mode 2684 Smart server environments consist of NFSv4 servers that are MAC aware 2685 and one or more MAC unaware clients. The server is the only entity 2686 enforcing policy, and may selectively provide standard NFS services 2687 to clients based on their authentication credentials and/or 2688 associated network attributes (e.g., IP address, network interface). 2689 The level of trust and access extended to a client in this mode is 2690 configuration-specific. 2692 8.7.3.1. Initial Labeling and Translation 2694 In smart server mode all labeling and access control decisions are 2695 performed by the NFSv4 server. In this environment the NFSv4 clients 2696 are not MAC aware so they cannot provide input into the access 2697 control decision. This requires the server to determine the initial 2698 labeling of objects. Normally the subject to use in this calculation 2699 would originate from the client. Instead the NFSv4 server may choose 2700 to assign the subject security attribute based on their 2701 authentication credentials and/or associated network attributes 2702 (e.g., IP address, network interface). 2704 In smart server mode security attributes are contained solely within 2705 the NFSv4 server. This means that all security attributes used in 2706 the system remain within a single LFS and DOI. Since security 2707 attributes will not cross DOIs or change format there is no need to 2708 provide any translation functionality above that which is needed 2709 internally by the MAC model. 2711 8.7.3.2. Policy Enforcement 2713 All access control decisions in smart server mode are made by the 2714 server. The server will assign the subject a security attribute 2715 based on some criteria (e.g., IP address, network interface). Using 2716 the newly calculated security attribute and the security attribute of 2717 the object being requested the MAC model makes the access control 2718 check and returns NFS4ERR_ACCESS on a denial and NFS4_OK on success. 2719 This check is done transparently to the client so if the MAC 2720 permission check fails the client may be unaware of the reason for 2721 the permission failure. When operating in this mode administrators 2722 attempting to debug permission failures should be aware to check the 2723 MAC policy running on the server in addition to the DAC settings. 2725 8.8. Use Cases 2727 MAC labeling is meant to allow NFSv4 to be deployed in site 2728 configurable security schemes. The LFS and opaque data scheme allows 2729 for flexibility to meet these different implementations. In this 2730 section, we provide some examples of how NFSv4 could be deployed to 2731 meet existing needs. This is not an exhaustive listing. 2733 8.8.1. Full MAC labeling support for remotely mounted filesystems 2735 In this case, we assume a local networked environment where the 2736 servers and clients are under common administrative control. All 2737 systems in this network have the same MAC implementation and 2738 semantically identical MAC security labels for objects (i.e. labels 2739 mean the same thing on different systems, even if the policies on 2740 each system may differ to some extent). Clients will be able to 2741 apply fine-grained MAC policy to objects accessed via NFS mounts, and 2742 thus improve the overall consistency of MAC policy application within 2743 this environment. 2745 An example of this case would be where user home directories are 2746 remotely mounted, and fine-grained MAC policy is implemented to 2747 protect, for example, private user data from being read by malicious 2748 web scripts running in the user's browser. With Labeled NFS, fine- 2749 grained MAC labeling of the user's files will allow the local MAC 2750 policy to be implemented and provide the desired protection. 2752 8.8.2. MAC labeling of virtual machine images stored on the network 2754 Virtualization is now a commonly implemented feature of modern 2755 operating systems, and there is a need to ensure that MAC security 2756 policy is able to to protect virtualized resources. A common 2757 implementation scheme involves storing virtualized guest filesystems 2758 on a networked server, which are then mounted remotely by guests upon 2759 instantiation. In this case, there is a need to ensure that the 2760 local guest kernel is able to access fine-grained MAC labels on the 2761 remotely mounted filesystem so that its MAC security policy can be 2762 applied. 2764 8.8.3. International Traffic in Arms Regulations (ITAR) 2766 The International Traffic in Arms Regulations (ITAR) is put forth by 2767 the United States Department of State, Directorate of Defense and 2768 Trade Controls. ITAR places strict requirements on the export and 2769 thus access of defense articles and defense services. Organizations 2770 that manage projects with articles and services deemed as within the 2771 scope of ITAR must ensure the regulations are met. The regulations 2772 require an assurance that ITAR information is accessed on a need-to- 2773 know basis, thus requiring strict, centrally managed access controls 2774 on items labeled as ITAR. Additionally, organizations must be able 2775 to prove that the controls were adequately maintained and that 2776 foreign nationals were not permitted access to these defense articles 2777 or service. ITAR control applicability may be dynamic; information 2778 may become subject to ITAR after creation (e.g., when the defense 2779 implications of technology are recognized). 2781 8.8.4. Legal Hold/eDiscovery 2783 Increased cases of legal holds on electronic sources of information 2784 (ESI) have resulted in organizations taking a pro-active approach to 2785 reduce the scope and thus costs associated with these activities. 2786 ESI Data Maps are increasing in use and require support in operating 2787 systems to strictly manage access controls in the case of a legal 2788 hold. The sizeable quantity of information involved in a legal 2789 discovery request may preclude making a copy of the information to a 2790 separate system that manages the legal hold on the copies; this 2791 results in a need to enforce the legal hold on the original 2792 information. 2794 Organizations are taking steps to map out the sources of information 2795 that are most likely to be placed under a legal hold, these efforts 2796 result in ESI Data Maps. ESI Data Maps specify the Electronic Source 2797 of Information and the requirements for sensitivity and criticality. 2798 In the case of a legal hold, the ESI data map and labels can be used 2799 to ensure the legal hold is properly enforced on the predetermined 2800 set of information. An ESI data map narrows the scope of a legal 2801 hold to the predetermined ESI. The information must then be 2802 protected at a level of security of which the weight and 2803 admissibility of that evidence may be proved in a court of law. 2804 Current systems use application level controls and do not adequately 2805 meet the requirements. Labels may be used in advance when an ESI 2806 data map exercise is conducted with controls being applied at the 2807 time of a hold or labels may be applied to data sets during an 2808 eDiscovery exercise to ensure the data protections are adequate 2809 during the legal hold period. 2811 Note that this use case requires multi-attribute labels, as both 2812 information sensitivity (e.g., to disclosure) and information 2813 criticality (e.g., to continued business operations) need to be 2814 captured. 2816 8.8.5. Simple security label storage 2818 In this case, a mixed and loosely administered network is assumed, 2819 where nodes may be running a variety of operating systems with 2820 different security mechanisms and security policies. It is desired 2821 that network file servers be simply capable of storing and retrieving 2822 MAC security labels for clients which use such labels. The Labeled 2823 NFS protocol would be implemented here solely to enable transport of 2824 MAC security labels across the network. It should be noted that in 2825 such an environment, overall security cannot be as strongly enforced 2826 as in case (a), and that this scheme is aimed at allowing MAC-capable 2827 clients to function with local MAC security policy enabled rather 2828 than perhaps disabling it entirely. 2830 8.8.6. Diskless Linux 2832 A number of popular operating system distributions depend on a 2833 mandatory access control (MAC) model to implement a kernel-enforced 2834 security policy. Typically, such models assign particular roles to 2835 individual processes, which limit or permit performing certain 2836 operations on a set of files, directories, sockets, or other objects. 2837 While the enforcing of the policy is typically a matter for the 2838 diskless NFS client itself, the filesystem objects in such models 2839 will typically carry MAC labels that are used to define policy on 2840 access. These policies may, for instance, describe privilege 2841 transitions that cannot be replicated using standard NFS ACL based 2842 models. 2844 For instance on a SYSV compatible system, if the 'init' process 2845 spawns a process that attempts to start the 'NetworkManager' 2846 executable, there may be a policy that sets up a role transition if 2847 the 'init' process and 'NetworkManager' file labels match a 2848 particular rule. Without this role transition, the process may find 2849 itself having insufficient privileges to perform its primary job of 2850 configuring network interfaces. 2852 In setups of this type, a lot of the policy targets (such as sockets 2853 or privileged system calls) are entirely local to the client. The 2854 use of RPCSEC_GSSv3 for enforcing compliance at the server level is 2855 therefore of limited value. The ability to permanently label files 2856 and have those labels read back by the client is, however, crucial to 2857 the ability to enforce that policy. 2859 8.8.7. Multi-Level Security 2861 In a MLS system objects are generally assigned a sensitivity level 2862 and a set of compartments. The sensitivity levels within the system 2863 are given an order ranging from lowest to highest classification 2864 level. Read access to an object is allowed when the sensitivity 2865 level of the subject "dominates" the object it wants to access. This 2866 means that the sensitivity level of the subject is higher than that 2867 of the object it wishes to access and that its set of compartments is 2868 a super-set of the compartments on the object. 2870 The rest of the section will just use sensitivity levels. In general 2871 the example is a client that wishes to list the contents of a 2872 directory. The system defines the sensitivity levels as Unclassified 2873 (U), Secret (S), and Top Secret (TS). The directory to be searched 2874 is labeled Top Secret which means access to read the directory will 2875 only be granted if the subject making the request is also labeled Top 2876 Secret. 2878 8.8.7.1. Full Mode 2880 In the first part of this example a process on the client is running 2881 at the Secret level. The process issues a readdir system call which 2882 enters the kernel. Before translating the readdir system call into a 2883 request to the NFSv4 server the host operating system will consult 2884 the MAC module to see if the operation is allowed. Since the process 2885 is operating at Secret and the directory to be accessed is labeled 2886 Top Secret the MAC module will deny the request and an error code is 2887 returned to user space. 2889 Consider a second case where instead of running at Secret the process 2890 is running at Top Secret. In this case the sensitivity of the 2891 process is equal to or greater than that of the directory so the MAC 2892 module will allow the request. Now the readdir is translated into 2893 the necessary NFSv4 call to the server. For the RPC request the 2894 client is using the proper credential to assert to the server that 2895 the process is running at Top Secret. 2897 When the server receives the request it extracts the security label 2898 from the RPC session and retrieves the label on the directory. The 2899 server then checks with its MAC module if a Top Secret process is 2900 allowed to read the contents of the Top Secret directory. Since this 2901 is allowed by the policy then the server will return the appropriate 2902 information back to the client. 2904 In this example the policy on the client and server were both the 2905 same. In the event that they were running different policies a 2906 translation of the labels might be needed. In this case it could be 2907 possible for a check to pass on the client and fail on the server. 2908 The server may consider additional information when making its policy 2909 decisions. For example the server could determine that a certain 2910 subnet is only cleared for data up to Secret classification. If that 2911 constraint was in place for the example above the client would still 2912 succeed, but the server would fail since the client is asserting a 2913 label that it is not able to use (Top Secret on a Secret network). 2915 8.8.7.2. Smart Client Mode 2917 In smart client mode the example is identical to the first part of a 2918 full mode operation. A process on the client labeled Secret wishes 2919 to access a Top Secret directory. As in the full mode example this 2920 is denied since Secret does not dominate Top Secret. If the process 2921 were operating at Top Secret it would pass the local access control 2922 check and the NFSv4 operation would proceed as in a normal NFSv4 2923 environment. 2925 8.8.7.3. Smart Server Mode 2927 In a smart server mode the client behaves as if it were in a normal 2928 NFSv4 environment. Since the process on the client does not provide 2929 a security attribute the server must define a mechanism for labeling 2930 all requests from a client. Assume that the server is using the same 2931 criteria used in the full mode example. The server sees the request 2932 as coming from a subnet that is a Secret network. The server 2933 determines that all clients on that subnet will have their requests 2934 labeled with Secret. Since the directory on the server is labeled 2935 Top Secret and Secret does not dominate Top Secret the server would 2936 fail the request with NFS4ERR_ACCESS. 2938 8.9. Security Considerations 2940 This entire document deals with security issues. 2942 Depending on the level of protection the MAC system offers there may 2943 be a requirement to tightly bind the security attribute to the data. 2945 When either the client is in Smart Client Mode or server is in Smart 2946 Server Mode, it is important to realize that the other side is not 2947 enforcing MAC protections. Alternate methods might be in use to 2948 handle the lack of MAC support and care should be taken to identify 2949 and mitigate threats from possible tampering outside of these 2950 methods. 2952 An example of this is that a smart server that modifies READDIR or 2953 LOOKUP results based on the client's subject label might want to 2954 always construct the same subject label for a client which does not 2955 present one. This will prevent a non-LNFS client from mixing entries 2956 in the directory cache. 2958 9. Security Considerations 2960 10. Operations: REQUIRED, RECOMMENDED, or OPTIONAL 2962 The following tables summarize the operations of the NFSv4.2 protocol 2963 and the corresponding designation of REQUIRED, RECOMMENDED, and 2964 OPTIONAL to implement or MUST NOT implement. The designation of MUST 2965 NOT implement is reserved for those operations that were defined in 2966 either NFSv4.0 or NFSV4.1 and MUST NOT be implemented in NFSv4.2. 2968 For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation 2969 for operations sent by the client is for the server implementation. 2970 The client is generally required to implement the operations needed 2971 for the operating environment for which it serves. For example, a 2972 read-only NFSv4.2 client would have no need to implement the WRITE 2973 operation and is not required to do so. 2975 The REQUIRED or OPTIONAL designation for callback operations sent by 2976 the server is for both the client and server. Generally, the client 2977 has the option of creating the backchannel and sending the operations 2978 on the fore channel that will be a catalyst for the server sending 2979 callback operations. A partial exception is CB_RECALL_SLOT; the only 2980 way the client can avoid supporting this operation is by not creating 2981 a backchannel. 2983 Since this is a summary of the operations and their designation, 2984 there are subtleties that are not presented here. Therefore, if 2985 there is a question of the requirements of implementation, the 2986 operation descriptions themselves must be consulted along with other 2987 relevant explanatory text within this either specification or that of 2988 NFSv4.1 [2].. 2990 The abbreviations used in the second and third columns of the table 2991 are defined as follows. 2993 REQ REQUIRED to implement 2995 REC RECOMMEND to implement 2997 OPT OPTIONAL to implement 2998 MNI MUST NOT implement 3000 For the NFSv4.2 features that are OPTIONAL, the operations that 3001 support those features are OPTIONAL, and the server would return 3002 NFS4ERR_NOTSUPP in response to the client's use of those operations. 3003 If an OPTIONAL feature is supported, it is possible that a set of 3004 operations related to the feature become REQUIRED to implement. The 3005 third column of the table designates the feature(s) and if the 3006 operation is REQUIRED or OPTIONAL in the presence of support for the 3007 feature. 3009 The OPTIONAL features identified and their abbreviations are as 3010 follows: 3012 pNFS Parallel NFS 3014 FDELG File Delegations 3016 DDELG Directory Delegations 3018 COPY Server Side Copy 3020 ADB Application Data Blocks 3022 Operations 3024 +----------------------+--------------------+-----------------------+ 3025 | Operation | REQ, REC, OPT, or | Feature (REQ, REC, or | 3026 | | MNI | OPT) | 3027 +----------------------+--------------------+-----------------------+ 3028 | ACCESS | REQ | | 3029 | BACKCHANNEL_CTL | REQ | | 3030 | BIND_CONN_TO_SESSION | REQ | | 3031 | CLOSE | REQ | | 3032 | COMMIT | REQ | | 3033 | COPY | OPT | COPY (REQ) | 3034 | COPY_ABORT | OPT | COPY (REQ) | 3035 | COPY_NOTIFY | OPT | COPY (REQ) | 3036 | COPY_REVOKE | OPT | COPY (REQ) | 3037 | COPY_STATUS | OPT | COPY (REQ) | 3038 | CREATE | REQ | | 3039 | CREATE_SESSION | REQ | | 3040 | DELEGPURGE | OPT | FDELG (REQ) | 3041 | DELEGRETURN | OPT | FDELG, DDELG, pNFS | 3042 | | | (REQ) | 3043 | DESTROY_CLIENTID | REQ | | 3044 | DESTROY_SESSION | REQ | | 3045 | EXCHANGE_ID | REQ | | 3046 | FREE_STATEID | REQ | | 3047 | GETATTR | REQ | | 3048 | GETDEVICEINFO | OPT | pNFS (REQ) | 3049 | GETDEVICELIST | OPT | pNFS (OPT) | 3050 | GETFH | REQ | | 3051 | INITIALIZE | OPT | ADB (REQ) | 3052 | GET_DIR_DELEGATION | OPT | DDELG (REQ) | 3053 | LAYOUTCOMMIT | OPT | pNFS (REQ) | 3054 | LAYOUTGET | OPT | pNFS (REQ) | 3055 | LAYOUTRETURN | OPT | pNFS (REQ) | 3056 | LINK | OPT | | 3057 | LOCK | REQ | | 3058 | LOCKT | REQ | | 3059 | LOCKU | REQ | | 3060 | LOOKUP | REQ | | 3061 | LOOKUPP | REQ | | 3062 | NVERIFY | REQ | | 3063 | OPEN | REQ | | 3064 | OPENATTR | OPT | | 3065 | OPEN_CONFIRM | MNI | | 3066 | OPEN_DOWNGRADE | REQ | | 3067 | PUTFH | REQ | | 3068 | PUTPUBFH | REQ | | 3069 | PUTROOTFH | REQ | | 3070 | READ | OPT | | 3071 | READDIR | REQ | | 3072 | READLINK | OPT | | 3073 | READ_PLUS | OPT | ADB (REQ) | 3074 | RECLAIM_COMPLETE | REQ | | 3075 | RELEASE_LOCKOWNER | MNI | | 3076 | REMOVE | REQ | | 3077 | RENAME | REQ | | 3078 | RENEW | MNI | | 3079 | RESTOREFH | REQ | | 3080 | SAVEFH | REQ | | 3081 | SECINFO | REQ | | 3082 | SECINFO_NO_NAME | REC | pNFS file layout | 3083 | | | (REQ) | 3084 | SEQUENCE | REQ | | 3085 | SETATTR | REQ | | 3086 | SETCLIENTID | MNI | | 3087 | SETCLIENTID_CONFIRM | MNI | | 3088 | SET_SSV | REQ | | 3089 | TEST_STATEID | REQ | | 3090 | VERIFY | REQ | | 3091 | WANT_DELEGATION | OPT | FDELG (OPT) | 3092 | WRITE | REQ | | 3093 +----------------------+--------------------+-----------------------+ 3094 Callback Operations 3096 +-------------------------+-------------------+---------------------+ 3097 | Operation | REQ, REC, OPT, or | Feature (REQ, REC, | 3098 | | MNI | or OPT) | 3099 +-------------------------+-------------------+---------------------+ 3100 | CB_COPY | OPT | COPY (REQ) | 3101 | CB_GETATTR | OPT | FDELG (REQ) | 3102 | CB_LAYOUTRECALL | OPT | pNFS (REQ) | 3103 | CB_NOTIFY | OPT | DDELG (REQ) | 3104 | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | 3105 | CB_NOTIFY_LOCK | OPT | | 3106 | CB_PUSH_DELEG | OPT | FDELG (OPT) | 3107 | CB_RECALL | OPT | FDELG, DDELG, pNFS | 3108 | | | (REQ) | 3109 | CB_RECALL_ANY | OPT | FDELG, DDELG, pNFS | 3110 | | | (REQ) | 3111 | CB_RECALL_SLOT | REQ | | 3112 | CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS (REQ) | 3113 | CB_SEQUENCE | OPT | FDELG, DDELG, pNFS | 3114 | | | (REQ) | 3115 | CB_WANTS_CANCELLED | OPT | FDELG, DDELG, pNFS | 3116 | | | (REQ) | 3117 +-------------------------+-------------------+---------------------+ 3119 11. NFSv4.2 Operations 3121 11.1. Operation 59: COPY - Initiate a server-side copy 3123 11.1.1. ARGUMENT 3125 const COPY4_GUARDED = 0x00000001; 3126 const COPY4_METADATA = 0x00000002; 3128 struct COPY4args { 3129 /* SAVED_FH: source file */ 3130 /* CURRENT_FH: destination file or */ 3131 /* directory */ 3132 offset4 ca_src_offset; 3133 offset4 ca_dst_offset; 3134 length4 ca_count; 3135 uint32_t ca_flags; 3136 component4 ca_destination; 3137 netloc4 ca_source_server<>; 3138 }; 3140 11.1.2. RESULT 3142 union COPY4res switch (nfsstat4 cr_status) { 3143 case NFS4_OK: 3144 stateid4 cr_callback_id<1>; 3145 default: 3146 length4 cr_bytes_copied; 3147 }; 3149 11.1.3. DESCRIPTION 3151 The COPY operation is used for both intra- and inter-server copies. 3152 In both cases, the COPY is always sent from the client to the 3153 destination server of the file copy. The COPY operation requests 3154 that a file be copied from the location specified by the SAVED_FH 3155 value to the location specified by the combination of CURRENT_FH and 3156 ca_destination. 3158 The SAVED_FH must be a regular file. If SAVED_FH is not a regular 3159 file, the operation MUST fail and return NFS4ERR_WRONG_TYPE. 3161 In order to set SAVED_FH to the source file handle, the compound 3162 procedure requesting the COPY will include a sub-sequence of 3163 operations such as 3165 PUTFH source-fh 3166 SAVEFH 3168 If the request is for a server-to-server copy, the source-fh is a 3169 filehandle from the source server and the compound procedure is being 3170 executed on the destination server. In this case, the source-fh is a 3171 foreign filehandle on the server receiving the COPY request. If 3172 either PUTFH or SAVEFH checked the validity of the filehandle, the 3173 operation would likely fail and return NFS4ERR_STALE. 3175 In order to avoid this problem, the minor version incorporating the 3176 COPY operations will need to make a few small changes in the handling 3177 of existing operations. If a server supports the server-to-server 3178 COPY feature, a PUTFH followed by a SAVEFH MUST NOT return 3179 NFS4ERR_STALE for either operation. These restrictions do not pose 3180 substantial difficulties for servers. The CURRENT_FH and SAVED_FH 3181 may be validated in the context of the operation referencing them and 3182 an NFS4ERR_STALE error returned for an invalid file handle at that 3183 point. 3185 The CURRENT_FH and ca_destination together specify the destination of 3186 the copy operation. If ca_destination is of 0 (zero) length, then 3187 CURRENT_FH specifies the target file. In this case, CURRENT_FH MUST 3188 be a regular file and not a directory. If ca_destination is not of 0 3189 (zero) length, the ca_destination argument specifies the file name to 3190 which the data will be copied within the directory identified by 3191 CURRENT_FH. In this case, CURRENT_FH MUST be a directory and not a 3192 regular file. 3194 If the file named by ca_destination does not exist and the operation 3195 completes successfully, the file will be visible in the file system 3196 namespace. If the file does not exist and the operation fails, the 3197 file MAY be visible in the file system namespace depending on when 3198 the failure occurs and on the implementation of the NFS server 3199 receiving the COPY operation. If the ca_destination name cannot be 3200 created in the destination file system (due to file name 3201 restrictions, such as case or length), the operation MUST fail. 3203 The ca_src_offset is the offset within the source file from which the 3204 data will be read, the ca_dst_offset is the offset within the 3205 destination file to which the data will be written, and the ca_count 3206 is the number of bytes that will be copied. An offset of 0 (zero) 3207 specifies the start of the file. A count of 0 (zero) requests that 3208 all bytes from ca_src_offset through EOF be copied to the 3209 destination. If concurrent modifications to the source file overlap 3210 with the source file region being copied, the data copied may include 3211 all, some, or none of the modifications. The client can use standard 3212 NFS operations (e.g., OPEN with OPEN4_SHARE_DENY_WRITE or mandatory 3213 byte range locks) to protect against concurrent modifications if the 3214 client is concerned about this. If the source file's end of file is 3215 being modified in parallel with a copy that specifies a count of 0 3216 (zero) bytes, the amount of data copied is implementation dependent 3217 (clients may guard against this case by specifying a non-zero count 3218 value or preventing modification of the source file as mentioned 3219 above). 3221 If the source offset or the source offset plus count is greater than 3222 or equal to the size of the source file, the operation will fail with 3223 NFS4ERR_INVAL. The destination offset or destination offset plus 3224 count may be greater than the size of the destination file. This 3225 allows for the client to issue parallel copies to implement 3226 operations such as "cat file1 file2 file3 file4 > dest". 3228 If the destination file is created as a result of this command, the 3229 destination file's size will be equal to the number of bytes 3230 successfully copied. If the destination file already existed, the 3231 destination file's size may increase as a result of this operation 3232 (e.g. if ca_dst_offset plus ca_count is greater than the 3233 destination's initial size). 3235 If the ca_source_server list is specified, then this is an inter- 3236 server copy operation and the source file is on a remote server. The 3237 client is expected to have previously issued a successful COPY_NOTIFY 3238 request to the remote source server. The ca_source_server list 3239 SHOULD be the same as the COPY_NOTIFY response's cnr_source_server 3240 list. If the client includes the entries from the COPY_NOTIFY 3241 response's cnr_source_server list in the ca_source_server list, the 3242 source server can indicate a specific copy protocol for the 3243 destination server to use by returning a URL, which specifies both a 3244 protocol service and server name. Server-to-server copy protocol 3245 considerations are described in Section 4.2.3 and Section 4.4.1. 3247 The ca_flags argument allows the copy operation to be customized in 3248 the following ways using the guarded flag (COPY4_GUARDED) and the 3249 metadata flag (COPY4_METADATA). 3251 If the guarded flag is set and the destination exists on the server, 3252 this operation will fail with NFS4ERR_EXIST. 3254 If the guarded flag is not set and the destination exists on the 3255 server, the behavior is implementation dependent. 3257 If the metadata flag is set and the client is requesting a whole file 3258 copy (i.e., ca_count is 0 (zero)), a subset of the destination file's 3259 attributes MUST be the same as the source file's corresponding 3260 attributes and a subset of the destination file's attributes SHOULD 3261 be the same as the source file's corresponding attributes. The 3262 attributes in the MUST and SHOULD copy subsets will be defined for 3263 each NFS version. 3265 For NFSv4.1, Table 2 and Table 3 list the REQUIRED and RECOMMENDED 3266 attributes respectively. A "MUST" in the "Copy to destination file?" 3267 column indicates that the attribute is part of the MUST copy set. A 3268 "SHOULD" in the "Copy to destination file?" column indicates that the 3269 attribute is part of the SHOULD copy set. 3271 +--------------------+----+---------------------------+ 3272 | Name | Id | Copy to destination file? | 3273 +--------------------+----+---------------------------+ 3274 | supported_attrs | 0 | no | 3275 | type | 1 | MUST | 3276 | fh_expire_type | 2 | no | 3277 | change | 3 | SHOULD | 3278 | size | 4 | MUST | 3279 | link_support | 5 | no | 3280 | symlink_support | 6 | no | 3281 | named_attr | 7 | no | 3282 | fsid | 8 | no | 3283 | unique_handles | 9 | no | 3284 | lease_time | 10 | no | 3285 | rdattr_error | 11 | no | 3286 | filehandle | 19 | no | 3287 | suppattr_exclcreat | 75 | no | 3288 +--------------------+----+---------------------------+ 3290 Table 2 3292 +--------------------+----+---------------------------+ 3293 | Name | Id | Copy to destination file? | 3294 +--------------------+----+---------------------------+ 3295 | acl | 12 | MUST | 3296 | aclsupport | 13 | no | 3297 | archive | 14 | no | 3298 | cansettime | 15 | no | 3299 | case_insensitive | 16 | no | 3300 | case_preserving | 17 | no | 3301 | change_policy | 60 | no | 3302 | chown_restricted | 18 | MUST | 3303 | dacl | 58 | MUST | 3304 | dir_notif_delay | 56 | no | 3305 | dirent_notif_delay | 57 | no | 3306 | fileid | 20 | no | 3307 | files_avail | 21 | no | 3308 | files_free | 22 | no | 3309 | files_total | 23 | no | 3310 | fs_charset_cap | 76 | no | 3311 | fs_layout_type | 62 | no | 3312 | fs_locations | 24 | no | 3313 | fs_locations_info | 67 | no | 3314 | fs_status | 61 | no | 3315 | hidden | 25 | MUST | 3316 | homogeneous | 26 | no | 3317 | layout_alignment | 66 | no | 3318 | layout_blksize | 65 | no | 3319 | layout_hint | 63 | no | 3320 | layout_type | 64 | no | 3321 | maxfilesize | 27 | no | 3322 | maxlink | 28 | no | 3323 | maxname | 29 | no | 3324 | maxread | 30 | no | 3325 | maxwrite | 31 | no | 3326 | max_hole_punch | 31 | no | 3327 | mdsthreshold | 68 | no | 3328 | mimetype | 32 | MUST | 3329 | mode | 33 | MUST | 3330 | mode_set_masked | 74 | no | 3331 | mounted_on_fileid | 55 | no | 3332 | no_trunc | 34 | no | 3333 | numlinks | 35 | no | 3334 | owner | 36 | MUST | 3335 | owner_group | 37 | MUST | 3336 | quota_avail_hard | 38 | no | 3337 | quota_avail_soft | 39 | no | 3338 | quota_used | 40 | no | 3339 | rawdev | 41 | no | 3340 | retentevt_get | 71 | MUST | 3341 | retentevt_set | 72 | no | 3342 | retention_get | 69 | MUST | 3343 | retention_hold | 73 | MUST | 3344 | retention_set | 70 | no | 3345 | sacl | 59 | MUST | 3346 | space_avail | 42 | no | 3347 | space_free | 43 | no | 3348 | space_freed | 78 | no | 3349 | space_reserved | 77 | MUST | 3350 | space_total | 44 | no | 3351 | space_used | 45 | no | 3352 | system | 46 | MUST | 3353 | time_access | 47 | MUST | 3354 | time_access_set | 48 | no | 3355 | time_backup | 49 | no | 3356 | time_create | 50 | MUST | 3357 | time_delta | 51 | no | 3358 | time_metadata | 52 | SHOULD | 3359 | time_modify | 53 | MUST | 3360 | time_modify_set | 54 | no | 3361 +--------------------+----+---------------------------+ 3363 Table 3 3365 [NOTE: The source file's attribute values will take precedence over 3366 any attribute values inherited by the destination file.] 3367 In the case of an inter-server copy or an intra-server copy between 3368 file systems, the attributes supported for the source file and 3369 destination file could be different. By definition,the REQUIRED 3370 attributes will be supported in all cases. If the metadata flag is 3371 set and the source file has a RECOMMENDED attribute that is not 3372 supported for the destination file, the copy MUST fail with 3373 NFS4ERR_ATTRNOTSUPP. 3375 Any attribute supported by the destination server that is not set on 3376 the source file SHOULD be left unset. 3378 Metadata attributes not exposed via the NFS protocol SHOULD be copied 3379 to the destination file where appropriate. 3381 The destination file's named attributes are not duplicated from the 3382 source file. After the copy process completes, the client MAY 3383 attempt to duplicate named attributes using standard NFSv4 3384 operations. However, the destination file's named attribute 3385 capabilities MAY be different from the source file's named attribute 3386 capabilities. 3388 If the metadata flag is not set and the client is requesting a whole 3389 file copy (i.e., ca_count is 0 (zero)), the destination file's 3390 metadata is implementation dependent. 3392 If the client is requesting a partial file copy (i.e., ca_count is 3393 not 0 (zero)), the client SHOULD NOT set the metadata flag and the 3394 server MUST ignore the metadata flag. 3396 If the operation does not result in an immediate failure, the server 3397 will return NFS4_OK, and the CURRENT_FH will remain the destination's 3398 filehandle. 3400 If an immediate failure does occur, cr_bytes_copied will be set to 3401 the number of bytes copied to the destination file before the error 3402 occurred. The cr_bytes_copied value indicates the number of bytes 3403 copied but not which specific bytes have been copied. 3405 A return of NFS4_OK indicates that either the operation is complete 3406 or the operation was initiated and a callback will be used to deliver 3407 the final status of the operation. 3409 If the cr_callback_id is returned, this indicates that the operation 3410 was initiated and a CB_COPY callback will deliver the final results 3411 of the operation. The cr_callback_id stateid is termed a copy 3412 stateid in this context. The server is given the option of returning 3413 the results in a callback because the data may require a relatively 3414 long period of time to copy. 3416 If no cr_callback_id is returned, the operation completed 3417 synchronously and no callback will be issued by the server. The 3418 completion status of the operation is indicated by cr_status. 3420 If the copy completes successfully, either synchronously or 3421 asynchronously, the data copied from the source file to the 3422 destination file MUST appear identical to the NFS client. However, 3423 the NFS server's on disk representation of the data in the source 3424 file and destination file MAY differ. For example, the NFS server 3425 might encrypt, compress, deduplicate, or otherwise represent the on 3426 disk data in the source and destination file differently. 3428 In the event of a failure the state of the destination file is 3429 implementation dependent. The COPY operation may fail for the 3430 following reasons (this is a partial list). 3432 NFS4ERR_MOVED: The file system which contains the source file, or 3433 the destination file or directory is not present. The client can 3434 determine the correct location and reissue the operation with the 3435 correct location. 3437 NFS4ERR_NOTSUPP: The copy offload operation is not supported by the 3438 NFS server receiving this request. 3440 NFS4ERR_PARTNER_NOTSUPP: The remote server does not support the 3441 server-to-server copy offload protocol. 3443 NFS4ERR_PARTNER_NO_AUTH: The remote server does not authorize a 3444 server-to-server copy offload operation. This may be due to the 3445 client's failure to send the COPY_NOTIFY operation to the remote 3446 server, the remote server receiving a server-to-server copy 3447 offload request after the copy lease time expired, or for some 3448 other permission problem. 3450 NFS4ERR_FBIG: The copy operation would have caused the file to grow 3451 beyond the server's limit. 3453 NFS4ERR_NOTDIR: The CURRENT_FH is a file and ca_destination has non- 3454 zero length. 3456 NFS4ERR_WRONG_TYPE: The SAVED_FH is not a regular file. 3458 NFS4ERR_ISDIR: The CURRENT_FH is a directory and ca_destination has 3459 zero length. 3461 NFS4ERR_INVAL: The source offset or offset plus count are greater 3462 than or equal to the size of the source file. 3464 NFS4ERR_DELAY: The server does not have the resources to perform the 3465 copy operation at the current time. The client should retry the 3466 operation sometime in the future. 3468 NFS4ERR_METADATA_NOTSUPP: The destination file cannot support the 3469 same metadata as the source file. 3471 NFS4ERR_WRONGSEC: The security mechanism being used by the client 3472 does not match the server's security policy. 3474 11.2. Operation 60: COPY_ABORT - Cancel a server-side copy 3476 11.2.1. ARGUMENT 3478 struct COPY_ABORT4args { 3479 /* CURRENT_FH: desination file */ 3480 stateid4 caa_stateid; 3481 }; 3483 11.2.2. RESULT 3485 struct COPY_ABORT4res { 3486 nfsstat4 car_status; 3487 }; 3489 11.2.3. DESCRIPTION 3491 COPY_ABORT is used for both intra- and inter-server asynchronous 3492 copies. The COPY_ABORT operation allows the client to cancel a 3493 server-side copy operation that it initiated. This operation is sent 3494 in a COMPOUND request from the client to the destination server. 3495 This operation may be used to cancel a copy when the application that 3496 requested the copy exits before the operation is completed or for 3497 some other reason. 3499 The request contains the filehandle and copy stateid cookies that act 3500 as the context for the previously initiated copy operation. 3502 The result's car_status field indicates whether the cancel was 3503 successful or not. A value of NFS4_OK indicates that the copy 3504 operation was canceled and no callback will be issued by the server. 3505 A copy operation that is successfully canceled may result in none, 3506 some, or all of the data copied. 3508 If the server supports asynchronous copies, the server is REQUIRED to 3509 support the COPY_ABORT operation. 3511 The COPY_ABORT operation may fail for the following reasons (this is 3512 a partial list): 3514 NFS4ERR_NOTSUPP: The abort operation is not supported by the NFS 3515 server receiving this request. 3517 NFS4ERR_RETRY: The abort failed, but a retry at some time in the 3518 future MAY succeed. 3520 NFS4ERR_COMPLETE_ALREADY: The abort failed, and a callback will 3521 deliver the results of the copy operation. 3523 NFS4ERR_SERVERFAULT: An error occurred on the server that does not 3524 map to a specific error code. 3526 11.3. Operation 61: COPY_NOTIFY - Notify a source server of a future 3527 copy 3529 11.3.1. ARGUMENT 3531 struct COPY_NOTIFY4args { 3532 /* CURRENT_FH: source file */ 3533 netloc4 cna_destination_server; 3534 }; 3536 11.3.2. RESULT 3538 struct COPY_NOTIFY4resok { 3539 nfstime4 cnr_lease_time; 3540 netloc4 cnr_source_server<>; 3541 }; 3543 union COPY_NOTIFY4res switch (nfsstat4 cnr_status) { 3544 case NFS4_OK: 3545 COPY_NOTIFY4resok resok4; 3546 default: 3547 void; 3548 }; 3550 11.3.3. DESCRIPTION 3552 This operation is used for an inter-server copy. A client sends this 3553 operation in a COMPOUND request to the source server to authorize a 3554 destination server identified by cna_destination_server to read the 3555 file specified by CURRENT_FH on behalf of the given user. 3557 The cna_destination_server MUST be specified using the netloc4 3558 network location format. The server is not required to resolve the 3559 cna_destination_server address before completing this operation. 3561 If this operation succeeds, the source server will allow the 3562 cna_destination_server to copy the specified file on behalf of the 3563 given user. If COPY_NOTIFY succeeds, the destination server is 3564 granted permission to read the file as long as both of the following 3565 conditions are met: 3567 o The destination server begins reading the source file before the 3568 cnr_lease_time expires. If the cnr_lease_time expires while the 3569 destination server is still reading the source file, the 3570 destination server is allowed to finish reading the file. 3572 o The client has not issued a COPY_REVOKE for the same combination 3573 of user, filehandle, and destination server. 3575 The cnr_lease_time is chosen by the source server. A cnr_lease_time 3576 of 0 (zero) indicates an infinite lease. To renew the copy lease 3577 time the client should resend the same copy notification request to 3578 the source server. 3580 To avoid the need for synchronized clocks, copy lease times are 3581 granted by the server as a time delta. However, there is a 3582 requirement that the client and server clocks do not drift 3583 excessively over the duration of the lease. There is also the issue 3584 of propagation delay across the network which could easily be several 3585 hundred milliseconds as well as the possibility that requests will be 3586 lost and need to be retransmitted. 3588 To take propagation delay into account, the client should subtract it 3589 from copy lease times (e.g., if the client estimates the one-way 3590 propagation delay as 200 milliseconds, then it can assume that the 3591 lease is already 200 milliseconds old when it gets it). In addition, 3592 it will take another 200 milliseconds to get a response back to the 3593 server. So the client must send a lease renewal or send the copy 3594 offload request to the cna_destination_server at least 400 3595 milliseconds before the copy lease would expire. If the propagation 3596 delay varies over the life of the lease (e.g., the client is on a 3597 mobile host), the client will need to continuously subtract the 3598 increase in propagation delay from the copy lease times. 3600 The server's copy lease period configuration should take into account 3601 the network distance of the clients that will be accessing the 3602 server's resources. It is expected that the lease period will take 3603 into account the network propagation delays and other network delay 3604 factors for the client population. Since the protocol does not allow 3605 for an automatic method to determine an appropriate copy lease 3606 period, the server's administrator may have to tune the copy lease 3607 period. 3609 A successful response will also contain a list of names, addresses, 3610 and URLs called cnr_source_server, on which the source is willing to 3611 accept connections from the destination. These might not be 3612 reachable from the client and might be located on networks to which 3613 the client has no connection. 3615 If the client wishes to perform an inter-server copy, the client MUST 3616 send a COPY_NOTIFY to the source server. Therefore, the source 3617 server MUST support COPY_NOTIFY. 3619 For a copy only involving one server (the source and destination are 3620 on the same server), this operation is unnecessary. 3622 The COPY_NOTIFY operation may fail for the following reasons (this is 3623 a partial list): 3625 NFS4ERR_MOVED: The file system which contains the source file is not 3626 present on the source server. The client can determine the 3627 correct location and reissue the operation with the correct 3628 location. 3630 NFS4ERR_NOTSUPP: The copy offload operation is not supported by the 3631 NFS server receiving this request. 3633 NFS4ERR_WRONGSEC: The security mechanism being used by the client 3634 does not match the server's security policy. 3636 11.4. Operation 62: COPY_REVOKE - Revoke a destination server's copy 3637 privileges 3639 11.4.1. ARGUMENT 3641 struct COPY_REVOKE4args { 3642 /* CURRENT_FH: source file */ 3643 netloc4 cra_destination_server; 3644 }; 3646 11.4.2. RESULT 3648 struct COPY_REVOKE4res { 3649 nfsstat4 crr_status; 3650 }; 3652 11.4.3. DESCRIPTION 3654 This operation is used for an inter-server copy. A client sends this 3655 operation in a COMPOUND request to the source server to revoke the 3656 authorization of a destination server identified by 3657 cra_destination_server from reading the file specified by CURRENT_FH 3658 on behalf of given user. If the cra_destination_server has already 3659 begun copying the file, a successful return from this operation 3660 indicates that further access will be prevented. 3662 The cra_destination_server MUST be specified using the netloc4 3663 network location format. The server is not required to resolve the 3664 cra_destination_server address before completing this operation. 3666 The COPY_REVOKE operation is useful in situations in which the source 3667 server granted a very long or infinite lease on the destination 3668 server's ability to read the source file and all copy operations on 3669 the source file have been completed. 3671 For a copy only involving one server (the source and destination are 3672 on the same server), this operation is unnecessary. 3674 If the server supports COPY_NOTIFY, the server is REQUIRED to support 3675 the COPY_REVOKE operation. 3677 The COPY_REVOKE operation may fail for the following reasons (this is 3678 a partial list): 3680 NFS4ERR_MOVED: The file system which contains the source file is not 3681 present on the source server. The client can determine the 3682 correct location and reissue the operation with the correct 3683 location. 3685 NFS4ERR_NOTSUPP: The copy offload operation is not supported by the 3686 NFS server receiving this request. 3688 11.5. Operation 63: COPY_STATUS - Poll for status of a server-side copy 3689 11.5.1. ARGUMENT 3691 struct COPY_STATUS4args { 3692 /* CURRENT_FH: destination file */ 3693 stateid4 csa_stateid; 3694 }; 3696 11.5.2. RESULT 3698 struct COPY_STATUS4resok { 3699 length4 csr_bytes_copied; 3700 nfsstat4 csr_complete<1>; 3701 }; 3703 union COPY_STATUS4res switch (nfsstat4 csr_status) { 3704 case NFS4_OK: 3705 COPY_STATUS4resok resok4; 3706 default: 3707 void; 3708 }; 3710 11.5.3. DESCRIPTION 3712 COPY_STATUS is used for both intra- and inter-server asynchronous 3713 copies. The COPY_STATUS operation allows the client to poll the 3714 server to determine the status of an asynchronous copy operation. 3715 This operation is sent by the client to the destination server. 3717 If this operation is successful, the number of bytes copied are 3718 returned to the client in the csr_bytes_copied field. The 3719 csr_bytes_copied value indicates the number of bytes copied but not 3720 which specific bytes have been copied. 3722 If the optional csr_complete field is present, the copy has 3723 completed. In this case the status value indicates the result of the 3724 asynchronous copy operation. In all cases, the server will also 3725 deliver the final results of the asynchronous copy in a CB_COPY 3726 operation. 3728 The failure of this operation does not indicate the result of the 3729 asynchronous copy in any way. 3731 If the server supports asynchronous copies, the server is REQUIRED to 3732 support the COPY_STATUS operation. 3734 The COPY_STATUS operation may fail for the following reasons (this is 3735 a partial list): 3737 NFS4ERR_NOTSUPP: The copy status operation is not supported by the 3738 NFS server receiving this request. 3740 NFS4ERR_BAD_STATEID: The stateid is not valid (see Section 4.3.2 3741 below). 3743 NFS4ERR_EXPIRED: The stateid has expired (see Copy Offload Stateid 3744 section below). 3746 11.6. Operation 64: INITIALIZE 3748 The server has no concept of the structure imposed by the 3749 application. It is only when the application writes to a section of 3750 the file does order get imposed. In order to detect corruption even 3751 before the application utilizes the file, the application will want 3752 to initialize a range of ADBs. It uses the INITIALIZE operation to 3753 do so. 3755 11.6.1. ARGUMENT 3757 /* 3758 * We use data_content4 in case we wish to 3759 * extend new types later. Note that we 3760 * are explicitly disallowing data. 3761 */ 3762 union initialize_arg4 switch (data_content4 content) { 3763 case NFS4_CONTENT_APP_BLOCK: 3764 app_data_block4 ia_adb; 3765 case NFS4_CONTENT_HOLE: 3766 hole_info4 ia_hole; 3767 default: 3768 void; 3769 }; 3771 struct INITIALIZE4args { 3772 /* CURRENT_FH: file */ 3773 stateid4 ia_stateid; 3774 stable_how4 ia_stable; 3775 initialize_arg4 ia_data<>; 3776 }; 3778 11.6.2. RESULT 3780 struct INITIALIZE4resok { 3781 count4 ir_count; 3782 stable_how4 ir_committed; 3783 verifier4 ir_writeverf; 3784 data_content4 ir_sparse; 3785 }; 3787 union INITIALIZE4res switch (nfsstat4 status) { 3788 case NFS4_OK: 3789 INITIALIZE4resok resok4; 3790 default: 3791 void; 3792 }; 3794 11.6.3. DESCRIPTION 3796 When the client invokes the INITIALIZE operation, it has two desired 3797 results: 3799 1. The structure described by the app_data_block4 be imposed on the 3800 file. 3802 2. The contents described by the app_data_block4 be sparse. 3804 If the server supports the INITIALIZE operation, it still might not 3805 support sparse files. So if it receives the INITIALIZE operation, 3806 then it MUST populate the contents of the file with the initialized 3807 ADBs. In other words, if the server supports INITIALIZE, then it 3808 supports the concept of ADBs. [[Comment.4: Do we want to support an 3809 asynchronous INITIALIZE? Do we have to? --TH]] 3811 If the data was already initialized, There are two interesting 3812 scenarios: 3814 1. The data blocks are allocated. 3816 2. Initializing in the middle of an existing ADB. 3818 If the data blocks were already allocated, then the INITIALIZE is a 3819 hole punch operation. If INITIALIZE supports sparse files, then the 3820 data blocks are to be deallocated. If not, then the data blocks are 3821 to be rewritten in the indicated ADB format. [[Comment.5: Need to 3822 document interaction between space reservation and hole punching? 3823 --TH]] 3824 Since the server has no knowledge of ADBs, it should not report 3825 misaligned creation of ADBs. Even while it can detect them, it 3826 cannot disallow them, as the application might be in the process of 3827 changing the size of the ADBs. Thus the server must be prepared to 3828 handle an INITIALIZE into an existing ADB. 3830 This document does not mandate the manner in which the server stores 3831 ADBs sparsely for a file. It does assume that if ADBs are stored 3832 sparsely, then the server can detect when an INITIALIZE arrives that 3833 will force a new ADB to start inside an existing ADB. For example, 3834 assume that ADBi has a adb_block_size of 4k and that an INITIALIZE 3835 starts 1k inside ADBi. The server should [[Comment.6: Need to flesh 3836 this out. --TH]] 3838 11.7. Modification to Operation 42: EXCHANGE_ID - Instantiate Client ID 3840 11.7.1. ARGUMENT 3842 /* new */ 3843 const EXCHGID4_FLAG_SUPP_FENCE_OPS = 0x00000004; 3845 11.7.2. RESULT 3847 Unchanged 3849 11.7.3. MOTIVATION 3851 Enterprise applications require guarantees that an operation has 3852 either aborted or completed. NFSv4.1 provides this guarantee as long 3853 as the session is alive: simply send a SEQUENCE operation on the same 3854 slot with a new sequence number, and the successful return of 3855 SEQUENCE indicates the previous operation has completed. However, if 3856 the session is lost, there is no way to know when any in progress 3857 operations have aborted or completed. In hindsight, the NFSv4.1 3858 specification should have mandated that DESTROY_SESSION abort/ 3859 complete all outstanding operations. 3861 11.7.4. DESCRIPTION 3863 A client SHOULD request the EXCHGID4_FLAG_SUPP_FENCE_OPS capability 3864 when it sends an EXCHANGE_ID operation. The server SHOULD set this 3865 capability in the EXCHANGE_ID reply whether the client requests it or 3866 not. If the client ID is created with this capability then the 3867 following will occur: 3869 o The server will not reply to DESTROY_SESSION until all operations 3870 in progress are completed or aborted. 3872 o The server will not reply to subsequent EXCHANGE_ID invoked on the 3873 same Client Owner with a new verifier until all operations in 3874 progress on the Client ID's session are completed or aborted. 3876 o When DESTROY_CLIENTID is invoked, if there are sessions (both idle 3877 and non-idle), opens, locks, delegations, layouts, and/or wants 3878 (Section 18.49) associated with the client ID are removed. 3879 Pending operations will be completed or aborted before the 3880 sessions, opens, locks, delegations, layouts, and/or wants are 3881 deleted. 3883 o The NFS server SHOULD support client ID trunking, and if it does 3884 and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a 3885 session ID created on one node of the storage cluster MUST be 3886 destroyable via DESTROY_SESSION. In addition, DESTROY_CLIENTID 3887 and an EXCHANGE_ID with a new verifier affects all sessions 3888 regardless what node the sessions were created on. 3890 11.8. Operation 65: READ_PLUS 3892 If the client sends a READ operation, it is explicitly stating that 3893 it is not supporting sparse files. So if a READ occurs on a sparse 3894 ADB, then the server must expand such ADBs to be raw bytes. If a 3895 READ occurs in the middle of an ADB, the server can only send back 3896 bytes starting from that offset. 3898 Such an operation is inefficient for transfer of sparse sections of 3899 the file. As such, READ is marked as OBSOLETE in NFSv4.2. Instead, 3900 a client should issue READ_PLUS. Note that as the client has no a 3901 priori knowledge of whether an ADB is present or not, it should 3902 always use READ_PLUS. 3904 11.8.1. ARGUMENT 3906 struct READ_PLUS4args { 3907 /* CURRENT_FH: file */ 3908 stateid4 rpa_stateid; 3909 offset4 rpa_offset; 3910 count4 rpa_count; 3911 }; 3913 11.8.2. RESULT 3915 union read_plus_content switch (data_content4 content) { 3916 case NFS4_CONTENT_DATA: 3917 opaque rpc_data<>; 3918 case NFS4_CONTENT_APP_BLOCK: 3919 app_data_block4 rpc_block; 3920 case NFS4_CONTENT_HOLE: 3921 hole_info4 rpc_hole; 3922 default: 3923 void; 3924 }; 3926 /* 3927 * Allow a return of an array of contents. 3928 */ 3929 struct read_plus_res4 { 3930 bool rpr_eof; 3931 read_plus_content rpr_contents<>; 3932 }; 3934 union READ_PLUS4res switch (nfsstat4 status) { 3935 case NFS4_OK: 3936 read_plus_res4 resok4; 3937 default: 3938 void; 3939 }; 3941 11.8.3. DESCRIPTION 3943 Over the given range, READ_PLUS will return all data and ADBs found 3944 as an array of read_plus_content. It is possible to have consecutive 3945 ADBs in the array as either different definitions of ADBs are present 3946 or as the guard pattern changes. 3948 Edge cases exist for ABDs which either begin before the rpa_offset 3949 requested by the READ_PLUS or end after the rpa_count requested - 3950 both of which may occur as not all applications which access the file 3951 are aware of the main application imposing a format on the file 3952 contents, i.e., tar, dd, cp, etc. READ_PLUS MUST retrieve whole 3953 ADBs, but it need not retrieve an entire sequences of ADBs. 3955 The server MUST return a whole ADB because if it does not, it must 3956 expand that partial ADB before it sends it to the client. E.g., if 3957 an ADB had a block size of 64k and the READ_PLUS was for 128k 3958 starting at an offset of 32k inside the ADB, then the first 32k would 3959 be converted to data. 3961 12. NFSv4.2 Callback Operations 3963 12.1. Operation 15: CB_COPY - Report results of a server-side copy 3965 12.1.1. ARGUMENT 3967 union copy_info4 switch (nfsstat4 cca_status) { 3968 case NFS4_OK: 3969 void; 3970 default: 3971 length4 cca_bytes_copied; 3972 }; 3974 struct CB_COPY4args { 3975 nfs_fh4 cca_fh; 3976 stateid4 cca_stateid; 3977 copy_info4 cca_copy_info; 3978 }; 3980 12.1.2. RESULT 3982 struct CB_COPY4res { 3983 nfsstat4 ccr_status; 3984 }; 3986 12.1.3. DESCRIPTION 3988 CB_COPY is used for both intra- and inter-server asynchronous copies. 3989 The CB_COPY callback informs the client of the result of an 3990 asynchronous server-side copy. This operation is sent by the 3991 destination server to the client in a CB_COMPOUND request. The copy 3992 is identified by the filehandle and stateid arguments. The result is 3993 indicated by the status field. If the copy failed, cca_bytes_copied 3994 contains the number of bytes copied before the failure occurred. The 3995 cca_bytes_copied value indicates the number of bytes copied but not 3996 which specific bytes have been copied. 3998 In the absence of an established backchannel, the server cannot 3999 signal the completion of the COPY via a CB_COPY callback. The loss 4000 of a callback channel would be indicated by the server setting the 4001 SEQ4_STATUS_CB_PATH_DOWN flag in the sr_status_flags field of the 4002 SEQUENCE operation. The client must re-establish the callback 4003 channel to receive the status of the COPY operation. Prolonged loss 4004 of the callback channel could result in the server dropping the COPY 4005 operation state and invalidating the copy stateid. 4007 If the client supports the COPY operation, the client is REQUIRED to 4008 support the CB_COPY operation. 4010 The CB_COPY operation may fail for the following reasons (this is a 4011 partial list): 4013 NFS4ERR_NOTSUPP: The copy offload operation is not supported by the 4014 NFS client receiving this request. 4016 13. IANA Considerations 4018 This section uses terms that are defined in [23]. 4020 14. References 4022 14.1. Normative References 4024 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 4025 Levels", March 1997. 4027 [2] Shepler, S., Eisler, M., and D. Noveck, "Network File System 4028 (NFS) Version 4 Minor Version 1 Protocol", RFC 5661, 4029 January 2010. 4031 [3] Haynes, T., "Network File System (NFS) Version 4 Minor Version 4032 2 External Data Representation Standard (XDR) Description", 4033 March 2011. 4035 [4] Halevy, B., Welch, B., and J. Zelenka, "Object-Based Parallel 4036 NFS (pNFS) Operations", RFC 5664, January 2010. 4038 [5] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 4039 Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, 4040 January 2005. 4042 [6] Haynes, T. and N. Williams, "Remote Procedure Call (RPC) 4043 Security Version 3", draft-williams-rpcsecgssv3 (work in 4044 progress), 2011. 4046 [7] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol 4047 Specification", RFC 2203, September 1997. 4049 [8] Shepler, S., Eisler, M., and D. Noveck, "Network File System 4050 (NFS) Version 4 Minor Version 1 External Data Representation 4051 Standard (XDR) Description", RFC 5662, January 2010. 4053 [9] Black, D., Glasgow, J., and S. Fridella, "Parallel NFS (pNFS) 4054 Block/Volume Layout", RFC 5663, January 2010. 4056 14.2. Informative References 4058 [10] Haynes, T. and D. Noveck, "Network File System (NFS) version 4 4059 Protocol", draft-ietf-nfsv4-rfc3530bis-09 (Work In Progress), 4060 March 2011. 4062 [11] Eisler, M., "XDR: External Data Representation Standard", 4063 RFC 4506, May 2006. 4065 [12] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik, 4066 "NSDB Protocol for Federated Filesystems", 4067 draft-ietf-nfsv4-federated-fs-protocol (Work In Progress), 4068 2010. 4070 [13] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik, 4071 "Administration Protocol for Federated Filesystems", 4072 draft-ietf-nfsv4-federated-fs-admin (Work In Progress), 2010. 4074 [14] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., 4075 Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- 4076 HTTP/1.1", RFC 2616, June 1999. 4078 [15] Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9, 4079 RFC 959, October 1985. 4081 [16] Simpson, W., "PPP Challenge Handshake Authentication Protocol 4082 (CHAP)", RFC 1994, August 1996. 4084 [17] Strohm, R., "Chapter 2, Data Blocks, Extents, and Segments, of 4085 Oracle Database Concepts 11g Release 1 (11.1)", January 2011. 4087 [18] Ashdown, L., "Chapter 15, Validating Database Files and 4088 Backups, of Oracle Database Backup and Recovery User's Guide 4089 11g Release 1 (11.1)", August 2008. 4091 [19] McDougall, R. and J. Mauro, "Section 11.4.3, Detecting Memory 4092 Corruption of Solaris Internals", 2007. 4094 [20] Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci- 4095 Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data 4096 Corruption in the Storage Stack", Proceedings of the 6th USENIX 4097 Symposium on File and Storage Technologies (FAST '08) , 2008. 4099 [21] "Section 46.6. Multi-Level Security (MLS) of Deployment Guide: 4100 Deployment, configuration and administration of Red Hat 4101 Enterprise Linux 5, Edition 6", 2011. 4103 [22] Quigley, D. and J. Lu, "Registry Specification for MAC Security 4104 Label Formats", draft-quigley-label-format-registry (work in 4105 progress), 2011. 4107 [23] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA 4108 Considerations Section in RFCs", BCP 26, RFC 5226, May 2008. 4110 [24] Nowicki, B., "NFS: Network File System Protocol specification", 4111 RFC 1094, March 1989. 4113 [25] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3 4114 Protocol Specification", RFC 1813, June 1995. 4116 [26] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", 4117 RFC 1833, August 1995. 4119 [27] Eisler, M., "NFS Version 2 and Version 3 Security Issues and 4120 the NFS Protocol's Use of RPCSEC_GSS and Kerberos V5", 4121 RFC 2623, June 1999. 4123 [28] Callaghan, B., "NFS URL Scheme", RFC 2224, October 1997. 4125 [29] Shepler, S., "NFS Version 4 Design Considerations", RFC 2624, 4126 June 1999. 4128 [30] Reynolds, J., "Assigned Numbers: RFC 1700 is Replaced by an On- 4129 line Database", RFC 3232, January 2002. 4131 [31] Linn, J., "The Kerberos Version 5 GSS-API Mechanism", RFC 1964, 4132 June 1996. 4134 [32] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, 4135 C., Eisler, M., and D. Noveck, "Network File System (NFS) 4136 version 4 Protocol", RFC 3530, April 2003. 4138 Appendix A. Acknowledgments 4140 For the pNFS Access Permissions Check, the original draft was by 4141 Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow. The work 4142 was influenced by discussions with Benny Halevy and Bruce Fields. A 4143 review was done by Tom Haynes. 4145 For the Sharing change attribute implementation details with NFSv4 4146 clients, the original draft was by Trond Myklebust. 4148 For the NFS Server-side Copy, the original draft was by James 4149 Lentini, Mike Eisler, Deepak Kenchammana, Anshul Madan, and Rahul 4150 Iyer. Talpey co-authored an unpublished version of that document. 4152 It was also was reviewed by a number of individuals: Pranoop Erasani, 4153 Tom Haynes, Arthur Lent, Trond Myklebust, Dave Noveck, Theresa 4154 Lingutla-Raj, Manjunath Shankararao, Satyam Vaghani, and Nico 4155 Williams. 4157 For the NFS space reservation operations, the original draft was by 4158 Mike Eisler, James Lentini, Manjunath Shankararao, and Rahul Iyer. 4160 For the sparse file support, the original draft was by Dean 4161 Hildebrand and Marc Eshel. Valuable input and advice was received 4162 from Sorin Faibish, Bruce Fields, Benny Halevy, Trond Myklebust, and 4163 Richard Scheffenegger. 4165 For Labeled NFS, the original draft was by David Quigley, James 4166 Morris, Jarret Lu, and Tom Haynes. Peter Staubach, Trond Myklebust, 4167 Sorrin Faibish, Nico Williams, and David Black also contributed in 4168 the final push to get this accepted. 4170 Appendix B. RFC Editor Notes 4172 [RFC Editor: please remove this section prior to publishing this 4173 document as an RFC] 4175 [RFC Editor: prior to publishing this document as an RFC, please 4176 replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the 4177 RFC number of this document] 4179 Author's Address 4181 Thomas Haynes 4182 NetApp 4183 9110 E 66th St 4184 Tulsa, OK 74133 4185 USA 4187 Phone: +1 918 307 1415 4188 Email: thomas@netapp.com 4189 URI: http://www.tulsalabs.com