idnits 2.17.1 draft-ietf-nfsv4-minorversion1-10.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 21340. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 21351. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 21358. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 21364. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year == Line 882 has weird spacing: '...privacy no ...' == Line 901 has weird spacing: '...privacy no ...' == Line 912 has weird spacing: '...privacy no ...' == Line 3056 has weird spacing: '...str_cis ser...' == Line 3358 has weird spacing: '...str_cis nii_...' == (24 more instances...) == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: If the server determines that the client holds no associated state for its client ID (including sessions, opens, locks, delegations, layouts, and wants), the server may choose to unilaterally release the client ID. The server may make this choice for an inactive client so that resources are not consumed by those intermittently active clients. If the client contacts the server after this release, the server must ensure the client receives the appropriate error so that it will use the EXCHANGE_ID/CREATE_SESSION sequence to establish a new identity. It should be clear that the server must be very hesitant to release a client ID since the resulting work on the client to recover from such an event will be the same burden as if the server had failed and restarted. Typically a server would not release a client ID unless there had been no activity from that client for many minutes. As long as there are sessions, opens, locks, delegations, layouts, or wants, the server MUST not release the client ID. See Section 2.10.8.1.4 for discussion on releasing inactive sessions. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: When the server gets a EXCHANGE_ID for a client owner that currently has state, or an unexpired lease, and the principal that issues the EXCHANGE_ID is different than principal the previously established the client owner, the server MUST not destroy the any state that currently exists for client owner. Regardless, the server has two choices. First, it can return NFS4ERR_CLID_INUSE. Second, it can allow the EXCHANGE_ID, and simply treat the client owner as consisting of both the co_ownerid and the principal that issued the EXCHANGE_ID. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: The NFSv4.1 server MUST not return NFS4ERR_WRONGSEC to any operation other than a put filehandle operation, LOOKUP, LOOKUPP, and OPEN (by component name). == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHOULD not' in this paragraph: Note that for two such file systems, any information within the fs_locations_info attribute that indicates the need for special transition activity, i.e. the appearance of the two file system instances with different _handle_, _fileid_, _verifier_, _change_ classes, MUST be ignored by the client. The server SHOULD not indicate that these instances belong to different _handle_, _fileid_, _verifier_, _change_ classes, whether the two instances are shown belonging to the same _simultaneous-use_ class or not. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: The set of fls_info data is subject to expansion in a future minor version, or in a standard-track RFC, within the context of a single minor version. The server SHOULD NOT send and the client MUST not use indices within the fls_info array that are not defined in standards-track RFC's. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: o OPEN4_RESULT_CONFIRM is deprecated and MUST not be returned by an NFSv4.1 server. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHOULD not' in this paragraph: If the clora_changed field is TRUE, then the client SHOULD not write and commit its modified data to the storage devices specified by the layout being recalled. Instead, it is preferable for the client to write and commit the modified data through the metadata server. Alternatively, the client may attempt to obtain a new layout. Note: in order to obtain a new layout the client must first return the old layout. Since obtaining a new layout is not guaranteed to succeed, the client must be ready to write and commit its modified data through the metadata server. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 4, 2007) is 6253 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: 'RPCRDMA' on line 2404 -- Looks like a reference, but probably isn't: 'NFSDDP' on line 2348 -- Looks like a reference, but probably isn't: 'RDDP' on line 2453 -- Looks like a reference, but probably isn't: 'XNFS' on line 6002 -- Looks like a reference, but probably isn't: 'Floyd' on line 6641 == Missing Reference: '0' is mentioned on line 12009, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. '1' ** Obsolete normative reference: RFC 3530 (ref. '2') (Obsoleted by RFC 7530) ** Obsolete normative reference: RFC 1831 (ref. '4') (Obsoleted by RFC 5531) ** Obsolete normative reference: RFC 1884 (ref. '9') (Obsoleted by RFC 2373) -- Possible downref: Non-RFC (?) normative reference: ref. '10' ** Obsolete normative reference: RFC 3454 (ref. '12') (Obsoleted by RFC 7564) ** Obsolete normative reference: RFC 3491 (ref. '13') (Obsoleted by RFC 5891) ** Downref: Normative reference to an Informational RFC: RFC 2104 (ref. '14') ** Obsolete normative reference: RFC 2434 (ref. '16') (Obsoleted by RFC 5226) == Outdated reference: A later version (-02) exists of draft-zelenka-pnfs-obj-01 -- Obsolete informational reference (is this intentional?): RFC 3720 (ref. '29') (Obsoleted by RFC 7143) Summary: 8 errors (**), 0 flaws (~~), 17 warnings (==), 16 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 S. Shepler 3 Internet-Draft M. Eisler 4 Intended status: Standards Track D. Noveck 5 Expires: September 5, 2007 Editors 6 March 4, 2007 8 NFSv4 Minor Version 1 9 draft-ietf-nfsv4-minorversion1-10.txt 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on September 5, 2007. 36 Copyright Notice 38 Copyright (C) The IETF Trust (2007). 40 Abstract 42 This Internet-Draft describes NFSv4 minor version one, including 43 features retained from the base protocol and protocol extensions made 44 subsequently. The current draft includes description of the major 45 extensions, Sessions, Directory Delegations, and parallel NFS (pNFS). 46 This Internet-Draft is an active work item of the NFSv4 working 47 group. Active and resolved issues may be found in the issue tracker 48 at: http://www.nfsv4-editor.org/cgi-bin/roundup/nfsv4. New issues 49 related to this document should be raised with the NFSv4 Working 50 Group nfsv4@ietf.org and logged in the issue tracker. 52 Requirements Language 54 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 55 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 56 document are to be interpreted as described in RFC 2119 [1]. 58 Table of Contents 60 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 10 61 1.1. The NFSv4.1 Protocol . . . . . . . . . . . . . . . . . . 10 62 1.2. NFS Version 4 Goals . . . . . . . . . . . . . . . . . . 10 63 1.3. Minor Version 1 Goals . . . . . . . . . . . . . . . . . 11 64 1.4. Overview of NFS version 4.1 Features . . . . . . . . . . 11 65 1.4.1. RPC and Security . . . . . . . . . . . . . . . . . . 12 66 1.4.2. Protocol Structure . . . . . . . . . . . . . . . . . 12 67 1.4.3. File System Model . . . . . . . . . . . . . . . . . 13 68 1.4.4. Locking Facilities . . . . . . . . . . . . . . . . . 14 69 1.5. General Definitions . . . . . . . . . . . . . . . . . . 15 70 1.6. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 17 71 2. Core Infrastructure . . . . . . . . . . . . . . . . . . . . . 17 72 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 18 73 2.2. RPC and XDR . . . . . . . . . . . . . . . . . . . . . . 18 74 2.2.1. RPC-based Security . . . . . . . . . . . . . . . . . 18 75 2.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 21 76 2.4. Client Identifiers and Client Owners . . . . . . . . . . 22 77 2.4.1. Server Release of Client ID . . . . . . . . . . . . 26 78 2.4.2. Handling Client Owner Conflicts . . . . . . . . . . 26 79 2.5. Server Owners . . . . . . . . . . . . . . . . . . . . . 27 80 2.6. Security Service Negotiation . . . . . . . . . . . . . . 27 81 2.6.1. NFSv4 Security Tuples . . . . . . . . . . . . . . . 28 82 2.6.2. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . 28 83 2.6.3. Security Error . . . . . . . . . . . . . . . . . . . 28 84 2.7. Minor Versioning . . . . . . . . . . . . . . . . . . . . 32 85 2.8. Non-RPC-based Security Services . . . . . . . . . . . . 34 86 2.8.1. Authorization . . . . . . . . . . . . . . . . . . . 34 87 2.8.2. Auditing . . . . . . . . . . . . . . . . . . . . . . 34 88 2.8.3. Intrusion Detection . . . . . . . . . . . . . . . . 35 89 2.9. Transport Layers . . . . . . . . . . . . . . . . . . . . 35 90 2.9.1. Required and Recommended Properties of Transports . 35 91 2.9.2. Client and Server Transport Behavior . . . . . . . . 35 92 2.9.3. Ports . . . . . . . . . . . . . . . . . . . . . . . 37 93 2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . 37 94 2.10.1. Motivation and Overview . . . . . . . . . . . . . . 37 95 2.10.2. NFSv4 Integration . . . . . . . . . . . . . . . . . 38 96 2.10.3. Channels . . . . . . . . . . . . . . . . . . . . . . 39 97 2.10.4. Exactly Once Semantics . . . . . . . . . . . . . . . 42 98 2.10.5. RDMA Considerations . . . . . . . . . . . . . . . . 51 99 2.10.6. Sessions Security . . . . . . . . . . . . . . . . . 53 100 2.10.7. Session Mechanics - Steady State . . . . . . . . . . 57 101 2.10.8. Session Mechanics - Recovery . . . . . . . . . . . . 59 102 2.10.9. Parallel NFS and Sessions . . . . . . . . . . . . . 62 103 3. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 62 104 3.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 62 105 3.2. Structured Data Types . . . . . . . . . . . . . . . . . 64 106 4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 73 107 4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 74 108 4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 74 109 4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 74 110 4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 75 111 4.2.1. General Properties of a Filehandle . . . . . . . . . 75 112 4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 76 113 4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 76 114 4.3. One Method of Constructing a Volatile Filehandle . . . . 77 115 4.4. Client Recovery from Filehandle Expiration . . . . . . . 78 116 5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 79 117 5.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 80 118 5.2. Recommended Attributes . . . . . . . . . . . . . . . . . 80 119 5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 81 120 5.4. Classification of Attributes . . . . . . . . . . . . . . 81 121 5.5. Mandatory Attributes - Definitions . . . . . . . . . . . 83 122 5.6. Recommended Attributes - Definitions . . . . . . . . . . 84 123 5.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 94 124 5.8. Interpreting owner and owner_group . . . . . . . . . . . 95 125 5.9. Character Case Attributes . . . . . . . . . . . . . . . 97 126 5.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 97 127 5.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 98 128 5.12. Directory Notification Attributes . . . . . . . . . . . 99 129 5.12.1. dir_notif_delay . . . . . . . . . . . . . . . . . . 99 130 5.12.2. dirent_notif_delay . . . . . . . . . . . . . . . . . 99 131 5.13. PNFS Attributes . . . . . . . . . . . . . . . . . . . . 99 132 5.13.1. fs_layout_type . . . . . . . . . . . . . . . . . . . 99 133 5.13.2. layout_alignment . . . . . . . . . . . . . . . . . . 99 134 5.13.3. layout_blksize . . . . . . . . . . . . . . . . . . . 100 135 5.13.4. layout_hint . . . . . . . . . . . . . . . . . . . . 100 136 5.13.5. layout_type . . . . . . . . . . . . . . . . . . . . 100 137 5.13.6. mdsthreshold . . . . . . . . . . . . . . . . . . . . 100 138 5.14. Retention Attributes . . . . . . . . . . . . . . . . . . 101 139 6. Access Control Lists . . . . . . . . . . . . . . . . . . . . 103 140 6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 103 141 6.2. File Attributes Discussion . . . . . . . . . . . . . . . 104 142 6.2.1. ACL Attribute . . . . . . . . . . . . . . . . . . . 104 143 6.2.2. dacl and sacl Attributes . . . . . . . . . . . . . . 115 144 6.2.3. mode Attribute . . . . . . . . . . . . . . . . . . . 116 145 6.2.4. mode_set_masked Attribute . . . . . . . . . . . . . 116 146 6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 117 147 6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . 117 148 6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 118 149 6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 119 150 6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 120 151 6.4.2. Retrieving the mode and/or ACL Attributes . . . . . 121 152 6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 122 153 7. Single-server Name Space . . . . . . . . . . . . . . . . . . 125 154 7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 126 155 7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 126 156 7.3. Server Pseudo File System . . . . . . . . . . . . . . . 126 157 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 127 158 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 127 159 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 127 160 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 128 161 7.8. Security Policy and Name Space Presentation . . . . . . 128 162 8. File Locking and Share Reservations . . . . . . . . . . . . . 129 163 8.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 130 164 8.1.1. Client and Session ID . . . . . . . . . . . . . . . 130 165 8.1.2. State-owner Definition . . . . . . . . . . . . . . . 130 166 8.1.3. Stateid Definition . . . . . . . . . . . . . . . . . 131 167 8.1.4. Use of the Stateid and Locking . . . . . . . . . . . 134 168 8.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 137 169 8.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 137 170 8.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 138 171 8.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 138 172 8.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 139 173 8.6.1. Client Failure and Recovery . . . . . . . . . . . . 139 174 8.6.2. Server Failure and Recovery . . . . . . . . . . . . 140 175 8.6.3. Network Partitions and Recovery . . . . . . . . . . 143 176 8.7. Server Revocation of Locks . . . . . . . . . . . . . . . 147 177 8.8. Share Reservations . . . . . . . . . . . . . . . . . . . 148 178 8.9. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 149 179 8.10. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 149 180 8.11. Short and Long Leases . . . . . . . . . . . . . . . . . 150 181 8.12. Clocks, Propagation Delay, and Calculating Lease 182 Expiration . . . . . . . . . . . . . . . . . . . . . . . 151 183 8.13. Vestigial Locking Infrastructure From V4.0 . . . . . . . 151 184 9. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 152 185 9.1. Performance Challenges for Client-Side Caching . . . . . 153 186 9.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 153 187 9.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 155 188 9.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 157 189 9.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 157 190 9.3.2. Data Caching and File Locking . . . . . . . . . . . 158 191 9.3.3. Data Caching and Mandatory File Locking . . . . . . 160 192 9.3.4. Data Caching and File Identity . . . . . . . . . . . 160 193 9.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 161 194 9.4.1. Open Delegation and Data Caching . . . . . . . . . . 164 195 9.4.2. Open Delegation and File Locks . . . . . . . . . . . 165 196 9.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 165 197 9.4.4. Recall of Open Delegation . . . . . . . . . . . . . 168 198 9.4.5. Clients that Fail to Honor Delegation Recalls . . . 170 199 9.4.6. Delegation Revocation . . . . . . . . . . . . . . . 171 200 9.5. Data Caching and Revocation . . . . . . . . . . . . . . 171 201 9.5.1. Revocation Recovery for Write Open Delegation . . . 172 202 9.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 173 203 9.7. Data and Metadata Caching and Memory Mapped Files . . . 175 204 9.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 177 205 9.9. Directory Caching . . . . . . . . . . . . . . . . . . . 178 206 10. Multi-Server Name Space . . . . . . . . . . . . . . . . . . . 179 207 10.1. Location attributes . . . . . . . . . . . . . . . . . . 179 208 10.2. File System Presence or Absence . . . . . . . . . . . . 179 209 10.3. Getting Attributes for an Absent File System . . . . . . 181 210 10.3.1. GETATTR Within an Absent File System . . . . . . . . 181 211 10.3.2. READDIR and Absent File Systems . . . . . . . . . . 182 212 10.4. Uses of Location Information . . . . . . . . . . . . . . 183 213 10.4.1. File System Replication . . . . . . . . . . . . . . 183 214 10.4.2. File System Migration . . . . . . . . . . . . . . . 185 215 10.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 186 216 10.5. Additional Client-side Considerations . . . . . . . . . 187 217 10.6. Effecting File System Transitions . . . . . . . . . . . 188 218 10.6.1. File System Transitions and Simultaneous Access . . 189 219 10.6.2. Simultaneous Use and Transparent Transitions . . . . 190 220 10.6.3. Filehandles and File System Transitions . . . . . . 192 221 10.6.4. Fileid's and File System Transitions . . . . . . . . 192 222 10.6.5. Fsids and File System Transitions . . . . . . . . . 193 223 10.6.6. The Change Attribute and File System Transitions . . 193 224 10.6.7. Lock State and File System Transitions . . . . . . . 194 225 10.6.8. Write Verifiers and File System Transitions . . . . 197 226 10.7. Effecting File System Referrals . . . . . . . . . . . . 197 227 10.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 198 228 10.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 202 229 10.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 204 230 10.9. The Attribute fs_locations . . . . . . . . . . . . . . . 204 231 10.10. The Attribute fs_locations_info . . . . . . . . . . . . 206 232 10.10.1. The fs_locations_server4 Structure . . . . . . . . . 209 233 10.10.2. The fs_locations_info4 Structure . . . . . . . . . . 214 234 10.10.3. The fs_locations_item4 Structure . . . . . . . . . . 215 235 10.11. The Attribute fs_status . . . . . . . . . . . . . . . . 216 236 11. Directory Delegations . . . . . . . . . . . . . . . . . . . . 220 237 11.1. Introduction to Directory Delegations . . . . . . . . . 220 238 11.2. Directory Delegation Design . . . . . . . . . . . . . . 221 239 11.3. Attributes in Support of Directory Notifications . . . . 222 240 11.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 222 241 11.5. Directory Delegation Recovery . . . . . . . . . . . . . 222 242 12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 222 243 12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 222 244 12.2. PNFS Definitions . . . . . . . . . . . . . . . . . . . . 224 245 12.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 224 246 12.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 224 247 12.2.3. Client . . . . . . . . . . . . . . . . . . . . . . . 225 248 12.2.4. Storage Device . . . . . . . . . . . . . . . . . . . 225 249 12.2.5. Data Server . . . . . . . . . . . . . . . . . . . . 225 250 12.2.6. Storage Protocol or Data Protocol . . . . . . . . . 225 251 12.2.7. Control Protocol . . . . . . . . . . . . . . . . . . 225 252 12.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 226 253 12.2.9. Layout Types . . . . . . . . . . . . . . . . . . . . 226 254 12.2.10. Layout Iomode . . . . . . . . . . . . . . . . . . . 226 255 12.2.11. Layout Segment . . . . . . . . . . . . . . . . . . . 227 256 12.2.12. Device IDs . . . . . . . . . . . . . . . . . . . . . 228 257 12.3. PNFS Operations . . . . . . . . . . . . . . . . . . . . 228 258 12.4. PNFS Attributes . . . . . . . . . . . . . . . . . . . . 229 259 12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 229 260 12.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 229 261 12.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 230 262 12.5.3. Committing a Layout . . . . . . . . . . . . . . . . 231 263 12.5.4. Recalling a Layout . . . . . . . . . . . . . . . . . 234 264 12.5.5. Metadata Server Write Propagation . . . . . . . . . 240 265 12.6. PNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 240 266 12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 241 267 12.7.1. Client Recovery . . . . . . . . . . . . . . . . . . 241 268 12.7.2. Dealing with Lease Expiration on the Client . . . . 242 269 12.7.3. Dealing with Loss of Layout State on the Metadata 270 Server . . . . . . . . . . . . . . . . . . . . . . . 243 271 12.7.4. Recovery from Metadata Server Restart . . . . . . . 244 272 12.7.5. Operations During Metadata Server Grace Period . . . 246 273 12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 246 274 12.8. Metadata and Storage Device Roles . . . . . . . . . . . 247 275 12.9. Security Considerations . . . . . . . . . . . . . . . . 248 276 13. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 249 277 13.1. Session Considerations . . . . . . . . . . . . . . . . . 249 278 13.2. File Layout Definitions . . . . . . . . . . . . . . . . 251 279 13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 251 280 13.4. Interpreting the File Layout . . . . . . . . . . . . . . 255 281 13.5. Sparse and Dense Stripe Unit Packing . . . . . . . . . . 257 282 13.6. Data Server Multipathing . . . . . . . . . . . . . . . . 259 283 13.7. Operations Issued to NFSv4.1 Data Servers . . . . . . . 259 284 13.8. COMMIT Through Metadata Server . . . . . . . . . . . . . 260 285 13.9. Global Stateid Requirements . . . . . . . . . . . . . . 261 286 13.10. The Layout Iomode . . . . . . . . . . . . . . . . . . . 261 287 13.11. Data Server State Propagation . . . . . . . . . . . . . 261 288 13.11.1. Lock State Propagation . . . . . . . . . . . . . . . 262 289 13.11.2. Open-mode Validation . . . . . . . . . . . . . . . . 262 290 13.11.3. File Attributes . . . . . . . . . . . . . . . . . . 263 291 13.12. Data Server Component File Size . . . . . . . . . . . . 263 292 13.13. Recovery Considerations . . . . . . . . . . . . . . . . 264 293 13.14. Security Considerations for the File Layout Type . . . . 265 294 14. Internationalization . . . . . . . . . . . . . . . . . . . . 265 295 14.1. Stringprep profile for the utf8str_cs type . . . . . . . 266 296 14.2. Stringprep profile for the utf8str_cis type . . . . . . 268 297 14.3. Stringprep profile for the utf8str_mixed type . . . . . 269 298 14.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 271 299 15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 271 300 15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 271 301 15.2. Operations and their valid errors . . . . . . . . . . . 285 302 15.3. Callback operations and their valid errors . . . . . . . 299 303 15.4. Errors and the operations that use them . . . . . . . . 300 304 16. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 307 305 16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 307 306 16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 308 307 17. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 313 308 17.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 313 309 17.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 315 310 17.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 317 311 17.4. Operation 6: CREATE - Create a Non-Regular File Object . 319 312 17.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting 313 Recovery . . . . . . . . . . . . . . . . . . . . . . . . 322 314 17.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 323 315 17.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 323 316 17.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 325 317 17.9. Operation 11: LINK - Create Link to a File . . . . . . . 326 318 17.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 327 319 17.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 331 320 17.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 332 321 17.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 334 322 17.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 335 323 17.15. Operation 17: NVERIFY - Verify Difference in 324 Attributes . . . . . . . . . . . . . . . . . . . . . . . 337 325 17.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 338 326 17.17. Operation 19: OPENATTR - Open Named Attribute 327 Directory . . . . . . . . . . . . . . . . . . . . . . . 352 328 17.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 354 329 17.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 355 330 17.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 356 331 17.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 357 332 17.22. Operation 25: READ - Read from File . . . . . . . . . . 358 333 17.23. Operation 26: READDIR - Read Directory . . . . . . . . . 360 334 17.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 364 335 17.25. Operation 28: REMOVE - Remove File System Object . . . . 365 336 17.26. Operation 29: RENAME - Rename Directory Entry . . . . . 367 337 17.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 369 338 17.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 370 339 17.29. Operation 33: SECINFO - Obtain Available Security . . . 370 340 17.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 374 341 17.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 376 342 17.32. Operation 38: WRITE - Write to File . . . . . . . . . . 377 343 17.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 382 344 17.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 383 345 17.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 387 346 17.36. Operation 43: CREATE_SESSION - Create New Session and 347 Confirm Client ID . . . . . . . . . . . . . . . . . . . 395 348 17.37. Operation 44: DESTROY_SESSION - Destroy existing 349 session . . . . . . . . . . . . . . . . . . . . . . . . 405 350 17.38. Operation 45: FREE_STATEID - Free stateid with no 351 locks . . . . . . . . . . . . . . . . . . . . . . . . . 406 352 17.39. Operation 46: GET_DIR_DELEGATION - Get a directory 353 delegation . . . . . . . . . . . . . . . . . . . . . . . 407 354 17.40. Operation 47: GETDEVICEINFO - Get Device Information . . 412 355 17.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 413 356 17.42. Operation 49: LAYOUTCOMMIT - Commit writes made using 357 a layout . . . . . . . . . . . . . . . . . . . . . . . . 414 358 17.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 417 359 17.44. Operation 51: LAYOUTRETURN - Release Layout 360 Information . . . . . . . . . . . . . . . . . . . . . . 420 361 17.45. Operation 52: SECINFO_NO_NAME - Get Security on 362 Unnamed Object . . . . . . . . . . . . . . . . . . . . . 423 363 17.46. Operation 53: SEQUENCE - Supply per-procedure 364 sequencing and control . . . . . . . . . . . . . . . . . 424 365 17.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 429 366 17.48. Operation 55: TEST_STATEID - Test stateids for 367 validity . . . . . . . . . . . . . . . . . . . . . . . . 431 368 17.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 432 369 17.50. Operation 57: DESTROY_CLIENTID - Destroy existing 370 client ID . . . . . . . . . . . . . . . . . . . . . . . 435 371 17.51. Operation 10044: ILLEGAL - Illegal operation . . . . . . 436 372 18. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 437 373 18.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 437 374 18.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 437 375 19. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 439 376 19.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 439 377 19.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 441 378 19.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 442 379 19.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 444 380 19.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 447 381 19.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 448 382 19.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 451 383 19.8. Operation 10: CB_RECALL_SLOT - change flow control 384 limits . . . . . . . . . . . . . . . . . . . . . . . . . 452 385 19.9. Operation 11: CB_SEQUENCE - Supply callback channel 386 sequencing and control . . . . . . . . . . . . . . . . . 453 387 19.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 455 388 19.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible 389 lock availability . . . . . . . . . . . . . . . . . . . 456 390 19.12. Operation 10044: CB_ILLEGAL - Illegal Callback 391 Operation . . . . . . . . . . . . . . . . . . . . . . . 457 392 20. Security Considerations . . . . . . . . . . . . . . . . . . . 458 393 21. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 458 394 21.1. Defining new layout types . . . . . . . . . . . . . . . 458 395 22. References . . . . . . . . . . . . . . . . . . . . . . . . . 459 396 22.1. Normative References . . . . . . . . . . . . . . . . . . 459 397 22.2. Informative References . . . . . . . . . . . . . . . . . 460 398 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 461 399 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 462 400 Intellectual Property and Copyright Statements . . . . . . . . . 464 402 1. Introduction 404 1.1. The NFSv4.1 Protocol 406 The NFSv4.1 protocol is a minor version of the NFSv4 protocol 407 described in [2]. It generally follows the guidelines for minor 408 versioning model laid in Section 10 of RFC 3530. However, it 409 diverges from guidelines 11 ("a client and server that supports minor 410 version X must support minor versions 0 through X-1"), and 12 ("no 411 features may be introduced as mandatory in a minor version"). These 412 divergences are due to the introduction of the sessions model for 413 managing non-idempotent operations and the RECLAIM_COMPLETE 414 operation. These two new features are infrastructural in nature and 415 simplify implementation of existing and other new features. Making 416 them optional would add undue complexity to protocol definition and 417 implementation. NFSv4.1 accordingly updates the Minor Versioning 418 guidelines (Section 2.7). 420 NFSv4.1, as a minor version, is consistent with the overall goals for 421 NFS Version 4, but extends the protocol so as to better meet those 422 goals, based on experiences with NFSv4.0. In addition, NFSv4.1 has 423 adopted some additional goals, which motivate some of the major 424 extensions in minor version 1. 426 1.2. NFS Version 4 Goals 428 The NFS version 4 protocol is a further revision of the NFS protocol 429 defined already by versions 2 [17]] and 3 [18]. It retains the 430 essential characteristics of previous versions: design for easy 431 recovery, independent of transport protocols, operating systems and 432 file systems, simplicity, and good performance. The NFS version 4 433 revision has the following goals: 435 o Improved access and good performance on the Internet. 437 The protocol is designed to transit firewalls easily, perform well 438 where latency is high and bandwidth is low, and scale to very 439 large numbers of clients per server. 441 o Strong security with negotiation built into the protocol. 443 The protocol builds on the work of the ONCRPC working group in 444 supporting the RPCSEC_GSS protocol. Additionally, the NFS version 445 4 protocol provides a mechanism to allow clients and servers the 446 ability to negotiate security and require clients and servers to 447 support a minimal set of security schemes. 449 o Good cross-platform interoperability. 451 The protocol features a file system model that provides a useful, 452 common set of features that does not unduly favor one file system 453 or operating system over another. 455 o Designed for protocol extensions. 457 The protocol is designed to accept standard extensions within a 458 framework that enable and encourages backward compatibility. 460 1.3. Minor Version 1 Goals 462 Minor version one has the following goals, within the framework 463 established by the overall version 4 goals. 465 o To correct significant structural weaknesses and oversights 466 discovered in the base protocol. 468 o To add clarity and specificity to areas left unaddressed or not 469 addressed in sufficient detail in the base protocol. 471 o To add specific features based on experience with the existing 472 protocol and recent industry developments. 474 o To provide protocol support to take advantage of clustered server 475 deployments including the ability to provide scalable parallel 476 access to files distributed among multiple servers. 478 1.4. Overview of NFS version 4.1 Features 480 To provide a reasonable context for the reader, the major features of 481 NFS version 4.1 protocol will be reviewed in brief. This will be 482 done to provide an appropriate context for both the reader who is 483 familiar with the previous versions of the NFS protocol and the 484 reader that is new to the NFS protocols. For the reader new to the 485 NFS protocols, there is still a set of fundamental knowledge that is 486 expected. The reader should be familiar with the XDR and RPC 487 protocols as described in [3] and [4]. A basic knowledge of file 488 systems and distributed file systems is expected as well. 490 This description of version 4.1 features will not distinguish those 491 added in minor version one from those present in the base protocol 492 but will treat minor version 1 as a unified whole. See Section 1.6 493 for a description of the differences between the two minor versions. 495 1.4.1. RPC and Security 497 As with previous versions of NFS, the External Data Representation 498 (XDR) and Remote Procedure Call (RPC) mechanisms used for the NFS 499 version 4.1 protocol are those defined in [3] and [4]. To meet end- 500 to-end security requirements, the RPCSEC_GSS framework [5] will be 501 used to extend the basic RPC security. With the use of RPCSEC_GSS, 502 various mechanisms can be provided to offer authentication, 503 integrity, and privacy to the NFS version 4 protocol. Kerberos V5 504 will be used as described in [6] to provide one security framework. 505 The LIPKEY and SPKM-3 GSS-API mechanisms described in [7] will be 506 used to provide for the use of user password and client/server public 507 key certificates by the NFS version 4 protocol. With the use of 508 RPCSEC_GSS, other mechanisms may also be specified and used for NFS 509 version 4.1 security. 511 To enable in-band security negotiation, the NFS version 4.1 protocol 512 has operations which provide the client a method of querying the 513 server about its policies regarding which security mechanisms must be 514 used for access to the server's file system resources. With this, 515 the client can securely match the security mechanism that meets the 516 policies specified at both the client and server. 518 1.4.2. Protocol Structure 520 1.4.2.1. Core Protocol 522 Unlike NFS Versions 2 and 3, which used a series of ancillary 523 protocols (e.g. NLM, NSM, MOUNT), within all minor versions of NFS 524 version 4 only a single RPC protocol is used to make requests of the 525 server. Facilities that had been separate protocols, such as 526 locking, are now integrated within a single unified protocol. 528 1.4.2.2. Parallel Access 530 Minor version one supports high-performance data access to a 531 clustered server implementation by enabling a separation of metadata 532 access and data access, with the latter done to multiple servers in 533 parallel. 535 Such parallel data access is controlled by recallable objects known 536 as "layouts", which are integrated into the protocol locking model. 537 Clients direct requests for data access to a set of data servers 538 specified by the layout via a data storage protocol which may be 539 NFSv4.1 or may be another protocol. 541 1.4.3. File System Model 543 The general file system model used for the NFS version 4.1 protocol 544 is the same as previous versions. The server file system is 545 hierarchical with the regular files contained within being treated as 546 opaque octet streams. In a slight departure, file and directory 547 names are encoded with UTF-8 to deal with the basics of 548 internationalization. 550 The NFS version 4.1 protocol does not require a separate protocol to 551 provide for the initial mapping between path name and filehandle. 552 All file systems exported by a server are presented as a tree so that 553 all file systems are reachable from a special per-server global root 554 filehandle. This allows LOOKUP operations to be used to perform 555 functions previously provided by the MOUNT protocol. The server 556 provides any necessary pseudo file systems to bridge any gaps that 557 arise due to unexported gaps between exported file systems. 559 1.4.3.1. Filehandles 561 As in previous versions of the NFS protocol, opaque filehandles are 562 used to identify individual files and directories. Lookup-type and 563 create operations are used to go from file and directory names to the 564 filehandle which is then used to identify the object to subsequent 565 operations. 567 The NFS version 4.1 protocol provides support for persistent 568 filehandles, guaranteed to be valid for the lifetime of the file 569 system object designated. In addition it provides support to servers 570 to provide filehandles with more limited validity guarantees, called 571 volatile filehandles. 573 1.4.3.2. File Attributes 575 The NFS version 4.1 protocol has a rich and extensible attribute 576 structure. Only a small set of the defined attributes are mandatory 577 and must be provided by all server implementations. The other 578 attributes are known as "recommended" attributes. 580 One significant recommended file attribute is the Access Control List 581 (ACL) attribute. This attribute provides for directory and file 582 access control beyond the model used in NFS Versions 2 and 3. The 583 ACL definition allows for specification of specific sets of 584 permissions for individual users and groups. In addition, ACL 585 inheritance allows propagation of access permissions and restriction 586 down a directory tree as file system objects are created. 588 One other type of attribute is the named attribute. A named 589 attribute is an opaque octet stream that is associated with a 590 directory or file and referred to by a string name. Named attributes 591 are meant to be used by client applications as a method to associate 592 application-specific data with a regular file or directory. 594 1.4.3.3. Multi-server Namespace 596 NFS Version 4.1 contains a number of features to allow implementation 597 of namespaces that cross server boundaries and that allow and 598 facilitate a non-disruptive transfer of support for individual file 599 systems between servers. They are all based upon attributes that 600 allow one file system to specify alternate or new locations for that 601 file system. 603 These attributes may be used together with the concept of absent file 604 system which provide specifications for additional locations but no 605 actual file system content. This allows a number of important 606 facilities: 608 o Location attributes may be used with absent file systems to 609 implement referrals whereby one server may direct the client to a 610 file system provided by another server. This allows extensive 611 multi-server namespaces to be constructed. 613 o Location attributes may be provided for present file systems to 614 provide the locations of alternate file system instances or 615 replicas to be used in the event that the current file system 616 instance becomes unavailable. 618 o Location attributes may be provided when a previously present file 619 system becomes absent. This allows non-disruptive migration of 620 file systems to alternate servers. 622 1.4.4. Locking Facilities 624 As mentioned previously, NFS v4.1, is a single protocol which 625 includes locking facilities. These locking facilities include 626 support for many types of locks including a number of sorts of 627 recallable locks. Recallable locks such as delegations allow the 628 client to be assured that certain events will not occur so long as 629 that lock is held. When circumstances change, the lock is recalled 630 via a callback request. The assurances provided by delegations allow 631 more extensive caching to be done safely when circumstances allow it. 633 o Share reservations as established by OPEN operations. 635 o Byte-range locks. 637 o File delegations which are recallable locks that assure the holder 638 that inconsistent opens and file changes cannot occur so long as 639 the delegation is held. 641 o Directory delegations which are recallable delegations that assure 642 the holder that inconsistent directory modifications cannot occur 643 so long as the delegation is held. 645 o Layouts which are recallable objects that assure the holder that 646 direct access to the file data may be performed directly by the 647 client and that no change to the data's location inconsistent with 648 that access may be made so long as the layout is held. 650 All locks for a given client are tied together under a single client- 651 wide lease. All requests made on sessions associated with the client 652 renew that lease. When leases are not promptly renewed lock are 653 subject to revocation. In the event of server reinitialization, 654 clients have the opportunity to safely reclaim their locks within a 655 special grace period. 657 1.5. General Definitions 659 The following definitions are provided for the purpose of providing 660 an appropriate context for the reader. 662 Client The "client" is the entity that accesses the NFS server's 663 resources. The client may be an application which contains the 664 logic to access the NFS server directly. The client may also be 665 the traditional operating system client remote file system 666 services for a set of applications. 668 A client is uniquely identified by a Client Owner. 670 In the case of file locking the client is the entity that 671 maintains a set of locks on behalf of one or more applications. 672 This client is responsible for crash or failure recovery for those 673 locks it manages. 675 Note that multiple clients may share the same transport and 676 connection and multiple clients may exist on the same network 677 node. 679 Client ID A 64-bit quantity used as a unique, short-hand reference 680 to a client supplied Verifier and client owner. The server is 681 responsible for supplying the client ID. 683 Client Owner The client owner is a unique string, opaque to the 684 server, which identifies a client. Multiple network connections 685 and source network addresses originating those connections may 686 share a client owner. The server is expected to treat requests 687 from connnections with the same client owner has coming from the 688 same client. 690 Lease An interval of time defined by the server for which the client 691 is irrevocably granted a lock. At the end of a lease period the 692 lock may be revoked if the lease has not been extended. The lock 693 must be revoked if a conflicting lock has been granted after the 694 lease interval. 696 All leases granted by a server have the same fixed interval. Note 697 that the fixed interval was chosen to alleviate the expense a 698 server would have in maintaining state about variable length 699 leases across server failures. 701 Lock The term "lock" is used to refer to any of record (octet-range) 702 locks, share reservations, delegations or layouts unless 703 specifically stated otherwise. 705 Server The "Server" is the entity responsible for coordinating 706 client access to a set of file systems. A server can span 707 multiple network addresses. In NFSv4.1, a server is a two tiered 708 entity allows for servers consisting of multiple components the 709 flexibility to tightly or loosely couple their components without 710 requiring tight synchronization among the components. Every 711 server has a "Server Owner" which reflects the two tiers of a 712 server entity. 714 Server Owner The "Server Owner" identifies the server to the client. 715 The server owner consists of a major and minor identifier. When 716 the client has two connections each to a peer with the same major 717 and minor identifier, the client assumes both peers are the same 718 server (the server namespace is the same via each connection), and 719 further assumes session and lock state is sharable across both 720 connections. When each peer has the same major identifier but 721 different minor identifier, the client assumes both peers can 722 serve the same namespace, but session and lock state is not 723 sharable across both connections. 725 Stable Storage NFS version 4 servers must be able to recover without 726 data loss from multiple power failures (including cascading power 727 failures, that is, several power failures in quick succession), 728 operating system failures, and hardware failure of components 729 other than the storage medium itself (for example, disk, 730 nonvolatile RAM). 732 Some examples of stable storage that are allowable for an NFS 733 server include: 735 1. Media commit of data, that is, the modified data has been 736 successfully written to the disk media, for example, the disk 737 platter. 739 2. An immediate reply disk drive with battery-backed on- drive 740 intermediate storage or uninterruptible power system (UPS). 742 3. Server commit of data with battery-backed intermediate storage 743 and recovery software. 745 4. Cache commit with uninterruptible power system (UPS) and 746 recovery software. 748 Stateid A 128-bit quantity returned by a server that uniquely 749 defines the open and locking state provided by the server for a 750 specific open or lock owner for a specific file and type of lock. 752 Verifier A 64-bit quantity generated by the client that the server 753 can use to determine if the client has restarted and lost all 754 previous lock state. 756 1.6. Differences from NFSv4.0 758 The following summarizes the differences between minor version one 759 and the base protocol: 761 o Implementation of the sessions model. 763 o Support for parallel access to data. 765 o Addition of the RECLAIM_COMPLETE operation to better structure the 766 lock reclamation process. 768 o Support for delegations on directories and other file types in 769 addition to regular files. 771 o Operations to re-obtain a delegation. 773 o Support for client and server implementation id's. 775 2. Core Infrastructure 776 2.1. Introduction 778 NFS version 4.1 (NFSv4.1) relies on core infrastructure common to 779 nearly every operation. This core infrastructure is described in the 780 remainder of this section. 782 2.2. RPC and XDR 784 The NFS version 4.1 (NFSv4.1) protocol is a Remote Procedure Call 785 (RPC) application that uses RPC version 2 and the corresponding 786 eXternal Data Representation (XDR) as defined in RFC1831 [4] and 787 RFC4506 [3]. 789 2.2.1. RPC-based Security 791 Previous NFS versions have been thought of as having a host-based 792 authentication model, where the NFS server authenticates the NFS 793 client, and trust the client to authenticate all users. Actually, 794 NFS has always depended on RPC for authentication. The first form of 795 RPC authentication which required a host-based authentication 796 approach. NFSv4 also depends on RPC for basic security services, and 797 mandates RPC support for a user-based authentication model. The 798 user-based authentication model has user principals authenticated by 799 a server, and in turn the server authenticated by user principals. 800 RPC provides some basic security services which are used by NFSv4. 802 2.2.1.1. RPC Security Flavors 804 As described in section 7.2 "Authentication" of [4], RPC security is 805 encapsulated in the RPC header, via a security or authentication 806 flavor, and information specific to the specification of the security 807 flavor. Every RPC header conveys information used to identify and 808 authenticate a client and server. As discussed in Section 2.2.1.1.1, 809 some security flavors provide additional security services. 811 NFSv4 clients and servers MUST implement RPCSEC_GSS. (This 812 requirement to implement is not a requirement to use.) Other 813 flavors, such as AUTH_NONE, and AUTH_SYS, MAY be implemented as well. 815 2.2.1.1.1. RPCSEC_GSS and Security Services 817 RPCSEC_GSS ([5]) uses the functionality of GSS-API RFC2743 [8]. This 818 allows for the use of various security mechanisms by the RPC layer 819 without the additional implementation overhead of adding RPC security 820 flavors. 822 2.2.1.1.1.1. Identification, Authentication, Integrity, Privacy 824 Via the GSS-API, RPCSEC_GSS can be used to identify and authenticate 825 users on clients to servers, and servers to users. It can also 826 perform integrity checking on the entire RPC message, including the 827 RPC header, and the arguments or results. Finally, privacy, usually 828 via encryption, is a service available with RPCSEC_GSS. Privacy is 829 performed on the arguments and results. Note that if privacy is 830 selected, integrity, authentication, and identification are enabled. 831 If privacy is not selected, but integrity is selected, authentication 832 and identification are enabled. If integrity and privacy are not 833 selected, but authentication is enabled, identification is enabled. 834 RPCSEC_GSS does not provide identification as a separate service. 836 Although GSS-API has an authentication service distinct from its 837 privacy and integrity services, GSS-API's authentication service is 838 not used for RPCSEC_GSS's authentication service. Instead, each RPC 839 request and response header is integrity protected with the GSS-API 840 integrity service, and this allows RPCSEC_GSS to offer per-RPC 841 authentication and identity. See [5] for more information. 843 NFSv4 client and servers MUST support RPCSEC_GSS's integrity and 844 authentication service. NFSv4.1 servers MUST support RPCSEC_GSS's 845 privacy service. 847 2.2.1.1.1.2. Security mechanisms for NFS version 4 849 RPCSEC_GSS, via GSS-API, normalizes access to mechanisms that provide 850 security services. Therefore NFSv4 clients and servers MUST support 851 three security mechanisms: Kerberos V5, SPKM-3, and LIPKEY. 853 The use of RPCSEC_GSS requires selection of: mechanism, quality of 854 protection (QOP), and service (authentication, integrity, privacy). 855 For the mandated security mechanisms, NFSv4 specifies that a QOP of 856 zero (0) is used, leaving it up to the mechanism or the mechanism's 857 configuration to use an appropriate level of protection that QOP zero 858 maps to. Each mandated mechanism specifies minimum set of 859 cryptographic algorithms for implementing integrity and privacy. 860 NFSv4 clients and servers MUST be implemented on operating 861 environments that comply with the mandatory cryptographic algorithms 862 of each mandated mechanism. 864 2.2.1.1.1.2.1. Kerberos V5 866 The Kerberos V5 GSS-API mechanism as described in RFC1964 [6] ( 867 [[Comment.1: need new Kerberos RFC]] ) MUST be implemented with the 868 RPCSEC_GSS services as specified in the following table: 870 column descriptions: 871 1 == number of pseudo flavor 872 2 == name of pseudo flavor 873 3 == mechanism's OID 874 4 == RPCSEC_GSS service 875 5 == NFSv4.1 clients MUST support 876 6 == NFSv4.1 servers MUST support 878 1 2 3 4 5 6 879 ------------------------------------------------------------------ 880 390003 krb5 1.2.840.113554.1.2.2 rpc_gss_svc_none yes yes 881 390004 krb5i 1.2.840.113554.1.2.2 rpc_gss_svc_integrity yes yes 882 390005 krb5p 1.2.840.113554.1.2.2 rpc_gss_svc_privacy no yes 884 Note that the number and name of the pseudo flavor is presented here 885 as a mapping aid to the implementor. Because the NFSv4 protocol 886 includes a method to negotiate security and it understands the GSS- 887 API mechanism, the pseudo flavor is not needed. The pseudo flavor is 888 needed for the NFS version 3 since the security negotiation is done 889 via the MOUNT protocol as described in [19]. 891 2.2.1.1.1.2.2. LIPKEY 893 The LIPKEY V5 GSS-API mechanism as described in [7] MUST be 894 implemented with the RPCSEC_GSS services as specified in the 895 following table: 897 1 2 3 4 5 6 898 ------------------------------------------------------------------ 899 390006 lipkey 1.3.6.1.5.5.9 rpc_gss_svc_none yes yes 900 390007 lipkey-i 1.3.6.1.5.5.9 rpc_gss_svc_integrity yes yes 901 390008 lipkey-p 1.3.6.1.5.5.9 rpc_gss_svc_privacy no yes 903 2.2.1.1.1.2.3. SPKM-3 as a security triple 905 The SPKM-3 GSS-API mechanism as described in [7] MUST be implemented 906 with the RPCSEC_GSS services as specified in the following table: 908 1 2 3 4 5 6 909 ------------------------------------------------------------------ 910 390009 spkm3 1.3.6.1.5.5.1.3 rpc_gss_svc_none yes yes 911 390010 spkm3i 1.3.6.1.5.5.1.3 rpc_gss_svc_integrity yes yes 912 390011 spkm3p 1.3.6.1.5.5.1.3 rpc_gss_svc_privacy no yes 914 2.2.1.1.1.3. GSS Server Principal 916 Regardless of what security mechanism under RPCSEC_GSS is being used, 917 the NFS server, MUST identify itself in GSS-API via a 918 GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE 919 names are of the form: 921 service@hostname 923 For NFS, the "service" element is 925 nfs 927 Implementations of security mechanisms will convert nfs@hostname to 928 various different forms. For Kerberos V5, LIPKEY, and SPKM-3, the 929 following form is RECOMMENDED: 931 nfs/hostname 933 2.3. COMPOUND and CB_COMPOUND 935 A significant departure from the versions of the NFS protocol before 936 version 4 is the introduction of the COMPOUND procedure. For the 937 NFSv4 protocol, in all minor versions, there are exactly two RPC 938 procedures, NULL and COMPOUND. The COMPOUND procedure is defined as 939 a series of individual operations and these operations perform the 940 sorts of functions performed by traditional NFS procedures. 942 The operations combined within a COMPOUND request are evaluated in 943 order by the server, without any atomicity guarantees. A limited set 944 of facilities exist to pass results from one operation to another. 945 Once an operation returns a failing result, the evaluation ends and 946 the results of all evaluated operations are returned to the client. 948 With the use of the COMPOUND procedure, the client is able to build 949 simple or complex requests. These COMPOUND requests allow for a 950 reduction in the number of RPCs needed for logical file system 951 operations. For example, multi-component lookup requests can be 952 constructed by combining multiple LOOKUP operations. Those can be 953 further combined with operations such as GETATTR, READDIR, or OPEN 954 plus READ to do more complicated sets of operation without incurring 955 additional latency. 957 NFSv4 also contains a considerable set of callback operations in 958 which the server makes an RPC directed at the client. Callback RPC's 959 have a similar structure to that of the normal server requests. For 960 the NFS version 4 protocol callbacks in all minor versions, there are 961 two RPC procedures, NULL and CB_COMPOUND. The CB_COMPOUND procedure 962 is defined in an analogous fashion to that of COMPOUND with its own 963 set of callback operations. 965 Addition of new server and callback operation within the COMPOUND and 966 CB_COMPOUND request framework provide means of extending the protocol 967 in subsequent minor versions. 969 Except for a small number of operations needed for session creation, 970 server requests and callback requests are performed within the 971 context of a session. Sessions provide a client context for every 972 request and support robust replay protection for non-idempotent 973 requests. 975 2.4. Client Identifiers and Client Owners 977 For each operation that obtains or depends on locking state, the 978 specific client must be determinable by the server. In NFSv4, each 979 distinct client instance is represented by a client ID, which is a 980 64-bit identifier that identifies a specific client at a given time 981 and which is changed whenever the client or the server re- 982 initializes. Client IDs are used to support lock identification and 983 crash recovery. 985 In NFSv4.1, during steady state operation, the client ID associated 986 with each operation is derived from the session (see Section 2.10) on 987 which the operation is issued. Each session is associated with a 988 specific client ID at session creation and that client ID then 989 becomes the client ID associated with all requests issued using it. 990 Therefore, unlike NFSv4.0, the only NFSv4.1 operations possible 991 before a client ID is established, are those directly connected with 992 establishing the client ID. 994 A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION 995 operation using that client ID (eir_clientid as returned from 996 EXCHANGE_ID) is required to establish the identification on the 997 server. Establishment of identification by a new incarnation of the 998 client also has the effect of immediately releasing any locking state 999 that a previous incarnation of that same client might have had on the 1000 server. Such released state would include all lock, share 1001 reservation, and, where the server is not supporting the 1002 CLAIM_DELEGATE_PREV claim type, all delegation state associated with 1003 same client with the same identity. For discussion of delegation 1004 state recovery, see Section 9.2.1. 1006 Releasing such state requires that the server be able to determine 1007 that one client instance is the successor of another. Where this 1008 cannot be done, for any of a number of reasons, the locking state 1009 will remain for a time subject to lease expiration (see Section 8.5) 1010 and the new client will need to wait for such state to be removed, if 1011 it makes conflicting lock requests. 1013 Client identification is encapsulated in the following Client Owner 1014 structure: 1016 struct client_owner4 { 1017 verifier4 co_verifier; 1018 opaque co_ownerid; 1019 }; 1021 The first field, co_verifier, is a client incarnation verifier that 1022 is used to detect client reboots. Only if the co_verifier is 1023 different from that the server had previously recorded for the client 1024 (as identified by the second field of the structure, co_ownerid) does 1025 the server start the process of canceling the client's leased state. 1027 The second field, co_ownerid is a variable length string that 1028 uniquely defines the client so that subsequent instances of the same 1029 client bear the same co_ownerid with a different verifier. 1031 There are several considerations for how the client generates the 1032 co_ownerid string: 1034 o The string should be unique so that multiple clients do not 1035 present the same string. The consequences of two clients 1036 presenting the same string range from one client getting an error 1037 to one client having its leased state abruptly and unexpectedly 1038 canceled. 1040 o The string should be selected so the subsequent incarnations (e.g. 1041 reboots) of the same client cause the client to present the same 1042 string. The implementor is cautioned from an approach that 1043 requires the string to be recorded in a local file because this 1044 precludes the use of the implementation in an environment where 1045 there is no local disk and all file access is from an NFS version 1046 4 server. 1048 o The string should be the same for each server network address that 1049 the client accesses, rather than common to all server network 1050 addresses (note: the precise opposite was advised in RFC3530). 1051 This way, if a server has multiple interfaces, the client can 1052 trunk traffic over multiple network paths as described in 1053 Section 2.10.3.4.1. 1055 o The algorithm for generating the string should not assume that the 1056 client's network address will not change, unless the client 1057 implementation knows it is using statically assigned network 1058 addresses. This includes changes between client incarnations and 1059 even changes while the client is still running in its current 1060 incarnation. This means that if the client includes just the 1061 client's network address in the co_ownerid string, there is a real 1062 risk, with dynamic address assignment, that after the client gives 1063 up the network address, another client, using a similar algorithm 1064 for generating the co_ownerid string, would generate a conflicting 1065 co_ownerid string. 1067 Given the above considerations, an example of a well generated 1068 co_ownerid string is one that includes: 1070 o If applicable, the client's statically assigned network address. 1072 o Additional information that tends to be unique, such as one or 1073 more of: 1075 * The client machine's serial number (for privacy reasons, it is 1076 best to perform some one way function on the serial number). 1078 * A MAC address (again, a one way function should be performed). 1080 * The timestamp of when the NFS version 4 software was first 1081 installed on the client (though this is subject to the 1082 previously mentioned caution about using information that is 1083 stored in a file, because the file might only be accessible 1084 over NFS version 4). 1086 * A true random number. However since this number ought to be 1087 the same between client incarnations, this shares the same 1088 problem as that of the using the timestamp of the software 1089 installation. 1091 o For a user level NFS version 4 client, it should contain 1092 additional information to distinguish the client from other user 1093 level clients running on the same host, such as a process 1094 identifier or other unique sequence. 1096 As a security measure, the server MUST NOT cancel a client's leased 1097 state if the principal established the state for a given co_ownerid 1098 string is not the same as the principal issuing the EXCHANGE_ID. 1100 A server may compare an client_owner4 in a EXCHANGE_ID with an 1101 nfs_client_id4 established using SETCLIENTID using NFSv4 minor 1102 version 0, so that an NFSv4.1 client is not forced to delay until 1103 lease expiration for locking state established by the earlier client 1104 using minor version 0. This requires the client_owner4 be 1105 constructed the same way as the nfs_client_id4. If the latter's 1106 contents included the server's network address, and the NFSv4.1 1107 client does not wish to use a client ID that prevents trunking, it 1108 should issue two EXCHANGE_ID operations. The first EXCHANGE_ID will 1109 have a client_owner4 equal to the nfs_client_id4. This will clear 1110 the state created by the NFSv4.0 client. The second EXCHANGE_ID will 1111 not have the server's network address. The state created for the 1112 second EXCHANGE_ID will not have to wait for lease expiration, 1113 because there will be no state to expire. 1115 Once an EXCHANGE_ID has been done, and the resulting client ID 1116 established as associated with a session, all requests made on that 1117 session implicitly identify that client ID, which in turn designates 1118 the client specified using the long-form client_owner4 structure. 1119 The shorthand client identifier (a client ID) is assigned by the 1120 server (the eir_clientid result from EXCHANGE_ID) and should be 1121 chosen so that it will not conflict with a client ID previously 1122 assigned by the server. This applies across server restarts or 1123 reboots. 1125 In the event of a server restart, a client may find out that its 1126 current client ID is no longer valid when receives a 1127 NFS4ERR_STALE_CLIENTID error. The precise circumstances depend of 1128 the characteristics of the sessions involved, specifically whether 1129 the session is persistent (see Section 2.10.4.5). 1131 When a session is not persistent, the client will need to create a 1132 new session. When the existing client ID is presented to a server as 1133 part of creating a session and that client ID is not recognized, as 1134 would happen after a server reboot, the server will reject the 1135 request with the error NFS4ERR_STALE_CLIENTID. When this happens, 1136 the client must obtain a new client ID by use of the EXCHANGE_ID 1137 operation and then use that client ID as the basis of the basis of a 1138 new session and then proceed to any other necessary recovery for the 1139 server reboot case (See Section 8.6.2). 1141 In the case of the session being persistent, the client will re- 1142 establish communication using the existing session after the reboot. 1143 This session will be associated with a client ID that has had state 1144 revoked (but the persistent session is never associated with a stale 1145 client ID, because if the session is persistent, the client ID MUST 1146 persist), and the client will receive an indication of that fact in 1147 the sr_status_flags field returned by the SEQUENCE operation (see 1148 Section 17.46.4). The client can then use the existing session to do 1149 whatever operations are necessary to determine the status of requests 1150 outstanding at the time of reboot, while avoiding issuing new 1151 requests, particularly any involving locking on that session. Such 1152 requests would fail with an NFS4ERR_STALE_STATEID error, if 1153 attempted. 1155 See the detailed descriptions of EXCHANGE_ID (Section 17.35 and 1156 CREATE_SESSION (Section 17.36) for a complete specification of these 1157 operations. 1159 2.4.1. Server Release of Client ID 1161 NFSv4.1 introduces a new operation called DESTROY_CLIENTID 1162 (Section 17.50) which the client SHOULD use to destroy a client ID it 1163 no longer needs. This permits graceful, bilateral release of a 1164 client ID. 1166 If the server determines that the client holds no associated state 1167 for its client ID (including sessions, opens, locks, delegations, 1168 layouts, and wants), the server may choose to unilaterally release 1169 the client ID. The server may make this choice for an inactive 1170 client so that resources are not consumed by those intermittently 1171 active clients. If the client contacts the server after this 1172 release, the server must ensure the client receives the appropriate 1173 error so that it will use the EXCHANGE_ID/CREATE_SESSION sequence to 1174 establish a new identity. It should be clear that the server must be 1175 very hesitant to release a client ID since the resulting work on the 1176 client to recover from such an event will be the same burden as if 1177 the server had failed and restarted. Typically a server would not 1178 release a client ID unless there had been no activity from that 1179 client for many minutes. As long as there are sessions, opens, 1180 locks, delegations, layouts, or wants, the server MUST not release 1181 the client ID. See Section 2.10.8.1.4 for discussion on releasing 1182 inactive sessions. 1184 2.4.2. Handling Client Owner Conflicts 1186 If the co_ownerid string in a EXCHANGE_ID request is properly 1187 constructed, and if the client takes care to use the same principal 1188 for each successive use of EXCHANGE_ID, then, barring an active 1189 denial of service attack, conflicts are not possible. 1191 However, client bugs, server bugs, or perhaps a deliberate change of 1192 the principal owner of the co_ownerid string (such as the case of a 1193 client that changes security flavors, and under the new flavor, there 1194 is no mapping to the previous owner) will in rare cases result in a 1195 conflict. 1197 When the server gets a EXCHANGE_ID for a client owner that currently 1198 has no state, or if it has state, but the lease has expired, server 1199 MUST allow the EXCHANGE_ID, and confirm the new client ID if followed 1200 by the appropriate CREATE_SESSION. 1202 When the server gets a EXCHANGE_ID for a client owner that currently 1203 has state, or an unexpired lease, and the principal that issues the 1204 EXCHANGE_ID is different than principal the previously established 1205 the client owner, the server MUST not destroy the any state that 1206 currently exists for client owner. Regardless, the server has two 1207 choices. First, it can return NFS4ERR_CLID_INUSE. Second, it can 1208 allow the EXCHANGE_ID, and simply treat the client owner as 1209 consisting of both the co_ownerid and the principal that issued the 1210 EXCHANGE_ID. 1212 2.5. Server Owners 1214 The Server Owner is somewhat similar to a Client Owner (Section 2.4), 1215 but unlike the Client Owner, there is no shorthand serverid. The 1216 Server Owner is defined in the following structure: 1218 struct server_owner4 { 1219 uint64_t so_minor_id; 1220 opaque so_major_id; 1221 }; 1223 The Server Owner is returned in the results of EXCHANGE_ID. When the 1224 so_major_id fields are the same in two EXCHANGE_ID results, the 1225 connections each EXCHANGE_ID are sent over can be assumed to address 1226 the same Server (as defined in Section 1.5). If the so_minor_id 1227 fields are also the same, then not only do both connections connect 1228 to the same server, but the session and other state can be shared 1229 across both connections. The reader is cautioned that multiple 1230 servers may deliberately or accidentally claim to have the same 1231 so_major_id or so_major_id/so_minor_id; the reader should examine 1232 Section 2.10.3.4.1 and Section 17.35. 1234 The considerations for generating an so_major_id are similar to that 1235 for generating a co_ownerid string (see Section 2.4). The 1236 consequences of two servers generating conflict so_major_id values 1237 are less dire than they are for co_ownerid conflicts because the 1238 client can use RPCSEC_GSS to compare the authenticity of each server 1239 (see Section 2.10.3.4.1). 1241 2.6. Security Service Negotiation 1243 With the NFS version 4 server potentially offering multiple security 1244 mechanisms, the client needs a method to determine or negotiate which 1245 mechanism is to be used for its communication with the server. The 1246 NFS server may have multiple points within its file system namespace 1247 that are available for use by NFS clients. These points can be 1248 considered security policy boundaries, and in some NFS 1249 implementations are tied to NFS export points. In turn the NFS 1250 server may be configured such that each of these security policy 1251 boundaries may have different or multiple security mechanisms in use. 1253 The security negotiation between client and server must be done with 1254 a secure channel to eliminate the possibility of a third party 1255 intercepting the negotiation sequence and forcing the client and 1256 server to choose a lower level of security than required or desired. 1257 See Section 20 for further discussion. 1259 2.6.1. NFSv4 Security Tuples 1261 An NFS server can assign one or more "security tuples" to each 1262 security policy boundary in its namespace. Each security tuple 1263 consists of a security flavor (see Section 2.2.1.1), and if the 1264 flavor is RPCSEC_GSS, a GSS-API mechanism OID, a GSS-API quality of 1265 protection, and an RPCSEC_GSS service. 1267 2.6.2. SECINFO and SECINFO_NO_NAME 1269 The SECINFO and SECINFO_NO_NAME operations allow the client to 1270 determine, on a per filehandle basis, what security tuple is to be 1271 used for server access. In general, the client will not have to use 1272 either operation except during initial communication with the server 1273 or when the client crosses security policy boundaries at the server. 1274 It is possible that the server's policies change during the client's 1275 interaction therefore forcing the client to negotiate a new security 1276 tuple. 1278 Where the use of different security tuples would affect the type of 1279 access that would be allowed if a request was issued over the same 1280 connection used for the SECINFO or SECINFO_NO_NAME operation (e.g. 1281 read-only vs. read-write) access, security tuples that allow greater 1282 access should be presented first. Where the general level of access 1283 is the same and different security flavors limit the range of 1284 principals whose privileges are recognized (e.g. allowing or 1285 disallowing root access), flavors supporting the greatest range of 1286 principals should be listed first. 1288 2.6.3. Security Error 1290 Based on the assumption that each NFS version 4 client and server 1291 must support a minimum set of security (i.e., LIPKEY, SPKM-3, and 1292 Kerberos-V5 all under RPCSEC_GSS), the NFS client will initiate file 1293 access to the server with one of the minimal security tuples. During 1294 communication with the server, the client may receive an NFS error of 1295 NFS4ERR_WRONGSEC. This error allows the server to notify the client 1296 that the security tuple currently being used is contravenes the 1297 server's security policy. The client is then responsible for 1298 determining (see Section 2.6.3.1) what security tuples are available 1299 at the server and choosing one which is appropriate for the client. 1301 2.6.3.1. Using NFS4ERR_WRONGSEC, SECINFO, and SECINFO_NO_NAME 1303 This section explains of the mechanics of NFSv4.1 security 1304 negotiation. The term "put filehandle operation" refers to 1305 PUTROOTFH, PUTPUBFH, PUTFH, and RESTOREFH. 1307 2.6.3.1.1. Put Filehandle Operation + SAVEFH 1309 The client is saving a filehandle for a future RESTOREFH. The server 1310 MUST NOT return NFS4ERR_WRONG to either the put filehandle operation 1311 or SAVEFH. 1313 2.6.3.1.2. Two or More Put Filehandle Operations 1315 For a series of N put filehandle operations, the server MUST NOT 1316 return NFS4ERR_WRONGSEC to the first N-1 put filehandle operations. 1317 The Nth put filehandle operation is handled as if it is the first in 1318 a series of operations, and the second in the series of operations is 1319 not a put filehandle operation. For example if the server received 1320 PUTFH, PUTROOTFH, LOOKUP, then the PUTFH is ignored for 1321 NFS4ERR_WRONGSEC purposes, and the PUTROOTFH, LOOKUP subseries is 1322 processed as according to Section 2.6.3.1.3. 1324 2.6.3.1.3. Put Filehandle Operation + LOOKUP (or OPEN by Name) 1326 This situation also applies to a put filehandle operation followed by 1327 a LOOKUP or an OPEN operation that specifies a component name. 1329 In this situation, the client is potentially crossing a security 1330 policy boundary, and the set of security tuples the parent directory 1331 supports differ from those of the child. The server implementation 1332 may decide whether to impose any restrictions on security policy 1333 administration. There are at least three approaches 1334 (sec_policy_child is the tuple set of the child export, 1335 sec_policy_parent is that of the parent). 1337 a) sec_policy_child <= sec_policy_parent (<= for subset). This 1338 means that the set of security tuples specified on the security 1339 policy of a child directory is always a subset of that of its 1340 parent directory. 1342 b) sec_policy_child ^ sec_policy_parent != {} (^ for intersection, 1343 {} for the empty set). This means that the security tuples 1344 specified on the security policy of a child directory always has a 1345 non empty intersection with that of the parent. 1347 c) sec_policy_child ^ sec_policy_parent == {}. This means that 1348 the set of tuples specified on the security policy of a child 1349 directory may not intersect with that of the parent. In other 1350 words, there are no restrictions on how the system administrator 1351 may set up these tuples. 1353 For a server to support approach (b) (when client chooses a flavor 1354 that is not a member of sec_policy_parent) and (c), the put 1355 filehandle operation must NOT return NFS4ERR_WRONGSEC in case of 1356 security mismatch. Instead, it should be returned from the LOOKUP 1357 (or OPEN by component name) that follows. 1359 Since the above guideline does not contradict approach (a), it should 1360 be followed in general. Even if approach (a) is implemented, it is 1361 possible for the security tuple used to be acceptable for the target 1362 of LOOKUP but not for the filehandles used in the put filehandle 1363 operation. The put filehandle operation could be a PUTROOTFH or 1364 PUTPUBFH, where the client cannot know the security tuples for the 1365 root or public filehandle. Or the security policy for the filehandle 1366 used by the put filehandle operation could have changed since the 1367 time the filehandle was obtained. 1369 Therefore, an NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC in 1370 response to the put filehandle operation if the operation is 1371 immediately followed by a LOOKUP or an OPEN by component name. 1373 2.6.3.1.4. Put Filehandle Operation + LOOKUPP 1375 Since SECINFO only works its way down, there is no way LOOKUPP can 1376 return NFS4ERR_WRONGSEC without SECINFO_NO_NAME. SECINFO_NO_NAME 1377 solves this issue because via style SECINFO_STYLE4_PARENT, it works 1378 in the opposite direction as SECINFO. As with Section 2.6.3.1.3, the 1379 put filehandle operation must not return NFS4ERR_WRONGSEC whenever it 1380 is followed by LOOKUPP. If the server does not support 1381 SECINFO_NO_NAME, the client's only recourse is to issue the put 1382 filehandle operation, LOOKUPP, GETFH sequence of operations with 1383 every security tuple it supports. 1385 Regardless whether SECINFO_NO_NAME is supported, an NFSv4.1 server 1386 MUST NOT return NFS4ERR_WRONGSEC in response to a put filehandle 1387 operation if the operation is immediately followed by a LOOKUPP. 1389 2.6.3.1.5. Put Filehandle Operation + SECINFO/SECINFO_NO_NAME 1391 A security sensitive client is allowed to choose a strong security 1392 tuple when querying a server to determine a file object's permitted 1393 security tuples. The security tuple chosen by the client does not 1394 have to be included in the tuple list of the security policy of the 1395 either parent directory indicated in the put filehandle operation, or 1396 the child file object indicated in SECINFO (or any parent directory 1397 indicated in SECINFO_NO_NAME). Of course the server has to be 1398 configured for whatever security tuple the client selects, otherwise 1399 the request will fail at RPC layer with an appropriate authentication 1400 error. 1402 In theory, there is no connection between the security flavor used by 1403 SECINFO or SECINFO_NO_NAME and those supported by the security 1404 policy. But in practice, the client may start looking for strong 1405 flavors from those supported by the security policy, followed by 1406 those in the mandatory set. 1408 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to a put 1409 filehandle operation whenever it is immediately followed by SECINFO 1410 or SECINFO_NO_NAME. The NFSv4.1 server MUST NOT return 1411 NFS4ERR_WRONGSEC from SECINFO or SECINFO_NO_NAME. 1413 2.6.3.1.6. Put Filehandle Operation + Nothing 1415 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC. 1417 2.6.3.1.7. Put Filehandle Operation + Anything Else 1419 "Anything Else" includes OPEN by filehandle. 1421 The security policy enforcement applies to the filehandle specified 1422 in the put filehandle operation. Therefore PUTFH must return 1423 NFS4ERR_WRONGSEC in case of security tuple on the part of the 1424 mismatch. This avoids the complexity adding NFS4ERR_WRONGSEC as an 1425 allowable error to every other operation. 1427 A COMPOUND containing the series put filehandle operation + 1428 SECINFO_NO_NAME (style SECINFO_STYLE4_CURRENT_FH) is an efficient way 1429 for the client to recover from NFS4ERR_WRONGSEC. 1431 The NFSv4.1 server MUST not return NFS4ERR_WRONGSEC to any operation 1432 other than a put filehandle operation, LOOKUP, LOOKUPP, and OPEN (by 1433 component name). 1435 2.7. Minor Versioning 1437 To address the requirement of an NFS protocol that can evolve as the 1438 need arises, the NFS version 4 protocol contains the rules and 1439 framework to allow for future minor changes or versioning. 1441 The base assumption with respect to minor versioning is that any 1442 future accepted minor version must follow the IETF process and be 1443 documented in a standards track RFC. Therefore, each minor version 1444 number will correspond to an RFC. Minor version zero of the NFS 1445 version 4 protocol is represented by [2], and minor version one is 1446 represented by this document [[Comment.2: change "document" to "RFC" 1447 when we publish]] . The COMPOUND and CB_COMPOUND procedures support 1448 the encoding of the minor version being requested by the client. 1450 The following items represent the basic rules for the development of 1451 minor versions. Note that a future minor version may decide to 1452 modify or add to the following rules as part of the minor version 1453 definition. 1455 1. Procedures are not added or deleted 1457 To maintain the general RPC model, NFS version 4 minor versions 1458 will not add to or delete procedures from the NFS program. 1460 2. Minor versions may add operations to the COMPOUND and 1461 CB_COMPOUND procedures. 1463 The addition of operations to the COMPOUND and CB_COMPOUND 1464 procedures does not affect the RPC model. 1466 * Minor versions may append attributes to GETATTR4args, 1467 bitmap4, and GETATTR4res. 1469 This allows for the expansion of the attribute model to allow 1470 for future growth or adaptation. 1472 * Minor version X must append any new attributes after the last 1473 documented attribute. 1475 Since attribute results are specified as an opaque array of 1476 per-attribute XDR encoded results, the complexity of adding 1477 new attributes in the midst of the current definitions would 1478 be too burdensome. 1480 3. Minor versions must not modify the structure of an existing 1481 operation's arguments or results. 1483 Again the complexity of handling multiple structure definitions 1484 for a single operation is too burdensome. New operations should 1485 be added instead of modifying existing structures for a minor 1486 version. 1488 This rule does not preclude the following adaptations in a minor 1489 version. 1491 * adding bits to flag fields such as new attributes to 1492 GETATTR's bitmap4 data type 1494 * adding bits to existing attributes like ACLs that have flag 1495 words 1497 * extending enumerated types (including NFS4ERR_*) with new 1498 values 1500 4. Minor versions may not modify the structure of existing 1501 attributes. 1503 5. Minor versions may not delete operations. 1505 This prevents the potential reuse of a particular operation 1506 "slot" in a future minor version. 1508 6. Minor versions may not delete attributes. 1510 7. Minor versions may not delete flag bits or enumeration values. 1512 8. Minor versions may declare an operation as mandatory to NOT 1513 implement. 1515 Specifying an operation as "mandatory to not implement" is 1516 equivalent to obsoleting an operation. For the client, it means 1517 that the operation should not be sent to the server. For the 1518 server, an NFS error can be returned as opposed to "dropping" 1519 the request as an XDR decode error. This approach allows for 1520 the obsolescence of an operation while maintaining its structure 1521 so that a future minor version can reintroduce the operation. 1523 1. Minor versions may declare attributes mandatory to NOT 1524 implement. 1526 2. Minor versions may declare flag bits or enumeration values 1527 as mandatory to NOT implement. 1529 9. Minor versions may downgrade features from mandatory to 1530 recommended, or recommended to optional. 1532 10. Minor versions may upgrade features from optional to recommended 1533 or recommended to mandatory. 1535 11. A client and server that supports minor version X should support 1536 minor versions 0 (zero) through X-1 as well. 1538 12. Except for infrastructural changes, no new features may be 1539 introduced as mandatory in a minor version. 1541 This rule allows for the introduction of new functionality and 1542 forces the use of implementation experience before designating a 1543 feature as mandatory. On the other hand, some classes of 1544 features are infrastructural and have broad effects. Allowing 1545 such features to not be mandatory complicates implementation of 1546 the minor version. 1548 13. A client MUST NOT attempt to use a stateid, filehandle, or 1549 similar returned object from the COMPOUND procedure with minor 1550 version X for another COMPOUND procedure with minor version Y, 1551 where X != Y. 1553 2.8. Non-RPC-based Security Services 1555 As described in Section 2.2.1.1.1.1, NFSv4 relies on RPC for 1556 identification, authentication, integrity, and privacy. NFSv4 itself 1557 provides additional security services as described in the next 1558 several subsections. 1560 2.8.1. Authorization 1562 Authorization to access a file object via an NFSv4 operation is 1563 ultimately determined by the NFSv4 server. A client can predetermine 1564 its access to a file object via the OPEN (Section 17.16) and the 1565 ACCESS (Section 17.1) operations. 1567 Principals with appropriate access rights can modify the 1568 authorization on a file object via the SETATTR (Section 17.30) 1569 operation. Four attributes that affect access rights are: mode, 1570 owner, owner_group, and acl. See Section 5. 1572 2.8.2. Auditing 1574 NFSv4 provides auditing on a per file object basis, via the ACL 1575 attribute as described in Section 6. It is outside the scope of this 1576 specification to specify audit log formats or management policies. 1578 2.8.3. Intrusion Detection 1580 NFSv4 provides alarm control on a per file object basis, via the ACL 1581 attribute as described in Section 6. Alarms may serve as the basis 1582 for intrusion detection. It is outside the scope of this 1583 specification to specify heuristics for detecting intrusion via 1584 alarms. 1586 2.9. Transport Layers 1588 2.9.1. Required and Recommended Properties of Transports 1590 NFSv4 works over RDMA and non-RDMA_based transports with the 1591 following attributes: 1593 o The transport supports reliable delivery of data, which NFSv4 1594 requires but neither NFSv4 nor RPC has facilities for ensuring. 1595 [20] 1597 o The transport delivers data in the order it was sent. Ordered 1598 delivery simplifies detection of transmit errors, and simplifies 1599 the sending of arbitrary sized requests and responses, via the 1600 record marking protocol [4]. 1602 Where an NFS version 4 implementation supports operation over the IP 1603 network protocol, any transport used between NFS and IP MUST be among 1604 the IETF-approved congestion control transport protocols. At the 1605 time this document was written, the only two transports that had the 1606 above attributes were TCP and SCTP. To enhance the possibilities for 1607 interoperability, an NFS version 4 implementation MUST support 1608 operation over the TCP transport protocol. 1610 Even if NFS version 4 is used over a non-IP network protocol, it is 1611 RECOMMENDED that the transport support congestion control. 1613 It is permissible for a connectionless transport to be used under 1614 NFSv4.1, however reliable and in-order delivery of data by the 1615 connectionless transport is still required. NFSv4.1 assumes that a 1616 client transport address and server transport address used to send 1617 data over a transport together constitute a connection, even if the 1618 underlying transport eschews the concept of a connection. 1620 2.9.2. Client and Server Transport Behavior 1622 If a connection-oriented transport (e.g. TCP) is used the client and 1623 server SHOULD use long lived connections for at least three reasons: 1625 1. This will prevent the weakening of the transport's congestion 1626 control mechanisms via short lived connections. 1628 2. This will improve performance for the WAN environment by 1629 eliminating the need for connection setup handshakes. 1631 3. The NFSv4.1 callback model differs from NFSv4.0, and requires the 1632 client and server to maintain a client-created channel (see 1633 Section 2.10.3.4for the server to use. 1635 In order to reduce congestion, if a connection-oriented transport is 1636 used, and the request is not the NULL procedure, 1638 o A requester MUST NOT retry a request unless the connection the 1639 request was issued over was disconnected before the reply was 1640 received. 1642 o A replier MUST NOT silently drop a request, even if the request is 1643 a retry. (The silent drop behavior of RPCSEC_GSS [5] does not 1644 apply because this behavior happens at the RPCSEC_GSS layer, a 1645 lower layer in the request processing). Instead, the replier 1646 SHOULD return an appropriate error (see Section 2.10.4.1) or it 1647 MAY disconnect the connection. 1649 When using RDMA transports there are other reasons for not tolerating 1650 retries over the same connection: 1652 o RDMA transports use "credits" to enforce flow control, where a 1653 credit is a right to a peer to transmit a message. If one peer 1654 were to retransmit a request (or reply), it would consume an 1655 additional credit. If the replier retransmitted a reply, it would 1656 certainly result in an RDMA connection loss, since the requester 1657 would typically only post a single receive buffer for each 1658 request. If the requester retransmitted a request, the additional 1659 credit consumed on the server might lead to RDMA connection 1660 failure unless the client accounted for it and decreased its 1661 available credit, leading to wasted resources. 1663 o RDMA credits present a new issue to the reply cache in NFSv4.1. 1664 The reply cache may be used when a connection within a session is 1665 lost, such as after the client reconnects. Credit information is 1666 a dynamic property of the RDMA connection, and stale values must 1667 not be replayed from the cache. This implies that the reply cache 1668 contents must not be blindly used when replies are issued from it, 1669 and credit information appropriate to the channel must be 1670 refreshed by the RPC layer. 1672 In addition, the NFSv4.1 requester is not allowed to stop waiting for 1673 a reply, as described in Section 2.10.4.2. 1675 2.9.3. Ports 1677 Historically, NFS version 2 and version 3 servers have resided on 1678 port 2049. The registered port 2049 RFC3232 [21] for the NFS 1679 protocol should be the default configuration. NFSv4 clients SHOULD 1680 NOT use the RPC binding protocols as described in RFC1833 [22]. 1682 2.10. Session 1684 2.10.1. Motivation and Overview 1686 Previous versions and minor versions of NFS have suffered from the 1687 following: 1689 o Lack of support for exactly once semantics (EOS). This includes 1690 lack of support for EOS through server failure and recovery. 1692 o Limited callback support, including no support for sending 1693 callbacks through firewalls, and races between responses from 1694 normal requests, and callbacks. 1696 o Limited trunking over multiple network paths. 1698 o Requiring machine credentials for fully secure operation. 1700 Through the introduction of a session, NFSv4.1 addresses the above 1701 shortfalls with practical solutions: 1703 o EOS is enabled by a reply cache with a bounded size, making it 1704 feasible to keep on persistent storage and enable EOS through 1705 server failure and recovery. One reason that previous revisions 1706 of NFS did not support EOS was because some EOS approaches often 1707 limited parallelism. As will be explained in Section 2.10.4), 1708 NFSv4.1 supports both EOS and unlimited parallelism. 1710 o The NFSv4.1 client provides creates transport connections and 1711 gives them to the server for sending callbacks, thus solving the 1712 firewall issue (Section 17.34). Races between responses from 1713 client requests, and callbacks caused by the requests are detected 1714 via the session's sequencing properties which are a byproduct of 1715 EOS (Section 2.10.4.3). 1717 o The NFSv4.1 client can add an arbitrary number of connections to 1718 the session, and thus provide trunking (Section 2.10.3.4.1). 1720 o The NFSv4.1 session produces a session key independent of client 1721 and server machine credentials which can be used to compute a 1722 digest for protecting key session management operations 1723 Section 2.10.6.3). 1725 o The NFSv4.1 client can also create secure RPCSEC_GSS contexts for 1726 use by the session's callback channel that do not require the 1727 server to authenticate to a client machine principal 1728 (Section 2.10.6.2). 1730 A session is a dynamically created, long-lived server object created 1731 by a client, used over time from one or more transport connections. 1732 Its function is to maintain the server's state relative to the 1733 connection(s) belonging to a client instance. This state is entirely 1734 independent of the connection itself, and indeed the state exists 1735 whether the connection exists or not (though locks, delegations, etc. 1736 and generally expire in the extended absence of an open connection). 1737 The session in effect becomes the object representing an active 1738 client on a set of zero or more connections. 1740 2.10.2. NFSv4 Integration 1742 Sessions are part of NFSv4.1 and not NFSv4.0. Normally, a major 1743 infrastructure change like sessions would require a new major version 1744 number to an RPC program like NFS. However, because NFSv4 1745 encapsulates its functionality in a single procedure, COMPOUND, and 1746 because COMPOUND can support an arbitrary number of operations, 1747 sessions are almost trivially added. COMPOUND includes a minor 1748 version number field, and for NFSv4.1 this minor version is set to 1. 1749 When the NFSv4 server processes a COMPOUND with the minor version set 1750 to 1, it expects a different set of operations than it does for 1751 NFSv4.0. One operation it expects is the SEQUENCE operation, which 1752 is required for every COMPOUND that operates over an established 1753 session. 1755 2.10.2.1. SEQUENCE and CB_SEQUENCE 1757 In NFSv4.1, when the SEQUENCE operation is present, it is always the 1758 first operation in the COMPOUND procedure. The primary purpose of 1759 SEQUENCE is to carry the session identifier. The session identifier 1760 associates all other operations in the COMPOUND procedure with a 1761 particular session. SEQUENCE also contains required information for 1762 maintaining EOS (see Section 2.10.4). Session-enabled NFSv4.1 1763 COMPOUND requests thus have the form: 1765 +-----+--------------+-----------+------------+-----------+---- 1766 | tag | minorversion | numops |SEQUENCE op | op + args | ... 1767 | | (== 1) | (limited) | + args | | 1768 +-----+--------------+-----------+------------+-----------+---- 1770 and the reply's structure is: 1772 +------------+-----+--------+-------------------------------+--// 1773 |last status | tag | numres |status + SEQUENCE op + results | // 1774 +------------+-----+--------+-------------------------------+--// 1775 //-----------------------+---- 1776 // status + op + results | ... 1777 //-----------------------+---- 1779 A CB_COMPOUND procedure request and reply has a similar form, but 1780 instead of a SEQUENCE operation, there is a CB_SEQUENCE operation, 1781 and there is an additional field called "callback_ident", which is 1782 superfluous in NFSv4.1. CB_SEQUENCE has the same information as 1783 SEQUENCE, but includes other information needed to solve callback 1784 races (Section 2.10.4.3). 1786 2.10.2.2. Client ID and Session Association 1788 Sessions are subordinate to the client ID (Section 2.4). Each client 1789 ID can have zero or more active sessions. A client ID, and a session 1790 bound to it are required to do anything useful in NFSv4.1. Each time 1791 a session is used, the state leased to its associated client ID is 1792 automatically renewed. 1794 State such as share reservations, locks, delegations, and layouts 1795 (Section 1.4.4) is tied to the client ID, not the sessions of the 1796 client ID. Successive state changing operations from a given state 1797 owner can go over different sessions, as long each session is 1798 associated with the same client ID. Callbacks can arrive over a 1799 different session than the session that sent the operation the 1800 acquired the state that the callback is for. For example, if session 1801 A is used to acquire a delegation, a request to recall the delegation 1802 can arrive over session B. 1804 2.10.3. Channels 1806 Each session has one or two channels: the "operation" or "fore" 1807 channel used for ordinary requests from client to server, and the 1808 "back" channel, used for callback requests from server to client. 1809 The session allocates resources for each channel, including separate 1810 reply caches (see Section 2.10.4.1). These resources are for the 1811 most part specified at time the session is created. 1813 2.10.3.1. Operation Channel 1815 The operation channel carries COMPOUND requests and responses. A 1816 session always has an operation channel. 1818 2.10.3.2. Backchannel 1820 The backchannel carries CB_COMPOUND requests and responses. Whether 1821 there is a backchannel or not is a decision of the client; NFSv4.1 1822 servers MUST support backchannels. 1824 2.10.3.3. Session and Channel Association 1826 Because there are at most two channels per session, and because each 1827 channel has a distinct purpose, channels are not assigned 1828 identifiers. The operation and backchannel are implicitly created 1829 and associated when the session is created. 1831 2.10.3.4. Connection and Channel Association 1833 Each channel is associated with zero or more transport connections. 1834 A connection can be bound to one channel or both channels of a 1835 session; the client and server negotiate whether a connection will 1836 carry traffic for one channel or both channels via the CREATE_SESSION 1837 (Section 17.36) and the BIND_CONN_TO_SESSION (Section 17.34) 1838 operations. When a session is created via CREATE_SESSION, it is 1839 automatically bound to the operation channel, and optionally the 1840 backchannel. If the client does not specify connecting binding 1841 enforcement when the session is created, then additional connections 1842 are automatically bound to the operation channel when the are used 1843 with a SEQUENCE operation that has the session's sessionid. 1845 A connection MAY be bound to the channels of other sessions. The 1846 client decides, and the NFSv4.1 server MUST allow it. A connection 1847 MAY be bound to the channels of other sessions of other clientids. 1848 Again, the client decides, and the server MUST allow it. 1850 It is permissible for connections of multiple types to be bound to 1851 the same channel. For example a TCP and RDMA connection can be bound 1852 to the operation channel. In the event an RDMA and non-RDMA 1853 connection are bound to the same channel, the maximum number of slots 1854 must be at least one more than the total number of credits. This way 1855 if all RDMA credits are use, the non-RDMA connection can have at 1856 least one outstanding request. 1858 It is permissible for a connection of one type to be bound to the 1859 operation channel, and another type bound to the backchannel. 1861 2.10.3.4.1. Trunking 1863 A client is allowed to issue EXCHANGE_ID multiple times to the same 1864 server. The client may be unaware that two different server network 1865 addresses refer to the same server. The use of EXCHANGE_ID allows a 1866 client to become aware that an additional network address refers to a 1867 server the client already has an established client ID and session 1868 for. The eir_server_owner and eir_server_scope results from 1869 EXCHANGE_ID give a client a hint that the server it is connected to 1870 may be the same as the server it is connected to via another 1871 connection. When EXCHANGE_ID is issued over two different 1872 connections, and each return the same eir_server_owner.so_major_id 1873 and eir_server_scope, the client treats the connections as connected 1874 to the same server (subject to verification, as described later in 1875 this section (Paragraph 2), even if the destination network addresses 1876 are different). As long two unrelated servers have not selected and 1877 returned a conflicting pair of eir_major_id and eir_server_scope, or 1878 unless the client has used different co_ownerid values in each 1879 EXCHANGE_ID request, or the server has lost client ID state (e.g. the 1880 server has rebooted) the server MUST return the same eir_clientid 1881 result. Otherwise, the client and server use the common eir_clientid 1882 to identify the client. The eir_server_owner.so_minor_id field 1883 allows the server to control binding of connections to sessions. 1884 When two connections have a matching eir_server_scope, so_major_id 1885 and so_minor_id, the client may bind both connections to a common 1886 session; this is session trunking. When two connections have a 1887 matching so_major_id and eir_server_scope, but different so_minor_id, 1888 the client will need to create a new session for the client ID in 1889 order to use the connection; this is client ID trunking. In either 1890 session or client ID trunking, the bandwidth capacity can scale with 1891 the number of connections. 1893 When two servers over two connections claim matching or partially 1894 matching eir_server_owner, eir_server_scope, and eir_clientid values 1895 the client does not have to trust the servers' claims. The client 1896 may verify these claims before trunking traffic in the following 1897 ways: 1899 o For session trunking, clients and servers can reliably verify if 1900 connections between different network paths are in fact bound to 1901 the same NFSv4.1 server and usable on the same session. The 1902 SET_SSV (Section 17.47) operation allows a client and server to 1903 establish a unique, shared key value (the SSV). When a new 1904 connection is bound to the session (via the BIND_CONN_TO_SESSION 1905 operation, see Section 17.34), the client offers a digest that is 1906 based on the SSV. If the client mistakenly tries to bind a 1907 connection to a session of a wrong server, the server will either 1908 reject the attempt because it is not aware of the session 1909 identifier of the BIND_CONN_TO_SESSION arguments, or it will 1910 reject the attempt because the digest for the SSV does not match 1911 what the server expects. Even if the server mistakenly or 1912 maliciously accepts the connection bind attempt, the digest it 1913 computes in the response will not be verified by the client, the 1914 client will know it cannot use the connection for trunking the 1915 specified channel. 1917 o In the case of client ID trunking, the client can use RPCSEC_GSS 1918 to verify that each connection is aimed at the same server. When 1919 the client invokes EXCHANGE_ID, it should use RPCSEC_GSS. If each 1920 RPCSEC_GSS context over each connection has the same server 1921 principal, then -- barring a compromise of the server's GSS 1922 credentials -- the servers at the end of each connection are the 1923 same. 1925 2.10.4. Exactly Once Semantics 1927 Via the session, NFSv4.1 offers exactly once semantics (EOS) for 1928 requests sent over a channel. EOS is supported on both the operation 1929 and back channels. 1931 Each COMPOUND or CB_COMPOUND request that is issued with a leading 1932 SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver 1933 exactly once. This requirement is regardless whether the request is 1934 issued with reply caching specified (see Section 2.10.4.1.2). The 1935 requirement holds even if the requester is issuing the request over a 1936 session created between a pNFS data client and pNFS data server. The 1937 rationale for this requirement is understood by categorizing requests 1938 into three classifications: 1940 o Nonidempotent requests. 1942 o Idempotent modifying requests. 1944 o Idempotent non-modifying requests. 1946 An example of a non-idempotent request is RENAME. If is obvious that 1947 if a replier executes the same RENAME request twice, and the first 1948 execution succeeds, the re-execution will fail. If the replier 1949 returns the result from the re-execution, this result is incorrect. 1950 Therefore, EOS is required for nonidempotent requests. 1952 An example of an idempotent modifying request is a COMPOUND request 1953 containing a WRITE operation. Repeated execution of the same WRITE 1954 has the same effect as execution of that write once. Nevertheless, 1955 putting enforcing EOS for WRITEs and other idempotent modifying 1956 requests is necessary to avoid data corruption. 1958 Suppose a client issues WRITEs A, B, C to a noncompliant server that 1959 does not enforce EOS, and receives no response, perhaps due to a 1960 network partition. The client reconnects to the server and re-issues 1961 all three WRITEs. Now, the server has outstanding two instances of 1962 each of A, B, and C. The server can be in a situation in which it 1963 executes and replies to the retries of A, B, and C while the first A, 1964 B, and C are still waiting around in the server's I/O system for some 1965 resource. Upon receiving the replies to the second attempts of 1966 WRITEs A, B, and C, the client believes its writes are done so it is 1967 free to do issue WRITE D which overlaps the range of one or more of 1968 A, B, C. If any of A, B, or C are subsequently are executed for the 1969 second time, then what has been written by D can be overwritten and 1970 thus corrupted. 1972 Note that it is not required the server cache the reply to the 1973 modifying operation to avoid data corruption (but if the client 1974 specified the reply to be cached, the server must cache it). 1976 An example of an idempotent non-modifying request is a COMPOUND 1977 containing SEQUENCE, PUTFH, READLINK and nothing else. The re- 1978 execution of a such a request will not cause data corruption, or 1979 produce an incorrect result. Nonetheless, for simplicity, the 1980 replier MUST enforce EOS for such requests. 1982 2.10.4.1. Slot Identifiers and Reply Cache 1984 The RPC layer provides a transaction ID (xid), which, while required 1985 to be unique, is not especially convenient for tracking requests. 1986 The xid is only meaningful to the requester it cannot be interpreted 1987 at the replier except to test for equality with previously issued 1988 requests. Because RPC operations may be completed by the replier in 1989 any order, many transaction IDs may be outstanding at any time. The 1990 requester may therefore perform a computationally expensive lookup 1991 operation in the process of demultiplexing each reply. 1993 In the NFSv4.1, there is a limit to the number of active requests. 1994 This immediately enables a computationally efficient index for each 1995 request which is designated as a Slot Identifier, or slotid. 1997 When the requester issues a new request, it selects a slotid in the 1998 range 0..N-1, where N is the replier's current "outstanding requests" 1999 limit granted to the requester on the session over which the request 2000 is to be issued. The value of N outstanding requests starts out as 2001 the value of ca_maxrequests (Section 17.36), but can be adjusted by 2002 the response to SEQUENCE or CB_SEQUENCE as described later in this 2003 section. The slotid must be unused by any of the requests which the 2004 requester has already active on the session. "Unused" here means the 2005 requester has no outstanding request for that slotid. Because the 2006 slot id is always an integer in the range 0..N-1, requester 2007 implementations can use the slotid from a replier response to 2008 efficiently match responses with outstanding requests, such as, for 2009 example, by using the slotid to index into an outstanding request 2010 array. This can be used to avoid expensive hashing and lookup 2011 functions in the performance-critical receive path. 2013 The sequenceid, which accompanies the slotid in each request, is for 2014 an important check at the server: it must be able to be determined 2015 efficiently whether a request using a certain slotid is a retransmit 2016 or a new, never-before-seen request. It is not feasible for the 2017 client to assert that it is retransmitting to implement this, because 2018 for any given request the client cannot know the server has seen it 2019 unless the server actually replies. Of course, if the client has 2020 seen the server's reply, the client would not retransmit. 2022 The sequenceid MUST increase monotonically for each new transmit of a 2023 given slotid, and MUST remain unchanged for any retransmission. The 2024 server must in turn compare each newly received request's sequenceid 2025 with the last one previously received for that slotid, to see if the 2026 new request is: 2028 o A new request, in which the sequenceid is one greater than that 2029 previously seen in the slot (accounting for sequence wraparound). 2030 The replier proceeds to execute the new request. 2032 o A retransmitted request, in which the sequenceid is equal to that 2033 last seen in the slot. Note that this request may be either 2034 complete, or in progress. The replier performs replay processing 2035 in these cases. 2037 o A misordered replay, in which the sequenceid is less than 2038 (accounting for sequence wraparound) than that previously seen in 2039 the slot. The replier MUST return NFS4ERR_SEQ_MISORDERED (as the 2040 result from SEQUENCE or CB_SEQUENCE). 2042 o A misordered new request, in which the sequenceid is two or more 2043 than (accounting for sequence wraparound) than that previously 2044 seen in the slot. Note that because the sequenceid must 2045 wraparound one it reaches 0xFFFFFFFF, a misordered new request and 2046 a misordered replay cannot be distinguished. Thus, the replier 2047 MUST return NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or 2048 CB_SEQUENCE). 2050 Unlike the XID, the slotid is always within a specific range; this 2051 has two implications. The first implication is that for a given 2052 session, the replier need only cache the results of a limited number 2053 of COMPOUND requests. The second implication derives from the first, 2054 which is unlike XID-indexed reply caches (also know as duplicate 2055 request caches - DRCs), the slotid-based reply cache cannot be 2056 overflowed. Through use of the sequenceid to identify retransmitted 2057 requests, the replier does not need to actually cache the request 2058 itself, reducing the storage requirements of the reply cache further. 2059 These new facilities makes it practical to maintain all the required 2060 entries for an effective reply cache. 2062 The slotid and sequenceid therefore take over the traditional role of 2063 the XID and port number in the replier reply cache implementation, 2064 and the session replaces the IP address. This approach is 2065 considerably more portable and completely robust - it is not subject 2066 to the frequent reassignment of ports as clients reconnect over IP 2067 networks. In addition, the RPC XID is not used in the reply cache, 2068 enhancing robustness of the cache in the face of any rapid reuse of 2069 XIDs by the client. [[Comment.3: We need to discuss the requirements 2070 of the client for changing the XID.]] 2072 The slotid information is included in each request, without violating 2073 the minor versioning rules of the NFSv4.0 specification, by encoding 2074 it in the SEQUENCE operation within each NFSv4.1 COMPOUND and 2075 CB_COMPOUND procedure. The operation easily piggybacks within 2076 existing messages. [[Comment.4: Need a better term than piggyback]] 2078 The receipt of a new sequenced request arriving on any valid slot is 2079 an indication that the previous reply cache contents of that slot may 2080 be discarded. 2082 The SEQUENCE (and CB_SEQUENCE) operation also carries a 2083 "highest_slotid" value which carries additional client slot usage 2084 information. The requester must always provide a slotid representing 2085 the outstanding request with the highest-numbered slot value. The 2086 requester should in all cases provide the most conservative value 2087 possible, although it can be increased somewhat above the actual 2088 instantaneous usage to maintain some minimum or optimal level. This 2089 provides a way for the requester to yield unused request slots back 2090 to the replier, which in turn can use the information to reallocate 2091 resources. 2093 The replier responds with both a new target highest_slotid, and an 2094 enforced highest_slotid, described as follows: 2096 o The target highest_slotid is an indication to the requester of the 2097 highest_slotid the replier wishes the requester to be using. This 2098 permits the replier to withdraw (or add) resources from a 2099 requester that has been found to not be using them, in order to 2100 more fairly share resources among a varying level of demand from 2101 other requesters. The requester must always comply with the 2102 replier's value updates, since they indicate newly established 2103 hard limits on the requester's access to session resources. 2104 However, because of request pipelining, the requester may have 2105 active requests in flight reflecting prior values, therefore the 2106 replier must not immediately require the requester to comply. 2108 o The enforced highest_slotid indicates the highest slotid the 2109 requester is permitted to use on a subsequent SEQUENCE or 2110 CB_SEQUENCE operation. 2112 The requester is required to use the lowest available slot when 2113 issuing a new request. This way, the replier may be able to retire 2114 slot entries faster. However, where the replier is actively 2115 adjusting its granted maximum request count (i.e. the highest_slotid) 2116 to the requester, it will not not be able to use just the receipt of 2117 the slotid and highest_slotid in the request. Neither the slotid nor 2118 the highest_slotid used in a request may reflect the replier's 2119 current idea of the requester's session limit, because the request 2120 may have been sent from the requester before the update was received. 2121 Therefore, in the downward adjustment case, the replier may have to 2122 retain a number of reply cache entries at least as large as the old 2123 value of maximum requests outstanding, until operation sequencing 2124 rules allow it to infer that the requester has seen its reply. 2126 2.10.4.1.1. Errors from SEQUENCE and CB_SEQUENCE 2128 Any time SEQUENCE or CB_SEQUENCE return an error, the sequenceid of 2129 the slot MUST NOT change. The replier MUST NOT modify the reply 2130 cache entry for the slot whenever an error is returned from SEQUENCE 2131 or CB_SEQUENCE. 2133 2.10.4.1.2. Optional Reply Caching 2135 On a per-request basis the requester can choose to direct the replier 2136 to cache the reply to all operations after the first operation 2137 (SEQUENCE or CB_SEQUENCE) via the sa_cachethis or csa_cachethis 2138 fields of the arguments to SEQUENCE or CB_SEQUENCE. The reason it 2139 would not direct the replier to cache the entire reply is that the 2140 request is composed of all idempotent operations [20]. Caching the 2141 reply may offer little benefit, and if the reply is too large (see 2142 Section 2.10.4.4), it may not be cacheable anyway. 2144 Whether the requester requests the reply to be cached or not has no 2145 effect on the slot processing. If the results of SEQUENCE or 2146 CB_SEQUENCE are NFS4_OK, then the slot's sequenceid MUST be 2147 incremented by one. If a requester does not direct the replier to 2148 cache, the reply, the replier MUST do one of following: 2150 o The replier can cache the entire original reply. Even though 2151 sa_cachethis or csa_cachethis are FALSE, the replier is always 2152 free to cache. It may choose this approach in order to simplify 2153 implementation. 2155 o The replier enters into its reply cache a reply consisting of the 2156 original results to the SEQUENCE or CB_SEQUENCE operation, 2157 followed by the error NFS4ERR_RETRY_UNCACHED_REP. Thus if the 2158 requester later retries the request, it will get 2159 NFS4ERR_RETRY_UNCACHE_REP. 2161 2.10.4.1.3. Multiple Connections and Sharing the Reply Cache 2163 Multiple connections can be bound to a session's channel, hence the 2164 connections share the same table of slotids. For connections over 2165 non-RDMA transports like TCP, there are no particular considerations. 2166 Considerations for multiple RDMA connections sharing a slot table are 2167 discussed in Section 2.10.5.1. [[Comment.5: Also need to discuss 2168 when RDMA and non-RDMA share a slot table.]] 2170 2.10.4.2. Retry and Replay 2172 A client MUST NOT retry a request, unless the connection it used to 2173 send the request disconnects. The client can then reconnect and 2174 resend the request, or it can resend the request over a different 2175 connection. In the case of the server resending over the 2176 backchannel, it cannot reconnect, and either resends the request over 2177 another connection that the client has bound to the backchannel, or 2178 if there is no other backchannel connection, waits for the client to 2179 bind a connection to the backchannel. 2181 A client MUST wait for a reply to a request before using the slot for 2182 another request. If it does not wait for a reply, then the client 2183 does not know what sequenceid to use for the slot on its next 2184 request. For example, suppose a client sends a request with 2185 sequenceid 1, and does not wait for the response. The next time it 2186 uses the slot, it sends the new request with sequenceid 2. If the 2187 server has not seen the request with sequenceid 1, then the server is 2188 expecting sequenceid 2, and rejects the client's new request with 2189 NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or CB_SEQUENCE). 2191 RDMA fabrics do not guarantee that the memory handles (Steering Tags) 2192 within each RDMA three-tuple are valid on a scope [[Comment.6: What 2193 is a three-tuple?]] outside that of a single connection. Therefore, 2194 handles used by the direct operations become invalid after connection 2195 loss. The server must ensure that any RDMA operations which must be 2196 replayed from the reply cache use the newly provided handle(s) from 2197 the most recent request. 2199 2.10.4.3. Resolving server callback races with sessions 2201 It is possible for server callbacks to arrive at the client before 2202 the reply from related forward channel operations. For example, a 2203 client may have been granted a delegation to a file it has opened, 2204 but the reply to the OPEN (informing the client of the granting of 2205 the delegation) may be delayed in the network. If a conflicting 2206 operation arrives at the server, it will recall the delegation using 2207 the callback channel, which may be on a different transport 2208 connection, perhaps even a different network. In NFSv4.0, if the 2209 callback request arrives before the related reply, the client may 2210 reply to the server with an error. 2212 The presence of a session between client and server alleviates this 2213 issue. When a session is in place, each client request is uniquely 2214 identified by its { slotid, sequenceid } pair. By the rules under 2215 which slot entries (reply cache entries) are retired, the server has 2216 knowledge whether the client has "seen" each of the server's replies. 2217 The server can therefore provide sufficient information to the client 2218 to allow it to disambiguate between an erroneous or conflicting 2219 callback and a race condition. 2221 For each client operation which might result in some sort of server 2222 callback, the server should "remember" the { slotid, sequenceid } 2223 pair of the client request until the slotid retirement rules allow 2224 the server to determine that the client has, in fact, seen the 2225 server's reply. Until the time the { slotid, sequenceid } request 2226 pair can be retired, any recalls of the associated object MUST carry 2227 an array of these referring identifiers (in the CB_SEQUENCE 2228 operation's arguments), for the benefit of the client. After this 2229 time, it is not necessary for the server to provide this information 2230 in related callbacks, since it is certain that a race condition can 2231 no longer occur. 2233 The CB_SEQUENCE operation which begins each server callback carries a 2234 list of "referring" { slotid, sequenceid } tuples. If the client 2235 finds the request corresponding to the referring slotid and sequenced 2236 id be currently outstanding (i.e. the server's reply has not been 2237 seen by the client), it can determine that the callback has raced the 2238 reply, and act accordingly. 2240 The client must not simply wait forever for the expected server reply 2241 to arrive on any of the session's operations channels, because it is 2242 possible that they will be delayed indefinitely. However, it should 2243 wait for a period of time, and if the time expires it can provide a 2244 more meaningful error such as NFS4ERR_DELAY. 2246 [[Comment.7: We need to consider the clients' options here, and 2247 describe them... NFS4ERR_DELAY has been discussed as a legal reply 2248 to CB_RECALL?]] 2250 There are other scenarios under which callbacks may race replies, 2251 among them pNFS layout recalls, described in Section 12.5.4.2 2252 [[Comment.8: fill in the blanks w/others, etc...]] 2254 2.10.4.4. COMPOUND and CB_COMPOUND Construction Issues 2256 Very large requests and replies may pose both buffer management 2257 issues (especially with RDMA) and reply cache issues. When the 2258 session is created, (Section 17.36) the client and server negotiate 2259 the maximum sized request they will send or process 2260 (ca_maxrequestsize), the maximum sized reply they will return or 2261 process (ca_maxresponsesize), and the maximum sized reply they will 2262 store in the reply cache (ca_maxresponsesize_cached). 2264 If a request exceeds ca_maxrequestsize, the reply will have the 2265 status NFS4ERR_REQ_TOO_BIG. A replier may return NFS4ERR_REQ_TOO_BIG 2266 as the status for first operation (SEQUENCE or CB_SEQUENCE) in the 2267 request, or it may chose to return it on a subsequent operation. 2269 If a reply exceeds ca_maxresponsesize, the reply will have the status 2270 NFS4ERR_REP_TOO_BIG. A replier may return NFS4ERR_REP_TOO_BIG as the 2271 status for first operation (SEQUENCE or CB_SEQUENCE) in the request, 2272 or it may chose to return it on a subsequent operation. 2274 If sa_cachethis or csa_cachethis are TRUE, then the replier MUST 2275 cache a reply except if an error is returned by the SEQUENCE or 2276 CB_SEQUENCE operation (see Section 2.10.4.1.1). If the reply exceeds 2277 ca_maxresponsesize_cached, (and sa_cachethis or csa_cachethis are 2278 TRUE) then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. Even 2279 if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that matter) 2280 is returned on a operation other than first operation (SEQUENCE or 2281 CB_SEQUENCE), then the reply MUST be cached if sa_cachethis or 2282 csa_cachethis are TRUE. For example, if a COMPOUND has eleven 2283 operations, including SEQUENCE, the fifth operation is a RENAME, and 2284 the tenth operation is a READ for one million bytes, server may 2285 return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth operation. Since 2286 the server executed several operations, especially the non-idempotent 2287 RENAME, the client's request to cache the reply needs to be honored 2288 in order for correct operation of exactly once semantics. If the 2289 client retries the request, the server will have cached a reply that 2290 contains results for ten of the eleven requested operations, with the 2291 tenth operation having a status of NFS4ERR_REP_TOO_BIG_TO_CACHE. 2293 A client needs to take care that when sending operations that change 2294 the current filehandle (except for PUTFH, PUTPUBFH, and PUTROOTFH) 2295 that it not exceed the maximum reply buffer before the GETFH 2296 operation. Otherwise the client will have to retry the operation 2297 that changed the current filehandle, in order obtain the desired 2298 filehandle. For the OPEN operation (see Section 17.16), retry is not 2299 always available as an option. The following guidelines for the 2300 handling of filehandle changing operations are advised: 2302 o A client SHOULD issue GETFH immediately after a current filehandle 2303 changing operation. This is especially important after any 2304 current filehandle changing non-idempotent operation. It is 2305 critical to issue GETFH immediately after OPEN. 2307 o A server MAY return NFS4ERR_REP_TOO_BIG or 2308 NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a 2309 filehandle changing operation if the reply would be too large on 2310 the next operation. 2312 o A server SHOULD return NFS4ERR_REP_TOO_BIG or 2313 NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a 2314 filehandle changing non-idempotent operation if the reply would be 2315 too large on the next operation, especially if the operation is 2316 OPEN. 2318 o A server MAY return NFS4ERR_UNSAFE_COMPOUND if it looks at the 2319 next operation after a non-idempotent current filehandle changing 2320 operation, and finds it is not GETFH. The server would do this if 2321 it is unable to determine in advance whether the total response 2322 size would exceed ca_maxresponsesize_cached or ca_maxresponsesize. 2324 2.10.4.5. Persistence 2326 Since the reply cache is bounded, it is practical for the server 2327 reply cache to persist across server reboots, and to be kept in 2328 stable storage (a client's reply cache for callbacks need not persist 2329 across client reboots unless the client intends for its session and 2330 other state to persist across reboots). 2332 o The slot table including the sequenceid and cached reply for each 2333 slot. 2335 o The sessionid. 2337 o The client ID. 2339 o The SSV (see Section 2.10.6.3). 2341 The CREATE_SESSION (see Section 17.36 operation determines the 2342 persistence of the reply cache. 2344 2.10.5. RDMA Considerations 2346 A complete discussion of the operation of RPC-based protocols atop 2347 RDMA transports is in [RPCRDMA]. A discussion of the operation of 2348 NFSv4, including NFSv4.1 over RDMA is in [NFSDDP]. Where RDMA is 2349 considered, this specification assumes the use of such a layering; it 2350 addresses only the upper layer issues relevant to making best use of 2351 RPC/RDMA. 2353 2.10.5.1. RDMA Connection Resources 2355 RDMA requires its consumers to register memory and post buffers of a 2356 specific size and number for receive operations. 2358 Registration of memory can be a relatively high-overhead operation, 2359 since it requires pinning of buffers, assignment of attributes (e.g. 2360 readable/writable), and initialization of hardware translation. 2361 Preregistration is desirable to reduce overhead. These registrations 2362 are specific to hardware interfaces and even to RDMA connection 2363 endpoints, therefore negotiation of their limits is desirable to 2364 manage resources effectively. 2366 Following the basic registration, these buffers must be posted by the 2367 RPC layer to handle receives. These buffers remain in use by the 2368 RPC/NFSv4 implementation; the size and number of them must be known 2369 to the remote peer in order to avoid RDMA errors which would cause a 2370 fatal error on the RDMA connection. 2372 NFSv4.1 manages slots as resources on a per session basis (see 2373 Section 2.10), while RDMA connections manage credits on a per 2374 connection basis. This means that in order for a peer to send data 2375 over RDMA to a remote buffer, it has to have both an NFSv4.1 slot, 2376 and an RDMA credit. 2378 2.10.5.2. Flow Control 2380 NFSv4.0 and all previous versions do not provide for any form of flow 2381 control; instead they rely on the windowing provided by transports 2382 like TCP to throttle requests. This does not work with RDMA, which 2383 provides no operation flow control and will terminate a connection in 2384 error when limits are exceeded. Limits such as maximum number of 2385 requests outstanding are therefore negotiated when a session is 2386 created (see the ca_maxrequests field in Section 17.36). These 2387 limits then provide the maxima each session's channels' connections 2388 must operate within. RDMA connections are managed within these 2389 limits as described in section 3.3 of [RPCRDMA]; if there are 2390 multiple RDMA connections, then the maximum requests for a channel 2391 will be divided among the RDMA connections. The limits may also be 2392 modified dynamically at the server's choosing by manipulating certain 2393 parameters present in each NFSv4.1 request. In addition, the 2394 CB_RECALL_SLOT callback operation (see Section 19.8 can be issued by 2395 a server to a client to return RDMA credits to the server, thereby 2396 lowering the maximum number of requests a client can have outstanding 2397 to the server. 2399 2.10.5.3. Padding 2401 Header padding is requested by each peer at session initiation (see 2402 the csa_headerpadsize argument to CREATE_SESSION in Section 17.36), 2403 and subsequently used by the RPC RDMA layer, as described in 2404 [RPCRDMA]. Zero padding is permitted. 2406 Padding leverages the useful property that RDMA receives preserve 2407 alignment of data, even when they are placed into anonymous 2408 (untagged) buffers. If requested, client inline writes will insert 2409 appropriate pad bytes within the request header to align the data 2410 payload on the specified boundary. The client is encouraged to add 2411 sufficient padding (up to the negotiated size) so that the "data" 2412 field of the NFSv4.1 WRITE operation is aligned. Most servers can 2413 make good use of such padding, which allows them to chain receive 2414 buffers in such a way that any data carried by client requests will 2415 be placed into appropriate buffers at the server, ready for file 2416 system processing. The receiver's RPC layer encounters no overhead 2417 from skipping over pad bytes, and the RDMA layer's high performance 2418 makes the insertion and transmission of padding on the sender a 2419 significant optimization. In this way, the need for servers to 2420 perform RDMA Read to satisfy all but the largest client writes is 2421 obviated. An added benefit is the reduction of message round trips 2422 on the network - a potentially good trade, where latency is present. 2424 The value to choose for padding is subject to a number of criteria. 2425 A primary source of variable-length data in the RPC header is the 2426 authentication information, the form of which is client-determined, 2427 possibly in response to server specification. The contents of 2428 COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all 2429 go into the determination of a maximal NFSv4 request size and 2430 therefore minimal buffer size. The client must select its offered 2431 value carefully, so as not to overburden the server, and vice- versa. 2432 The payoff of an appropriate padding value is higher performance. 2434 Sender gather: 2435 |RPC Request|Pad bytes|Length| -> |User data...| 2436 \------+---------------------/ \ 2437 \ \ 2438 \ Receiver scatter: \-----------+- ... 2439 /-----+----------------\ \ \ 2440 |RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->... 2442 In the above case, the server may recycle unused buffers to the next 2443 posted receive if unused by the actual received request, or may pass 2444 the now-complete buffers by reference for normal write processing. 2445 For a server which can make use of it, this removes any need for data 2446 copies of incoming data, without resorting to complicated end-to-end 2447 buffer advertisement and management. This includes most kernel-based 2448 and integrated server designs, among many others. The client may 2449 perform similar optimizations, if desired. 2451 2.10.5.4. Dual RDMA and Non-RDMA Transports 2453 Some RDMA transports (for example see [RDDP]), [[Comment.9: need 2454 xref]] require a "streaming" (non-RDMA) phase, where ordinary traffic 2455 might flow before "stepping" up to RDMA mode, commencing RDMA 2456 traffic. Some RDMA transports start connections always in RDMA mode. 2457 NFSv4.1 allows, but does not assume, a streaming phase before RDMA 2458 mode. When a connection is bound to a session, the client and server 2459 negotiate whether the connection is used in RDMA or non-RDMA mode 2460 (see Section 17.36 and Section 17.34). 2462 2.10.6. Sessions Security 2464 2.10.6.1. Session Callback Security 2466 Via session connection binding, NFSv4.1 improves security over that 2467 provided by NFSv4.0 for the callback channel. The connection is 2468 client-initiated (see Section 17.34), and subject to the same 2469 firewall and routing checks as the operations channel. The 2470 connection cannot be hijacked by an attacker who connects to the 2471 client port prior to the intended server. At the client's option 2472 (see Section 17.36 binding is fully authenticated before being 2473 activated (see Section 17.34). Traffic from the server over the 2474 callback channel is authenticated exactly as the client specifies 2475 (see Section 2.10.6.2). 2477 2.10.6.2. Backchannel RPC Security 2479 When the NFSv4.1 client establishes the backchannel, it informs the 2480 server what security flavors and principals it must use when sending 2481 requests over the backchannel. If the security flavor is RPCSEC_GSS, 2482 the client expresses the principal in the form of an established 2483 RPCSEC_GSS context. The server is free to use any flavor/principal 2484 combination the server offers, but MUST NOT use unoffered 2485 combinations. 2487 This way, the client does not have to provide a target GSS principal 2488 as it did with NFSv4.0, and the server does not have to implement an 2489 RPCSEC_GSS initiator as it did with NFSv4.0. [[Comment.10: xrefs]] 2491 The CREATE_SESSION (Section 17.36) and BACKCHANNEL_CTL 2492 (Section 17.33) operations allow the client to specify flavor/ 2493 principal combinations. 2495 2.10.6.3. Protection from Unauthorized State Changes 2497 Under some conditions, NFSv4.0 is vulnerable to a denial of service 2498 issue with respect to its state management. 2500 The attack works via an unauthorized client faking an open_owner4, an 2501 open_owner/lock_owner pair, or stateid, combined with a seqid. The 2502 operation is sent to the NFSv4 server. The NFSv4 server accepts the 2503 state information, and as long as any status code from the result of 2504 this operation is not NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID, 2505 NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR, 2506 NFS4ERR_RESOURCE, or NFS4ERR_NOFILEHANDLE, the sequence number is 2507 incremented. When the authorized client issues an operation, it gets 2508 back NFS4ERR_BAD_SEQID, because its idea of the current sequence 2509 number is off by one. The authorized client's recovery options are 2510 pretty limited, with SETCLIENTID, followed by complete reclaim of 2511 state, which may or may not succeed completely. That qualifies as a 2512 denial of service attack. 2514 If the client uses RPCSEC_GSS authentication and integrity, and every 2515 client maps each open_owner and lock_owner one and only one 2516 principal, and the server enforces this binding, then the conditions 2517 leading to vulnerability to the denial of service do not exist. One 2518 should keep in mind that if AUTH_SYS is being used, far simpler 2519 easier denial of service and other attacks are possible. 2521 With NFSv4.1 sessions, the per-operation sequence number is ignored 2522 (see Section 8.13) therefore the NFSv4.0 denial of service 2523 vulnerability described above does not apply. However as described 2524 to this point in the specification, an attacker could forge the 2525 sessionid and issue a SEQUENCE with a slot id that he expects the 2526 legitimate client to use next. The legitimate client could then use 2527 the slotid with the same sequence number, and the server returns the 2528 attacker's result from the replay cache, thereby disrupting the 2529 legitimate client. 2531 If we give each NFSv4.1 user their own session, and each user uses 2532 RPCSEC_GSS authentication and integrity, then the denial of service 2533 issue is solved, at the cost of additional per session state. The 2534 alternative NFSv4.1 specifies is described as follows. 2536 Transport connections MUST be bound to a session by the client. The 2537 server MUST return an error to an operation (other than the operation 2538 that binds the connection to the session) that uses an unbound 2539 connection. As a simplification, the transport connection used by 2540 CREATE_SESSION (see Section 17.36) is automatically bound to the 2541 session. Additional connections are bound to a session via 2542 BIND_CONN_TO_SESSION (see Section 17.34). 2544 To prevent attackers from issuing BIND_CONN_TO_SESSION operations, 2545 the arguments to BIND_CONN_TO_SESSION include a digest of a shared 2546 secret called the secret session verifier (SSV) that only the client 2547 and server know. The digest is created via a one way, collision 2548 resistant hash function, making it intractable for the attacker to 2549 forge. 2551 The SSV is sent to the server via SET_SSV (see Section 17.47). To 2552 prevent eavesdropping, a SET_SSV for the SSV SHOULD be protected via 2553 RPCSEC_GSS with the privacy service. The SSV can be changed by the 2554 client at any time, by any principal. However several aspects of SSV 2555 changing prevent an attacker from engaging in a successful denial of 2556 service attack: 2558 o A SET_SSV on the SSV does not replace the SSV with the argument to 2559 SET_SSV. Instead, the current SSV on the server is logically 2560 exclusive ORed (XORed) with the argument to SET_SSV. SET_SSV MUST 2561 NOT be called with an SSV value that is zero. 2563 o The arguments to and results of SET_SSV include digests of the old 2564 and new SSV, respectively. 2566 o Because the initial value of the SSV is zero, therefore known, the 2567 client that opts for connecting binding enforcement, MUST issue at 2568 least one SET_SSV operation before the first BIND_CONN_TO_SESSION 2569 operation. A client SHOULD issue SET_SSV as soon as a session is 2570 created. 2572 If a connection is disconnected, BIND_CONN_TO_SESSION is required to 2573 bind a connection to the session, even if the connection that was 2574 disconnected was the one CREATE_SESSION was created with. 2576 If a client is assigned a machine principal then the client SHOULD 2577 use the machine principal's RPCSEC_GSS context to privacy protect the 2578 SSV from eavesdropping during the SET_SSV operation. If a machine 2579 principal is not being used, then the client MAY use the non-machine 2580 principal's RPCSEC_GSS context to privacy protect the SSV. The 2581 server MUST accept either type of principal. A client SHOULD change 2582 the SSV each time a new principal uses the session. 2584 Here are the types of attacks that can be attempted by an attacker 2585 named Eve, and how the connection to session binding approach 2586 addresses each attack: 2588 o If the Eve creates a connection after the legitimate client 2589 establishes an SSV via privacy protection from a machine 2590 principal's RPCSEC_GSS session, she does not know the SSV and so 2591 cannot compute a digest that BIND_CONN_TO_SESSION will accept. 2592 Users on the legitimate client cannot be disrupted by Eve. 2594 o If Eve is the first one log into the legitimate client, and the 2595 client does not use machine principals, then Eve can cause an SSV 2596 to be created via the legitimate client's NFSv4.1 implementation, 2597 protected by the RPCSEC_GSS context created by the legitimate 2598 client (which uses Eve's GSS principal and credentials). Eve can 2599 then eavesdrop on the network, and because she knows her 2600 credentials, she can decrypt the SSV. Eve can compute a digest 2601 BIND_CONN_TO_SESSION will accept, and so bind a new connection to 2602 the session. Eve can change the slotid, sequence state, and/or 2603 the SSV state in such a way that when Bob accesses the server via 2604 the legitimate client, the legitimate client will be unable to use 2605 the session. 2607 The client's only recourse is to create a new session, which will 2608 cause any state Eve created on the legitimate client over the old 2609 (but hijacked) session to be lost. This disrupts Eve, but because 2610 she is the attacker, this is acceptable. 2612 Once the legitimate client establishes an SSV over the new session 2613 using Bob's RPCSEC_GSS context, Eve can use the new session via 2614 the legitimate client, but she cannot disrupt Bob. Moreover, 2615 because the client SHOULD have modified the SSV due to Eve using 2616 the new session, Bob cannot get revenge on Eve by binding a rogue 2617 connection to the session. 2619 The question is how does the legitimate client detect that Eve has 2620 hijacked the old session? When the client detects that a new 2621 principal, Bob, wants to use the session, it SHOULD have issued a 2622 SET_SSV. 2624 * Let us suppose that from the rogue connection, Eve issued a 2625 SET_SSV with the same slotid and sequence that the legitimate 2626 client later uses. The server will assume this is a replay, 2627 and return to the legitimate client the reply it sent Eve. 2628 However, unless Eve can correctly guess the SSV the legitimate 2629 client will use, the digest verification checks in the SET_SSV 2630 response will fail. That is the clue to the client that the 2631 session has been hijacked. 2633 * Alternatively, Eve issued a SET_SSV with a different slotid 2634 than the legitimate client uses for its SET_SSV. Then the 2635 digest verification on the server fails, and the client is 2636 again clued that the session has been hijacked. 2638 * Alternatively, Eve issued an operation other than SET_SSV, but 2639 with the same slotid and sequence that the legitimate client 2640 uses for its SET_SSV. The server returns to the legitimate 2641 client the response it sent Eve. The client sees that the 2642 response is not at all what it expects. The client assumes 2643 either session hijacking or server bug, and either way destroys 2644 the old session. 2646 o Eve binds a rogue connection to the session as above, and then 2647 destroys the session. Again, Bob goes to use the server from the 2648 legitimate client. The client has a very clear indication that 2649 its session was hijacked, and does not even have to destroy the 2650 old session before creating a new session, which Eve will be 2651 unable to hijack because it will be protected with an SSV created 2652 via Bob's RPCSEC_GSS protection. 2654 o If Eve creates a connection before the legitimate client 2655 establishes an SSV, because the initial value of the SSV is zero 2656 and therefore known, Eve can issue a SET_SSV that will pass the 2657 digest verification check. However because the new connection has 2658 not been bound to the session, the SET_SSV is rejected for that 2659 reason. 2661 o The connection to session binding model does not prevent 2662 connection hijacking. However, if an attacker can perform 2663 connection hijacking, it can issue denial of service attacks that 2664 are less difficult than attacks based on forging sessions. 2666 2.10.7. Session Mechanics - Steady State 2668 2.10.7.1. Obligations of the Server 2670 The server has the primary obligation to monitor the state of 2671 backchannel resources that the client has created for the server 2672 (RPCSEC_GSS contexts and back channel connections). When these 2673 resources go away, the server takes action as specified in 2674 Section 2.10.8.2. 2676 2.10.7.2. Obligations of the Client 2678 The client has the following obligations in order to utilize the 2679 session: 2681 o Keep a necessary session from going idle on the server. A client 2682 that requires a session, but nonetheless is not sending operations 2683 risks having the session be destroyed by the server. This is 2684 because sessions consume resources, and resource limitations may 2685 force the server to cull the least recently used session. 2687 o Destroy the session when idle. When a session has no state other 2688 than the session, and no outstanding requests, the client should 2689 consider destroying the session. 2691 o Maintain GSS contexts for callback. If the client requires the 2692 server to use the RPCSEC_GSS security flavor for callbacks, then 2693 it needs to be sure the contexts handed to the server via 2694 BACKCHANNEL_CTL are unexpired. A good practice is to keep at 2695 least two contexts outstanding, where the expiration time of the 2696 newest context at the time it was created, is N times that of the 2697 oldest context, where N is the number of contexts available for 2698 callbacks. 2700 o Maintain an active connection. The server requires a callback 2701 path in order to gracefully recall recallable state, or notify the 2702 client of certain events. 2704 2.10.7.3. Steps the Client Takes To Establish a Session 2706 The client issues EXCHANGE_ID to establish a client ID. 2708 The client uses the client ID to issue a CREATE_SESSION on a 2709 connection to the server. The results of CREATE_SESSION indicate 2710 whether the server will persist the session replay cache through a 2711 server reboot or not, and the client notes this for future reference. 2713 The client SHOULD have specified connecting binding enforcement when 2714 the session was created. If so, the client SHOULD issue SET_SSV in 2715 the first COMPOUND after the session is created. If it is not using 2716 machine credentials, then each time a new principal goes to use the 2717 session, it SHOULD issue a SET_SSV again. 2719 If the client wants to use delegations, layouts, directory 2720 notifications, or any other state that requires a callback channel, 2721 then it MUST add a connection to the backchannel if CREATE_SESSION 2722 did not already do so. The client creates a connection, and calls 2723 BIND_CONN_TO_SESSION to bind the connection to the session and the 2724 session's backchannel. If CREATE_SESSION did not already do so, the 2725 client MUST tell the server what security is required in order for 2726 the client to accept callbacks. The client does this via 2727 BACKCHANNEL_CTL. 2729 If the client wants to use additional connections for the 2730 backchannel, then it MUST call BIND_CONN_TO_SESSION on each 2731 connection it wants to use with the session. If the client wants to 2732 use additional connections for the operation channel, then it MUST 2733 call BIND_CONN_TO_SESSION if it specified connection binding 2734 enforcement before using the connection. 2736 At this point the client has reached a steady state as far as session 2737 use. 2739 2.10.8. Session Mechanics - Recovery 2741 2.10.8.1. Events Requiring Client Action 2743 The following events require client action to recover. 2745 2.10.8.1.1. RPCSEC_GSS Context Loss by Callback Path 2747 If all RPCSEC_GSS contexts granted to by the client to the server for 2748 callback use have expired, the client MUST establish a new context 2749 via BACKCHANNEL_CTL. The sr_status_flags field of the SEQUENCE 2750 results indicates when callback contexts are nearly expired, or fully 2751 expired (see Section 17.46.4). 2753 2.10.8.1.2. Connection Disconnect 2755 If the client loses the last connection of the session, then it MUST 2756 create a new connection, and if connecting binding enforcement was 2757 specified when the session was created, bind it to the session via 2758 BIND_CONN_TO_SESSION. 2760 If there were requests outstanding at the time the of connection 2761 disconnect, then the client MUST retry the request, as described in 2762 Section 2.10.4.2. Note that it is not necessary to retry requests 2763 over a connection with the same source network address or the same 2764 destination network address as the disconnected connection. As long 2765 as the sessionid, slotid, and sequenceid in the retry match that of 2766 the original request, the server will recognize the request as a 2767 retry if it did see the request prior to disconnect. 2769 If the connection that was bound to the backchannel is lost, the 2770 client may need to reconnect, and use BIND_CONN_TO_SESSION, to give 2771 the connection to the backchannel. If the connection that was lost 2772 was the last one bound to the backchannel, the client MUST reconnect, 2773 and bind the connection to the session and backchannel. The server 2774 should indicate when it has no callback connection via the 2775 sr_status_flags result from SEQUENCE. 2777 2.10.8.1.3. Backchannel GSS Context Loss 2779 Via the sr_status_flags result of the SEQUENCE operation or other 2780 means, the client will learn if some or all of the RPCSEC_GSS 2781 contexts it assigned to the backchannel have been lost. The client 2782 may need to use BACKCHANNEL_CTL to assign new contexts. It MUST 2783 assign new contexts if there are no more contexts. 2785 2.10.8.1.4. Loss of Session 2787 The server may lose a record of the session. Causes include: 2789 o Server crash and reboot 2791 o A catastrophe that causes the cache to be corrupted or lost on the 2792 media it was stored on. This applies even if the server indicated 2793 in the CREATE_SESSION results that it would persist the cache. 2795 o The server purges the session of a client that has been inactive 2796 for a very extended period of time. [[Comment.11: XXX - Should we 2797 add a value to the CREATE_SESSION results that tells a client how 2798 long he can let a session stay idle before losing it?]] 2800 Loss of replay cache is equivalent to loss of session. The server 2801 indicates loss of session to the client by returning 2802 NFS4ERR_BADSESSION on the next operation that uses the sessionid 2803 associated with the lost session. 2805 After an event like a server reboot, the client may have lost its 2806 connections. The client assumes for the moment that the session has 2807 not been lost. It reconnects, and if it specified connecting binding 2808 enforcement when the session was created, it invokes 2809 BIND_CONN_TO_SESSION using the sessionid. Otherwise, it invokes 2810 SEQUENCE. If BIND_CONN_TO_SESSION or SEQUENCE returns 2811 NFS4ERR_BADSESSION, the client knows the session was lost. If the 2812 connection survives session loss, then the next SEQUENCE operation 2813 the client issues over the connection will get back 2814 NFS4ERR_BADSESSION. The client again knows the session was lost. 2816 When the client detects session loss, it must call CREATE_SESSION to 2817 recover. Any non-idempotent operations that were in progress may 2818 have been performed on the server at the time of session loss. The 2819 client has no general way to recover from this. 2821 Note that loss of session does not imply loss of lock, open, 2822 delegation, or layout state. Nor does loss of lock, open, 2823 delegation, or layout state imply loss of session state. 2824 [[Comment.12: Add reference to lock recovery section]] . A session 2825 can survive a server reboot, but lock recovery may still be needed. 2826 The converse is also true. 2828 It is possible CREATE_SESSION will fail with NFS4ERR_STALE_CLIENTID 2829 (for example the server reboots and does not preserve client ID 2830 state). If so, the client needs to call EXCHANGE_ID, followed by 2831 CREATE_SESSION. 2833 2.10.8.1.5. Failover 2835 [[Comment.13: Dave Noveck requested this section; not sure what is 2836 needed here if this refers to failover to a replica. What are the 2837 session ramifications?]] 2839 2.10.8.2. Events Requiring Server Action 2841 The following events require server action to recover. 2843 2.10.8.2.1. Client Crash and Reboot 2845 As described in Section 17.35, a rebooted client causes the server to 2846 delete any sessions it had. 2848 2.10.8.2.2. Client Crash with No Reboot 2850 If a client crashes and never comes back, it will never issue 2851 EXCHANGE_ID with its old client owner. Thus the server has session 2852 state that will never be used again. After an extended period of 2853 time and if the server has resource constraints, it MAY destroy the 2854 old session. 2856 2.10.8.2.3. Extended Network Partition 2858 To the server, the extended network partition may be no different 2859 than a client crash with no reboot (see Section 2.10.8.2.2). Unless 2860 the server can discern that there is a network partition, it is free 2861 to treat the situation as if the client has crashed for good. 2863 2.10.8.2.4. Backchannel Connection Loss 2865 If there were callback requests outstanding at the time the of a 2866 connection disconnect, then the server MUST retry the request, as 2867 described in Section 2.10.4.2. Note that it is not necessary to 2868 retry requests over a connection with the same source network address 2869 or the same destination network address as the disconnected 2870 connection. As long as the sessionid, slotid, and sequenceid in the 2871 retry match that of the original request, the callback target will 2872 recognize the request as a retry if it did see the request prior to 2873 disconnect. 2875 If the connection lost is the last one bound to the backchannel, then 2876 the server MUST indicate that in the sr_status_flags field of the 2877 next SEQUENCE reply. 2879 2.10.8.2.5. GSS Context Loss 2881 The server SHOULD monitor when the last RPCSEC_GSS context assigned 2882 to the backchannel is near expiry (i.e. between one and two periods 2883 of lease time), and indicate so in the sr_status_flags field of the 2884 next SEQUENCE reply. The server MUST indicate when the backchannel's 2885 last RPCSEC_GSS context has expired in the sr_status_flags field of 2886 the next SEQUENCE reply. 2888 2.10.9. Parallel NFS and Sessions 2890 A client and server can potentially be a non-pNFS implementation, a 2891 metadata server implementation, a data server implementation, or two 2892 or three types of implementations. The EXCHGID4_FLAG_USE_NON_PNFS, 2893 EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not 2894 mutually exclusive) are passed in the EXCHANGE_ID arguments and 2895 results to allow the client to indicate how it wants to use sessions 2896 created under the client ID, and to allow the server to indicate how 2897 it will allow the sessions to be used. See Section 13.1 for pNFS 2898 sessions considerations. 2900 3. Protocol Data Types 2902 The syntax and semantics to describe the data types of the NFS 2903 version 4 protocol are defined in the XDR RFC4506 [3] and RPC RFC1831 2904 [4] documents. The next sections build upon the XDR data types to 2905 define types and structures specific to this protocol. 2907 3.1. Basic Data Types 2909 These are the base NFSv4 data types. 2911 +---------------+---------------------------------------------------+ 2912 | Data Type | Definition | 2913 +---------------+---------------------------------------------------+ 2914 | int32_t | typedef int int32_t; | 2915 | uint32_t | typedef unsigned int uint32_t; | 2916 | int64_t | typedef hyper int64_t; | 2917 | uint64_t | typedef unsigned hyper uint64_t; | 2918 | attrlist4 | typedef opaque attrlist4<>; | 2919 | | Used for file/directory attributes | 2920 | bitmap4 | typedef uint32_t bitmap4<>; | 2921 | | Used in attribute array encoding. | 2922 | changeid4 | typedef uint64_t changeid4; | 2923 | | Used in definition of change_info | 2924 | clientid4 | typedef uint64_t clientid4; | 2925 | | Shorthand reference to client identification | 2926 | component4 | typedef utf8str_cs component4; | 2927 | | Represents path name components | 2928 | count4 | typedef uint32_t count4; | 2929 | | Various count parameters (READ, WRITE, COMMIT) | 2930 | length4 | typedef uint64_t length4; | 2931 | | Describes LOCK lengths | 2932 | linktext4 | typedef utf8str_cs linktext4; | 2933 | | Symbolic link contents | 2934 | mode4 | typedef uint32_t mode4; | 2935 | | Mode attribute data type | 2936 | nfs_cookie4 | typedef uint64_t nfs_cookie4; | 2937 | | Opaque cookie value for READDIR | 2938 | nfs_fh4 | typedef opaque nfs_fh4 | 2939 | | Filehandle definition; NFS4_FHSIZE is defined as | 2940 | | 128 | 2941 | nfs_ftype4 | enum nfs_ftype4; | 2942 | | Various defined file types | 2943 | nfsstat4 | enum nfsstat4; | 2944 | | Return value for operations | 2945 | offset4 | typedef uint64_t offset4; | 2946 | | Various offset designations (READ, WRITE, LOCK, | 2947 | | COMMIT) | 2948 | pathname4 | typedef component4 pathname4<>; | 2949 | | Represents path name for fs_locations | 2950 | qop4 | typedef uint32_t qop4; | 2951 | | Quality of protection designation in SECINFO | 2952 | sec_oid4 | typedef opaque sec_oid4<>; | 2953 | | Security Object Identifier The sec_oid4 data type | 2954 | | is not really opaque. Instead contains an ASN.1 | 2955 | | OBJECT IDENTIFIER as used by GSS-API in the | 2956 | | mech_type argument to GSS_Init_sec_context. See | 2957 | | RFC2743 [8] for details. | 2958 | sequenceid4 | typedef uint32_t sequenceid4; | 2959 | | sequence number used for various session | 2960 | | operations (EXCHANGE_ID, CREATE_SESSION, | 2961 | | SEQUENCE, CB_SEQUENCE). | 2962 | seqid4 | typedef uint32_t seqid4; | 2963 | | Sequence identifier used for file locking | 2964 | sessionid4 | typedef opaque sessionid4[16]; | 2965 | | Session identifier | 2966 | slotid4 | typedef uint32_t slotid4; | 2967 | | sequencing artifact various session operations | 2968 | | (SEQUENCE, CB_SEQUENCE). | 2969 | utf8string | typedef opaque utf8string<>; | 2970 | | UTF-8 encoding for strings | 2971 | utf8str_cis | typedef opaque utf8str_cis; | 2972 | | Case-insensitive UTF-8 string | 2973 | utf8str_cs | typedef opaque utf8str_cs; | 2974 | | Case-sensitive UTF-8 string | 2975 | utf8str_mixed | typedef opaque utf8str_mixed; | 2976 | | UTF-8 strings with a case sensitive prefix and a | 2977 | | case insensitive suffix. | 2978 | verifier4 | typedef opaque verifier4[NFS4_VERIFIER_SIZE]; | 2979 | | Verifier used for various operations (COMMIT, | 2980 | | CREATE, EXCHANGE_ID, OPEN, READDIR, WRITE) | 2981 | | NFS4_VERIFIER_SIZE is defined as 8. | 2982 +---------------+---------------------------------------------------+ 2984 End of Base Data Types 2986 Table 1 2988 3.2. Structured Data Types 2990 3.2.1. nfstime4 2992 struct nfstime4 { 2993 int64_t seconds; 2994 uint32_t nseconds; 2995 } 2997 The nfstime4 structure gives the number of seconds and nanoseconds 2998 since midnight or 0 hour January 1, 1970 Coordinated Universal Time 2999 (UTC). Values greater than zero for the seconds field denote dates 3000 after the 0 hour January 1, 1970. Values less than zero for the 3001 seconds field denote dates before the 0 hour January 1, 1970. In 3002 both cases, the nseconds field is to be added to the seconds field 3003 for the final time representation. For example, if the time to be 3004 represented is one-half second before 0 hour January 1, 1970, the 3005 seconds field would have a value of negative one (-1) and the 3006 nseconds fields would have a value of one-half second (500000000). 3007 Values greater than 999,999,999 for nseconds are considered invalid. 3009 This data type is used to pass time and date information. A server 3010 converts to and from its local representation of time when processing 3011 time values, preserving as much accuracy as possible. If the 3012 precision of timestamps stored for a file system object is less than 3013 defined, loss of precision can occur. An adjunct time maintenance 3014 protocol is recommended to reduce client and server time skew. 3016 3.2.2. time_how4 3018 enum time_how4 { 3019 SET_TO_SERVER_TIME4 = 0, 3020 SET_TO_CLIENT_TIME4 = 1 3021 }; 3023 3.2.3. settime4 3025 union settime4 switch (time_how4 set_it) { 3026 case SET_TO_CLIENT_TIME4: 3027 nfstime4 time; 3028 default: 3029 void; 3030 }; 3032 The above definitions are used as the attribute definitions to set 3033 time values. If set_it is SET_TO_SERVER_TIME4, then the server uses 3034 its local representation of time for the time value. 3036 3.2.4. specdata4 3038 struct specdata4 { 3039 uint32_t specdata1; /* major device number */ 3040 uint32_t specdata2; /* minor device number */ 3041 }; 3043 This data type represents additional information for the device file 3044 types NF4CHR and NF4BLK. 3046 3.2.5. fsid4 3048 struct fsid4 { 3049 uint64_t major; 3050 uint64_t minor; 3051 }; 3053 3.2.6. fs_location4 3055 struct fs_location4 { 3056 utf8str_cis server<>; 3057 pathname4 rootpath; 3058 }; 3060 3.2.7. fs_locations4 3062 struct fs_locations4 { 3063 pathname4 fs_root; 3064 fs_location4 locations<>; 3065 }; 3067 The fs_location4 and fs_locations4 data types are used for the 3068 fs_locations recommended attribute which is used for migration and 3069 replication support. 3071 3.2.8. fattr4 3073 struct fattr4 { 3074 bitmap4 attrmask; 3075 attrlist4 attr_vals; 3076 }; 3078 The fattr4 structure is used to represent file and directory 3079 attributes. 3081 The bitmap is a counted array of 32 bit integers used to contain bit 3082 values. The position of the integer in the array that contains bit n 3083 can be computed from the expression (n / 32) and its bit within that 3084 integer is (n mod 32). 3086 0 1 3087 +-----------+-----------+-----------+-- 3088 | count | 31 .. 0 | 63 .. 32 | 3089 +-----------+-----------+-----------+-- 3091 3.2.9. change_info4 3093 struct change_info4 { 3094 bool atomic; 3095 changeid4 before; 3096 changeid4 after; 3097 }; 3099 This structure is used with the CREATE, LINK, REMOVE, RENAME 3100 operations to let the client know the value of the change attribute 3101 for the directory in which the target file system object resides. 3103 3.2.10. netaddr4 3105 struct netaddr4 { 3106 /* see struct rpcb in RFC1833 */ 3107 string r_netid<>; /* network id */ 3108 string r_addr<>; /* universal address */ 3109 }; 3111 The netaddr4 structure is used to identify TCP/IP based endpoints. 3112 The r_netid and r_addr fields are specified in RFC1833 [22], but they 3113 are underspecified in RFC1833 [22] as far as what they should look 3114 like for specific protocols. 3116 For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the 3117 US-ASCII string: 3119 h1.h2.h3.h4.p1.p2 3121 The prefix, "h1.h2.h3.h4", is the standard textual form for 3122 representing an IPv4 address, which is always four octets long. 3123 Assuming big-endian ordering, h1, h2, h3, and h4, are respectively, 3124 the first through fourth octets each converted to ASCII-decimal. 3125 Assuming big-endian ordering, p1 and p2 are, respectively, the first 3126 and second octets each converted to ASCII-decimal. For example, if a 3127 host, in big-endian order, has an address of 0x0A010307 and there is 3128 a service listening on, in big endian order, port 0x020F (decimal 3129 527), then complete universal address is "10.1.3.7.2.15". 3131 For TCP over IPv4 the value of r_netid is the string "tcp". For UDP 3132 over IPv4 the value of r_netid is the string "udp". That this 3133 document specifies the universal address and netid for UDP/IPv6 does 3134 not imply that UDP/IPv4 is a legal transport for NFSv4.1 (see 3135 Section 2.9). 3137 For TCP over IPv6 and for UDP over IPv6, the format of r_addr is the 3138 US-ASCII string: 3140 x1:x2:x3:x4:x5:x6:x7:x8.p1.p2 3142 The suffix "p1.p2" is the service port, and is computed the same way 3143 as with universal addresses for TCP and UDP over IPv4. The prefix, 3144 "x1:x2:x3:x4:x5:x6:x7:x8", is the standard textual form for 3145 representing an IPv6 address as defined in Section 2.2 of RFC1884 3146 [9]. Additionally, the two alternative forms specified in Section 3147 2.2 of RFC1884 [9] are also acceptable. 3149 For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP 3150 over IPv6 the value of r_netid is the string "udp6". That this 3151 document specifies the universal address and netid for UDP/IPv6 does 3152 not imply that UDP/IPv6 is a legal transport for NFSv4.1 (see 3153 Section 2.9). 3155 3.2.11. open_owner4 3157 struct open_owner4 { 3158 clientid4 clientid; 3159 opaque owner 3160 }; 3162 This structure is used to identify the owner of open state. 3163 NFS4_OPAQUE_LIMIT is defined as 1024. 3165 3.2.12. lock_owner4 3167 struct lock_owner4 { 3168 clientid4 clientid; 3169 opaque owner 3170 }; 3172 This structure is used to identify the owner of file locking state. 3174 3.2.13. open_to_lock_owner4 3176 struct open_to_lock_owner4 { 3177 seqid4 open_seqid; 3178 stateid4 open_stateid; 3179 seqid4 lock_seqid; 3180 lock_owner4 lock_owner; 3181 }; 3183 This structure is used for the first LOCK operation done for an 3184 open_owner4. It provides both the open_stateid and lock_owner such 3185 that the transition is made from a valid open_stateid sequence to 3186 that of the new lock_stateid sequence. Using this mechanism avoids 3187 the confirmation of the lock_owner/lock_seqid pair since it is tied 3188 to established state in the form of the open_stateid/open_seqid. 3190 3.2.14. stateid4 3192 struct stateid4 { 3193 uint32_t seqid; 3194 opaque other[12]; 3195 }; 3197 This structure is used for the various state sharing mechanisms 3198 between the client and server. For the client, this data structure 3199 is read-only. The starting value of the seqid field is undefined. 3200 The server is required to increment the seqid field monotonically at 3201 each transition of the stateid. This is important since the client 3202 will inspect the seqid in OPEN stateids to determine the order of 3203 OPEN processing done by the server. 3205 3.2.15. layouttype4 3207 enum layouttype4 { 3208 LAYOUT4_NFSV4_1_FILES = 1, 3209 LAYOUT4_OSD2_OBJECTS = 2, 3210 LAYOUT4_BLOCK_VOLUME = 3 3211 }; 3213 A layout type specifies the layout being used. The implication is 3214 that clients have "layout drivers" that support one or more layout 3215 types. The file server advertises the layout types it supports 3216 through the fs_layout_type file system attribute (Section 5.13.1). A 3217 client asks for layouts of a particular type in LAYOUTGET, and passes 3218 those layouts to its layout driver. 3220 The layouttype4 structure is 32 bits in length. The range 3221 represented by the layout type is split into three parts. Type 0x0 3222 is reserved. Types within the range 0x00000001-0x7FFFFFFF are 3223 globally unique and are assigned according to the description in 3224 Section 21.1; they are maintained by IANA. Types within the range 3225 0x80000000-0xFFFFFFFF are site specific and for "private use" only. 3227 The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file 3228 layout type is to be used. The LAYOUT4_OSD2_OBJECTS enumeration 3229 specifies that the object layout, as defined in [23], is to be used. 3230 Similarly, the LAYOUT4_BLOCK_VOLUME enumeration that the block/volume 3231 layout, as defined in [24], is to be used. 3233 3.2.16. deviceid4 3235 typedef uint32_t deviceid4; /* 32-bit device ID */ 3237 Layout information includes device IDs that specify a storage device 3238 through a compact handle. Addressing and type information is 3239 obtained with the GETDEVICEINFO operation. A client must not assume 3240 that device IDs are valid across metadata server reboots. The device 3241 ID is qualified by the layout type and are unique per file system 3242 (FSID). This allows different layout drivers to generate device IDs 3243 without the need for co-ordination. See Section 12.2.12 for more 3244 details. 3246 3.2.17. device_addr4 3248 struct device_addr4 { 3249 layouttype4 da_layout_type; 3250 opaque da_addr_body<>; 3251 }; 3253 The device address is used to set up a communication channel with the 3254 storage device. Different layout types will require different types 3255 of structures to define how they communicate with storage devices. 3256 The opaque da_addr_body field must be interpreted based on the 3257 specified da_layout_type field. 3259 This document defines the device address for the NFSv4.1 file layout 3260 ([[Comment.14: need xref]]), which identifies a storage device by 3261 network IP address and port number. This is sufficient for the 3262 clients to communicate with the NFSv4.1 storage devices, and may be 3263 sufficient for other layout types as well. Device types for object 3264 storage devices and block storage devices (e.g., SCSI volume labels) 3265 will be defined by their respective layout specifications. 3267 3.2.18. devlist_item4 3269 struct devlist_item4 { 3270 deviceid4 dli_id; 3271 device_addr4 dli_device_addr<>; 3272 }; 3274 An array of these values is returned by the GETDEVICELIST operation. 3275 They define the set of devices associated with a file system for the 3276 layout type specified in the GETDEVICELIST4args. 3278 3.2.19. layout_content4 3280 struct layout_content4 { 3281 layouttype4 loc_type; 3282 opaque loc_body<>; 3283 }; 3285 The loc_body field must be interpreted based on the layout type 3286 (loc_type). This document defines the loc_body for the NFSv4.1 file 3287 layout type is defined; see Section 13.3 for its definition. 3289 3.2.20. layout4 3291 struct layout4 { 3292 offset4 lo_offset; 3293 length4 lo_length; 3294 layoutiomode4 lo_iomode; 3295 layout_content4 lo_content; 3296 }; 3298 The layout4 structure defines a layout for a file. The layout type 3299 specific data is opaque within lo_content. Since layouts are sub- 3300 dividable, the offset and length together with the file's filehandle, 3301 the client ID, iomode, and layout type, identifies the layout. 3303 3.2.21. layoutupdate4 3305 struct layoutupdate4 { 3306 layouttype4 lou_type; 3307 opaque lou_body<>; 3308 }; 3310 The layoutupdate4 structure is used by the client to return 'updated' 3311 layout information to the metadata server at LAYOUTCOMMIT time. This 3312 structure provides a channel to pass layout type specific information 3313 (in field lou_body) back to the metadata server. E.g., for block/ 3314 volume layout types this could include the list of reserved blocks 3315 that were written. The contents of the opaque lou_body argument are 3316 determined by the layout type and are defined in their context. The 3317 NFSv4.1 file-based layout does not use this structure, thus the 3318 lou_body field should have a zero length. 3320 3.2.22. layouthint4 3322 struct layouthint4 { 3323 layouttype4 loh_type; 3324 opaque loh_body<>; 3325 }; 3327 The layouthint4 structure is used by the client to pass in a hint 3328 about the type of layout it would like created for a particular file. 3329 It is the structure specified by the layout_hint attribute described 3330 in Section 5.13.4. The metadata server may ignore the hint, or may 3331 selectively ignore fields within the hint. This hint should be 3332 provided at create time as part of the initial attributes within 3333 OPEN. The loh_body field is specific to the type of layout 3334 (loh_type). The NFSv4.1 file-based layout uses the 3335 nfsv4_1_file_layouthint4 structure as defined in Section 13.3. 3337 3.2.23. layoutiomode4 3339 enum layoutiomode4 { 3340 LAYOUTIOMODE4_READ = 1, 3341 LAYOUTIOMODE4_RW = 2, 3342 LAYOUTIOMODE4_ANY = 3 3343 }; 3345 The iomode specifies whether the client intends to read or write 3346 (with the possibility of reading) the data represented by the layout. 3347 The ANY iomode MUST NOT be used for LAYOUTGET, however, it can be 3348 used for LAYOUTRETURN and LAYOUTRECALL. The ANY iomode specifies 3349 that layouts pertaining to both READ and RW iomodes are being 3350 returned or recalled, respectively. The metadata server's use of the 3351 iomode may depend on the layout type being used. The storage devices 3352 may validate I/O accesses against the iomode and reject invalid 3353 accesses. 3355 3.2.24. nfs_impl_id4 3357 struct nfs_impl_id4 { 3358 utf8str_cis nii_domain; 3359 utf8str_cs nii_name; 3360 nfstime4 nii_date; 3361 }; 3363 This structure is used to identify client and server implementation 3364 detail. The nii_domain field is the DNS domain name that the 3365 implementer is associated with. The nii_name field is the product 3366 name of the implementation and is completely free form. It is 3367 recommended that the nii_name be used to distinguish machine 3368 architecture, machine platforms, revisions, versions, and patch 3369 levels. The nii_date field is the timestamp of when the software 3370 instance was published or built. 3372 3.2.25. threshold_item4 3374 struct threshold_item4 { 3375 layouttype4 thi_layout_type; 3376 bitmap4 thi_hintset; 3377 opaque thi_hintlist<>; 3378 }; 3380 This structure contains a list of hints specific to a layout type for 3381 helping the client determine when it should issue I/O directly 3382 through the metadata server vs. the data servers. The hint structure 3383 consists of the layout type (thi_layout_type), a bitmap (thi_hintset) 3384 describing the set of hints supported by the server (they may differ 3385 based on the layout type), and a list of hints (thi_hintlist), whose 3386 structure is determined by the hintset bitmap. See the mdsthreshold 3387 attribute for more details. 3389 The thi_hintset field is a bitmap of the following values: 3391 +-------------------------+---+---------+---------------------------+ 3392 | name | # | Data | Description | 3393 | | | Type | | 3394 +-------------------------+---+---------+---------------------------+ 3395 | threshold4_read_size | 0 | length4 | The file size below which | 3396 | | | | it is recommended to read | 3397 | | | | data through the MDS. | 3398 | threshold4_write_size | 1 | length4 | The file size below which | 3399 | | | | it is recommended to | 3400 | | | | write data through the | 3401 | | | | MDS. | 3402 | threshold4_read_iosize | 2 | length4 | For read I/O sizes below | 3403 | | | | this threshold it is | 3404 | | | | recommended to read data | 3405 | | | | through the MDS | 3406 | threshold4_write_iosize | 3 | length4 | For write I/O sizes below | 3407 | | | | this threshold it is | 3408 | | | | recommended to write data | 3409 | | | | through the MDS | 3410 +-------------------------+---+---------+---------------------------+ 3412 3.2.26. mdsthreshold4 3414 struct mdsthreshold4 { 3415 threshold_item4 mth_hints<>; 3416 }; 3418 This structure holds an array of threshold_item4 structures each of 3419 which is valid for a particular layout type. An array is necessary 3420 since a server can support multiple layout types for a single file. 3422 4. Filehandles 3424 The filehandle in the NFS protocol is a per server unique identifier 3425 for a file system object. The contents of the filehandle are opaque 3426 to the client. Therefore, the server is responsible for translating 3427 the filehandle to an internal representation of the file system 3428 object. 3430 4.1. Obtaining the First Filehandle 3432 The operations of the NFS protocol are defined in terms of one or 3433 more filehandles. Therefore, the client needs a filehandle to 3434 initiate communication with the server. With the NFS version 2 3435 protocol RFC1094 [17] and the NFS version 3 protocol RFC1813 [18], 3436 there exists an ancillary protocol to obtain this first filehandle. 3437 The MOUNT protocol, RPC program number 100005, provides the mechanism 3438 of translating a string based file system path name to a filehandle 3439 which can then be used by the NFS protocols. 3441 The MOUNT protocol has deficiencies in the area of security and use 3442 via firewalls. This is one reason that the use of the public 3443 filehandle was introduced in RFC2054 [25] and RFC2055 [26]. With the 3444 use of the public filehandle in combination with the LOOKUP operation 3445 in the NFS version 2 and 3 protocols, it has been demonstrated that 3446 the MOUNT protocol is unnecessary for viable interaction between NFS 3447 client and server. 3449 Therefore, the NFS version 4 protocol will not use an ancillary 3450 protocol for translation from string based path names to a 3451 filehandle. Two special filehandles will be used as starting points 3452 for the NFS client. 3454 4.1.1. Root Filehandle 3456 The first of the special filehandles is the ROOT filehandle. The 3457 ROOT filehandle is the "conceptual" root of the file system name 3458 space at the NFS server. The client uses or starts with the ROOT 3459 filehandle by employing the PUTROOTFH operation. The PUTROOTFH 3460 operation instructs the server to set the "current" filehandle to the 3461 ROOT of the server's file tree. Once this PUTROOTFH operation is 3462 used, the client can then traverse the entirety of the server's file 3463 tree with the LOOKUP operation. A complete discussion of the server 3464 name space is in the section "NFS Server Name Space". 3466 4.1.2. Public Filehandle 3468 The second special filehandle is the PUBLIC filehandle. Unlike the 3469 ROOT filehandle, the PUBLIC filehandle may be bound or represent an 3470 arbitrary file system object at the server. The server is 3471 responsible for this binding. It may be that the PUBLIC filehandle 3472 and the ROOT filehandle refer to the same file system object. 3473 However, it is up to the administrative software at the server and 3474 the policies of the server administrator to define the binding of the 3475 PUBLIC filehandle and server file system object. The client may not 3476 make any assumptions about this binding. The client uses the PUBLIC 3477 filehandle via the PUTPUBFH operation. 3479 4.2. Filehandle Types 3481 In the NFS version 2 and 3 protocols, there was one type of 3482 filehandle with a single set of semantics. This type of filehandle 3483 is termed "persistent" in NFS Version 4. The semantics of a 3484 persistent filehandle remain the same as before. A new type of 3485 filehandle introduced in NFS Version 4 is the "volatile" filehandle, 3486 which attempts to accommodate certain server environments. 3488 The volatile filehandle type was introduced to address server 3489 functionality or implementation issues which make correct 3490 implementation of a persistent filehandle infeasible. Some server 3491 environments do not provide a file system level invariant that can be 3492 used to construct a persistent filehandle. The underlying server 3493 file system may not provide the invariant or the server's file system 3494 programming interfaces may not provide access to the needed 3495 invariant. Volatile filehandles may ease the implementation of 3496 server functionality such as hierarchical storage management or file 3497 system reorganization or migration. However, the volatile filehandle 3498 increases the implementation burden for the client. 3500 Since the client will need to handle persistent and volatile 3501 filehandles differently, a file attribute is defined which may be 3502 used by the client to determine the filehandle types being returned 3503 by the server. 3505 4.2.1. General Properties of a Filehandle 3507 The filehandle contains all the information the server needs to 3508 distinguish an individual file. To the client, the filehandle is 3509 opaque. The client stores filehandles for use in a later request and 3510 can compare two filehandles from the same server for equality by 3511 doing an octet-by-octet comparison. However, the client MUST NOT 3512 otherwise interpret the contents of filehandles. If two filehandles 3513 from the same server are equal, they MUST refer to the same file. 3514 Servers SHOULD try to maintain a one-to-one correspondence between 3515 filehandles and files but this is not required. Clients MUST use 3516 filehandle comparisons only to improve performance, not for correct 3517 behavior. All clients need to be prepared for situations in which it 3518 cannot be determined whether two filehandles denote the same object 3519 and in such cases, avoid making invalid assumptions which might cause 3520 incorrect behavior. Further discussion of filehandle and attribute 3521 comparison in the context of data caching is presented in the section 3522 "Data Caching and File Identity". 3524 As an example, in the case that two different path names when 3525 traversed at the server terminate at the same file system object, the 3526 server SHOULD return the same filehandle for each path. This can 3527 occur if a hard link is used to create two file names which refer to 3528 the same underlying file object and associated data. For example, if 3529 paths /a/b/c and /a/d/c refer to the same file, the server SHOULD 3530 return the same filehandle for both path names traversals. 3532 4.2.2. Persistent Filehandle 3534 A persistent filehandle is defined as having a fixed value for the 3535 lifetime of the file system object to which it refers. Once the 3536 server creates the filehandle for a file system object, the server 3537 MUST accept the same filehandle for the object for the lifetime of 3538 the object. If the server restarts or reboots the NFS server must 3539 honor the same filehandle value as it did in the server's previous 3540 instantiation. Similarly, if the file system is migrated, the new 3541 NFS server must honor the same filehandle as the old NFS server. 3543 The persistent filehandle will be become stale or invalid when the 3544 file system object is removed. When the server is presented with a 3545 persistent filehandle that refers to a deleted object, it MUST return 3546 an error of NFS4ERR_STALE. A filehandle may become stale when the 3547 file system containing the object is no longer available. The file 3548 system may become unavailable if it exists on removable media and the 3549 media is no longer available at the server or the file system in 3550 whole has been destroyed or the file system has simply been removed 3551 from the server's name space (i.e. unmounted in a UNIX environment). 3553 4.2.3. Volatile Filehandle 3555 A volatile filehandle does not share the same longevity 3556 characteristics of a persistent filehandle. The server may determine 3557 that a volatile filehandle is no longer valid at many different 3558 points in time. If the server can definitively determine that a 3559 volatile filehandle refers to an object that has been removed, the 3560 server should return NFS4ERR_STALE to the client (as is the case for 3561 persistent filehandles). In all other cases where the server 3562 determines that a volatile filehandle can no longer be used, it 3563 should return an error of NFS4ERR_FHEXPIRED. 3565 The mandatory attribute "fh_expire_type" is used by the client to 3566 determine what type of filehandle the server is providing for a 3567 particular file system. This attribute is a bitmask with the 3568 following values: 3570 FH4_PERSISTENT The value of FH4_PERSISTENT is used to indicate a 3571 persistent filehandle, which is valid until the object is removed 3572 from the file system. The server will not return 3573 NFS4ERR_FHEXPIRED for this filehandle. FH4_PERSISTENT is defined 3574 as a value in which none of the bits specified below are set. 3576 FH4_VOLATILE_ANY The filehandle may expire at any time, except as 3577 specifically excluded (i.e. FH4_NO_EXPIRE_WITH_OPEN). 3579 FH4_NOEXPIRE_WITH_OPEN May only be set when FH4_VOLATILE_ANY is set. 3580 If this bit is set, then the meaning of FH4_VOLATILE_ANY is 3581 qualified to exclude any expiration of the filehandle when it is 3582 open. 3584 FH4_VOL_MIGRATION The filehandle will expire as a result of a file 3585 system transition (migration or replication), in those case in 3586 which the continuity of filehandle use is not specified by 3587 _handle_ class information within the fs_locations_info attribute. 3588 When this bit is set, clients without access to fs_locations_info 3589 information should assume filehandles will expire on file system 3590 transitions. 3592 FH4_VOL_RENAME The filehandle will expire during rename. This 3593 includes a rename by the requesting client or a rename by any 3594 other client. If FH4_VOL_ANY is set, FH4_VOL_RENAME is redundant. 3596 Servers which provide volatile filehandles that may expire while open 3597 (i.e. if FH4_VOL_MIGRATION or FH4_VOL_RENAME is set or if 3598 FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set), should 3599 deny a RENAME or REMOVE that would affect an OPEN file of any of the 3600 components leading to the OPEN file. In addition, the server should 3601 deny all RENAME or REMOVE requests during the grace period upon 3602 server restart. 3604 Servers which provide volatile filehandles that may expire while open 3605 require special care as regards handling of RENAMESs and REMOVEs. 3606 This situation can arise if FH4_VOL_MIGRATION or FH4_VOL_RENAME is 3607 set, if FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set, 3608 or if a non-readonly file system has a transition target in a 3609 different _handle _ class. In these cases, the server should deny a 3610 RENAME or REMOVE that would affect an OPEN file of any of the 3611 components leading to the OPEN file. In addition, the server should 3612 deny all RENAME or REMOVE requests during the grace period, in order 3613 to make sure that reclaims of files where filehandles may have 3614 expired do not do a reclaim for the wrong file. 3616 4.3. One Method of Constructing a Volatile Filehandle 3618 A volatile filehandle, while opaque to the client could contain: 3620 [volatile bit = 1 | server boot time | slot | generation number] 3621 o slot is an index in the server volatile filehandle table 3623 o generation number is the generation number for the table entry/ 3624 slot 3626 When the client presents a volatile filehandle, the server makes the 3627 following checks, which assume that the check for the volatile bit 3628 has passed. If the server boot time is less than the current server 3629 boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return 3630 NFS4ERR_BADHANDLE. If the generation number does not match, return 3631 NFS4ERR_FHEXPIRED. 3633 When the server reboots, the table is gone (it is volatile). 3635 If volatile bit is 0, then it is a persistent filehandle with a 3636 different structure following it. 3638 4.4. Client Recovery from Filehandle Expiration 3640 If possible, the client SHOULD recover from the receipt of an 3641 NFS4ERR_FHEXPIRED error. The client must take on additional 3642 responsibility so that it may prepare itself to recover from the 3643 expiration of a volatile filehandle. If the server returns 3644 persistent filehandles, the client does not need these additional 3645 steps. 3647 For volatile filehandles, most commonly the client will need to store 3648 the component names leading up to and including the file system 3649 object in question. With these names, the client should be able to 3650 recover by finding a filehandle in the name space that is still 3651 available or by starting at the root of the server's file system name 3652 space. 3654 If the expired filehandle refers to an object that has been removed 3655 from the file system, obviously the client will not be able to 3656 recover from the expired filehandle. 3658 It is also possible that the expired filehandle refers to a file that 3659 has been renamed. If the file was renamed by another client, again 3660 it is possible that the original client will not be able to recover. 3661 However, in the case that the client itself is renaming the file and 3662 the file is open, it is possible that the client may be able to 3663 recover. The client can determine the new path name based on the 3664 processing of the rename request. The client can then regenerate the 3665 new filehandle based on the new path name. The client could also use 3666 the compound operation mechanism to construct a set of operations 3667 like: 3669 RENAME A B 3670 LOOKUP B 3671 GETFH 3673 Note that the COMPOUND procedure does not provide atomicity. This 3674 example only reduces the overhead of recovering from an expired 3675 filehandle. 3677 5. File Attributes 3679 To meet the requirements of extensibility and increased 3680 interoperability with non-UNIX platforms, attributes must be handled 3681 in a flexible manner. The NFS version 3 fattr3 structure contains a 3682 fixed list of attributes that not all clients and servers are able to 3683 support or care about. The fattr3 structure can not be extended as 3684 new needs arise and it provides no way to indicate non-support. With 3685 the NFS version 4 protocol, the client is able query what attributes 3686 the server supports and construct requests with only those supported 3687 attributes (or a subset thereof). 3689 To this end, attributes are divided into three groups: mandatory, 3690 recommended, and named. Both mandatory and recommended attributes 3691 are supported in the NFS version 4 protocol by a specific and well- 3692 defined encoding and are identified by number. They are requested by 3693 setting a bit in the bit vector sent in the GETATTR request; the 3694 server response includes a bit vector to list what attributes were 3695 returned in the response. New mandatory or recommended attributes 3696 may be added to the NFS protocol between major revisions by 3697 publishing a standards-track RFC which allocates a new attribute 3698 number value and defines the encoding for the attribute. See the 3699 section "Minor Versioning" for further discussion. 3701 Named attributes are accessed by the new OPENATTR operation, which 3702 accesses a hidden directory of attributes associated with a file 3703 system object. OPENATTR takes a filehandle for the object and 3704 returns the filehandle for the attribute hierarchy. The filehandle 3705 for the named attributes is a directory object accessible by LOOKUP 3706 or READDIR and contains files whose names represent the named 3707 attributes and whose data bytes are the value of the attribute. For 3708 example: 3710 +----------+-----------+---------------------------------+ 3711 | LOOKUP | "foo" | ; look up file | 3712 | GETATTR | attrbits | | 3713 | OPENATTR | | ; access foo's named attributes | 3714 | LOOKUP | "x11icon" | ; look up specific attribute | 3715 | READ | 0,4096 | ; read stream of bytes | 3716 +----------+-----------+---------------------------------+ 3718 Named attributes are intended for data needed by applications rather 3719 than by an NFS client implementation. NFS implementors are strongly 3720 encouraged to define their new attributes as recommended attributes 3721 by bringing them to the IETF standards-track process. 3723 The set of attributes which are classified as mandatory is 3724 deliberately small since servers must do whatever it takes to support 3725 them. A server should support as many of the recommended attributes 3726 as possible but by their definition, the server is not required to 3727 support all of them. Attributes are deemed mandatory if the data is 3728 both needed by a large number of clients and is not otherwise 3729 reasonably computable by the client when support is not provided on 3730 the server. 3732 Note that the hidden directory returned by OPENATTR is a convenience 3733 for protocol processing. The client should not make any assumptions 3734 about the server's implementation of named attributes and whether the 3735 underlying file system at the server has a named attribute directory 3736 or not. Therefore, operations such as SETATTR and GETATTR on the 3737 named attribute directory are undefined. 3739 5.1. Mandatory Attributes 3741 These MUST be supported by every NFS version 4 client and server in 3742 order to ensure a minimum level of interoperability. The server must 3743 store and return these attributes and the client must be able to 3744 function with an attribute set limited to these attributes. With 3745 just the mandatory attributes some client functionality may be 3746 impaired or limited in some ways. A client may ask for any of these 3747 attributes to be returned by setting a bit in the GETATTR request and 3748 the server must return their value. 3750 5.2. Recommended Attributes 3752 These attributes are understood well enough to warrant support in the 3753 NFS version 4 protocol. However, they may not be supported on all 3754 clients and servers. A client may ask for any of these attributes to 3755 be returned by setting a bit in the GETATTR request but must handle 3756 the case where the server does not return them. A client may ask for 3757 the set of attributes the server supports and should not request 3758 attributes the server does not support. A server should be tolerant 3759 of requests for unsupported attributes and simply not return them 3760 rather than considering the request an error. It is expected that 3761 servers will support all attributes they comfortably can and only 3762 fail to support attributes which are difficult to support in their 3763 operating environments. A server should provide attributes whenever 3764 they don't have to "tell lies" to the client. For example, a file 3765 modification time should be either an accurate time or should not be 3766 supported by the server. This will not always be comfortable to 3767 clients but the client is better positioned decide whether and how to 3768 fabricate or construct an attribute or whether to do without the 3769 attribute. 3771 5.3. Named Attributes 3773 These attributes are not supported by direct encoding in the NFS 3774 Version 4 protocol but are accessed by string names rather than 3775 numbers and correspond to an uninterpreted stream of bytes which are 3776 stored with the file system object. The name space for these 3777 attributes may be accessed by using the OPENATTR operation. The 3778 OPENATTR operation returns a filehandle for a virtual "attribute 3779 directory" and further perusal of the name space may be done using 3780 READDIR and LOOKUP operations on this filehandle. Named attributes 3781 may then be examined or changed by normal READ and WRITE and CREATE 3782 operations on the filehandles returned from READDIR and LOOKUP. 3783 Named attributes may have attributes. 3785 It is recommended that servers support arbitrary named attributes. A 3786 client should not depend on the ability to store any named attributes 3787 in the server's file system. If a server does support named 3788 attributes, a client which is also able to handle them should be able 3789 to copy a file's data and meta-data with complete transparency from 3790 one location to another; this would imply that names allowed for 3791 regular directory entries are valid for named attribute names as 3792 well. 3794 Names of attributes will not be controlled by this document or other 3795 IETF standards track documents. See the section "IANA 3796 Considerations" for further discussion. 3798 5.4. Classification of Attributes 3800 Each of the Mandatory and Recommended attributes can be classified in 3801 one of three categories: per server, per file system, or per file 3802 system object. Note that it is possible that some per file system 3803 attributes may vary within the file system. See the "homogeneous" 3804 attribute for its definition. Note that the attributes 3805 time_access_set and time_modify_set are not listed in this section 3806 because they are write-only attributes corresponding to time_access 3807 and time_modify, and are used in a special instance of SETATTR. 3809 o The per server attribute is: 3811 lease_time 3813 o The per file system attributes are: 3815 supp_attr, fh_expire_type, link_support, symlink_support, 3816 unique_handles, aclsupport, cansettime, case_insensitive, 3817 case_preserving, chown_restricted, files_avail, files_free, 3818 files_total, fs_locations, homogeneous, maxfilesize, maxname, 3819 maxread, maxwrite, no_trunc, space_avail, space_free, 3820 space_total, time_delta, fs_status, fs_layout_type, 3821 fs_locations_info 3823 o The per file system object attributes are: 3825 type, change, size, named_attr, fsid, rdattr_error, filehandle, 3826 ACL, archive, fileid, hidden, maxlink, mimetype, mode, 3827 numlinks, owner, owner_group, rawdev, space_used, system, 3828 time_access, time_backup, time_create, time_metadata, 3829 time_modify, mounted_on_fileid, dir_notif_delay, 3830 dirent_notif_delay, dacl, sacl, layout_type, layout_hint, 3831 layout_blksize, layout_alignment, mdsthreshold, retention_get, 3832 retention_set, retentevt_get, retentevt_set, retention_hold, 3833 mode_set_masked 3835 For quota_avail_hard, quota_avail_soft, and quota_used see their 3836 definitions below for the appropriate classification. 3838 5.5. Mandatory Attributes - Definitions 3840 +-----------------+----+------------+--------+----------------------+ 3841 | name | # | Data Type | Access | Description | 3842 +-----------------+----+------------+--------+----------------------+ 3843 | supp_attr | 0 | bitmap | READ | The bit vector which | 3844 | | | | | would retrieve all | 3845 | | | | | mandatory and | 3846 | | | | | recommended | 3847 | | | | | attributes that are | 3848 | | | | | supported for this | 3849 | | | | | object. The scope | 3850 | | | | | of this attribute | 3851 | | | | | applies to all | 3852 | | | | | objects with a | 3853 | | | | | matching fsid. | 3854 | type | 1 | nfs4_ftype | READ | The type of the | 3855 | | | | | object (file, | 3856 | | | | | directory, symlink, | 3857 | | | | | etc.) | 3858 | fh_expire_type | 2 | uint32 | READ | Server uses this to | 3859 | | | | | specify filehandle | 3860 | | | | | expiration behavior | 3861 | | | | | to the client. See | 3862 | | | | | the section | 3863 | | | | | "Filehandles" for | 3864 | | | | | additional | 3865 | | | | | description. | 3866 | change | 3 | uint64 | READ | A value created by | 3867 | | | | | the server that the | 3868 | | | | | client can use to | 3869 | | | | | determine if file | 3870 | | | | | data, directory | 3871 | | | | | contents or | 3872 | | | | | attributes of the | 3873 | | | | | object have been | 3874 | | | | | modified. The | 3875 | | | | | server may return | 3876 | | | | | the object's | 3877 | | | | | time_metadata | 3878 | | | | | attribute for this | 3879 | | | | | attribute's value | 3880 | | | | | but only if the file | 3881 | | | | | system object can | 3882 | | | | | not be updated more | 3883 | | | | | frequently than the | 3884 | | | | | resolution of | 3885 | | | | | time_metadata. | 3886 | size | 4 | uint64 | R/W | The size of the | 3887 | | | | | object in bytes. | 3888 | link_support | 5 | bool | READ | True, if the | 3889 | | | | | object's file system | 3890 | | | | | supports hard links. | 3891 | symlink_support | 6 | bool | READ | True, if the | 3892 | | | | | object's file system | 3893 | | | | | supports symbolic | 3894 | | | | | links. | 3895 | named_attr | 7 | bool | READ | True, if this object | 3896 | | | | | has named | 3897 | | | | | attributes. In | 3898 | | | | | other words, object | 3899 | | | | | has a non-empty | 3900 | | | | | named attribute | 3901 | | | | | directory. | 3902 | fsid | 8 | fsid4 | READ | Unique file system | 3903 | | | | | identifier for the | 3904 | | | | | file system holding | 3905 | | | | | this object. fsid | 3906 | | | | | contains major and | 3907 | | | | | minor components | 3908 | | | | | each of which are | 3909 | | | | | uint64. | 3910 | unique_handles | 9 | bool | READ | True, if two | 3911 | | | | | distinct filehandles | 3912 | | | | | guaranteed to refer | 3913 | | | | | to two different | 3914 | | | | | file system objects. | 3915 | lease_time | 10 | nfs_lease4 | READ | Duration of leases | 3916 | | | | | at server in | 3917 | | | | | seconds. | 3918 | rdattr_error | 11 | enum | READ | Error returned from | 3919 | | | | | getattr during | 3920 | | | | | readdir. | 3921 | filehandle | 19 | nfs_fh4 | READ | The filehandle of | 3922 | | | | | this object | 3923 | | | | | (primarily for | 3924 | | | | | readdir requests). | 3925 +-----------------+----+------------+--------+----------------------+ 3927 5.6. Recommended Attributes - Definitions 3928 +-------------------+----+----------------+--------+----------------+ 3929 | name | # | Data Type | Access | Description | 3930 +-------------------+----+----------------+--------+----------------+ 3931 | ACL | 12 | nfsace4<> | R/W | The access | 3932 | | | | | control list | 3933 | | | | | for the | 3934 | | | | | object. | 3935 | aclsupport | 13 | uint32 | READ | Indicates what | 3936 | | | | | types of ACLs | 3937 | | | | | are supported | 3938 | | | | | on the current | 3939 | | | | | file system. | 3940 | archive | 14 | bool | R/W | True, if this | 3941 | | | | | file has been | 3942 | | | | | archived since | 3943 | | | | | the time of | 3944 | | | | | last | 3945 | | | | | modification | 3946 | | | | | (deprecated in | 3947 | | | | | favor of | 3948 | | | | | time_backup). | 3949 | cansettime | 15 | bool | READ | True, if the | 3950 | | | | | server able to | 3951 | | | | | change the | 3952 | | | | | times for a | 3953 | | | | | file system | 3954 | | | | | object as | 3955 | | | | | specified in a | 3956 | | | | | SETATTR | 3957 | | | | | operation. | 3958 | case_insensitive | 16 | bool | READ | True, if | 3959 | | | | | filename | 3960 | | | | | comparisons on | 3961 | | | | | this file | 3962 | | | | | system are | 3963 | | | | | case | 3964 | | | | | insensitive. | 3965 | case_preserving | 17 | bool | READ | True, if | 3966 | | | | | filename case | 3967 | | | | | on this file | 3968 | | | | | system are | 3969 | | | | | preserved. | 3970 | chown_restricted | 18 | bool | READ | If TRUE, the | 3971 | | | | | server will | 3972 | | | | | reject any | 3973 | | | | | request to | 3974 | | | | | change either | 3975 | | | | | the owner or | 3976 | | | | | the group | 3977 | | | | | associated | 3978 | | | | | with a file if | 3979 | | | | | the caller is | 3980 | | | | | not a | 3981 | | | | | privileged | 3982 | | | | | user (for | 3983 | | | | | example, | 3984 | | | | | "root" in UNIX | 3985 | | | | | operating | 3986 | | | | | environments | 3987 | | | | | or in Windows | 3988 | | | | | 2000 the "Take | 3989 | | | | | Ownership" | 3990 | | | | | privilege). | 3991 | dacl | 58 | nfsacl41 | R/W | Automatically | 3992 | | | | | inheritable | 3993 | | | | | access control | 3994 | | | | | list used for | 3995 | | | | | determining | 3996 | | | | | access to file | 3997 | | | | | system | 3998 | | | | | objects. | 3999 | dir_notif_delay | 56 | nfstime4 | READ | notification | 4000 | | | | | delays on | 4001 | | | | | directory | 4002 | | | | | attributes | 4003 | dirent_ | 57 | nfstime4 | READ | notification | 4004 | notif_delay | | | | delays on | 4005 | | | | | child | 4006 | | | | | attributes | 4007 | fileid | 20 | uint64 | READ | A number | 4008 | | | | | uniquely | 4009 | | | | | identifying | 4010 | | | | | the file | 4011 | | | | | within the | 4012 | | | | | file system. | 4013 | files_avail | 21 | uint64 | READ | File slots | 4014 | | | | | available to | 4015 | | | | | this user on | 4016 | | | | | the file | 4017 | | | | | system | 4018 | | | | | containing | 4019 | | | | | this object - | 4020 | | | | | this should be | 4021 | | | | | the smallest | 4022 | | | | | relevant | 4023 | | | | | limit. | 4024 | files_free | 22 | uint64 | READ | Free file | 4025 | | | | | slots on the | 4026 | | | | | file system | 4027 | | | | | containing | 4028 | | | | | this object - | 4029 | | | | | this should be | 4030 | | | | | the smallest | 4031 | | | | | relevant | 4032 | | | | | limit. | 4033 | files_total | 23 | uint64 | READ | Total file | 4034 | | | | | slots on the | 4035 | | | | | file system | 4036 | | | | | containing | 4037 | | | | | this object. | 4038 | fs_absent | 60 | bool | READ | Is current | 4039 | | | | | file system | 4040 | | | | | present or | 4041 | | | | | absent. | 4042 | fs_layout_type | 62 | layouttype4<> | READ | Layout types | 4043 | | | | | available for | 4044 | | | | | the file | 4045 | | | | | system. | 4046 | fs_locations | 24 | fs_locations | READ | Locations | 4047 | | | | | where this | 4048 | | | | | file system | 4049 | | | | | may be found. | 4050 | | | | | If the server | 4051 | | | | | returns | 4052 | | | | | NFS4ERR_MOVED | 4053 | | | | | as an error, | 4054 | | | | | this attribute | 4055 | | | | | MUST be | 4056 | | | | | supported. | 4057 | fs_locations_info | 67 | | READ | Full function | 4058 | | | | | file system | 4059 | | | | | location. | 4060 | fs_status | 61 | fs4_status | READ | Generic file | 4061 | | | | | system type | 4062 | | | | | information. | 4063 | hidden | 25 | bool | R/W | True, if the | 4064 | | | | | file is | 4065 | | | | | considered | 4066 | | | | | hidden with | 4067 | | | | | respect to the | 4068 | | | | | Windows API? | 4069 | homogeneous | 26 | bool | READ | True, if this | 4070 | | | | | object's file | 4071 | | | | | system is | 4072 | | | | | homogeneous, | 4073 | | | | | i.e. are per | 4074 | | | | | file system | 4075 | | | | | attributes the | 4076 | | | | | same for all | 4077 | | | | | file system's | 4078 | | | | | objects. | 4079 | layout_alignment | 66 | uint32_t | READ | Preferred | 4080 | | | | | alignment for | 4081 | | | | | layout related | 4082 | | | | | I/O. | 4083 | layout_blksize | 65 | uint32_t | READ | Preferred | 4084 | | | | | block size for | 4085 | | | | | layout related | 4086 | | | | | I/O. | 4087 | layout_hint | 63 | layouthint4 | WRITE | Client | 4088 | | | | | specified hint | 4089 | | | | | for file | 4090 | | | | | layout. | 4091 | layout_type | 64 | layouttype4<> | READ | Layout types | 4092 | | | | | available for | 4093 | | | | | the file. | 4094 | maxfilesize | 27 | uint64 | READ | Maximum | 4095 | | | | | supported file | 4096 | | | | | size for the | 4097 | | | | | file system of | 4098 | | | | | this object. | 4099 | maxlink | 28 | uint32 | READ | Maximum number | 4100 | | | | | of links for | 4101 | | | | | this object. | 4102 | maxname | 29 | uint32 | READ | Maximum | 4103 | | | | | filename size | 4104 | | | | | supported for | 4105 | | | | | this object. | 4106 | maxread | 30 | uint64 | READ | Maximum read | 4107 | | | | | size supported | 4108 | | | | | for this | 4109 | | | | | object. | 4110 | maxwrite | 31 | uint64 | READ | Maximum write | 4111 | | | | | size supported | 4112 | | | | | for this | 4113 | | | | | object. This | 4114 | | | | | attribute | 4115 | | | | | SHOULD be | 4116 | | | | | supported if | 4117 | | | | | the file is | 4118 | | | | | writable. | 4119 | | | | | Lack of this | 4120 | | | | | attribute can | 4121 | | | | | lead to the | 4122 | | | | | client either | 4123 | | | | | wasting | 4124 | | | | | bandwidth or | 4125 | | | | | not receiving | 4126 | | | | | the best | 4127 | | | | | performance. | 4128 | mdsthreshold | 68 | mdsthreshold4 | READ | Hint to client | 4129 | | | | | as to when to | 4130 | | | | | write through | 4131 | | | | | the pnfs | 4132 | | | | | metadata | 4133 | | | | | server. | 4134 | mimetype | 32 | utf8<> | R/W | MIME body | 4135 | | | | | type/subtype | 4136 | | | | | of this | 4137 | | | | | object. | 4138 | mode | 33 | mode4 | R/W | UNIX-style | 4139 | | | | | mode including | 4140 | | | | | permission | 4141 | | | | | bits for this | 4142 | | | | | object. | 4143 | mode_set_masked | 74 | mode_masked4 | WRITE | Allows setting | 4144 | | | | | or resetting a | 4145 | | | | | subset of the | 4146 | | | | | bits in a | 4147 | | | | | UNIX-style | 4148 | | | | | mode | 4149 | mounted_on_fileid | 55 | uint64 | READ | Like fileid, | 4150 | | | | | but if the | 4151 | | | | | target | 4152 | | | | | filehandle is | 4153 | | | | | the root of a | 4154 | | | | | file system | 4155 | | | | | return the | 4156 | | | | | fileid of the | 4157 | | | | | underlying | 4158 | | | | | directory. | 4159 | no_trunc | 34 | bool | READ | True, if a | 4160 | | | | | name longer | 4161 | | | | | than name_max | 4162 | | | | | is used, an | 4163 | | | | | error be | 4164 | | | | | returned and | 4165 | | | | | name is not | 4166 | | | | | truncated. | 4167 | numlinks | 35 | uint32 | READ | Number of hard | 4168 | | | | | links to this | 4169 | | | | | object. | 4170 | owner | 36 | utf8<> | R/W | The string | 4171 | | | | | name of the | 4172 | | | | | owner of this | 4173 | | | | | object. | 4174 | owner_group | 37 | utf8<> | R/W | The string | 4175 | | | | | name of the | 4176 | | | | | group | 4177 | | | | | ownership of | 4178 | | | | | this object. | 4179 | quota_avail_hard | 38 | uint64 | READ | For definition | 4180 | | | | | see "Quota | 4181 | | | | | Attributes" | 4182 | | | | | section below. | 4183 | quota_avail_soft | 39 | uint64 | READ | For definition | 4184 | | | | | see "Quota | 4185 | | | | | Attributes" | 4186 | | | | | section below. | 4187 | quota_used | 40 | uint64 | READ | For definition | 4188 | | | | | see "Quota | 4189 | | | | | Attributes" | 4190 | | | | | section below. | 4191 | rawdev | 41 | specdata4 | READ | Raw device | 4192 | | | | | identifier. | 4193 | | | | | UNIX device | 4194 | | | | | major/minor | 4195 | | | | | node | 4196 | | | | | information. | 4197 | | | | | If the value | 4198 | | | | | of type is not | 4199 | | | | | NF4BLK or | 4200 | | | | | NF4CHR, the | 4201 | | | | | value return | 4202 | | | | | SHOULD NOT be | 4203 | | | | | considered | 4204 | | | | | useful. | 4205 | retentevt_get | 71 | retention_get4 | READ | Get the | 4206 | | | | | event-based | 4207 | | | | | retention | 4208 | | | | | duration, and | 4209 | | | | | if enabled, | 4210 | | | | | the | 4211 | | | | | event-based | 4212 | | | | | retention | 4213 | | | | | begin time of | 4214 | | | | | the file | 4215 | | | | | object. | 4216 | | | | | GETATTR use | 4217 | | | | | only. | 4218 | retentevt_set | 72 | retention_set4 | WRITE | Set the | 4219 | | | | | event-based | 4220 | | | | | retention | 4221 | | | | | duration, and | 4222 | | | | | optionally | 4223 | | | | | enable | 4224 | | | | | event-based | 4225 | | | | | retention on | 4226 | | | | | the file | 4227 | | | | | object. | 4228 | | | | | SETATTR use | 4229 | | | | | only. | 4230 | retention_get | 69 | retention_get4 | READ | Get the | 4231 | | | | | retention | 4232 | | | | | duration, and | 4233 | | | | | if enabled, | 4234 | | | | | the retention | 4235 | | | | | begin time of | 4236 | | | | | the file | 4237 | | | | | object. | 4238 | | | | | GETATTR use | 4239 | | | | | only. | 4240 | retention_hold | 73 | uint64_t | R/W | Get or set | 4241 | | | | | administrative | 4242 | | | | | retention | 4243 | | | | | holds, one | 4244 | | | | | hold per bit | 4245 | | | | | position. | 4246 | retention_set | 70 | retention_set4 | WRITE | Set the | 4247 | | | | | retention | 4248 | | | | | duration, and | 4249 | | | | | optionally | 4250 | | | | | enable | 4251 | | | | | retention on | 4252 | | | | | the file | 4253 | | | | | object. | 4254 | | | | | SETATTR use | 4255 | | | | | only. | 4256 | sacl | 59 | nfsacl41 | R/W | Automatically | 4257 | | | | | inheritable | 4258 | | | | | access control | 4259 | | | | | list used for | 4260 | | | | | auditing | 4261 | | | | | access to | 4262 | | | | | files. | 4263 | space_avail | 42 | uint64 | READ | Disk space in | 4264 | | | | | bytes | 4265 | | | | | available to | 4266 | | | | | this user on | 4267 | | | | | the file | 4268 | | | | | system | 4269 | | | | | containing | 4270 | | | | | this object - | 4271 | | | | | this should be | 4272 | | | | | the smallest | 4273 | | | | | relevant | 4274 | | | | | limit. | 4275 | space_free | 43 | uint64 | READ | Free disk | 4276 | | | | | space in bytes | 4277 | | | | | on the file | 4278 | | | | | system | 4279 | | | | | containing | 4280 | | | | | this object - | 4281 | | | | | this should be | 4282 | | | | | the smallest | 4283 | | | | | relevant | 4284 | | | | | limit. | 4285 | space_total | 44 | uint64 | READ | Total disk | 4286 | | | | | space in bytes | 4287 | | | | | on the file | 4288 | | | | | system | 4289 | | | | | containing | 4290 | | | | | this object. | 4291 | space_used | 45 | uint64 | READ | Number of file | 4292 | | | | | system bytes | 4293 | | | | | allocated to | 4294 | | | | | this object. | 4295 | system | 46 | bool | R/W | True, if this | 4296 | | | | | file is a | 4297 | | | | | "system" file | 4298 | | | | | with respect | 4299 | | | | | to the Windows | 4300 | | | | | API? | 4301 | time_access | 47 | nfstime4 | READ | The time of | 4302 | | | | | last access to | 4303 | | | | | the object by | 4304 | | | | | a read that | 4305 | | | | | was satisfied | 4306 | | | | | by the server. | 4307 | time_access_set | 48 | settime4 | WRITE | Set the time | 4308 | | | | | of last access | 4309 | | | | | to the object. | 4310 | | | | | SETATTR use | 4311 | | | | | only. | 4312 | time_backup | 49 | nfstime4 | R/W | The time of | 4313 | | | | | last backup of | 4314 | | | | | the object. | 4315 | time_create | 50 | nfstime4 | R/W | The time of | 4316 | | | | | creation of | 4317 | | | | | the object. | 4318 | | | | | This attribute | 4319 | | | | | does not have | 4320 | | | | | any relation | 4321 | | | | | to the | 4322 | | | | | traditional | 4323 | | | | | UNIX file | 4324 | | | | | attribute | 4325 | | | | | "ctime" or | 4326 | | | | | "change time". | 4327 | time_delta | 51 | nfstime4 | READ | Smallest | 4328 | | | | | useful server | 4329 | | | | | time | 4330 | | | | | granularity. | 4331 | time_metadata | 52 | nfstime4 | READ | The time of | 4332 | | | | | last meta-data | 4333 | | | | | modification | 4334 | | | | | of the object. | 4335 | time_modify | 53 | nfstime4 | READ | The time of | 4336 | | | | | last | 4337 | | | | | modification | 4338 | | | | | to the object. | 4339 | time_modify_set | 54 | settime4 | WRITE | Set the time | 4340 | | | | | of last | 4341 | | | | | modification | 4342 | | | | | to the object. | 4343 | | | | | SETATTR use | 4344 | | | | | only. | 4345 +-------------------+----+----------------+--------+----------------+ 4347 5.7. Time Access 4349 As defined above, the time_access attribute represents the time of 4350 last access to the object by a read that was satisfied by the server. 4351 The notion of what is an "access" depends on server's operating 4352 environment and/or the server's file system semantics. For example, 4353 for servers obeying POSIX semantics, time_access would be updated 4354 only by the READLINK, READ, and READDIR operations and not any of the 4355 operations that modify the content of the object. Of course, setting 4356 the corresponding time_access_set attribute is another way to modify 4357 the time_access attribute. 4359 Whenever the file object resides on a writable file system, the 4360 server should make best efforts to record time_access into stable 4361 storage. However, to mitigate the performance effects of doing so, 4362 and most especially whenever the server is satisfying the read of the 4363 object's content from its cache, the server MAY cache access time 4364 updates and lazily write them to stable storage. It is also 4365 acceptable to give administrators of the server the option to disable 4366 time_access updates. 4368 5.8. Interpreting owner and owner_group 4370 The recommended attributes "owner" and "owner_group" (and also users 4371 and groups within the "acl" attribute) are represented in terms of a 4372 UTF-8 string. To avoid a representation that is tied to a particular 4373 underlying implementation at the client or server, the use of the 4374 UTF-8 string has been chosen. Note that section 6.1 of RFC2624 [27] 4375 provides additional rationale. It is expected that the client and 4376 server will have their own local representation of owner and 4377 owner_group that is used for local storage or presentation to the end 4378 user. Therefore, it is expected that when these attributes are 4379 transferred between the client and server that the local 4380 representation is translated to a syntax of the form "user@ 4381 dns_domain". This will allow for a client and server that do not use 4382 the same local representation the ability to translate to a common 4383 syntax that can be interpreted by both. 4385 Similarly, security principals may be represented in different ways 4386 by different security mechanisms. Servers normally translate these 4387 representations into a common format, generally that used by local 4388 storage, to serve as a means of identifying the users corresponding 4389 to these security principals. When these local identifiers are 4390 translated to the form of the owner attribute, associated with files 4391 created by such principals they identify, in a common format, the 4392 users associated with each corresponding set of security principals. 4394 The translation used to interpret owner and group strings is not 4395 specified as part of the protocol. This allows various solutions to 4396 be employed. For example, a local translation table may be consulted 4397 that maps between a numeric id to the user@dns_domain syntax. A name 4398 service may also be used to accomplish the translation. A server may 4399 provide a more general service, not limited by any particular 4400 translation (which would only translate a limited set of possible 4401 strings) by storing the owner and owner_group attributes in local 4402 storage without any translation or it may augment a translation 4403 method by storing the entire string for attributes for which no 4404 translation is available while using the local representation for 4405 those cases in which a translation is available. 4407 Servers that do not provide support for all possible values of the 4408 owner and owner_group attributes, should return an error 4409 (NFS4ERR_BADOWNER) when a string is presented that has no 4410 translation, as the value to be set for a SETATTR of the owner, 4411 owner_group, or acl attributes. When a server does accept an owner 4412 or owner_group value as valid on a SETATTR (and similarly for the 4413 owner and group strings in an acl), it is promising to return that 4414 same string when a corresponding GETATTR is done. Configuration 4415 changes and ill-constructed name translations (those that contain 4416 aliasing) may make that promise impossible to honor. Servers should 4417 make appropriate efforts to avoid a situation in which these 4418 attributes have their values changed when no real change to ownership 4419 has occurred. 4421 The "dns_domain" portion of the owner string is meant to be a DNS 4422 domain name. For example, user@ietf.org. Servers should accept as 4423 valid a set of users for at least one domain. A server may treat 4424 other domains as having no valid translations. A more general 4425 service is provided when a server is capable of accepting users for 4426 multiple domains, or for all domains, subject to security 4427 constraints. 4429 In the case where there is no translation available to the client or 4430 server, the attribute value must be constructed without the "@". 4431 Therefore, the absence of the @ from the owner or owner_group 4432 attribute signifies that no translation was available at the sender 4433 and that the receiver of the attribute should not use that string as 4434 a basis for translation into its own internal format. Even though 4435 the attribute value can not be translated, it may still be useful. 4436 In the case of a client, the attribute string may be used for local 4437 display of ownership. 4439 To provide a greater degree of compatibility with previous versions 4440 of NFS (i.e. v2 and v3), which identified users and groups by 32-bit 4441 unsigned uid's and gid's, owner and group strings that consist of 4442 decimal numeric values with no leading zeros can be given a special 4443 interpretation by clients and servers which choose to provide such 4444 support. The receiver may treat such a user or group string as 4445 representing the same user as would be represented by a v2/v3 uid or 4446 gid having the corresponding numeric value. A server is not 4447 obligated to accept such a string, but may return an NFS4ERR_BADOWNER 4448 instead. To avoid this mechanism being used to subvert user and 4449 group translation, so that a client might pass all of the owners and 4450 groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER 4451 error when there is a valid translation for the user or owner 4452 designated in this way. In that case, the client must use the 4453 appropriate name@domain string and not the special form for 4454 compatibility. 4456 The owner string "nobody" may be used to designate an anonymous user, 4457 which will be associated with a file created by a security principal 4458 that cannot be mapped through normal means to the owner attribute. 4460 5.9. Character Case Attributes 4462 With respect to the case_insensitive and case_preserving attributes, 4463 each UCS-4 character (which UTF-8 encodes) has a "long descriptive 4464 name" RFC1345 [28] which may or may not included the word "CAPITAL" 4465 or "SMALL". The presence of SMALL or CAPITAL allows an NFS server to 4466 implement unambiguous and efficient table driven mappings for case 4467 insensitive comparisons, and non-case-preserving storage. For 4468 general character handling and internationalization issues, see the 4469 section "Internationalization". 4471 5.10. Quota Attributes 4473 For the attributes related to file system quotas, the following 4474 definitions apply: 4476 quota_avail_soft The value in bytes which represents the amount of 4477 additional disk space that can be allocated to this file or 4478 directory before the user may reasonably be warned. It is 4479 understood that this space may be consumed by allocations to other 4480 files or directories though there is a rule as to which other 4481 files or directories. 4483 quota_avail_hard The value in bytes which represent the amount of 4484 additional disk space beyond the current allocation that can be 4485 allocated to this file or directory before further allocations 4486 will be refused. It is understood that this space may be consumed 4487 by allocations to other files or directories. 4489 quota_used The value in bytes which represent the amount of disc 4490 space used by this file or directory and possibly a number of 4491 other similar files or directories, where the set of "similar" 4492 meets at least the criterion that allocating space to any file or 4493 directory in the set will reduce the "quota_avail_hard" of every 4494 other file or directory in the set. 4496 Note that there may be a number of distinct but overlapping sets 4497 of files or directories for which a quota_used value is 4498 maintained. E.g. "all files with a given owner", "all files with 4499 a given group owner". etc. 4501 The server is at liberty to choose any of those sets but should do 4502 so in a repeatable way. The rule may be configured per file 4503 system or may be "choose the set with the smallest quota". 4505 5.11. mounted_on_fileid 4507 UNIX-based operating environments connect a file system into the 4508 namespace by connecting (mounting) the file system onto the existing 4509 file object (the mount point, usually a directory) of an existing 4510 file system. When the mount point's parent directory is read via an 4511 API like readdir(), the return results are directory entries, each 4512 with a component name and a fileid. The fileid of the mount point's 4513 directory entry will be different from the fileid that the stat() 4514 system call returns. The stat() system call is returning the fileid 4515 of the root of the mounted file system, whereas readdir() is 4516 returning the fileid stat() would have returned before any file 4517 systems were mounted on the mount point. 4519 Unlike NFS version 3, NFS version 4 allows a client's LOOKUP request 4520 to cross other file systems. The client detects the file system 4521 crossing whenever the filehandle argument of LOOKUP has an fsid 4522 attribute different from that of the filehandle returned by LOOKUP. 4523 A UNIX-based client will consider this a "mount point crossing". 4524 UNIX has a legacy scheme for allowing a process to determine its 4525 current working directory. This relies on readdir() of a mount 4526 point's parent and stat() of the mount point returning fileids as 4527 previously described. The mounted_on_fileid attribute corresponds to 4528 the fileid that readdir() would have returned as described 4529 previously. 4531 While the NFS version 4 client could simply fabricate a fileid 4532 corresponding to what mounted_on_fileid provides (and if the server 4533 does not support mounted_on_fileid, the client has no choice), there 4534 is a risk that the client will generate a fileid that conflicts with 4535 one that is already assigned to another object in the file system. 4536 Instead, if the server can provide the mounted_on_fileid, the 4537 potential for client operational problems in this area is eliminated. 4539 If the server detects that there is no mounted point at the target 4540 file object, then the value for mounted_on_fileid that it returns is 4541 the same as that of the fileid attribute. 4543 The mounted_on_fileid attribute is RECOMMENDED, so the server SHOULD 4544 provide it if possible, and for a UNIX-based server, this is 4545 straightforward. Usually, mounted_on_fileid will be requested during 4546 a READDIR operation, in which case it is trivial (at least for UNIX- 4547 based servers) to return mounted_on_fileid since it is equal to the 4548 fileid of a directory entry returned by readdir(). If 4549 mounted_on_fileid is requested in a GETATTR operation, the server 4550 should obey an invariant that has it returning a value that is equal 4551 to the file object's entry in the object's parent directory, i.e. 4552 what readdir() would have returned. Some operating environments 4553 allow a series of two or more file systems to be mounted onto a 4554 single mount point. In this case, for the server to obey the 4555 aforementioned invariant, it will need to find the base mount point, 4556 and not the intermediate mount points. 4558 5.12. Directory Notification Attributes 4560 As described in Section 17.39, the client can request a minimum delay 4561 for notifications of changes to attributes, but the server is free 4562 ignore what the client requests. The client can determine in advance 4563 what notification delays the server will accept by issuing a GETATTR 4564 for either or both of two directory notification attributes. When 4565 the client calls the GET_DIR_DELEGATION operation and asks^M for 4566 attribute change notifications, it should request^M notification 4567 delays that are no less than the values in the^M server-provided 4568 attributes. 4570 5.12.1. dir_notif_delay 4572 The dir_notify_delay attribute is the minimum number of seconds the 4573 server will delay before notifying the client of a change to the 4574 directory's attributes. 4576 5.12.2. dirent_notif_delay 4578 The dirent_notif_delay attribute is the minimum number of seconds the 4579 server will delay before notifying the client of a change to a file 4580 object that has an entry in the directory. 4582 5.13. PNFS Attributes 4584 5.13.1. fs_layout_type 4586 The fs_layout_type attribute (data type layouttype4, see 4587 Section 3.2.15) applies to a file system and indicates what layout 4588 types are supported by the file system. This attribute is expected 4589 be queried when a client encounters a new fsid. This attribute is 4590 used by the client to determine if it supports the layout type. 4592 5.13.2. layout_alignment 4594 The layout_alignment attribute indicates the preferred alignment for 4595 I/O to files on the file system the client has layouts for. Where 4596 possible, the client should issue READ and WRITE operations with 4597 offsets are whole multiples of the layout_alignment attribute. 4599 5.13.3. layout_blksize 4601 The layout_blksize attribute indicates the preferred block size for 4602 I/O to files on the file system the client has layouts for. Where 4603 possible, the client should issue READ operations with a count 4604 argument that is a whole multiple of layout_blksize, and WRITE 4605 operations with a data argument of size that is a whole multiple of 4606 layout_blksize. 4608 5.13.4. layout_hint 4610 The layout_hint attribute (data type layouthint4, see Section 3.2.22) 4611 may be set on newly created files to influence the metadata server's 4612 choice for the file's layout. It is suggested that this attribute is 4613 set as one of the initial attributes within the OPEN call. The 4614 metadata server may ignore this attribute. This attribute is a sub- 4615 set of the layout structure returned by LAYOUTGET. For example, 4616 instead of specifying particular devices, this would be used to 4617 suggest the stripe width of a file. It is up to the server 4618 implementation to determine which fields within the layout it uses. 4620 5.13.5. layout_type 4622 This attribute indicates the particular layout type(s) used for a 4623 file. This is for informational purposes only. The client needs to 4624 use the LAYOUTGET operation in order to get enough information (e.g., 4625 specific device information) in order to perform I/O. 4627 5.13.6. mdsthreshold 4629 This attribute acts as a hint to the client to help it determine when 4630 it is more efficient to issue read and write requests to the metadata 4631 server vs. the data server. Two types of thresholds are described: 4632 file size thresholds and I/O size thresholds. If a file's size is 4633 smaller than the file size threshold, data accesses should be issued 4634 to the metadata server. If an I/O is below the I/O size threshold, 4635 the I/O should be issued to the metadata server. Each threshold can 4636 be specified independently for read and write requests. For either 4637 threshold type, a value of 0 indicates no read or write should be 4638 issued to the metadata server, while a value of all 1s indicates all 4639 reads or writes should be issued to the metadata server. 4641 The attribute is available on a per filehandle basis. If the current 4642 filehandle refers to a non-pNFS file or directory, the metadata 4643 server should return an attribute that is representative of the 4644 filehandle's file system. It is suggested that this attribute is 4645 queried as part of the OPEN operation. Due to dynamic system 4646 changes, the client should not assume that the attribute will remain 4647 constant for any specific time period, thus it should be periodically 4648 refreshed. 4650 5.14. Retention Attributes 4652 Retention is a concept whereby a file object can be placed in an 4653 immutable, undeletable, unrenamable state for a fixed or infinite 4654 duration of time. Once in this "retained" state, the file cannot be 4655 moved out of the state until the duration of retention has been 4656 reached. 4658 When retention is enabled, retention MUST extend to the data of the 4659 file, and the name of file. The server MAY extend retention any 4660 other property of the file, including any subset of mandatory, 4661 recommended, and named attributes, with the exceptions noted in this 4662 section. 4664 Servers MAY support or not support retention on any file object type. 4666 There are five retention attributes: 4668 o retention_get. This attribute is only readable via GETATTR and 4669 not setable via SETATTR. The value of the attribute consists of: 4671 const RET4_DURATION_INFINITE = 0xffffffffffffffff; 4672 struct retention_get4 { 4673 uint64_t rg_duration; 4674 nfstime4 rg_begin_time<1>; 4675 }; 4677 The field rg_duration is duration in seconds indicating how long 4678 the file will be retained once retention is enabled. The field 4679 rg_begin_time is an array of up to one absolute time value. If 4680 the array is zero length, no beginning retention time has been 4681 established, and retention is not enabled. If rg_duration is 4682 equal to RET4_DURATION_INFINITE, the file, once retention is 4683 enabled, will be retained for an infinite duration. 4685 o retention_set. This attribute corresponds to retention_get. This 4686 attribute is only setable via SETATTR and not readable via 4687 GETATTR. The value of the attribute consists of: 4689 struct retention_set4 { 4690 bool rs_enable; 4691 uint64_t rs_duration<1>; 4692 }; 4693 If the client sets rs_enable to TRUE, then it is enabling 4694 retention on the file object with the begin time of retention 4695 commencing from the server's current time and date. The duration 4696 of the retention can also be provided if the rs_duration array is 4697 of length one. The duration is time is seconds from the begin 4698 time of retention, and if set to RET4_DURATION_INFINITE, the file 4699 is to be retained forever. If retention is enabled, with no 4700 duration specified in either this SETATTR or a previous SETATTR, 4701 the duration defaults to zero seconds. The server MAY restrict 4702 the enabling of retention or the duration of retention on the 4703 basis of the ACE4_WRITE_RETENTION ACL permission. The enabling of 4704 retention does not prevent the enabling of event-based retention 4705 nor the modification of the retention_hold attribute. 4707 o retentevt_get. This attribute is like retention_get, but refers 4708 to event-based retention. The event that triggers event-based 4709 retention is not defined by the NFSv4.1 specification. 4711 o retentevt_set. This attribute corresponds to retentevt_get, is 4712 like retention_set, but refers to event-based retention. When 4713 event based retention is set, the file MUST be retained even if 4714 non-event-based retention has been set, and the duration of non- 4715 event-based retention has been reached. Conversely, when non- 4716 event-based retention has been set, the file MUST be retained even 4717 the event-based retention has been set, and the duration of event- 4718 based retention has been reached. The server MAY restrict the 4719 enabling of event-based retention or the duration of event-based 4720 retention on the basis of the ACE4_WRITE_RETENTION ACL permission. 4721 The enabling of event-based retention does not prevent the 4722 enabling of non-event-based retention nor the modification of the 4723 retention_hold attribute. 4725 o retention_hold. This attribute allows one to 64 administrative 4726 holds, one hold per bit on the attribute. If retention_hold is 4727 not zero, then the file MUST NOT be deleted, renamed, or modified, 4728 even if the duration on enabled event or non-event-based retention 4729 has been reached. The server MAY restrict the modification of 4730 retention_hold on the basis of the ACE4_WRITE_RETENTION_HOLD ACL 4731 permission. The enabling of administration retention holds does 4732 not prevent the enabling of event-based or non-event-based 4733 retention. 4735 6. Access Control Lists 4737 Access Control Lists (ACLs) are a file attribute that specify fine 4738 grained access control. This chapter covers the "acl", "dacl", 4739 "sacl", "aclsupport", "mode", "mode_set_masked" file attributes, and 4740 their interactions. 4742 6.1. Goals 4744 ACLs and modes represent two well established but different models 4745 for specifying permissions. This chapter specifies requirements that 4746 attempt to meet the following goals: 4748 o If a server supports the mode attribute, it should provide 4749 reasonable semantics to clients that only set and retrieve the 4750 mode attribute. 4752 o If a server supports the ACL attribute, it should provide 4753 reasonable semantics to clients that only set and retrieve the ACL 4754 attribute. 4756 o On servers that support the mode attribute, if the ACL attribute 4757 has never been set on an object, via inheritance or explicitly, 4758 the behavior should be traditional UNIX-like behavior. 4760 o On servers that support the mode attribute, if the ACL attribute 4761 has been previously set on an object, either explicitly or via 4762 inheritance: 4764 * Setting only the mode attribute should effectively control the 4765 traditional UNIX-like permissions of read, write, and execute 4766 on owner, owner_group, and other. 4768 * Setting only the mode attribute should provide reasonable 4769 security. For example, setting a mode of 000 should be enough 4770 to ensure that future opens for read or write by any principal 4771 should fail, regardless of a previously existing or inherited 4772 ACL. 4774 o This minor version of NFSv4 should not introduce significantly 4775 different semantics relating to the mode and ACL attributes, nor 4776 should it render invalid any existing implementations. Rather, 4777 this chapter provides clarifications based on previous 4778 implementations and discussions around them. 4780 o If a server supports the ACL attribute, then at any time, the 4781 server can provide an ACL attribute when requested. The ACL 4782 attribute will describe all permissions on the file object, except 4783 for the three high-order bits of the mode attribute (described in 4784 Section 6.2.3). The ACL attribute will not conflict with the mode 4785 attribute, on servers that support the mode attribute. 4787 o If a server supports the mode attribute, then at any time, the 4788 server can provide a mode attribute when requested. The mode 4789 attribute will not conflict with the ACL attribute, on servers 4790 that support the ACL attribute. 4792 o When a mode attribute is set on an object, the ACL attribute may 4793 need to be modified so as to not conflict with the new mode. In 4794 such cases, it is desirable that the ACL keep as much information 4795 as possible. This includes information about inheritance, AUDIT 4796 and ALARM ACEs, and permissions granted and denied that do not 4797 conflict with the new mode. 4799 6.2. File Attributes Discussion 4801 6.2.1. ACL Attribute 4803 The NFS version 4 ACL attribute is an array of access control entries 4804 (ACEs). Although the client can read and write the ACL attribute, 4805 the server is responsible for using the ACL to perform access 4806 control. The client can use the OPEN or ACCESS operations to check 4807 access without modifying or reading data or metadata. 4809 The NFS ACE attribute is defined as follows: 4811 typedef uint32_t acetype4; 4812 typedef uint32_t aceflag4; 4813 typedef uint32_t acemask4; 4815 struct nfsace4 { 4816 acetype4 type; 4817 aceflag4 flag; 4818 acemask4 access_mask; 4819 utf8str_mixed who; 4820 }; 4822 To determine if a request succeeds, the server processes each nfsace4 4823 entry in order. Only ACEs which have a "who" that matches the 4824 requester are considered. Each ACE is processed until all of the 4825 bits of the requester's access have been ALLOWED. Once a bit (see 4826 below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer 4827 considered in the processing of later ACEs. If an ACCESS_DENIED_ACE 4828 is encountered where the requester's access still has unALLOWED bits 4829 in common with the "access_mask" of the ACE, the request is denied. 4830 When the ACL is fully processed, if there are bits in the requester's 4831 mask that have not been ALLOWED or DENIED, access is denied. 4833 Unlike the ALLOW and DENY ACE types, the ALARM and AUDIT ACE types do 4834 not affect a requester's access, and instead are for triggering 4835 events as a result of a requester's access attempt. Therefore, all 4836 AUDIT and ALARM ACEs are processed until end of the ACL. 4838 The NFS version 4 ACL model is quite rich. Some server platforms may 4839 provide access control functionality that goes beyond the UNIX-style 4840 mode attribute, but which is not as rich as the NFS ACL model. So 4841 that users can take advantage of this more limited functionality, the 4842 server may indicate that it supports ACLs as long as it follows the 4843 guidelines for mapping between its ACL model and the NFS version 4 4844 ACL model. 4846 The situation is complicated by the fact that a server may have 4847 multiple modules that enforce ACLs. For example, the enforcement for 4848 NFS version 4 access may be different from the enforcement for local 4849 access, and both may be different from the enforcement for access 4850 through other protocols such as SMB. So it may be useful for a 4851 server to accept an ACL even if not all of its modules are able to 4852 support it. 4854 The guiding principle in all cases is that the server must not accept 4855 ACLs that appear to make the file more secure than it really is. 4857 6.2.1.1. ACE Type 4859 The constants used for the type field (acetype4) are as follows: 4861 const ACE4_ACCESS_ALLOWED_ACE_TYPE = 0x00000000; 4862 const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001; 4863 const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002; 4864 const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003; 4866 +------------------------------+--------------+---------------------+ 4867 | Value | Abbreviation | Description | 4868 +------------------------------+--------------+---------------------+ 4869 | ACE4_ACCESS_ALLOWED_ACE_TYPE | ALLOW | Explicitly grants | 4870 | | | the access defined | 4871 | | | in acemask4 to the | 4872 | | | file or directory. | 4873 | ACE4_ACCESS_DENIED_ACE_TYPE | DENY | Explicitly denies | 4874 | | | the access defined | 4875 | | | in acemask4 to the | 4876 | | | file or directory. | 4877 | ACE4_SYSTEM_AUDIT_ACE_TYPE | AUDIT | LOG (system | 4878 | | | dependent) any | 4879 | | | access attempt to a | 4880 | | | file or directory | 4881 | | | which uses any of | 4882 | | | the access methods | 4883 | | | specified in | 4884 | | | acemask4. | 4885 | ACE4_SYSTEM_ALARM_ACE_TYPE | ALARM | Generate a system | 4886 | | | ALARM (system | 4887 | | | dependent) when any | 4888 | | | access attempt is | 4889 | | | made to a file or | 4890 | | | directory for the | 4891 | | | access methods | 4892 | | | specified in | 4893 | | | acemask4. | 4894 +------------------------------+--------------+---------------------+ 4896 The "Abbreviation" column denotes how the types will be referred to 4897 throughout the rest of this document. 4899 6.2.1.2. The aclsupport Attribute 4901 A server need not support all of the above ACE types. The bitmask 4902 constants used to represent the above definitions within the 4903 aclsupport attribute are as follows: 4905 const ACL4_SUPPORT_ALLOW_ACL = 0x00000001; 4906 const ACL4_SUPPORT_DENY_ACL = 0x00000002; 4907 const ACL4_SUPPORT_AUDIT_ACL = 0x00000004; 4908 const ACL4_SUPPORT_ALARM_ACL = 0x00000008; 4910 Clients should not attempt to set an ACE unless the server claims 4911 support for that ACE type. If the server receives a request to set 4912 an ACE that it cannot store, it MUST reject the request with 4913 NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE 4914 that it can store but cannot enforce, the server SHOULD reject the 4915 request with NFS4ERR_ATTRNOTSUPP. 4917 Example: suppose a server can enforce NFS ACLs for NFS access but 4918 cannot enforce ACLs for local access. If arbitrary processes can run 4919 on the server, then the server SHOULD NOT indicate ACL support. On 4920 the other hand, if only trusted administrative programs run locally, 4921 then the server may indicate ACL support. 4923 6.2.1.3. ACE Access Mask 4925 The bitmask constants used for the access mask field are as follows: 4927 const ACE4_READ_DATA = 0x00000001; 4928 const ACE4_LIST_DIRECTORY = 0x00000001; 4929 const ACE4_WRITE_DATA = 0x00000002; 4930 const ACE4_ADD_FILE = 0x00000002; 4931 const ACE4_APPEND_DATA = 0x00000004; 4932 const ACE4_ADD_SUBDIRECTORY = 0x00000004; 4933 const ACE4_READ_NAMED_ATTRS = 0x00000008; 4934 const ACE4_WRITE_NAMED_ATTRS = 0x00000010; 4935 const ACE4_EXECUTE = 0x00000020; 4936 const ACE4_DELETE_CHILD = 0x00000040; 4937 const ACE4_READ_ATTRIBUTES = 0x00000080; 4938 const ACE4_WRITE_ATTRIBUTES = 0x00000100; 4939 const ACE4_WRITE_RETENTION = 0x00000200; 4940 const ACE4_WRITE_RETENTION_HOLD = 0x00000400; 4941 const ACE4_DELETE = 0x00010000; 4942 const ACE4_READ_ACL = 0x00020000; 4943 const ACE4_WRITE_ACL = 0x00040000; 4944 const ACE4_WRITE_OWNER = 0x00080000; 4945 const ACE4_SYNCHRONIZE = 0x00100000; 4947 6.2.1.3.1. Discussion of Mask Attributes 4949 ACE4_READ_DATA 4950 Operation(s) affected: 4951 READ 4952 OPEN 4953 Discussion: 4954 Permission to read the data of the file. 4956 Servers SHOULD allow a user the ability to read the data 4957 of the file when only the ACE4_EXECUTE access mask bit is 4958 allowed. 4960 ACE4_LIST_DIRECTORY 4961 Operation(s) affected: 4962 READDIR 4963 Discussion: 4964 Permission to list the contents of a directory. 4966 ACE4_WRITE_DATA 4967 Operation(s) affected: 4968 WRITE 4969 OPEN 4970 SETATTR of size 4972 Discussion: 4973 Permission to modify a file's data anywhere in the file's 4974 offset range. This includes the ability to write to any 4975 arbitrary offset and as a result to grow the file. 4977 ACE4_ADD_FILE 4978 Operation(s) affected: 4979 CREATE 4980 OPEN 4981 Discussion: 4982 Permission to add a new file in a directory. The CREATE 4983 operation is affected when nfs_ftype4 is NF4LNK, NF4BLK, 4984 NF4CHR, NF4SOCK, or NF4FIFO. (NF4DIR is not listed because 4985 it is covered by ACE4_ADD_SUBDIRECTORY.) OPEN is affected 4986 when used to create a regular file. 4988 ACE4_APPEND_DATA 4989 Operation(s) affected: 4990 WRITE 4991 OPEN 4992 SETATTR of size 4993 Discussion: 4994 The ability to modify a file's data, but only starting at 4995 EOF. This allows for the notion of append-only files, by 4996 allowing ACE4_APPEND_DATA and denying ACE4_WRITE_DATA to 4997 the same user or group. If a file has an ACL such as the 4998 one described above and a WRITE request is made for 4999 somewhere other than EOF, the server SHOULD return 5000 NFS4ERR_ACCESS. 5002 ACE4_ADD_SUBDIRECTORY 5003 Operation(s) affected: 5004 CREATE 5005 Discussion: 5006 Permission to create a subdirectory in a directory. The 5007 CREATE operation is affected when nfs_ftype4 is NF4DIR. 5009 ACE4_READ_NAMED_ATTRS 5010 Operation(s) affected: 5011 OPENATTR 5012 Discussion: 5013 Permission to read the named attributes of a file or to 5014 lookup the named attributes directory. OPENATTR is 5015 affected when it is not used to create a named attribute 5016 directory. This is when 1.) createdir is TRUE, but a 5017 named attribute directory already exists, or 2.) createdir 5018 is FALSE. 5020 ACE4_WRITE_NAMED_ATTRS 5021 Operation(s) affected: 5022 OPENATTR 5023 Discussion: 5024 Permission to write the named attributes of a file or 5025 to create a named attribute directory. OPENATTR is 5026 affected when it is used to create a named attribute 5027 directory. This is when createdir is TRUE and no named 5028 attribute directory exists. The ability to check whether 5029 or not a named attribute directory exists depends on the 5030 ability to look it up, therefore, users also need the 5031 ACE4_READ_NAMED_ATTRS permission in order to create a 5032 named attribute directory. 5034 ACE4_EXECUTE 5035 Operation(s) affected: 5036 LOOKUP 5037 READ 5038 OPEN 5039 Discussion: 5040 Permission to execute a file or traverse/search a 5041 directory. 5043 Servers SHOULD allow a user the ability to read the data 5044 of the file when only the ACE4_EXECUTE access mask bit is 5045 allowed. This is because there is no way to execute a 5046 file without reading the contents. Though a server may 5047 treat ACE4_EXECUTE and ACE4_READ_DATA bits identically 5048 when deciding to permit a READ operation, it SHOULD still 5049 allow the two bits to be set independently in ACLs, and 5050 MUST distinguish between them when replying to ACCESS 5051 operations. In particular, servers SHOULD NOT silently 5052 turn on one of the two bits when the other is set, as 5053 that would make it impossible for the client to correctly 5054 enforce the distinction between read and execute 5055 permissions. 5057 As an example, following a SETATTR of the following ACL: 5058 nfsuser:ACE4_EXECUTE:ALLOW 5060 A subsequent GETATTR of ACL for that file SHOULD return: 5061 nfsuser:ACE4_EXECUTE:ALLOW 5062 Rather than: 5063 nfsuser:ACE4_EXECUTE/ACE4_READ_DATA:ALLOW 5065 ACE4_DELETE_CHILD 5066 Operation(s) affected: 5067 REMOVE 5069 Discussion: 5070 Permission to delete a file or directory within a 5071 directory. See section "ACE4_DELETE vs. ACE4_DELETE_CHILD" 5072 for information on how these two access mask bits interact. 5074 ACE4_READ_ATTRIBUTES 5075 Operation(s) affected: 5076 GETATTR of file system object attributes 5077 Discussion: 5078 The ability to read basic attributes (non-ACLs) of a file. 5079 On a UNIX system, basic attributes can be thought of as 5080 the stat level attributes. Allowing this access mask bit 5081 would mean the entity can execute "ls -l" and stat. 5083 ACE4_WRITE_ATTRIBUTES 5084 Operation(s) affected: 5085 SETATTR of time_access_set, time_backup, 5086 time_create, time_modify_set, mimetype, hidden, system 5087 Discussion: 5088 Permission to change the times associated with a file 5089 or directory to an arbitrary value. Also permission 5090 to change the mimetype, hidden and system attributes. 5091 A user having ACE4_WRITE_DATA permission, but lacking 5092 ACE4_WRITE_ATTRIBUTES must be allowed to implicitly set 5093 the times associated with a file. 5095 ACE4_WRITE_RETENTION 5096 Operation(s) affected: 5097 SETATTR of retention_set, retentevt_set. 5098 Discussion: 5099 Permission to modify the durations of event and non-event-based 5100 retention. Also permission to enable event and non-event-based 5101 retention. A server MAY map ACE4_WRITE_ATTRIBUTES to 5102 ACE_WRITE_RETENTION. 5104 ACE4_WRITE_RETENTION_HOLD 5105 Operation(s) affected: 5106 SETATTR of retention_hold. 5107 Discussion: 5108 Permission to modify the administration retention holds. 5109 A server MAY map ACE4_WRITE_ATTRIBUTES to 5110 ACE_WRITE_RETENTION_HOLD. 5112 ACE4_DELETE 5113 Operation(s) affected: 5114 REMOVE 5115 Discussion: 5116 Permission to delete the file or directory. See section 5117 "ACE4_DELETE vs. ACE4_DELETE_CHILD" for information on how 5118 these two access mask bits interact. 5120 ACE4_READ_ACL 5121 Operation(s) affected: 5122 GETATTR of acl 5123 Discussion: 5124 Permission to read the ACL. 5126 ACE4_WRITE_ACL 5127 Operation(s) affected: 5128 SETATTR of acl and mode 5129 Discussion: 5130 Permission to write the acl and mode attributes. 5132 ACE4_WRITE_OWNER 5133 Operation(s) affected: 5134 SETATTR of owner and owner_group 5135 Discussions: 5136 Permission to write the owner and owner_group attributes. 5137 On UNIX systems, this is the ability to execute chown(). 5139 ACE4_SYNCHRONIZE 5140 Operation(s) affected: 5141 NONE 5142 Discussion: 5143 Permission to access file locally at the server with 5144 synchronized reads and writes. 5146 Server implementations need not provide the granularity of control 5147 that is implied by this list of masks. For example, POSIX-based 5148 systems might not distinguish ACE4_APPEND_DATA (the ability to append 5149 to a file) from ACE4_WRITE_DATA (the ability to modify existing 5150 contents); both masks would be tied to a single "write" permission. 5151 When such a server returns attributes to the client, it would show 5152 both ACE4_APPEND_DATA and ACE4_WRITE_DATA if and only if the write 5153 permission is enabled. 5155 If a server receives a SETATTR request that it cannot accurately 5156 implement, it should error in the direction of more restricted 5157 access. For example, suppose a server cannot distinguish overwriting 5158 data from appending new data, as described in the previous paragraph. 5159 If a client submits an ACE where ACE4_APPEND_DATA is set but 5160 ACE4_WRITE_DATA is not (or vice versa), the server should reject the 5161 request with NFS4ERR_ATTRNOTSUPP. Nonetheless, if the ACE has type 5162 DENY, the server may silently turn on the other bit, so that both 5163 ACE4_APPEND_DATA and ACE4_WRITE_DATA are denied. 5165 6.2.1.3.2. ACE4_DELETE vs. ACE4_DELETE_CHILD 5167 Two access mask bits govern the ability to delete a file or directory 5168 object: ACE4_DELETE on the object itself, and ACE4_DELETE_CHILD on 5169 the object's parent directory. 5171 Many systems also consult the "sticky bit" (MODE4_SVTX) and write 5172 mode bit on the parent directory when determining whether to allow a 5173 file to be deleted. The mode bit for write corresponds to 5174 ACE4_WRITE_DATA, which is the same physical bit as ACE4_ADD_FILE. 5175 Therefore, ACE4_ADD_FILE can come into play when determining 5176 permission to delete. 5178 In the algorithm below, the strategy is that ACE4_DELETE and 5179 ACE4_DELETE_CHILD take precedence over the sticky bit, and the sticky 5180 bit takes precedence over the "write" mode bits (reflected in 5181 ACE4_ADD_FILE). 5183 Server implementations SHOULD grant or deny permission to delete 5184 based on the following algorithm. 5186 if ACE4_EXECUTE is denied by the parent directory ACL: 5187 deny delete 5188 else if ACE4_DELETE is allowed by the target object ACL: 5189 allow delete 5190 else if ACE4_DELETE_CHILD is allowed by the parent 5191 directory ACL: 5192 allow delete 5193 else if ACE4_DELETE_CHILD is denied by the 5194 parent directory ACL: 5195 deny delete 5196 else if ACE4_ADD_FILE is allowed by the parent directory ACL: 5197 if MODE4_SVTX is set for the parent directory: 5198 if the principal owns the parent directory OR 5199 the principal owns the target object OR 5200 ACE4_WRITE_DATA is allowed by the target 5201 object ACL: 5202 allow delete 5203 else: 5204 deny delete 5205 else: 5206 allow delete 5207 else: 5208 deny delete 5210 6.2.1.4. ACE flag 5212 The bitmask constants used for the flag field are as follows: 5214 const ACE4_FILE_INHERIT_ACE = 0x00000001; 5215 const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002; 5216 const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004; 5217 const ACE4_INHERIT_ONLY_ACE = 0x00000008; 5218 const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010; 5219 const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020; 5220 const ACE4_IDENTIFIER_GROUP = 0x00000040; 5221 const ACE4_INHERITED_ACE = 0x00000080; 5223 A server need not support any of these flags. If the server supports 5224 flags that are similar to, but not exactly the same as, these flags, 5225 the implementation may define a mapping between the protocol-defined 5226 flags and the implementation-defined flags. Again, the guiding 5227 principle is that the file not appear to be more secure than it 5228 really is. 5230 For example, suppose a client tries to set an ACE with 5231 ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the 5232 server does not support any form of ACL inheritance, the server 5233 should reject the request with NFS4ERR_ATTRNOTSUPP. If the server 5234 supports a single "inherit ACE" flag that applies to both files and 5235 directories, the server may reject the request (i.e., requiring the 5236 client to set both the file and directory inheritance flags). The 5237 server may also accept the request and silently turn on the 5238 ACE4_DIRECTORY_INHERIT_ACE flag. 5240 6.2.1.4.1. Discussion of Flag Bits 5242 ACE4_FILE_INHERIT_ACE 5243 Can be placed on a directory and indicates that this ACE should be 5244 added to each new non-directory file created. 5246 ACE4_DIRECTORY_INHERIT_ACE 5247 Can be placed on a directory and indicates that this ACE should be 5248 added to each new directory created. 5250 ACE4_INHERIT_ONLY_ACE 5251 Can be placed on a directory but does not apply to the directory; 5252 ALLOW and DENY ACEs with this bit set do not affect access to the 5253 directory, and AUDIT and ALARM ACEs with this bit set do not 5254 trigger log or alarm events. Such ACEs only take effect once they 5255 are applied (with this bit cleared) to newly created files and 5256 directories as specified by the above two flags. 5258 ACE4_NO_PROPAGATE_INHERIT_ACE 5259 Can be placed on a directory. This flag tells the server that 5260 inheritance of this ACE should stop at newly created child 5261 directories. 5263 ACE4_INHERITED_ACE 5264 Indicates that this ACE is inherited from a parent directory. A 5265 server that supports automatic inheritance will place this flag on 5266 any ACEs inherited from the parent directory when creating a new 5267 object. Client applications will use this to perform automatic 5268 inheritance. Clients and servers MUST clear this bit in the acl 5269 attribute; it may only be used in the dacl and sacl attributes. 5271 ACE4_SUCCESSFUL_ACCESS_ACE_FLAG 5273 ACE4_FAILED_ACCESS_ACE_FLAG 5274 The ACE4_SUCCESSFUL_ACCESS_ACE_FLAG (SUCCESS) and 5275 ACE4_FAILED_ACCESS_ACE_FLAG (FAILED) flag bits relate only to 5276 ACE4_SYSTEM_AUDIT_ACE_TYPE (AUDIT) and ACE4_SYSTEM_ALARM_ACE_TYPE 5277 (ALARM) ACE types. If during the processing of the file's ACL, 5278 the server encounters an AUDIT or ALARM ACE that matches the 5279 principal attempting the OPEN, the server notes that fact, and the 5280 presence, if any, of the SUCCESS and FAILED flags encountered in 5281 the AUDIT or ALARM ACE. Once the server completes the ACL 5282 processing, it then notes if the operation succeeded or failed. 5283 If the operation succeeded, and if the SUCCESS flag was set for a 5284 matching AUDIT or ALARM ACE, then the appropriate AUDIT or ALARM 5285 event occurs. If the operation failed, and if the FAILED flag was 5286 set for the matching AUDIT or ALARM ACE, then the appropriate 5287 AUDIT or ALARM event occurs. Either or both of the SUCCESS or 5288 FAILED can be set, but if neither is set, the AUDIT or ALARM ACE 5289 is not useful. 5291 The previously described processing applies to that of the ACCESS 5292 operation as well, the difference being that "success" or 5293 "failure" does not mean whether ACCESS returns NFS4_OK or not. 5294 Success means whether ACCESS returns all requested and supported 5295 bits. Failure means whether ACCESS failed to return a bit that 5296 was requested and supported. 5298 ACE4_IDENTIFIER_GROUP 5299 Indicates that the "who" refers to a GROUP as defined under UNIX 5300 or a GROUP ACCOUNT as defined under Windows. Clients and servers 5301 must ignore the ACE4_IDENTIFIER_GROUP flag on ACEs with a who 5302 value equal to one of the special identifiers outlined in 5303 Section 6.2.1.5. 5305 6.2.1.5. ACE Who 5307 The "who" field of an ACE is an identifier that specifies the 5308 principal or principals to whom the ACE applies. It may refer to a 5309 user or a group, with the flag bit ACE4_IDENTIFIER_GROUP specifying 5310 which. 5312 There are several special identifiers which need to be understood 5313 universally, rather than in the context of a particular DNS domain. 5314 Some of these identifiers cannot be understood when an NFS client 5315 accesses the server, but have meaning when a local process accesses 5316 the file. The ability to display and modify these permissions is 5317 permitted over NFS, even if none of the access methods on the server 5318 understands the identifiers. 5320 +---------------+--------------------------------------------------+ 5321 | Who | Description | 5322 +---------------+--------------------------------------------------+ 5323 | OWNER | The owner of the file | 5324 | GROUP | The group associated with the file. | 5325 | EVERYONE | The world, including the owner and owning group. | 5326 | INTERACTIVE | Accessed from an interactive terminal. | 5327 | NETWORK | Accessed via the network. | 5328 | DIALUP | Accessed as a dialup user to the server. | 5329 | BATCH | Accessed from a batch job. | 5330 | ANONYMOUS | Accessed without any authentication. | 5331 | AUTHENTICATED | Any authenticated user (opposite of ANONYMOUS) | 5332 | SERVICE | Access from a system service. | 5333 +---------------+--------------------------------------------------+ 5335 Table 7 5337 To avoid conflict, these special identifiers are distinguish by an 5338 appended "@" and should appear in the form "xxxx@" (note: no domain 5339 name after the "@"). For example: ANONYMOUS@. 5341 6.2.1.5.1. Discussion of EVERYONE@ 5343 It is important to note that "EVERYONE@" is not equivalent to the 5344 UNIX "other" entity. This is because, by definition, UNIX "other" 5345 does not include the owner or owning group of a file. "EVERYONE@" 5346 means literally everyone, including the owner or owning group. 5348 6.2.2. dacl and sacl Attributes 5350 The dacl and sacl attributes are like the acl attribute, but dacl and 5351 sacl each allow only certain types of ACEs. The dacl attribute 5352 allows just ALLOW and DENY ACEs. The sacl attribute allows just 5353 AUDIT and ALARM ACEs. The dacl and sacl attributes also have 5354 improved support for automatic inheritance (see Section 6.4.3.2). 5355 The separation of ACE types and inheritance support make dacl and 5356 sacl a better choice (over acl) for clients when setting ACEs on a 5357 file. 5359 6.2.3. mode Attribute 5361 The NFS version 4 mode attribute is based on the UNIX mode bits. The 5362 following bits are defined: 5364 const MODE4_SUID = 0x800; /* set user id on execution */ 5365 const MODE4_SGID = 0x400; /* set group id on execution */ 5366 const MODE4_SVTX = 0x200; /* save text even after use */ 5367 const MODE4_RUSR = 0x100; /* read permission: owner */ 5368 const MODE4_WUSR = 0x080; /* write permission: owner */ 5369 const MODE4_XUSR = 0x040; /* execute permission: owner */ 5370 const MODE4_RGRP = 0x020; /* read permission: group */ 5371 const MODE4_WGRP = 0x010; /* write permission: group */ 5372 const MODE4_XGRP = 0x008; /* execute permission: group */ 5373 const MODE4_ROTH = 0x004; /* read permission: other */ 5374 const MODE4_WOTH = 0x002; /* write permission: other */ 5375 const MODE4_XOTH = 0x001; /* execute permission: other */ 5377 Bits MODE4_RUSR, MODE4_WUSR, and MODE4_XUSR apply to the principal 5378 identified in the owner attribute. Bits MODE4_RGRP, MODE4_WGRP, and 5379 MODE4_XGRP apply to principals identified in the owner_group 5380 attribute but who are not identified in the owner attribute. Bits 5381 MODE4_ROTH, MODE4_WOTH, MODE4_XOTH apply to any principal that does 5382 not match that in the owner attribute, and does not have a group 5383 matching that of the owner_group attribute. 5385 Bits within the mode other than those specified above are not defined 5386 by this protocol. A server MUST NOT return bits other than those 5387 defined above in a GETATTR or READDIR operation, and it MUST return 5388 NFS4ERR_INVAL if bits other than those defined above are set in a 5389 SETATTR, CREATE, or OPEN operation. 5391 6.2.4. mode_set_masked Attribute 5393 The mode_set_masked attribute is a write-only attribute that allows 5394 individual bits in the mode attribute to be set or reset, without 5395 changing others. It allows, for example, the bits MODE4_SUID, 5396 MODE4_SGID, and MODE4_SVTX to be modified while leaving unmodified 5397 any of the nine low-order mode bits devoted to permissions. 5399 The mode_set_masked attribute consists of two words each in the form 5400 of a mode4. The first consists of the value to be applied to the 5401 current mode value and the second is a mask. Only bits set to one in 5402 the mask word are changed (set or reset) in the file's mode. All 5403 other bits in the mode remain unchanged. Bits in the first word that 5404 correspond to bits which are zero in the mask are ignored, except 5405 that undefined bits are checked for validity and can result in 5406 NFSERR_INVAL as described below. 5408 The mode_set_masked attribute is only valid in a SETATTR operation. 5409 If it is used in a CREATE or OPEN operation, the server MUST return 5410 NFS4ERR_INVAL. 5412 Bits not defined as valid in the mode attribute are not valid in 5413 either word of the mode_set_masked attribute. The server MUST return 5414 NFS4ERR_INVAL if any of those are on in a SETATTR. If the mode and 5415 mode_set_masked attributes are both specified in the same SETATTR, 5416 the server MUST also return NFS4ERR_INVAL. 5418 6.3. Common Methods 5420 The requirements in this section will be referred to in future 5421 sections, especially Section 6.4. 5423 6.3.1. Interpreting an ACL 5425 6.3.1.1. Server Considerations 5427 The server uses the algorithm described in Section 6.2.1 to determine 5428 whether an ACL allows access to an object. However, the ACL may not 5429 be the sole determiner of access. For example: 5431 o In the case of a file system exported as read-only, the server may 5432 deny write permissions even though an object's ACL grants it. 5434 o Server implementations MAY grant ACE4_WRITE_ACL and ACE4_READ_ACL 5435 permissions in order to prevent the owner from getting into the 5436 situation where they can't ever modify the ACL. 5438 o All servers will allow a user the ability to read the data of the 5439 file when only the execute permission is granted (i.e. If the ACL 5440 denies the user the ACE4_READ_DATA access and allows the user 5441 ACE4_EXECUTE, the server will allow the user to read the data of 5442 the file). 5444 o Many servers have the notion of owner-override in which the owner 5445 of the object is allowed to override accesses that are denied by 5446 the ACL. This may be helpful, for example, to allow users 5447 continued access to open files on which the permissions have 5448 changed. 5450 6.3.1.2. Client Considerations 5452 Clients SHOULD NOT do their own access checks based on their 5453 interpretation the ACL, but rather use the OPEN and ACCESS operations 5454 to do access checks. This allows the client to act on the results of 5455 having the server determine whether or not access should be granted 5456 based on its interpretation of the ACL. 5458 Clients must be aware of situations in which an object's ACL will 5459 define a certain access even though the server will not enforce it. 5460 In general, but especially in these situations, the client needs to 5461 do its part in the enforcement of access as defined by the ACL. To 5462 do this, the client MAY issue the appropriate ACCESS operation prior 5463 to servicing the request of the user or application in order to 5464 determine whether the user or application should be granted the 5465 access requested. For examples in which the ACL may define accesses 5466 that the server doesn't enforce see Section 6.3.1.1. 5468 6.3.2. Computing a Mode Attribute from an ACL 5470 The following method can be used to calculate the MODE4_R*, MODE4_W* 5471 and MODE4_X* bits of a mode attribute, based upon an ACL. 5473 1. To determine MODE4_ROTH, MODE4_WOTH, and MODE4_XOTH: 5475 1. If the special identifier EVERYONE@ is granted 5476 ACE4_READ_DATA, then the bit MODE4_ROTH SHOULD be set. 5477 Otherwise, MODE4_ROTH SHOULD NOT be set. 5479 2. If the special identifier EVERYONE@ is granted 5480 ACE4_WRITE_DATA or ACE4_APPEND_DATA, then the bit MODE4_WOTH 5481 SHOULD be set. Otherwise, MODE4_WOTH SHOULD NOT be set. 5483 3. If the special identifier EVERYONE@ is granted ACE4_EXECUTE, 5484 then the bit MODE4_XOTH SHOULD be set. Otherwise, MODE4_XOTH 5485 SHOULD NOT be set. 5487 2. To determine MODE4_RGRP, MODE4_WGRP, and MODE4_XGRP, note that 5488 the EVERYONE@ special identifier SHOULD be taken into account. 5489 In other words, when determining if the GROUP@ special identifier 5490 is granted a permission, ACEs with the identifier EVERYONE@ 5491 should take effect just as ACEs with the special identifier 5492 GROUP@ would. 5494 1. If the special identifier GROUP@ is granted ACE4_READ_DATA, 5495 then the bit MODE4_RGRP SHOULD be set. Otherwise, MODE4_RGRP 5496 SHOULD NOT be set. 5498 2. If the special identifier GROUP@ is granted ACE4_WRITE_DATA 5499 or ACE4_APPEND_DATA, then the bit MODE4_WGRP SHOULD be set. 5500 Otherwise, MODE4_WGRP SHOULD NOT be set. 5502 3. If the special identifier GROUP@ is granted ACE4_EXECUTE, 5503 then the bit MODE4_XGRP SHOULD be set. Otherwise, MODE4_XGRP 5504 SHOULD NOT be set. 5506 3. To determine MODE4_RUSR, MODE4_WUSR, and MODE4_XUSR, note that 5507 the EVERYONE@ special identifier SHOULD be taken into account. 5508 In other words, when determining if the OWNER@ special identifier 5509 is granted a permission, ACEs with the identifier EVERYONE@ 5510 should take effect just as ACEs with the special identifer OWNER@ 5511 would. 5513 1. If the special identifier OWNER@ is granted ACE4_READ_DATA, 5514 then the bit MODE4_RUSR SHOULD be set. Otherwise, MODE4_RUSR 5515 SHOULD NOT be set. 5517 2. If the special identifier OWNER@ is granted ACE4_WRITE_DATA 5518 or ACE4_APPEND_DATA, then the bit MODE4_WUSR SHOULD be set. 5519 Otherwise, MODE4_WUSR SHOULD NOT be set. 5521 3. If the special identifier OWNER@ is granted ACE4_EXECUTE, 5522 then the bit MODE4_XUSR SHOULD be set. Otherwise, MODE4_XUSR 5523 SHOULD NOT be set. 5525 6.3.2.1. Discussion 5527 The nine low-order mode bits (MODE4_R*, MODE4_W*, MODE4_X*) 5528 correspond to ACE4_READ_DATA, ACE4_WRITE_DATA/ACE4_APPEND_DATA, and 5529 ACE4_EXECUTE for OWNER@, GROUP@, and EVERYONE@. On some 5530 implementations, mode bits may represent a superset of these 5531 permissions, e.g. if a specific user is granted ACE4_WRITE_DATA, then 5532 MODE4_WGRP will be set, even though the file's owner_group is not 5533 granted ACE4_WRITE_DATA. 5535 Server implementations are discouraged from doing this, as experience 5536 has shown that this is confusing and annoying to end users. The 5537 specifications above also discourage this practice to enforce the 5538 semantic that setting the mode attribute effectively specifies read, 5539 write, and execute for owner, group, and other. 5541 6.4. Requirements 5543 The server that supports both mode and ACL must take care to 5544 synchronize the MODE4_*USR, MODE4_*GRP, and MODE4_*OTH bits with the 5545 ACEs which have respective who fields of "OWNER@", "GROUP@", and 5546 "EVERYONE@" so that the client can see semantically equivalent access 5547 permissions exist whether the client asks for owner, owner_group and 5548 mode attributes, or for just the ACL. 5550 In this section, much is made of the methods in Section 6.3.2. Many 5551 requirements refer to this section. But note that the methods have 5552 behaviors specified with "SHOULD". This is intentional, to avoid 5553 invalidating existing implementations that compute the mode according 5554 to the withdrawn POSIX ACL draft (1003.1e draft 17), rather than by 5555 actual permissions on owner, group, and other. 5557 6.4.1. Setting the mode and/or ACL Attributes 5559 6.4.1.1. Setting mode and not ACL 5561 When any mode permission bits are subject to change, either because 5562 the mode attribute was set or because the mode_set_masked attribute 5563 was set and the mask included one or more bits from the low-order 5564 nine mode bits that control permissions, and the ACL attribute is not 5565 explicitly set, the ACL attribute must be modified in accordance with 5566 the updated value of the permissions bits within the mode. This must 5567 happen even if the value of the permission bits within the mode is 5568 the same after the mode is set as before. 5570 In cases in which the permissions bits are subject to change, the ACL 5571 attribute MUST be modified such that the mode computed via the method 5572 in Section 6.3.2 yields the low-order nine bits (MODE4_R*, MODE4_W*, 5573 MODE4_X*) of the mode attribute as modified by the attribute change. 5574 The ACL SHOULD also be modified such that: 5576 1. If MODE4_RGRP is not set, entities explicitly listed in the ACL 5577 other than OWNER@ and EVERYONE@ SHOULD NOT be granted 5578 ACE4_READ_DATA. 5580 2. If MODE4_WGRP is not set, entities explicitly listed in the ACL 5581 other than OWNER@ and EVERYONE@ SHOULD NOT be granted 5582 ACE4_WRITE_DATA or ACE4_APPEND_DATA. 5584 3. If MODE4_XGRP is not set, entities explicitly listed in the ACL 5585 other than OWNER@ and EVERYONE@ SHOULD NOT be granted 5586 ACE4_EXECUTE. 5588 Access mask bits other those listed above, appearing in ALLOW ACEs, 5589 MAY also be disabled. 5591 Note that ACEs with the flag ACE4_INHERIT_ONLY_ACE set do not affect 5592 the permissions of the ACL itself, nor do ACEs of the type AUDIT and 5593 ALARM. As such, it is desirable to leave these ACEs unmodified when 5594 modifying the ACL attribute. 5596 Also note that the requirement may be met by discarding the ACL, in 5597 favor of an ACL that represents the mode and only the mode. This is 5598 permitted, but it is preferable for a server to preserve as much of 5599 the ACL as possible without violating the above requirements. 5600 Discarding the ACL makes it effectively impossible for a file created 5601 with a mode attribute to inherit an ACL (see Section 6.4.3). 5603 6.4.1.2. Setting ACL and not mode 5605 When setting an ACL attribute and not setting the mode or 5606 mode_set_masked attributes, the permission bits of the mode need to 5607 be derived from the ACL. In this case, the ACL attribute SHOULD be 5608 set as given. The nine low-order bits of the mode attribute 5609 (MODE4_R*, MODE4_W*, MODE4_X*) MUST be modified to match the result 5610 of the method Section 6.3.2. The three high-order bits of the mode 5611 (MODE4_SUID, MODE4_SGID, MODE4_SVTX) SHOULD remain unchanged. 5613 6.4.1.3. Setting both ACL and mode 5615 When setting both the mode (includes use of either the mode attribute 5616 or the mode_set_masked attribute) and the ACL attribute in the same 5617 operation, the attributes MUST be applied in this order: mode (or 5618 mode_set_masked), then ACL. The mode-related attribute is set as 5619 given, then the ACL attribute is set as given, possibly changing the 5620 final mode, as described above in Section 6.4.1.2. 5622 6.4.2. Retrieving the mode and/or ACL Attributes 5624 This section applies only to servers that support both the mode and 5625 the ACL attribute. 5627 Some server implementations may have a concept of "objects without 5628 ACLs", meaning that all permissions are granted and denied according 5629 to the mode attribute, and that no ACL attribute is stored for that 5630 object. If an ACL attribute is requested of such a server, the 5631 server SHOULD return an ACL that does not conflict with the mode; 5632 that is to say, the ACL returned SHOULD represent the nine low-order 5633 bits of the mode attribute (MODE4_R*, MODE4_W*, MODE4_X*) as 5634 described in Section 6.3.2. 5636 For other server implementations, the ACL attribute is always present 5637 for every object. Such servers SHOULD store at least the three high- 5638 order bits of the mode attribute (MODE4_SUID, MODE4_SGID, 5639 MODE4_SVTX). The server SHOULD return a mode attribute if one is 5640 requested, and the low-order nine bits of the mode (MODE4_R*, 5641 MODE4_W*, MODE4_X*) MUST match the result of applying the method in 5642 Section 6.3.2 to the ACL attribute. 5644 6.4.3. Creating New Objects 5646 If a server supports the ACL attribute, it may use the ACL attribute 5647 on the parent directory to compute an initial ACL attribute for a 5648 newly created object. This will be referred to as the inherited ACL 5649 within this section. The act of adding one or more ACEs to the 5650 inherited ACL that are based upon ACEs in the parent directory's ACL 5651 will be referred to as inheriting an ACE within this section. 5653 Implementors should standardize on what the behavior of CREATE and 5654 OPEN must be depending on the presence or absence of the mode and ACL 5655 attributes. 5657 1. If just mode is given: 5659 In this case, inheritance SHOULD take place, but the mode MUST be 5660 applied to the inherited ACL as described in Section 6.4.1.1, 5661 thereby modifying the ACL. 5663 2. If just ACL is given: 5665 In this case, inheritance SHOULD NOT take place, and the ACL as 5666 defined in the CREATE or OPEN will be set without modification, 5667 and the mode modified as in Section 6.4.1.2 5669 3. If both mode and ACL are given: 5671 In this case, inheritance SHOULD NOT take place, and both 5672 attributes will be set as described in Section 6.4.1.3. 5674 4. If neither mode nor ACL are given: 5676 In the case where an object is being created without any initial 5677 attributes at all, e.g. an OPEN operation with an opentype4 of 5678 OPEN4_CREATE and a createmode4 of EXCLUSIVE4, inheritance SHOULD 5679 NOT take place. Instead, the server SHOULD set permissions to 5680 deny all access to the newly created object. It is expected that 5681 the appropriate client will set the desired attributes in a 5682 subsequent SETATTR operation, and the server SHOULD allow that 5683 operation to succeed, regardless of what permissions the object 5684 is created with. For example, an empty ACL denies all 5685 permissions, but the server should allow the owner's SETATTR to 5686 succeed even though WRITE_ACL is implicitly denied. 5688 In other cases, inheritance SHOULD take place, and no 5689 modifications to the ACL will happen. The mode attribute, if 5690 supported, MUST be as computed in Section 6.3.2, with the 5691 MODE4_SUID, MODE4_SGID and MODE4_SVTX bits clear. It is worth 5692 noting that if no inheritable ACEs exist on the parent directory, 5693 the file will be created with an empty ACL, thus granting no 5694 access. 5696 6.4.3.1. The Inherited ACL 5698 If the object being created is not a directory, the inherited ACL 5699 SHOULD NOT inherit ACEs from the parent directory ACL unless the 5700 ACE4_FILE_INHERIT_FLAG is set. 5702 If the object being created is a directory, the inherited ACL should 5703 inherit all inheritable ACEs from the parent directory, those that 5704 have ACE4_FILE_INHERIT_ACE or ACE4_DIRECTORY_INHERIT_ACE flag set. 5705 If the inheritable ACE has ACE4_FILE_INHERIT_ACE set, but 5706 ACE4_DIRECTORY_INHERIT_ACE is clear, the inherited ACE on the newly 5707 created directory MUST have the ACE4_INHERIT_ONLY_ACE flag set to 5708 prevent the directory from being affected by ACEs meant for non- 5709 directories. 5711 If when a new directory is created and it inherits ACEs from its 5712 parent, for each inheritable ACE which affects the directory's 5713 permissions, a server MAY create two ACEs on the directory being 5714 created; one effective and one which is only inheritable (i.e. has 5715 ACE4_INHERIT_ONLY_ACE flag set). This gives the user and the server, 5716 in the cases which it must mask certain permissions upon creation, 5717 the ability to modify the effective permissions without modifying the 5718 ACE which is to be inherited to the new directory's children. 5720 When a newly created object is created with attributes, and those 5721 attributes contain an ACL attribute and/or a mode attribute, the 5722 server MUST apply those attributes to the newly created object, as 5723 described in Section 6.4.1. 5725 6.4.3.2. Automatic Inheritance 5727 Unlike the acl attribute, the sacl and dacl (see Section 6.2.2) 5728 attributes both have an additional flag field. The flag field 5729 applies to the entire sacl or dacl; three flag values are defined 5731 const ACL4_AUTO_INHERIT = 0x00000001; 5732 const ACL4_PROTECTED = 0x00000002; 5733 const ACL4_DEFAULTED = 0x00000004; 5735 and all other bits must be cleared. The ACE4_INHERITED_ACE flag may 5736 be set in the ACEs of the sacl or dacl (whereas it must always be 5737 cleared in the acl). 5739 Together these features allow a server to support automatic 5740 inheritance, which we now explain in more detail. 5742 Inheritable ACEs are normally inherited by child objects only at the 5743 time that the child objects are created; later modifications to 5744 inheritable ACEs do not result in modifications to inherited ACEs on 5745 descendents. 5747 However, the dacl and sacl provide an optional mechanism which allows 5748 a client application to propagate changes to inheritable ACEs to an 5749 entire directory hierarchy. 5751 A server that supports this performs inheritance at object creation 5752 time in the normal way, but also sets the ACE4_INHERITED_ACE flag on 5753 any inherited ACEs as they are added to the new object. 5755 A client application such as an ACL editor may then propagate changes 5756 to inheritable ACEs on a directory by recursively traversing that 5757 directory's descendants and modifying each ACL encountered to remove 5758 any ACEs with the ACE4_INHERITED_ACE flag and to replace them by the 5759 new inheritable ACEs (also with the ACE4_INHERITED_ACE flag set). It 5760 uses the existing ACE inheritance flags in the obvious way to decide 5761 which ACEs to propagate. (Note that it may encounter further 5762 inheritable ACEs when descending the directory hierarchy, and that 5763 those will also need to be taken into account when propagating 5764 inheritable ACEs to further descendants.) 5766 The reach of this propagation may be limited in two ways: first, 5767 automatic inheritance is not performed from any directory ACL that 5768 has the ACL4_AUTO_INHERIT flag cleared; and second, automatic 5769 inheritance stops wherever an ACL with the ACL4_PROTECTED flag is 5770 set, preventing modification of that ACL and also (if the ACL is set 5771 on a directory) of the ACL on any of the object's descendants. 5773 This propagation is performed independently for the sacl and the dacl 5774 attributes; thus the ACL4_AUTO_INHERIT and ACL4_PROTECTED flags may 5775 be independently set for the sacl and the dacl, and propagation of 5776 one type of acl may continue down a hierarchy even where propagation 5777 of the other acl has stopped. 5779 New objects should be created with a dacl and a sacl that both have 5780 the ACL4_PROTECTED flag cleared and the ACL4_AUTO_INHERIT flag set to 5781 the same value as that on, respectively, the sacl or dacl of the 5782 parent object. 5784 Both the dacl and sacl attributes are RECOMMENDED, and a server may 5785 support one without supporting the other. 5787 A server that supports both the old acl attribute and one or both of 5788 the new dacl or sacl attributes must do so in such a way as to keep 5789 all three attributes consistent with each other. Thus the ACEs 5790 reported in the acl attribute should be the union of the ACEs 5791 reported in the dacl and sacl attributes, except that the 5792 ACE4_INHERITED_ACE flag must be cleared from the ACEs in the acl. 5793 And of course a client that queries only the acl will be unable to 5794 determine the values of the sacl or dacl flag fields. 5796 When a client performs a SETATTR for the acl attribute, the server 5797 SHOULD set the ACL4_PROTECTED flag to true on both the sacl and the 5798 dacl. By using the acl attribute, as opposed to the dacl or sacl 5799 attributes, the client signals that it may not understand automatic 5800 inheritance, and thus cannot be trusted to set an ACL for which 5801 automatic inheritance would make sense. 5803 When a client application queries an ACL, modifies it, and sets it 5804 again, it should leave any ACEs marked with ACE4_INHERITED_ACE 5805 unchanged, in their original order, at the end of the ACL. If the 5806 application is unable to do this, it should set the ACL4_PROTECTED 5807 flag. This behavior is not enforced by servers, but violations of 5808 this rule may lead to unexpected results when applications perform 5809 automatic inheritance. 5811 If a server also supports the mode attribute, it SHOULD set the mode 5812 in such a way that leaves inherited ACEs unchanged, in their original 5813 order, at the end of the ACL. If it is unable to do so, it SHOULD 5814 set the ACL4_PROTECTED flag on the file's dacl. 5816 Finally, in the case where the request that creates a new file or 5817 directory does not also set permissions for that file or directory, 5818 and there are also no ACEs to inherit from the parent's directory, 5819 then the server's choice of ACL for the new object is implementation- 5820 dependent. In this case, the server SHOULD set the ACL4_DEFAULTED 5821 flag on the ACL it chooses for the new object. An application 5822 performing automatic inheritance takes the ACL4_DEFAULTED flag as a 5823 sign that the ACL should be completely replaced by one generated 5824 using the automatic inheritance rules. 5826 7. Single-server Name Space 5828 This chapter describes the NFSv4 single-server name space. Single- 5829 server namespaces may be presented directly to clients, or they may 5830 be used as a basis to form larger multi-server namespaces (e.g. site- 5831 wide or organization-wide) to be presented to clients, as described 5832 in Section 10. 5834 7.1. Server Exports 5836 On a UNIX server, the name space describes all the files reachable by 5837 pathnames under the root directory or "/". On a Windows NT server 5838 the name space constitutes all the files on disks named by mapped 5839 disk letters. NFS server administrators rarely make the entire 5840 server's file system name space available to NFS clients. More often 5841 portions of the name space are made available via an "export" 5842 feature. In previous versions of the NFS protocol, the root 5843 filehandle for each export is obtained through the MOUNT protocol; 5844 the client sends a string that identifies the export of name space 5845 and the server returns the root filehandle for it. The MOUNT 5846 protocol supports an EXPORTS procedure that will enumerate the 5847 server's exports. 5849 7.2. Browsing Exports 5851 The NFS version 4 protocol provides a root filehandle that clients 5852 can use to obtain filehandles for the exports of a particular server, 5853 via a series of LOOKUP operations within a COMPOUND, to traverse a 5854 path. A common user experience is to use a graphical user interface 5855 (perhaps a file "Open" dialog window) to find a file via progressive 5856 browsing through a directory tree. The client must be able to move 5857 from one export to another export via single-component, progressive 5858 LOOKUP operations. 5860 This style of browsing is not well supported by the NFS version 2 and 5861 3 protocols. The client expects all LOOKUP operations to remain 5862 within a single server file system. For example, the device 5863 attribute will not change. This prevents a client from taking name 5864 space paths that span exports. 5866 An automounter on the client can obtain a snapshot of the server's 5867 name space using the EXPORTS procedure of the MOUNT protocol. If it 5868 understands the server's pathname syntax, it can create an image of 5869 the server's name space on the client. The parts of the name space 5870 that are not exported by the server are filled in with a "pseudo file 5871 system" that allows the user to browse from one mounted file system 5872 to another. There is a drawback to this representation of the 5873 server's name space on the client: it is static. If the server 5874 administrator adds a new export the client will be unaware of it. 5876 7.3. Server Pseudo File System 5878 NFS version 4 servers avoid this name space inconsistency by 5879 presenting all the exports for a given server within the framework of 5880 a single namespace, for that server. An NFS version 4 client uses 5881 LOOKUP and READDIR operations to browse seamlessly from one export to 5882 another. Portions of the server name space that are not exported are 5883 bridged via a "pseudo file system" that provides a view of exported 5884 directories only. A pseudo file system has a unique fsid and behaves 5885 like a normal, read only file system. 5887 Based on the construction of the server's name space, it is possible 5888 that multiple pseudo file systems may exist. For example, 5890 /a pseudo file system 5891 /a/b real file system 5892 /a/b/c pseudo file system 5893 /a/b/c/d real file system 5895 Each of the pseudo file systems are considered separate entities and 5896 therefore will have its own unique fsid. 5898 7.4. Multiple Roots 5900 The DOS and Windows operating environments are sometimes described as 5901 having "multiple roots". File Systems are commonly represented as 5902 disk letters. MacOS represents file systems as top level names. NFS 5903 version 4 servers for these platforms can construct a pseudo file 5904 system above these root names so that disk letters or volume names 5905 are simply directory names in the pseudo root. 5907 7.5. Filehandle Volatility 5909 The nature of the server's pseudo file system is that it is a logical 5910 representation of file system(s) available from the server. 5911 Therefore, the pseudo file system is most likely constructed 5912 dynamically when the server is first instantiated. It is expected 5913 that the pseudo file system may not have an on disk counterpart from 5914 which persistent filehandles could be constructed. Even though it is 5915 preferable that the server provide persistent filehandles for the 5916 pseudo file system, the NFS client should expect that pseudo file 5917 system filehandles are volatile. This can be confirmed by checking 5918 the associated "fh_expire_type" attribute for those filehandles in 5919 question. If the filehandles are volatile, the NFS client must be 5920 prepared to recover a filehandle value (e.g. with a series of LOOKUP 5921 operations) when receiving an error of NFS4ERR_FHEXPIRED. 5923 7.6. Exported Root 5925 If the server's root file system is exported, one might conclude that 5926 a pseudo-file system is unneeded. This not necessarily so. Assume 5927 the following file systems on a server: 5929 / disk1 (exported) 5930 /a disk2 (not exported) 5931 /a/b disk3 (exported) 5933 Because disk2 is not exported, disk3 cannot be reached with simple 5934 LOOKUPs. The server must bridge the gap with a pseudo-file system. 5936 7.7. Mount Point Crossing 5938 The server file system environment may be constructed in such a way 5939 that one file system contains a directory which is 'covered' or 5940 mounted upon by a second file system. For example: 5942 /a/b (file system 1) 5943 /a/b/c/d (file system 2) 5945 The pseudo file system for this server may be constructed to look 5946 like: 5948 / (place holder/not exported) 5949 /a/b (file system 1) 5950 /a/b/c/d (file system 2) 5952 It is the server's responsibility to present the pseudo file system 5953 that is complete to the client. If the client sends a lookup request 5954 for the path "/a/b/c/d", the server's response is the filehandle of 5955 the file system "/a/b/c/d". In previous versions of the NFS 5956 protocol, the server would respond with the filehandle of directory 5957 "/a/b/c/d" within the file system "/a/b". 5959 The NFS client will be able to determine if it crosses a server mount 5960 point by a change in the value of the "fsid" attribute. 5962 7.8. Security Policy and Name Space Presentation 5964 The application of the server's security policy needs to be carefully 5965 considered by the implementor. One may choose to limit the 5966 viewability of portions of the pseudo file system based on the 5967 server's perception of the client's ability to authenticate itself 5968 properly. However, with the support of multiple security mechanisms 5969 and the ability to negotiate the appropriate use of these mechanisms, 5970 the server is unable to properly determine if a client will be able 5971 to authenticate itself. If, based on its policies, the server 5972 chooses to limit the contents of the pseudo file system, the server 5973 may effectively hide file systems from a client that may otherwise 5974 have legitimate access. 5976 As suggested practice, the server should apply the security policy of 5977 a shared resource in the server's namespace to the components of the 5978 resource's ancestors. For example: 5980 / 5981 /a/b 5982 /a/b/c 5984 The /a/b/c directory is a real file system and is the shared 5985 resource. The security policy for /a/b/c is Kerberos with integrity. 5986 The server should apply the same security policy to /, /a, and /a/b. 5987 This allows for the extension of the protection of the server's 5988 namespace to the ancestors of the real shared resource. 5990 For the case of the use of multiple, disjoint security mechanisms in 5991 the server's resources, the security for a particular object in the 5992 server's namespace should be the union of all security mechanisms of 5993 all direct descendants. 5995 8. File Locking and Share Reservations 5997 Integrating locking into the NFS protocol necessarily causes it to be 5998 stateful. With the inclusion of such features as share reservations, 5999 file and directory delegations, recallable layouts, and support for 6000 mandatory record locking the protocol becomes substantially more 6001 dependent on state than the traditional combination of NFS and NLM 6002 [XNFS]. There are three components to making this state manageable: 6004 o Clear division between client and server 6006 o Ability to reliably detect inconsistency in state between client 6007 and server 6009 o Simple and robust recovery mechanisms 6011 In this model, the server owns the state information. The client 6012 requests changes in locks and the server responds with the changes 6013 made. Non-client-initiated changes in locking state are infrequent 6014 and the client receives prompt notification of them and can adjust 6015 its view of the locking state to reflect the server's changes. 6017 To support Win32 share reservations it is necessary to provide 6018 operations which atomically OPEN or CREATE files. Having a separate 6019 share/unshare operation would not allow correct implementation of the 6020 Win32 OpenFile API. In order to correctly implement share semantics, 6021 the previous NFS protocol mechanisms used when a file is opened or 6022 created (LOOKUP, CREATE, ACCESS) need to be replaced. The NFS 6023 version 4.1 protocol defines OPEN operation which looks up or creates 6024 a file and establishes locking state on the server. 6026 8.1. Locking 6028 It is assumed that manipulating a lock is rare when compared to READ 6029 and WRITE operations. It is also assumed that crashes and network 6030 partitions are relatively rare. Therefore it is important that the 6031 READ and WRITE operations have a lightweight mechanism to indicate if 6032 they possess a held lock. A lock request contains the heavyweight 6033 information required to establish a lock and uniquely define the lock 6034 owner. 6036 The following sections describe the transition from the heavyweight 6037 information to the eventual lightweight stateid used for most client 6038 and server locking interactions. 6040 8.1.1. Client and Session ID 6042 A client must establish a client ID (see Section 2.4) and then one or 6043 more sessionids (see Section 2.10) before performing any operations 6044 to open, lock, or delegate a file object. The sessionid services as 6045 a shorthand referral to an NFSv4.1 client. 6047 8.1.2. State-owner Definition 6049 When opening a file or requesting a record lock, the client must 6050 specify an identifier which represents the owner of the requested 6051 lock. This identifier is in the form of a state-owner, represented 6052 in the protocol by a state_owner4, a variable-length opaque array 6053 which, when concatenated with the current client ID uniquely defines 6054 the owner of lock managed by the client. This may be a thread id, 6055 process id, or other unique value. 6057 Owners of opens and owners of record locks are separate entities and 6058 remain separate even if the same opaque arrays are used to designate 6059 owners of each. The protocol distinguishes between open-owners 6060 (represented by open_owner4 structures) and lock-owners (represented 6061 by lock_owner4 structures). 6063 Each open is associated with a specific open-owner while each record 6064 lock is associated with a lock-owner and an open-owner, the latter 6065 being the open-owner associated with the open file under which the 6066 LOCK operation was done. Delegations and layouts, on the other hand, 6067 are not associated with a specific owner but are associated the 6068 client as a whole. 6070 8.1.3. Stateid Definition 6072 When the server grants a lock of any type (including opens, record 6073 locks, delegations, and layouts) it responds with a unique stateid, 6074 that represents a set of locks (often a single lock) for the same 6075 file, of the same type, and sharing the same ownership 6076 characteristics. Thus opens of the same file by different open- 6077 owners each have an identifying stateid. Similarly, each set of 6078 record locks on a file owned by a specific lock-owner and gotten via 6079 an open for a specific open-owner, has its own identifying stateid. 6080 Delegations and layouts also have associated stateids by which they 6081 may be referenced. The stateid is used as a shorthand reference to a 6082 lock or set of locks and given a stateid the client can determine the 6083 associated state-owner or state-owners (in the case of an open-owner/ 6084 lock-owner pair) and the associated filehandle. When stateids are 6085 used the current filehandle must be the one associated with that 6086 stateid. 6088 The server may assign stateids independently for different clients 6089 and a stateid with the same bit pattern for one client may designate 6090 an entirely different set of locks for a different client. The 6091 stateid is always interpreted with respect to the client ID 6092 associated with the current session. Stateids apply to all sessions 6093 associated with the given client ID and the client may use a stateid 6094 obtained from one session on another session associated with the same 6095 client ID. 6097 8.1.3.1. Stateid Structure 6099 Stateids are divided into two fields, a 96-bit "other" field 6100 identifying the specific set of locks and a 32-bit "seqid" sequence 6101 value. Except in the case of special stateids, to be discussed 6102 below, the purpose of the sequence value within NFSv4.1 is to allow 6103 the server to communicate to the client the order in which operations 6104 that modified locking state associated with a stateid have been 6105 processed. 6107 In the case of stateids associated with opens, i.e. the stateids 6108 returned by OPEN (the state for the open, rather than that for the 6109 delegation), OPEN_DOWNGRADE, or CLOSE, the server MUST provide an 6110 "seqid" value starting at one for the first use of a given "other" 6111 value and incremented by one with each subsequent operation returning 6112 a stateid. 6114 In the case of other sorts of stateids (i.e. stateids associated with 6115 record locks and delegations), the server MAY provide an incrementing 6116 sequence value on successive stateids returned with same identifying 6117 field, or it may return the value zero. If it does return a non-zero 6118 "seqid" value it MUST start at one and be incremented by one with 6119 each subsequent operation returning a stateid with same "other" 6120 value, just as is done with open state. 6122 The client when using a stateid as a parameter to an operation, must, 6123 except in the case of a special stateid, set the sequence value to 6124 zero. If the value is non-zero, the server MUST return the error 6125 NFS4ERR_BAD_STATEID. 6127 8.1.3.2. Special Stateids 6129 Stateid values whose "other" field is either all zeros or all ones 6130 are reserved. They may not be assigned by the server but have 6131 special meanings defined by the protocol. The particular meaning 6132 depends on whether the "other" field is all zeros or all ones and the 6133 specific value of the "seqid" field. 6135 The following combinations of "other" and "seqid" are defined in 6136 NFSv4.1: 6138 o When "other" and "seqid" are both zero, the stateid is treated as 6139 a special anonymous stateid, which can be used in READ, WRITE, and 6140 SETATTR requests to indicate the absence of any open state 6141 associated with the request. When an anonymous stateid value is 6142 used, and an existing open denies the form of access requested, 6143 then access will be denied to the request. 6145 o When "other" and "seqid" are both all ones, the stateid is a 6146 special read bypass stateid. When this value is used in WRITE or 6147 SETATTR, it is treated like the anonymous value. When used in 6148 READ, the server MAY grant access, even if access would normally 6149 be denied to READ requests. 6151 o When "other" is zero and "seqid" is one, the stateid represents 6152 the current stateid, which is whatever value is the last stateid 6153 returned by an operation within the COMPOUND. In the case of an 6154 OPEN, the stateid returned for the open file, and not the 6155 delegation is used. The stateid passed to the operation in place 6156 of the special value has its "seqid" value set to zero. If there 6157 is no operation in the COMPOUND which has returned a stateid 6158 value, the server MUST return the error NFS4ERR_BAD_STATEID. 6160 If a stateid value is used which has all zero or all ones in the 6161 "other" field, but does not match one of the cases above, the server 6162 MUST return the error NFS4ERR_BAD_STATEID. 6164 Special stateids, unlike other stateids are not associated with 6165 individual client ID's or filehandles and can be used with all valid 6166 client ID's and filehandles. In the case of a special stateid 6167 designating the current current stateid, the current stateid value 6168 substituted for the special stateid is associated with a particular 6169 client ID and filehandle. 6171 8.1.3.3. Stateid Lifetime and Validation 6173 Stateids must remain valid until either a client reboot or a sever 6174 reboot or until the client returns all of the locks associated with 6175 the stateid by means of an operation such as CLOSE or DELEGRETURN. 6176 If the locks are lost due to revocation the stateid remains a valid 6177 designation of that revoked state until the client frees it by using 6178 FREE_STATEID. Stateids associated with record locks are an 6179 exception. They remain valid even if a LOCKU free all remaining 6180 locks, so long as the open file with which they are associated 6181 remains open, unless the client does a FREE_STATEID to cause the 6182 stateid to be freed. 6184 An "other" value must never be reused for a different purpose (i.e. 6185 different filehandle, owner, or type of locks) within the context of 6186 a single client ID. A server may retain the "other" value for the 6187 same purpose beyond the point where it may otherwise be freed but if 6188 it does so, it must maintain "seqid" continuity with previous values, 6189 in all case in which it is required to return incrementing "seqid" 6190 values in general. 6192 One mechanism that may be used to satisfy the requirement that the 6193 server recognize invalid and out-of-date stateids is for the server 6194 to divide the "other" field of the stateid into two fields. 6196 o An index into a table of locking-state structures. 6198 o A generation number which is incremented on each allocation of a 6199 table entry for a particular use. 6201 And then store in each table entry, 6203 o The current generation number. 6205 o The client ID with which the stateid is associated. 6207 o The filehandle of the file on which the locks are taken. 6209 o An indication of the type of stateid (open, record lock, file 6210 delegation, directory delegation, layout). 6212 o The last "seqid" value returned corresponding to the current 6213 "other" value. 6215 With this information, the following procedure would be used to 6216 validate an incoming stateid and return an appropriate error, when 6217 necessary: 6219 o If the server has restarted resulting in loss of all lessed state 6220 but the sessionid and clientID are still valid, return 6221 NFS4ERR_STALE_STATEID. (If server restart has resulted in an 6222 invalid client ID or sessionid is invalid, SEQUENCE will return an 6223 error - not NFS4ERR_STATE_STATEID - and the operation that takes a 6224 stateid as an argument will never be processed.) 6226 o If the "other" field is all zeros or all ones, check that the 6227 "other" and "seqid" match a defined combination for a special 6228 stateid and that that stateid can be used in the current context. 6229 If not, then return NFS4ERR_BAD_STATEID. 6231 o If the "seqid" field is not zero, return NFS4ERR_BAD_STATEID. 6233 o Otherwise divide the "other" into a table index and an entry 6234 generation. 6236 o If the table index field is outside the range of the associated 6237 table, return NFS4ERR_BAD_STATEID. 6239 o If the selected table entry is of a different generation than that 6240 specified in the incoming stateid, return NFS4ERR_BAD_STATEID. 6242 o If the selected table entry does not match the current file 6243 handle, return NFS4ERR_BAD_STATEID. 6245 o If the client ID in the table entry does not match the client ID 6246 associated with the current session, return NFS4ERR_BAD_STATEID. 6248 o If the stateid type is not valid for the context in which the 6249 stateid appears, return NFS4ERR_BAD_STATEID. 6251 o Otherwise, the stateid is valid and the table entry should contain 6252 any additional information about the associated set of locks, such 6253 as open-owner and lock-owner information, as well as information 6254 on the specific locks, such as open modes and octet ranges. 6256 8.1.4. Use of the Stateid and Locking 6258 All READ, WRITE and SETATTR operations contain a stateid. For the 6259 purposes of this section, SETATTR operations which change the size 6260 attribute of a file are treated as if they are writing the area 6261 between the old and new size (i.e. the range truncated or added to 6262 the file by means of the SETATTR), even where SETATTR is not 6263 explicitly mentioned in the text. 6265 If the state-owner performs a READ or WRITE in a situation in which 6266 it has established a lock or share reservation on the server (any 6267 OPEN constitutes a share reservation) the stateid (previously 6268 returned by the server) must be used to indicate what locks, 6269 including both record locks and share reservations, are held by the 6270 state-owner. If no state is established by the client, either record 6271 lock or share reservation, a special stateid for anonymous state 6272 (zero as "other" and "seqid") is used. Regardless whether a stateid 6273 for anonymous state or a stateid returned by the server is used, if 6274 there is a conflicting share reservation or mandatory record lock 6275 held on the file, the server MUST refuse to service the READ or WRITE 6276 operation. 6278 Share reservations are established by OPEN operations and by their 6279 nature are mandatory in that when the OPEN denies READ or WRITE 6280 operations, that denial results in such operations being rejected 6281 with error NFS4ERR_LOCKED. Record locks may be implemented by the 6282 server as either mandatory or advisory, or the choice of mandatory or 6283 advisory behavior may be determined by the server on the basis of the 6284 file being accessed (for example, some UNIX-based servers support a 6285 "mandatory lock bit" on the mode attribute such that if set, record 6286 locks are required on the file before I/O is possible). When record 6287 locks are advisory, they only prevent the granting of conflicting 6288 lock requests and have no effect on READs or WRITEs. Mandatory 6289 record locks, however, prevent conflicting I/O operations. When they 6290 are attempted, they are rejected with NFS4ERR_LOCKED. When the 6291 client gets NFS4ERR_LOCKED on a file it knows it has the proper share 6292 reservation for, it will need to issue a LOCK request on the region 6293 of the file that includes the region the I/O was to be performed on, 6294 with an appropriate locktype (i.e. READ*_LT for a READ operation, 6295 WRITE*_LT for a WRITE operation). 6297 Note that for UNIX environments that support mandatory file locking, 6298 the distinction between advisory and mandatory locking is subtle. In 6299 fact, advisory and mandatory record locks are exactly the same in so 6300 far as the APIs and requirements on implementation. If the mandatory 6301 lock attribute is set on the file, the server checks to see if the 6302 lock-owner has an appropriate shared (read) or exclusive (write) 6303 record lock on the region it wishes to read or write to. If there is 6304 no appropriate lock, the server checks if there is a conflicting lock 6305 (which can be done by attempting to acquire the conflicting lock on 6306 the behalf of the lock-owner, and if successful, release the lock 6307 after the READ or WRITE is done), and if there is, the server returns 6308 NFS4ERR_LOCKED. 6310 For Windows environments, there are no advisory record locks, so the 6311 server always checks for record locks during I/O requests. 6313 Thus, the NFS version 4 LOCK operation does not need to distinguish 6314 between advisory and mandatory record locks. It is the NFS version 4 6315 server's processing of the READ and WRITE operations that introduces 6316 the distinction. 6318 Every stateid with the exception of special stateid values, whether 6319 returned by an OPEN-type operation (i.e. OPEN, OPEN_DOWNGRADE), or 6320 by a LOCK-type operation (i.e. LOCK or LOCKU), defines an access 6321 mode for the file (i.e. READ, WRITE, or READ-WRITE) as established 6322 by the original OPEN which caused the allocation of the open stateid 6323 and as modified by subsequent OPENs and OPEN_DOWNGRADEs for the same 6324 open-owner/file pair. Stateids returned by record lock operations 6325 imply the access mode for the open stateid associated with the lock 6326 set represented by the stateid. Delegation stateids have an access 6327 mode based on the type of delegation. When a READ, WRITE, or SETATTR 6328 which specifies the size attribute, is done, the operation is subject 6329 to checking against the access mode to verify that the operation is 6330 appropriate given the OPEN with which the operation is associated. 6332 In the case of WRITE-type operations (i.e. WRITEs and SETATTRs which 6333 set size), the server must verify that the access mode allows writing 6334 and return an NFS4ERR_OPENMODE error if it does not. In the case, of 6335 READ, the server may perform the corresponding check on the access 6336 mode, or it may choose to allow READ on opens for WRITE only, to 6337 accommodate clients whose write implementation may unavoidably do 6338 reads (e.g. due to buffer cache constraints). However, even if READs 6339 are allowed in these circumstances, the server MUST still check for 6340 locks that conflict with the READ (e.g. another open specify denial 6341 of READs). Note that a server which does enforce the access mode 6342 check on READs need not explicitly check for conflicting share 6343 reservations since the existence of OPEN for read access guarantees 6344 that no conflicting share reservation can exist. 6346 The read bypass special stateid (all bits of "other" and "seqid" set 6347 to one) stateid indicates a desire to bypass locking checks. The 6348 server MAY allow READ operations to bypass locking checks at the 6349 server, when this special stateid is used. However, WRITE operations 6350 with this special stateid value MUST NOT bypass locking checks and 6351 are treated exactly the same as if a special stateid for anonymous 6352 state were used. 6354 A lock may not be granted while a READ or WRITE operation using one 6355 of the special stateids is being performed and the range of the lock 6356 request conflicts with the range of the READ or WRITE operation. For 6357 the purposes of this paragraph, a conflict occurs when a shared lock 6358 is requested and a WRITE operation is being performed, or an 6359 exclusive lock is requested and either a READ or a WRITE operation is 6360 being performed. A SETATTR that sets size is treated similarly to a 6361 WRITE as discussed above. 6363 8.2. Lock Ranges 6365 The protocol allows a lock owner to request a lock with an octet 6366 range and then either upgrade, downgrade, or unlock a sub-range of 6367 the initial lock. It is expected that this will be an uncommon type 6368 of request. In any case, servers or server filesystems may not be 6369 able to support sub-range lock semantics. In the event that a server 6370 receives a locking request that represents a sub-range of current 6371 locking state for the lock owner, the server is allowed to return the 6372 error NFS4ERR_LOCK_RANGE to signify that it does not support sub- 6373 range lock operations. Therefore, the client should be prepared to 6374 receive this error and, if appropriate, report the error to the 6375 requesting application. 6377 The client is discouraged from combining multiple independent locking 6378 ranges that happen to be adjacent into a single request since the 6379 server may not support sub-range requests and for reasons related to 6380 the recovery of file locking state in the event of server failure. 6381 As discussed in the section "Server Failure and Recovery" below, the 6382 server may employ certain optimizations during recovery that work 6383 effectively only when the client's behavior during lock recovery is 6384 similar to the client's locking behavior prior to server failure. 6386 8.3. Upgrading and Downgrading Locks 6388 If a client has a write lock on a record, it can request an atomic 6389 downgrade of the lock to a read lock via the LOCK request, by setting 6390 the type to READ_LT. If the server supports atomic downgrade, the 6391 request will succeed. If not, it will return NFS4ERR_LOCK_NOTSUPP. 6392 The client should be prepared to receive this error, and if 6393 appropriate, report the error to the requesting application. 6395 If a client has a read lock on a record, it can request an atomic 6396 upgrade of the lock to a write lock via the LOCK request by setting 6397 the type to WRITE_LT or WRITEW_LT. If the server does not support 6398 atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the upgrade 6399 can be achieved without an existing conflict, the request will 6400 succeed. Otherwise, the server will return either NFS4ERR_DENIED or 6401 NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is returned if the 6402 client issued the LOCK request with the type set to WRITEW_LT and the 6403 server has detected a deadlock. The client should be prepared to 6404 receive such errors and if appropriate, report the error to the 6405 requesting application. 6407 8.4. Blocking Locks 6409 Some clients require the support of blocking locks. NFSv4.1 does not 6410 provide a callback when a previously unavailable lock becomes 6411 available. Clients thus have no choice but to continually poll for 6412 the lock. This presents a fairness problem. Two new lock types are 6413 added, READW and WRITEW, and are used to indicate to the server that 6414 the client is requesting a blocking lock. The server should maintain 6415 an ordered list of pending blocking locks. When the conflicting lock 6416 is released, the server may wait the lease period for the first 6417 waiting client to re-request the lock. After the lease period 6418 expires the next waiting client request is allowed the lock. Clients 6419 are required to poll at an interval sufficiently small that it is 6420 likely to acquire the lock in a timely manner. The server is not 6421 required to maintain a list of pending blocked locks as it is used to 6422 increase fairness and not correct operation. Because of the 6423 unordered nature of crash recovery, storing of lock state to stable 6424 storage would be required to guarantee ordered granting of blocking 6425 locks. 6427 Servers may also note the lock types and delay returning denial of 6428 the request to allow extra time for a conflicting lock to be 6429 released, allowing a successful return. In this way, clients can 6430 avoid the burden of needlessly frequent polling for blocking locks. 6431 The server should take care in the length of delay in the event the 6432 client retransmits the request. 6434 If a server receives a blocking lock request, denies it, and then 6435 later receives a nonblocking request for the same lock, which is also 6436 denied, then it should remove the lock in question from its list of 6437 pending blocking locks. Clients should use such a nonblocking 6438 request to indicate to the server that this is the last time they 6439 intend to poll for the lock, as may happen when the process 6440 requesting the lock is interrupted. This is a courtesy to the 6441 server, to prevent it from unnecessarily waiting a lease period 6442 before granting other lock requests. However, clients are not 6443 required to perform this courtesy, and servers must not depend on 6444 them doing so. Also, clients must be prepared for the possibility 6445 that this final locking request will be accepted. 6447 8.5. Lease Renewal 6449 The purpose of a lease is to allow a server to remove stale locks 6450 that are held by a client that has crashed or is otherwise 6451 unreachable. It is not a mechanism for cache consistency and lease 6452 renewals may not be denied if the lease interval has not expired. 6454 Since each session is associated with a specific client, any 6455 operation issued on that session is an indication that the associated 6456 client is reachable. When a request is issued for a given session, 6457 execution of a SEQUENCE operation will result in all leases for the 6458 associated client to be implicitly renewed. This approach allows for 6459 low overhead lease renewal which scales well. In the typical case no 6460 extra RPC calls are required for lease renewal and in the worst case 6461 one RPC is required every lease period, via a COMPOUND that consists 6462 solely of a single SEQUENCE operation. The number of locks held by 6463 the client is not a factor since all state for the client is involved 6464 with the lease renewal action. 6466 Since all operations that create a new lease also renew existing 6467 leases, the server must maintain a common lease expiration time for 6468 all valid leases for a given client. This lease time can then be 6469 easily updated upon implicit lease renewal actions. 6471 8.6. Crash Recovery 6473 The important requirement in crash recovery is that both the client 6474 and the server know when the other has failed. Additionally, it is 6475 required that a client sees a consistent view of data across server 6476 restarts or reboots. All READ and WRITE operations that may have 6477 been queued within the client or network buffers must wait until the 6478 client has successfully recovered the locks protecting the READ and 6479 WRITE operations. 6481 8.6.1. Client Failure and Recovery 6483 In the event that a client fails, the server may release the client's 6484 locks when the associated leases have expired. Conflicting locks 6485 from another client may only be granted after this lease expiration. 6486 When a client has not failed and re-establishes his lease before 6487 expiration occurs, requests for conflicting locks will not be 6488 granted. 6490 To minimize client delay upon restart, lock requests are associated 6491 with an instance of the client by a client supplied verifier. This 6492 verifier is part of the initial EXCHANGE_ID call made by the client. 6493 The server returns a client ID as a result of the EXCHANGE_ID 6494 operation. The client then confirms the use of the client ID by 6495 establishing a session associated with that client ID. All locks, 6496 including opens, record locks, delegations, and layout obtained by 6497 sessions using that client ID are associated with that client ID. 6499 Since the verifier will be changed by the client upon each 6500 initialization, the server can compare a new verifier to the verifier 6501 associated with currently held locks and determine that they do not 6502 match. This signifies the client's new instantiation and subsequent 6503 loss of locking state. As a result, the server is free to release 6504 all locks held which are associated with the old client ID which was 6505 derived from the old verifier. At this point conflicting locks from 6506 other clients, kept waiting while the leaser had not yet expired, can 6507 be granted. 6509 Note that the verifier must have the same uniqueness properties of 6510 the verifier for the COMMIT operation. 6512 8.6.2. Server Failure and Recovery 6514 If the server loses locking state (usually as a result of a restart 6515 or reboot), it must allow clients time to discover this fact and re- 6516 establish the lost locking state. The client must be able to re- 6517 establish the locking state without having the server deny valid 6518 requests because the server has granted conflicting access to another 6519 client. Likewise, if there is a possibility that clients have not 6520 yet re-established their locking state for a file, the server must 6521 disallow READ and WRITE operations for that file. 6523 A client can determine that loss of locking state has occurred via 6524 several methods. 6526 1. When a SEQUENCE succeeds, but sr_status_flags in the reply to 6527 SEQUENCE indicates SEQ4_STATUS_RESTART_RECLAIM_NEEDED (see 6528 Section 17.46.4). The client's client ID and session are valid 6529 (have persisted through server restart) and the client can now 6530 re-establish its lock state (Section 8.6.2.1). 6532 2. When an operation returns NFS4ERR_STALE_STATEID, this indicates a 6533 stateid invalidated by a server reboot or restart. Since the 6534 operation that returned NFS4ERR_STALE_STATEID MUST have been 6535 preceded by SEQUENCE, and SEQUENCE did not return an error, this 6536 means the client ID and session are valid. The client can now 6537 re-establish is lock state as described in Section 8.6.2.1. Note 6538 that the server should (MUST) have set 6539 SEQ4_STATUS_RESTART_RECLAIM_NEEDED in the sr_status_flags of the 6540 results of the SEQUENCE operation, and thus this situation should 6541 be the same as that described above. 6543 3. When a SEQUENCE operation returns NFS4ERR_STALE_CLIENTID, this 6544 means both sessionid SEQUENCE refers to (field sa_sessionid) and 6545 the implied client ID are now invalid, where the client ID was 6546 invalidated by server reboot or restart or by lease expiration. 6547 When SEQUENCE returns NFS4ERR_STALE_CLIENTID, the client must 6548 establish a new client ID (see Section 8.1.1) and re-establish 6549 its lock state (Section 8.6.2.1). 6551 4. When a SEQUENCE operation returns NFS4ERR_BADSESSION, this may 6552 mean the session has been destroyed, but the client ID is still 6553 valid. The client issues a CREATE_SESSION request with the 6554 client ID to re-establish the session. If CREATE_SESSION fails 6555 with NFS4ERR_STALE_CLIENTID, the client must establish a new 6556 client ID (see Section 8.1.1) and re-establish its lock state 6557 (Section 8.6.2.1). If CREATE_SESSION succeeds, the client must 6558 then re-establish its lock state (Section 8.6.2.1). 6560 5. When a operation, neither SEQUENCE nor preceded by SEQUENCE (for 6561 example, CREATE_SESSION, DESTROY_SESSION) returns 6562 NFS4ERR_STALE_CLIENTID. The client MUST establish a new client 6563 ID (Section 8.1.1) and re-establish its lock state 6564 (Section 8.6.2.1). 6566 8.6.2.1. State Reclaim 6568 Once a session is established using the new client ID, the client 6569 will use reclaim-type locking requests (i.e. LOCK requests with 6570 reclaim set to true and OPEN operations with a claim type of 6571 CLAIM_PREVIOUS) to re-establish its locking state. Once this is 6572 done, or if there is no such locking state to reclaim, the client 6573 does a RECLAIM_COMPLETE operation to indicate that it has reclaimed 6574 all of the locking state that it will reclaim. Once a client does a 6575 RECLAIM_COMPLETE operation, it may attempt non-reclaim locking 6576 operations, although it may get NFS4ERR_GRACE errors on these until 6577 the period of special handling is over. 6579 The period of special handling of locking and READs and WRITEs, is 6580 referred to as the "grace period". During the grace period, clients 6581 recover locks and the associated state using reclaim-type locking 6582 requests. During this period, the server must reject READ and WRITE 6583 operations and non-reclaim locking requests (i.e. other LOCK and OPEN 6584 operations) with an error of NFS4ERR_GRACE, unless it is able to 6585 guarantee that these may be done safely, as described below. 6587 The grace period may last until all clients who are known to possibly 6588 have had locks have done a RECLAIM_COMPLETE operation, indicating 6589 that they have finished reclaiming the locks they held before the 6590 server reboot. The server is assumed to maintain in stable storage a 6591 list of clients who may have such locks. The server may also 6592 terminate the grace period before all clients have done 6593 RECLAIM_COMPLETE. The server SHOULD NOT terminate the grace period 6594 before a time equal to the lease period in order to give clients an 6595 opportunity to find out about the server reboot. Some additional 6596 time in order to allow time to establish a new client ID and session 6597 and to effect lock reclaims may be added. 6599 If the server can reliably determine that granting a non-reclaim 6600 request will not conflict with reclamation of locks by other clients, 6601 the NFS4ERR_GRACE error does not have to be returned even within the 6602 grace period, although NFS4ERR_GRACE must always be returned to 6603 clients attempting a non-reclaim lock request before doing their own 6604 RECLAIM_COMPLETE. For the server to be able to service READ and 6605 WRITE operations during the grace period, it must again be able to 6606 guarantee that no possible conflict could arise between a potential 6607 reclaim locking request and the READ or WRITE operation. If the 6608 server is unable to offer that guarantee, the NFS4ERR_GRACE error 6609 must be returned to the client. 6611 For a server to provide simple, valid handling during the grace 6612 period, the easiest method is to simply reject all non-reclaim 6613 locking requests and READ and WRITE operations by returning the 6614 NFS4ERR_GRACE error. However, a server may keep information about 6615 granted locks in stable storage. With this information, the server 6616 could determine if a regular lock or READ or WRITE operation can be 6617 safely processed. 6619 For example, if the server maintained on stable storage summary 6620 information on whether mandatory locks exist, either mandatory record 6621 locks, or share reservations specifying deny modes, many requests 6622 could be allowed during the grace period. If it is known that no 6623 such share reservations exist, OPEN request that do not specify deny 6624 modes may be safely granted. If, in addition, it is known that no 6625 mandatory record locks exist, either through information stored on 6626 stable storage or simply because the server does not support such 6627 locks, READ and WRITE requests may be safely processed during the 6628 grace period. 6630 To reiterate, for a server that allows non-reclaim lock and I/O 6631 requests to be processed during the grace period, it MUST determine 6632 that no lock subsequently reclaimed will be rejected and that no lock 6633 subsequently reclaimed would have prevented any I/O operation 6634 processed during the grace period. 6636 Clients should be prepared for the return of NFS4ERR_GRACE errors for 6637 non-reclaim lock and I/O requests. In this case the client should 6638 employ a retry mechanism for the request. A delay (on the order of 6639 several seconds) between retries should be used to avoid overwhelming 6640 the server. Further discussion of the general issue is included in 6641 [Floyd]. The client must account for the server that is able to 6642 perform I/O and non-reclaim locking requests within the grace period 6643 as well as those that can not do so. 6645 A reclaim-type locking request outside the server's grace period can 6646 only succeed if the server can guarantee that no conflicting lock or 6647 I/O request has been granted since reboot or restart. 6649 A server may, upon restart, establish a new value for the lease 6650 period. Therefore, clients should, once a new client ID is 6651 established, refetch the lease_time attribute and use it as the basis 6652 for lease renewal for the lease associated with that server. 6653 However, the server must establish, for this restart event, a grace 6654 period at least as long as the lease period for the previous server 6655 instantiation. This allows the client state obtained during the 6656 previous server instance to be reliably re-established. 6658 8.6.3. Network Partitions and Recovery 6660 If the duration of a network partition is greater than the lease 6661 period provided by the server, the server will have not received a 6662 lease renewal from the client. If this occurs, the server may free 6663 all locks held for the client, or it may allow the lock state to 6664 remain for a considerable period, subject to the constraint that if a 6665 request for a conflicting lock is made, locks associated with expired 6666 leases do not prevent such a conflicting lock from being granted but 6667 are revoked as necessary so as not to interfere with such conflicting 6668 requests. 6670 If the server chooses to delay freeing of lock state until there is a 6671 conflict, it may either free all of the clients locks once there is a 6672 conflict, or it may only revoke the minimum set of locks necessary to 6673 allow conflicting requests. When it adopts the finer-grained 6674 approach, it must revoke all locks associated with a given stateid, 6675 as long as it revokes a single such lock. 6677 When the server chooses to free all of a client's lock state, either 6678 immediately upon lease expiration, or a result of the first attempt 6679 to get a lock, all stateids held by the client will become invalid or 6680 stale. Once the client is able to reach the server after such a 6681 network partition, the status returned by the SEQUENCE operation will 6682 indicate a loss of locking state. In addition all I/O submitted by 6683 the client with the now invalid stateids will fail with the server 6684 returning the error NFS4ERR_EXPIRED. Once the client learns of the 6685 loss of locking state, it will suitably notify the applications that 6686 held the invalidated locks. The client should then take action to 6687 free invalidated stateids, either by establishing a new client ID 6688 using a new verifier or by doing a FREE_STATEID operation to release 6689 each of the invalidated stateids. 6691 When the server adopts a finer-grained approach to revocation of 6692 locks when lease have expired, only a subset of stateids will 6693 normally become invalid during a network partition. When the client 6694 is able to communicate with the server after such a network 6695 partition, the status returned by the SEQUENCE operation will 6696 indicate a partial loss of locking state. In addition, operations, 6697 including I/O submitted by the client with the now invalid stateids 6698 will fail with the server returning the error NFS4ERR_EXPIRED. Once 6699 the client learns of the loss of locking state, it will use the 6700 TEST_STATEID operation on all of its stateids to determine which 6701 locks have been lost and them suitably notify the applications that 6702 held the invalidated locks. The client can then release the 6703 invalidated locking state and acknowledge the revocation of the 6704 associated locks by doing a FREE_STATEID operation on each of the 6705 invalidated stateids. 6707 When a network partition is combined with a server reboot, there are 6708 edge conditions that place requirements on the server in order to 6709 avoid silent data corruption following the server reboot. Two of 6710 these edge conditions are known, and are discussed below. 6712 The first edge condition arises as a result of the scenarios such as 6713 the following: 6715 1. Client A acquires a lock. 6717 2. Client A and server experience mutual network partition, such 6718 that client A is unable to renew its lease. 6720 3. Client A's lease expires, and the server releases lock. 6722 4. Client B acquires a lock that would have conflicted with that of 6723 Client A. 6725 5. Client B releases its lock. 6727 6. Server reboots. 6729 7. Network partition between client A and server heals. 6731 8. Client A connects to new server instance and finds out about 6732 server reboot. 6734 9. Client A reclaims its lock within the server's grace period. 6736 Thus, at the final step, the server has erroneously granted client 6737 A's lock reclaim. If client B modified the object the lock was 6738 protecting, client A will experience object corruption. 6740 The second known edge condition arises in situations such as the 6741 following: 6743 1. Client A acquires one or more locks. 6745 2. Server reboots. 6747 3. Client A and server experience mutual network partition, such 6748 that client A is unable to reclaim all of its locks within the 6749 grace period. 6751 4. Server's reclaim grace period ends. Client A has either no 6752 locks or an incomplete set of locks known to the server. 6754 5. Client B acquires a lock that would have conflicted with a lock 6755 of client A that was not reclaimed. 6757 6. Client B releases the lock. 6759 7. Server reboots a second time. 6761 8. Network partition between client A and server heals. 6763 9. Client A connects to new server instance and finds out about 6764 server reboot. 6766 10. Client A reclaims its lock within the server's grace period. 6768 As with the first edge condition, the final step of the scenario of 6769 the second edge condition has the server erroneously granting client 6770 A's lock reclaim. 6772 Solving the first and second edge conditions requires that the server 6773 either always assumes after it reboots that some edge condition 6774 occurs, and thus return NFS4ERR_NO_GRACE for all reclaim attempts, or 6775 that the server record some information in stable storage. The 6776 amount of information the server records in stable storage is in 6777 inverse proportion to how harsh the server intends to be whenever 6778 edge conditions arise. The server that is completely tolerant of all 6779 edge conditions will record in stable storage every lock that is 6780 acquired, removing the lock record from stable storage only when the 6781 lock is released. For the two edge conditions discussed above, the 6782 harshest a server can be, and still support a grace period for 6783 reclaims, requires that the server record in stable storage 6784 information some minimal information. For example, a server 6785 implementation could, for each client, save in stable storage a 6786 record containing: 6788 o the client's id string 6789 o a boolean that indicates if the client's lease expired or if there 6790 was administrative intervention (see Section 8.7) to revoke a 6791 record lock, share reservation, or delegation and there has been 6792 no acknowledgement (via FREE_STATEID) of such revocation. 6794 o a boolean that indicates whether the client may have locks that it 6795 believes to be reclaimable in situations which the grace period 6796 was terminated, making the server's view of lock reclaimability 6797 suspect. The server will set this for any client record in stable 6798 storage where the client has not done a RECLAIM_COMPLETE, before 6799 it grants any new (i.e. not reclaimed) lock to any client. 6801 Assuming the above record keeping, for the first edge condition, 6802 after the server reboots, the record that client A's lease expired 6803 means that another client could have acquired a conflicting record 6804 lock, share reservation, or delegation. Hence the server must reject 6805 a reclaim from client A with the error NFS4ERR_NO_GRACE. 6807 For the second edge condition, after the server reboots for a second 6808 time, the indication that the client had not completed its reclaims 6809 at the time at which the grace period ended means that the server 6810 must reject a reclaim from client A with the error NFS4ERR_NO_GRACE. 6812 When either edge condition occurs, the client's attempt to reclaim 6813 locks will result in the error NFS4ERR_NO_GRACE. When this is 6814 received, or after the client reboots with no lock state, the client 6815 will issue a RECLAIM_COMPLETE. When the RECLAIM_COMPLETE is 6816 received, the server and client are again in agreement regarding 6817 reclaimable locks and both booleans in persistent storage can be 6818 reset, to be set again only when there is a subsequent event that 6819 causes lock reclaim operations to be questionable. 6821 Regardless of the level and approach to record keeping, the server 6822 MUST implement one of the following strategies (which apply to 6823 reclaims of share reservations, record locks, and delegations): 6825 1. Reject all reclaims with NFS4ERR_NO_GRACE. This is extremely 6826 unforgiving, but necessary if the server does not record lock 6827 state in stable storage. 6829 2. Record sufficient state in stable storage such that all known 6830 edge conditions involving server reboot, including the two noted 6831 in this section, are detected. False positives are acceptable. 6832 Note that at this time, it is not known if there are other edge 6833 conditions. 6835 In the event that, after a server reboot, the server determines 6836 that there is unrecoverable damage or corruption to the 6837 information in stable storage, then for all clients and/or locks 6838 which may be affected, the server MUST return NFS4ERR_NO_GRACE. 6840 A mandate for the client's handling of the NFS4ERR_NO_GRACE error is 6841 outside the scope of this specification, since the strategies for 6842 such handling are very dependent on the client's operating 6843 environment. However, one potential approach is described below. 6845 When the client receives NFS4ERR_NO_GRACE, it could examine the 6846 change attribute of the objects the client is trying to reclaim state 6847 for, and use that to determine whether to re-establish the state via 6848 normal OPEN or LOCK requests. This is acceptable provided the 6849 client's operating environment allows it. In other words, the client 6850 implementor is advised to document for his users the behavior. The 6851 client could also inform the application that its record lock or 6852 share reservations (whether they were delegated or not) have been 6853 lost, such as via a UNIX signal, a GUI pop-up window, etc. See the 6854 section, "Data Caching and Revocation" for a discussion of what the 6855 client should do for dealing with unreclaimed delegations on client 6856 state. 6858 For further discussion of revocation of locks see Section 8.7. 6860 8.7. Server Revocation of Locks 6862 At any point, the server can revoke locks held by a client and the 6863 client must be prepared for this event. When the client detects that 6864 its locks have been or may have been revoked, the client is 6865 responsible for validating the state information between itself and 6866 the server. Validating locking state for the client means that it 6867 must verify or reclaim state for each lock currently held. 6869 The first occasion of lock revocation is upon server reboot or 6870 restart. In this instance the client will receive an error 6871 (NFS4ERR_STALE_STATEID on an operation that takes a stateid as an 6872 argument or NFS4ERR_STALE_CLIENTID on an operation that takes a 6873 sessionid or client ID) and the client will proceed with normal crash 6874 recovery as described in the Section 8.6.2.1. 6876 The second occasion of lock revocation is the inability to renew the 6877 lease before expiration, as discussed above. While this is 6878 considered a rare or unusual event, the client must be prepared to 6879 recover. The server is responsible for determining lease expiration, 6880 and deciding exactly how to deal with it, informing the client of the 6881 scope of the lock revocation. The client then uses the status 6882 information provided by the server in the SEQUENCE results (field 6883 sr_status_flags, see Section 17.46.4) to synchronize its locking 6884 state with that of the server, in order to recover. 6886 The third occasion of lock revocation can occur as a result of 6887 revocation of locks within the lease period, either because of 6888 administrative intervention, or because a recallable lock (a 6889 delegation or layout) was not returned within the lease period after 6890 having been recalled. While these are considered rare events, they 6891 are possible and the client must be prepared to deal with them. When 6892 either of these events occur, the client finds out about the 6893 situation through the status returned by the SEQUENCE operation. Any 6894 use of stateids associated with revoked locks will receive the error 6895 NFS4ERR_ADMIN_REVOKED or NFS4ERR_DELEG_REVOKED, as appropriate. 6897 In all situations in which a subset of locking state may have been 6898 revoked, which include all cases in which locking state is revoked 6899 within the lease period, it is up to the client to determine which 6900 locks have been revoked and which have not. It does this by using 6901 the TEST_STATEID operation on the appropriate set of stateids. Once 6902 the set of revoked locks has been determined, the applications can be 6903 notified, and the invalidated stateids can be freed and lock 6904 revocation acknowledged by using FREE_STATEID. 6906 8.8. Share Reservations 6908 A share reservation is a mechanism to control access to a file. It 6909 is a separate and independent mechanism from record locking. When a 6910 client opens a file, it issues an OPEN operation to the server 6911 specifying the type of access required (READ, WRITE, or BOTH) and the 6912 type of access to deny others (deny NONE, READ, WRITE, or BOTH). If 6913 the OPEN fails the client will fail the application's open request. 6915 Pseudo-code definition of the semantics: 6917 if (request.access == 0) 6918 return (NFS4ERR_INVAL) 6919 else 6920 if ((request.access & file_state.deny)) || 6921 (request.deny & file_state.access)) 6922 return (NFS4ERR_DENIED) 6924 This checking of share reservations on OPEN is done with no exception 6925 for an existing OPEN for the same open-owner. 6927 The constants used for the OPEN and OPEN_DOWNGRADE operations for the 6928 access and deny fields are as follows: 6930 const OPEN4_SHARE_ACCESS_READ = 0x00000001; 6931 const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; 6932 const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; 6934 const OPEN4_SHARE_DENY_NONE = 0x00000000; 6935 const OPEN4_SHARE_DENY_READ = 0x00000001; 6936 const OPEN4_SHARE_DENY_WRITE = 0x00000002; 6937 const OPEN4_SHARE_DENY_BOTH = 0x00000003; 6939 8.9. OPEN/CLOSE Operations 6941 To provide correct share semantics, a client MUST use the OPEN 6942 operation to obtain the initial filehandle and indicate the desired 6943 access and what access, if any, to deny. Even if the client intends 6944 to use a special stateid for anonymous state or read bypass, it must 6945 still obtain the filehandle for the regular file with the OPEN 6946 operation so the appropriate share semantics can be applied. For 6947 clients that do not have a deny mode built into their open 6948 programming interfaces, deny equal to NONE should be used. 6950 The OPEN operation with the CREATE flag, also subsumes the CREATE 6951 operation for regular files as used in previous versions of the NFS 6952 protocol. This allows a create with a share to be done atomically. 6954 The CLOSE operation removes all share reservations held by the open- 6955 owner on that file. If record locks are held, the client SHOULD 6956 release all locks before issuing a CLOSE. The server MAY free all 6957 outstanding locks on CLOSE but some servers may not support the CLOSE 6958 of a file that still has record locks held. The server MUST return 6959 failure, NFS4ERR_LOCKS_HELD, if any locks would exist after the 6960 CLOSE. 6962 The LOOKUP operation will return a filehandle without establishing 6963 any lock state on the server. Without a valid stateid, the server 6964 will assume the client has the least access. For example, a file 6965 opened with deny READ/WRITE using a filehandle obtained through 6966 LOOKUP could only be read using the special read bypass stateid and 6967 could not be written at all because it would not have a valid stateid 6968 and the special anonymous stateid would not be allowed access. 6970 8.10. Open Upgrade and Downgrade 6972 When an OPEN is done for a file and the open-owner for which the open 6973 is being done already has the file open, the result is to upgrade the 6974 open file status maintained on the server to include the access and 6975 deny bits specified by the new OPEN as well as those for the existing 6976 OPEN. The result is that there is one open file, as far as the 6977 protocol is concerned, and it includes the union of the access and 6978 deny bits for all of the OPEN requests completed. Only a single 6979 CLOSE will be done to reset the effects of both OPENs. Note that the 6980 client, when issuing the OPEN, may not know that the same file is in 6981 fact being opened. The above only applies if both OPENs result in 6982 the OPENed object being designated by the same filehandle. 6984 When the server chooses to export multiple filehandles corresponding 6985 to the same file object and returns different filehandles on two 6986 different OPENs of the same file object, the server MUST NOT "OR" 6987 together the access and deny bits and coalesce the two open files. 6988 Instead the server must maintain separate OPENs with separate 6989 stateids and will require separate CLOSEs to free them. 6991 When multiple open files on the client are merged into a single open 6992 file object on the server, the close of one of the open files (on the 6993 client) may necessitate change of the access and deny status of the 6994 open file on the server. This is because the union of the access and 6995 deny bits for the remaining opens may be smaller (i.e. a proper 6996 subset) than previously. The OPEN_DOWNGRADE operation is used to 6997 make the necessary change and the client should use it to update the 6998 server so that share reservation requests by other clients are 6999 handled properly. 7001 8.11. Short and Long Leases 7003 When determining the time period for the server lease, the usual 7004 lease tradeoffs apply. Short leases are good for fast server 7005 recovery at a cost of increased operations to effect lease renewal 7006 (when there are no other operations during the period to effect lease 7007 renewal as a side-effect). Long leases are certainly kinder and 7008 gentler to servers trying to handle very large numbers of clients. 7009 The number of extra requests to effect lock renewal drop in inverse 7010 proportion to the lease time. The disadvantages of long leases 7011 include the possibility of slower recovery after certain failures. 7012 After server failure, a longer grace period may be required when some 7013 clients do not promptly reclaim their locks and do a 7014 RECLAIM_COMPLETE. In the event of client failure, it can longer 7015 period for leases to expire thus forcing conflicting requests to 7016 wait. 7018 Long leases are usable if the server is able to store lease state in 7019 non-volatile memory. Upon recovery, the server can reconstruct the 7020 lease state from its non-volatile memory and continue operation with 7021 its clients and therefore long leases would not be an issue. 7023 8.12. Clocks, Propagation Delay, and Calculating Lease Expiration 7025 To avoid the need for synchronized clocks, lease times are granted by 7026 the server as a time delta. However, there is a requirement that the 7027 client and server clocks do not drift excessively over the duration 7028 of the lock. There is also the issue of propagation delay across the 7029 network which could easily be several hundred milliseconds as well as 7030 the possibility that requests will be lost and need to be 7031 retransmitted. 7033 To take propagation delay into account, the client should subtract it 7034 from lease times (e.g. if the client estimates the one-way 7035 propagation delay as 200 msec, then it can assume that the lease is 7036 already 200 msec old when it gets it). In addition, it will take 7037 another 200 msec to get a response back to the server. So the client 7038 must send a lock renewal or write data back to the server 400 msec 7039 before the lease would expire. 7041 The server's lease period configuration should take into account the 7042 network distance of the clients that will be accessing the server's 7043 resources. It is expected that the lease period will take into 7044 account the network propagation delays and other network delay 7045 factors for the client population. Since the protocol does not allow 7046 for an automatic method to determine an appropriate lease period, the 7047 server's administrator may have to tune the lease period. 7049 8.13. Vestigial Locking Infrastructure From V4.0 7051 There are a number of operations and fields within existing 7052 operations that no longer have a function in minor version one. In 7053 one way or another, these changes are all due to the implementation 7054 of sessions which provides client context and replay protection as a 7055 base feature of the protocol, separate from locking itself. 7057 The following operations have become mandatory-to-not-implement. The 7058 server should return NFS4ERR_NOTSUPP if these operations are found in 7059 an NFSv4.1 COMPOUND. 7061 o SETCLIENTID since its function has been replaced by EXCHANGE_ID. 7063 o SETCLIENTID_CONFIRM since client ID confirmation now happens by 7064 means of CREATE_SESSION. 7066 o OPEN_CONFIRM because OPEN's no longer require confirmation to 7067 establish an owner-based sequence value. 7069 o RELEASE_LOCKOWNER because lock-owners with no associated locks 7070 have any sequence-related state and so can be deleted by the 7071 server at will. 7073 o RENEW because every SEQUENCE operation for a session causes lease 7074 renewal, making a separate operation useless. 7076 Also, there are a number of fields, present in existing operations 7077 related to locking that have no use in minor version one. They were 7078 used in minor version zero to perform functions now provided in a 7079 different fashion. 7081 o Sequence id's used to sequence requests for a given state-owner 7082 and to provide replay protection, now provided via sessions. 7084 o Client IDs used to identify the client associated with a given 7085 request. Client identification is now available using the client 7086 ID associated with the current session, without needing an 7087 explicit client ID field. 7089 Such vestigial fields in existing operations should be set by the 7090 client to zero. When they are not, the server MUST return an 7091 NFS4ERR_INVAL error. 7093 9. Client-Side Caching 7095 Client-side caching of data, of file attributes, and of file names is 7096 essential to providing good performance with the NFS protocol. 7097 Providing distributed cache coherence is a difficult problem and 7098 previous versions of the NFS protocol have not attempted it. 7099 Instead, several NFS client implementation techniques have been used 7100 to reduce the problems that a lack of coherence poses for users. 7101 These techniques have not been clearly defined by earlier protocol 7102 specifications and it is often unclear what is valid or invalid 7103 client behavior. 7105 The NFS version 4 protocol uses many techniques similar to those that 7106 have been used in previous protocol versions. The NFS version 4 7107 protocol does not provide distributed cache coherence. However, it 7108 defines a more limited set of caching guarantees to allow locks and 7109 share reservations to be used without destructive interference from 7110 client side caching. 7112 In addition, the NFS version 4 protocol introduces a delegation 7113 mechanism which allows many decisions normally made by the server to 7114 be made locally by clients. This mechanism provides efficient 7115 support of the common cases where sharing is infrequent or where 7116 sharing is read-only. 7118 9.1. Performance Challenges for Client-Side Caching 7120 Caching techniques used in previous versions of the NFS protocol have 7121 been successful in providing good performance. However, several 7122 scalability challenges can arise when those techniques are used with 7123 very large numbers of clients. This is particularly true when 7124 clients are geographically distributed which classically increases 7125 the latency for cache revalidation requests. 7127 The previous versions of the NFS protocol repeat their file data 7128 cache validation requests at the time the file is opened. This 7129 behavior can have serious performance drawbacks. A common case is 7130 one in which a file is only accessed by a single client. Therefore, 7131 sharing is infrequent. 7133 In this case, repeated reference to the server to find that no 7134 conflicts exist is expensive. A better option with regards to 7135 performance is to allow a client that repeatedly opens a file to do 7136 so without reference to the server. This is done until potentially 7137 conflicting operations from another client actually occur. 7139 A similar situation arises in connection with file locking. Sending 7140 file lock and unlock requests to the server as well as the read and 7141 write requests necessary to make data caching consistent with the 7142 locking semantics (see the section "Data Caching and File Locking") 7143 can severely limit performance. When locking is used to provide 7144 protection against infrequent conflicts, a large penalty is incurred. 7145 This penalty may discourage the use of file locking by applications. 7147 The NFS version 4 protocol provides more aggressive caching 7148 strategies with the following design goals: 7150 .IP o Compatibility with a large range of server semantics. .IP o 7151 Provide the same caching benefits as previous versions of the NFS 7152 protocol when unable to provide the more aggressive model. .IP o 7153 Requirements for aggressive caching are organized so that a large 7154 portion of the benefit can be obtained even when not all of the 7155 requirements can be met. .LP The appropriate requirements for the 7156 server are discussed in later sections in which specific forms of 7157 caching are covered. (see the section "Open Delegation"). 7159 9.2. Delegation and Callbacks 7161 Recallable delegation of server responsibilities for a file to a 7162 client improves performance by avoiding repeated requests to the 7163 server in the absence of inter-client conflict. With the use of a 7164 "callback" RPC from server to client, a server recalls delegated 7165 responsibilities when another client engages in sharing of a 7166 delegated file. 7168 A delegation is passed from the server to the client, specifying the 7169 object of the delegation and the type of delegation. There are 7170 different types of delegations but each type contains a stateid to be 7171 used to represent the delegation when performing operations that 7172 depend on the delegation. This stateid is similar to those 7173 associated with locks and share reservations but differs in that the 7174 stateid for a delegation is associated with a client ID and may be 7175 used on behalf of all the open_owners for the given client. A 7176 delegation is made to the client as a whole and not to any specific 7177 process or thread of control within it. 7179 The callback path or backchannel is established by CREATE_SESSION and 7180 BIND_CONN_TO_SESSION, and the client is required to maintain it. 7181 Because the backchannel may be down, even temporarily, correct 7182 protocol operation does not depend on them. Preliminary testing of 7183 callback functionality by means of a CB_COMPOUND procedure with a 7184 single operation, CB_SEQUENCE, can be used to check the continuity of 7185 the backchannel. A server avoids delegating responsibilities until 7186 it has determined that the backchannel exists. Because the granting 7187 of a delegation is always conditional upon the absence of conflicting 7188 access, clients must not assume that a delegation will be granted and 7189 they must always be prepared for OPENs, WANT_DELEGATIONs, and 7190 GET_DIR_DELEGATIONs to be processed without any delegations being 7191 granted. 7193 Once granted, a delegation behaves in many ways like a lock. There 7194 is an associated lease that is subject to renewal together with all 7195 of the other leases held by that client. 7197 Unlike locks, an operation by a second client to a delegated file 7198 will cause the server to recall a delegation through a callback. 7200 On recall, the client holding the delegation must flush modified 7201 state (such as modified data) to the server and return the 7202 delegation. The conflicting request will not receive a response 7203 until the recall is complete. The recall is considered complete when 7204 the client returns the delegation or the server times out on the 7205 recall and revokes the delegation as a result of the timeout. 7206 Following the resolution of the recall, the server has the 7207 information necessary to grant or deny the second client's request. 7209 At the time the client receives a delegation recall, it may have 7210 substantial state that needs to be flushed to the server. Therefore, 7211 the server should allow sufficient time for the delegation to be 7212 returned since it may involve numerous RPCs to the server. If the 7213 server is able to determine that the client is diligently flushing 7214 state to the server as a result of the recall, the server may extend 7215 the usual time allowed for a recall. However, the time allowed for 7216 recall completion should not be unbounded. 7218 An example of this is when responsibility to mediate opens on a given 7219 file is delegated to a client (see the section "Open Delegation"). 7220 The server will not know what opens are in effect on the client. 7221 Without this knowledge the server will be unable to determine if the 7222 access and deny state for the file allows any particular open until 7223 the delegation for the file has been returned. 7225 A client failure or a network partition can result in failure to 7226 respond to a recall callback. In this case, the server will revoke 7227 the delegation which in turn will render useless any modified state 7228 still on the client. 7230 9.2.1. Delegation Recovery 7232 There are three situations that delegation recovery must deal with: 7234 o Client reboot or restart 7236 o Server reboot or restart 7238 o Network partition (full or callback-only) 7240 In the event the client reboots or restarts, the failure to renew 7241 leases will result in the revocation of record locks and share 7242 reservations. Delegations, however, may be treated a bit 7243 differently. 7245 There will be situations in which delegations will need to be 7246 reestablished after a client reboots or restarts. The reason for 7247 this is the client may have file data stored locally and this data 7248 was associated with the previously held delegations. The client will 7249 need to reestablish the appropriate file state on the server. 7251 To allow for this type of client recovery, the server MAY extend the 7252 period for delegation recovery beyond the typical lease expiration 7253 period. This implies that requests from other clients that conflict 7254 with these delegations will need to wait. Because the normal recall 7255 process may require significant time for the client to flush changed 7256 state to the server, other clients need be prepared for delays that 7257 occur because of a conflicting delegation. This longer interval 7258 would increase the window for clients to reboot and consult stable 7259 storage so that the delegations can be reclaimed. For open 7260 delegations, such delegations are reclaimed using OPEN with a claim 7261 type of CLAIM_DELEGATE_PREV. (See the sections on "Data Caching and 7262 Revocation" and "Operation 18: OPEN" for discussion of open 7263 delegation and the details of OPEN respectively). 7265 A server MAY support a claim type of CLAIM_DELEGATE_PREV, but if it 7266 does, it MUST NOT remove delegations upon a CREATE_SESSION that 7267 confirms a client ID created by EXCHANGE_ID, and instead MUST, for a 7268 period of time no less than that of the value of the lease_time 7269 attribute, maintain the client's delegations to allow time for the 7270 client to issue CLAIM_DELEGATE_PREV requests. The server that 7271 supports CLAIM_DELEGATE_PREV MUST support the DELEGPURGE operation. 7273 When the server reboots or restarts, delegations are reclaimed (using 7274 the OPEN operation with CLAIM_PREVIOUS) in a similar fashion to 7275 record locks and share reservations. However, there is a slight 7276 semantic difference. In the normal case if the server decides that a 7277 delegation should not be granted, it performs the requested action 7278 (e.g. OPEN) without granting any delegation. For reclaim, the 7279 server grants the delegation but a special designation is applied so 7280 that the client treats the delegation as having been granted but 7281 recalled by the server. Because of this, the client has the duty to 7282 write all modified state to the server and then return the 7283 delegation. This process of handling delegation reclaim reconciles 7284 three principles of the NFS version 4 protocol: 7286 o Upon reclaim, a client reporting resources assigned to it by an 7287 earlier server instance must be granted those resources. 7289 o The server has unquestionable authority to determine whether 7290 delegations are to be granted and, once granted, whether they are 7291 to be continued. 7293 o The use of callbacks is not to be depended upon until the client 7294 has proven its ability to receive them. 7296 When a network partition occurs, delegations are subject to freeing 7297 by the server when the lease renewal period expires. This is similar 7298 to the behavior for locks and share reservations. For delegations, 7299 however, the server may extend the period in which conflicting 7300 requests are held off. Eventually the occurrence of a conflicting 7301 request from another client will cause revocation of the delegation. 7302 A loss of the callback path (e.g. by later network configuration 7303 change) will have the same effect. A recall request will fail and 7304 revocation of the delegation will result. 7306 A client normally finds out about revocation of a delegation when it 7307 uses a stateid associated with a delegation and receives the error 7308 NFS4ERR_EXPIRED. It also may find out about delegation revocation 7309 after a client reboot when it attempts to reclaim a delegation and 7310 receives that same error. Note that in the case of a revoked write 7311 open delegation, there are issues because data may have been modified 7312 by the client whose delegation is revoked and separately by other 7313 clients. See the section "Revocation Recovery for Write Open 7314 Delegation" for a discussion of such issues. Note also that when 7315 delegations are revoked, information about the revoked delegation 7316 will be written by the server to stable storage (as described in the 7317 section "Crash Recovery"). This is done to deal with the case in 7318 which a server reboots after revoking a delegation but before the 7319 client holding the revoked delegation is notified about the 7320 revocation. 7322 9.3. Data Caching 7324 When applications share access to a set of files, they need to be 7325 implemented so as to take account of the possibility of conflicting 7326 access by another application. This is true whether the applications 7327 in question execute on different clients or reside on the same 7328 client. 7330 Share reservations and record locks are the facilities the NFS 7331 version 4 protocol provides to allow applications to coordinate 7332 access by providing mutual exclusion facilities. The NFS version 4 7333 protocol's data caching must be implemented such that it does not 7334 invalidate the assumptions that those using these facilities depend 7335 upon. 7337 9.3.1. Data Caching and OPENs 7339 In order to avoid invalidating the sharing assumptions that 7340 applications rely on, NFS version 4 clients should not provide cached 7341 data to applications or modify it on behalf of an application when it 7342 would not be valid to obtain or modify that same data via a READ or 7343 WRITE operation. 7345 Furthermore, in the absence of open delegation (see the section "Open 7346 Delegation") two additional rules apply. Note that these rules are 7347 obeyed in practice by many NFS version 2 and version 3 clients. 7349 o First, cached data present on a client must be revalidated after 7350 doing an OPEN. Revalidating means that the client fetches the 7351 change attribute from the server, compares it with the cached 7352 change attribute, and if different, declares the cached data (as 7353 well as the cached attributes) as invalid. This is to ensure that 7354 the data for the OPENed file is still correctly reflected in the 7355 client's cache. This validation must be done at least when the 7356 client's OPEN operation includes DENY=WRITE or BOTH thus 7357 terminating a period in which other clients may have had the 7358 opportunity to open the file with WRITE access. Clients may 7359 choose to do the revalidation more often (i.e. at OPENs specifying 7360 DENY=NONE) to parallel the NFS version 3 protocol's practice for 7361 the benefit of users assuming this degree of cache revalidation. 7363 Since the change attribute is updated for data and metadata 7364 modifications, some client implementors may be tempted to use the 7365 time_modify attribute and not change to validate cached data, so 7366 that metadata changes do not spuriously invalidate clean data. 7367 The implementor is cautioned in this approach. The change 7368 attribute is guaranteed to change for each update to the file, 7369 whereas time_modify is guaranteed to change only at the 7370 granularity of the time_delta attribute. Use by the client's data 7371 cache validation logic of time_modify and not change runs the risk 7372 of the client incorrectly marking stale data as valid. 7374 o Second, modified data must be flushed to the server before closing 7375 a file OPENed for write. This is complementary to the first rule. 7376 If the data is not flushed at CLOSE, the revalidation done after 7377 client OPENs as file is unable to achieve its purpose. The other 7378 aspect to flushing the data before close is that the data must be 7379 committed to stable storage, at the server, before the CLOSE 7380 operation is requested by the client. In the case of a server 7381 reboot or restart and a CLOSEd file, it may not be possible to 7382 retransmit the data to be written to the file. Hence, this 7383 requirement. 7385 9.3.2. Data Caching and File Locking 7387 For those applications that choose to use file locking instead of 7388 share reservations to exclude inconsistent file access, there is an 7389 analogous set of constraints that apply to client side data caching. 7390 These rules are effective only if the file locking is used in a way 7391 that matches in an equivalent way the actual READ and WRITE 7392 operations executed. This is as opposed to file locking that is 7393 based on pure convention. For example, it is possible to manipulate 7394 a two-megabyte file by dividing the file into two one-megabyte 7395 regions and protecting access to the two regions by file locks on 7396 octets zero and one. A lock for write on octet zero of the file 7397 would represent the right to do READ and WRITE operations on the 7398 first region. A lock for write on octet one of the file would 7399 represent the right to do READ and WRITE operations on the second 7400 region. As long as all applications manipulating the file obey this 7401 convention, they will work on a local file system. However, they may 7402 not work with the NFS version 4 protocol unless clients refrain from 7403 data caching. 7405 The rules for data caching in the file locking environment are: 7407 o First, when a client obtains a file lock for a particular region, 7408 the data cache corresponding to that region (if any cache data 7409 exists) must be revalidated. If the change attribute indicates 7410 that the file may have been updated since the cached data was 7411 obtained, the client must flush or invalidate the cached data for 7412 the newly locked region. A client might choose to invalidate all 7413 of non-modified cached data that it has for the file but the only 7414 requirement for correct operation is to invalidate all of the data 7415 in the newly locked region. 7417 o Second, before releasing a write lock for a region, all modified 7418 data for that region must be flushed to the server. The modified 7419 data must also be written to stable storage. 7421 Note that flushing data to the server and the invalidation of cached 7422 data must reflect the actual octet ranges locked or unlocked. 7423 Rounding these up or down to reflect client cache block boundaries 7424 will cause problems if not carefully done. For example, writing a 7425 modified block when only half of that block is within an area being 7426 unlocked may cause invalid modification to the region outside the 7427 unlocked area. This, in turn, may be part of a region locked by 7428 another client. Clients can avoid this situation by synchronously 7429 performing portions of write operations that overlap that portion 7430 (initial or final) that is not a full block. Similarly, invalidating 7431 a locked area which is not an integral number of full buffer blocks 7432 would require the client to read one or two partial blocks from the 7433 server if the revalidation procedure shows that the data which the 7434 client possesses may not be valid. 7436 The data that is written to the server as a prerequisite to the 7437 unlocking of a region must be written, at the server, to stable 7438 storage. The client may accomplish this either with synchronous 7439 writes or by following asynchronous writes with a COMMIT operation. 7440 This is required because retransmission of the modified data after a 7441 server reboot might conflict with a lock held by another client. 7443 A client implementation may choose to accommodate applications which 7444 use record locking in non-standard ways (e.g. using a record lock as 7445 a global semaphore) by flushing to the server more data upon an LOCKU 7446 than is covered by the locked range. This may include modified data 7447 within files other than the one for which the unlocks are being done. 7448 In such cases, the client must not interfere with applications whose 7449 READs and WRITEs are being done only within the bounds of record 7450 locks which the application holds. For example, an application locks 7451 a single octet of a file and proceeds to write that single octet. A 7452 client that chose to handle a LOCKU by flushing all modified data to 7453 the server could validly write that single octet in response to an 7454 unrelated unlock. However, it would not be valid to write the entire 7455 block in which that single written octet was located since it 7456 includes an area that is not locked and might be locked by another 7457 client. Client implementations can avoid this problem by dividing 7458 files with modified data into those for which all modifications are 7459 done to areas covered by an appropriate record lock and those for 7460 which there are modifications not covered by a record lock. Any 7461 writes done for the former class of files must not include areas not 7462 locked and thus not modified on the client. 7464 9.3.3. Data Caching and Mandatory File Locking 7466 Client side data caching needs to respect mandatory file locking when 7467 it is in effect. The presence of mandatory file locking for a given 7468 file is indicated when the client gets back NFS4ERR_LOCKED from a 7469 READ or WRITE on a file it has an appropriate share reservation for. 7470 When mandatory locking is in effect for a file, the client must check 7471 for an appropriate file lock for data being read or written. If a 7472 lock exists for the range being read or written, the client may 7473 satisfy the request using the client's validated cache. If an 7474 appropriate file lock is not held for the range of the read or write, 7475 the read or write request must not be satisfied by the client's cache 7476 and the request must be sent to the server for processing. When a 7477 read or write request partially overlaps a locked region, the request 7478 should be subdivided into multiple pieces with each region (locked or 7479 not) treated appropriately. 7481 9.3.4. Data Caching and File Identity 7483 When clients cache data, the file data needs to be organized 7484 according to the file system object to which the data belongs. For 7485 NFS version 3 clients, the typical practice has been to assume for 7486 the purpose of caching that distinct filehandles represent distinct 7487 file system objects. The client then has the choice to organize and 7488 maintain the data cache on this basis. 7490 In the NFS version 4 protocol, there is now the possibility to have 7491 significant deviations from a "one filehandle per object" model 7492 because a filehandle may be constructed on the basis of the object's 7493 pathname. Therefore, clients need a reliable method to determine if 7494 two filehandles designate the same file system object. If clients 7495 were simply to assume that all distinct filehandles denote distinct 7496 objects and proceed to do data caching on this basis, caching 7497 inconsistencies would arise between the distinct client side objects 7498 which mapped to the same server side object. 7500 By providing a method to differentiate filehandles, the NFS version 4 7501 protocol alleviates a potential functional regression in comparison 7502 with the NFS version 3 protocol. Without this method, caching 7503 inconsistencies within the same client could occur and this has not 7504 been present in previous versions of the NFS protocol. Note that it 7505 is possible to have such inconsistencies with applications executing 7506 on multiple clients but that is not the issue being addressed here. 7508 For the purposes of data caching, the following steps allow an NFS 7509 version 4 client to determine whether two distinct filehandles denote 7510 the same server side object: 7512 o If GETATTR directed to two filehandles returns different values of 7513 the fsid attribute, then the filehandles represent distinct 7514 objects. 7516 o If GETATTR for any file with an fsid that matches the fsid of the 7517 two filehandles in question returns a unique_handles attribute 7518 with a value of TRUE, then the two objects are distinct. 7520 o If GETATTR directed to the two filehandles does not return the 7521 fileid attribute for both of the handles, then it cannot be 7522 determined whether the two objects are the same. Therefore, 7523 operations which depend on that knowledge (e.g. client side data 7524 caching) cannot be done reliably. Note that if GETATTR does not 7525 return the fileid attribute for both filehandles, it will return 7526 it for neither of the filehandles, since the fsid for both 7527 filehandles is the same. 7529 o If GETATTR directed to the two filehandles returns different 7530 values for the fileid attribute, then they are distinct objects. 7532 o Otherwise they are the same object. 7534 9.4. Open Delegation 7536 When a file is being OPENed, the server may delegate further handling 7537 of opens and closes for that file to the opening client. Any such 7538 delegation is recallable, since the circumstances that allowed for 7539 the delegation are subject to change. In particular, the server may 7540 receive a conflicting OPEN from another client, the server must 7541 recall the delegation before deciding whether the OPEN from the other 7542 client may be granted. Making a delegation is up to the server and 7543 clients should not assume that any particular OPEN either will or 7544 will not result in an open delegation. The following is a typical 7545 set of conditions that servers might use in deciding whether OPEN 7546 should be delegated: 7548 o The client must be able to respond to the server's callback 7549 requests. The server will use the CB_NULL procedure for a test of 7550 callback ability. 7552 o The client must have responded properly to previous recalls. 7554 o There must be no current open conflicting with the requested 7555 delegation. 7557 o There should be no current delegation that conflicts with the 7558 delegation being requested. 7560 o The probability of future conflicting open requests should be low 7561 based on the recent history of the file. 7563 o The existence of any server-specific semantics of OPEN/CLOSE that 7564 would make the required handling incompatible with the prescribed 7565 handling that the delegated client would apply (see below). 7567 There are two types of open delegations, read and write. A read open 7568 delegation allows a client to handle, on its own, requests to open a 7569 file for reading that do not deny read access to others. Multiple 7570 read open delegations may be outstanding simultaneously and do not 7571 conflict. A write open delegation allows the client to handle, on 7572 its own, all opens. Only one write open delegation may exist for a 7573 given file at a given time and it is inconsistent with any read open 7574 delegations. 7576 When a client has a read open delegation, it may not make any changes 7577 to the contents or attributes of the file but it is assured that no 7578 other client may do so. When a client has a write open delegation, 7579 it may modify the file data since no other client will be accessing 7580 the file's data. The client holding a write delegation may only 7581 affect file attributes which are intimately connected with the file 7582 data: size, time_modify, change. 7584 When a client has an open delegation, it does not send OPENs or 7585 CLOSEs to the server but updates the appropriate status internally. 7586 For a read open delegation, opens that cannot be handled locally 7587 (opens for write or that deny read access) must be sent to the 7588 server. 7590 When an open delegation is made, the response to the OPEN contains an 7591 open delegation structure which specifies the following: 7593 o the type of delegation (read or write) 7595 o space limitation information to control flushing of data on close 7596 (write open delegation only, see the section "Open Delegation and 7597 Data Caching") 7599 o an nfsace4 specifying read and write permissions 7601 o a stateid to represent the delegation for READ and WRITE 7603 The delegation stateid is separate and distinct from the stateid for 7604 the OPEN proper. The standard stateid, unlike the delegation 7605 stateid, is associated with a particular lock_owner and will continue 7606 to be valid after the delegation is recalled and the file remains 7607 open. 7609 When a request internal to the client is made to open a file and open 7610 delegation is in effect, it will be accepted or rejected solely on 7611 the basis of the following conditions. Any requirement for other 7612 checks to be made by the delegate should result in open delegation 7613 being denied so that the checks can be made by the server itself. 7615 o The access and deny bits for the request and the file as described 7616 in the section "Share Reservations". 7618 o The read and write permissions as determined below. 7620 The nfsace4 passed with delegation can be used to avoid frequent 7621 ACCESS calls. The permission check should be as follows: 7623 o If the nfsace4 indicates that the open may be done, then it should 7624 be granted without reference to the server. 7626 o If the nfsace4 indicates that the open may not be done, then an 7627 ACCESS request must be sent to the server to obtain the definitive 7628 answer. 7630 The server may return an nfsace4 that is more restrictive than the 7631 actual ACL of the file. This includes an nfsace4 that specifies 7632 denial of all access. Note that some common practices such as 7633 mapping the traditional user "root" to the user "nobody" may make it 7634 incorrect to return the actual ACL of the file in the delegation 7635 response. 7637 The use of delegation together with various other forms of caching 7638 creates the possibility that no server authentication will ever be 7639 performed for a given user since all of the user's requests might be 7640 satisfied locally. Where the client is depending on the server for 7641 authentication, the client should be sure authentication occurs for 7642 each user by use of the ACCESS operation. This should be the case 7643 even if an ACCESS operation would not be required otherwise. As 7644 mentioned before, the server may enforce frequent authentication by 7645 returning an nfsace4 denying all access with every open delegation. 7647 9.4.1. Open Delegation and Data Caching 7649 OPEN delegation allows much of the message overhead associated with 7650 the opening and closing files to be eliminated. An open when an open 7651 delegation is in effect does not require that a validation message be 7652 sent to the server. The continued endurance of the "read open 7653 delegation" provides a guarantee that no OPEN for write and thus no 7654 write has occurred. Similarly, when closing a file opened for write 7655 and if write open delegation is in effect, the data written does not 7656 have to be flushed to the server until the open delegation is 7657 recalled. The continued endurance of the open delegation provides a 7658 guarantee that no open and thus no read or write has been done by 7659 another client. 7661 For the purposes of open delegation, READs and WRITEs done without an 7662 OPEN are treated as the functional equivalents of a corresponding 7663 type of OPEN. This refers to the READs and WRITEs that use the 7664 special stateids consisting of all zero bits or all one bits. 7665 Therefore, READs or WRITEs with a special stateid done by another 7666 client will force the server to recall a write open delegation. A 7667 WRITE with a special stateid done by another client will force a 7668 recall of read open delegations. 7670 With delegations, a client is able to avoid writing data to the 7671 server when the CLOSE of a file is serviced. The file close system 7672 call is the usual point at which the client is notified of a lack of 7673 stable storage for the modified file data generated by the 7674 application. At the close, file data is written to the server and 7675 through normal accounting the server is able to determine if the 7676 available file system space for the data has been exceeded (i.e. 7677 server returns NFS4ERR_NOSPC or NFS4ERR_DQUOT). This accounting 7678 includes quotas. The introduction of delegations requires that a 7679 alternative method be in place for the same type of communication to 7680 occur between client and server. 7682 In the delegation response, the server provides either the limit of 7683 the size of the file or the number of modified blocks and associated 7684 block size. The server must ensure that the client will be able to 7685 flush data to the server of a size equal to that provided in the 7686 original delegation. The server must make this assurance for all 7687 outstanding delegations. Therefore, the server must be careful in 7688 its management of available space for new or modified data taking 7689 into account available file system space and any applicable quotas. 7690 The server can recall delegations as a result of managing the 7691 available file system space. The client should abide by the server's 7692 state space limits for delegations. If the client exceeds the stated 7693 limits for the delegation, the server's behavior is undefined. 7695 Based on server conditions, quotas or available file system space, 7696 the server may grant write open delegations with very restrictive 7697 space limitations. The limitations may be defined in a way that will 7698 always force modified data to be flushed to the server on close. 7700 With respect to authentication, flushing modified data to the server 7701 after a CLOSE has occurred may be problematic. For example, the user 7702 of the application may have logged off the client and unexpired 7703 authentication credentials may not be present. In this case, the 7704 client may need to take special care to ensure that local unexpired 7705 credentials will in fact be available. This may be accomplished by 7706 tracking the expiration time of credentials and flushing data well in 7707 advance of their expiration or by making private copies of 7708 credentials to assure their availability when needed. 7710 9.4.2. Open Delegation and File Locks 7712 When a client holds a write open delegation, lock operations are 7713 performed locally. This includes those required for mandatory file 7714 locking. This can be done since the delegation implies that there 7715 can be no conflicting locks. Similarly, all of the revalidations 7716 that would normally be associated with obtaining locks and the 7717 flushing of data associated with the releasing of locks need not be 7718 done. 7720 When a client holds a read open delegation, lock operations are not 7721 performed locally. All lock operations, including those requesting 7722 non-exclusive locks, are sent to the server for resolution. 7724 9.4.3. Handling of CB_GETATTR 7726 The server needs to employ special handling for a GETATTR where the 7727 target is a file that has a write open delegation in effect. The 7728 reason for this is that the client holding the write delegation may 7729 have modified the data and the server needs to reflect this change to 7730 the second client that submitted the GETATTR. Therefore, the client 7731 holding the write delegation needs to be interrogated. The server 7732 will use the CB_GETATTR operation. The only attributes that the 7733 server can reliably query via CB_GETATTR are size and change. 7735 Since CB_GETATTR is being used to satisfy another client's GETATTR 7736 request, the server only needs to know if the client holding the 7737 delegation has a modified version of the file. If the client's copy 7738 of the delegated file is not modified (data or size), the server can 7739 satisfy the second client's GETATTR request from the attributes 7740 stored locally at the server. If the file is modified, the server 7741 only needs to know about this modified state. If the server 7742 determines that the file is currently modified, it will respond to 7743 the second client's GETATTR as if the file had been modified locally 7744 at the server. 7746 Since the form of the change attribute is determined by the server 7747 and is opaque to the client, the client and server need to agree on a 7748 method of communicating the modified state of the file. For the size 7749 attribute, the client will report its current view of the file size. 7750 For the change attribute, the handling is more involved. 7752 For the client, the following steps will be taken when receiving a 7753 write delegation: 7755 o The value of the change attribute will be obtained from the server 7756 and cached. Let this value be represented by c. 7758 o The client will create a value greater than c that will be used 7759 for communicating modified data is held at the client. Let this 7760 value be represented by d. 7762 o When the client is queried via CB_GETATTR for the change 7763 attribute, it checks to see if it holds modified data. If the 7764 file is modified, the value d is returned for the change attribute 7765 value. If this file is not currently modified, the client returns 7766 the value c for the change attribute. 7768 For simplicity of implementation, the client MAY for each CB_GETATTR 7769 return the same value d. This is true even if, between successive 7770 CB_GETATTR operations, the client again modifies in the file's data 7771 or metadata in its cache. The client can return the same value 7772 because the only requirement is that the client be able to indicate 7773 to the server that the client holds modified data. Therefore, the 7774 value of d may always be c + 1. 7776 While the change attribute is opaque to the client in the sense that 7777 it has no idea what units of time, if any, the server is counting 7778 change with, it is not opaque in that the client has to treat it as 7779 an unsigned integer, and the server has to be able to see the results 7780 of the client's changes to that integer. Therefore, the server MUST 7781 encode the change attribute in network order when sending it to the 7782 client. The client MUST decode it from network order to its native 7783 order when receiving it and the client MUST encode it network order 7784 when sending it to the server. For this reason, change is defined as 7785 an unsigned integer rather than an opaque array of octets. 7787 For the server, the following steps will be taken when providing a 7788 write delegation: 7790 o Upon providing a write delegation, the server will cache a copy of 7791 the change attribute in the data structure it uses to record the 7792 delegation. Let this value be represented by sc. 7794 o When a second client sends a GETATTR operation on the same file to 7795 the server, the server obtains the change attribute from the first 7796 client. Let this value be cc. 7798 o If the value cc is equal to sc, the file is not modified and the 7799 server returns the current values for change, time_metadata, and 7800 time_modify (for example) to the second client. 7802 o If the value cc is NOT equal to sc, the file is currently modified 7803 at the first client and most likely will be modified at the server 7804 at a future time. The server then uses its current time to 7805 construct attribute values for time_metadata and time_modify. A 7806 new value of sc, which we will call nsc, is computed by the 7807 server, such that nsc >= sc + 1. The server then returns the 7808 constructed time_metadata, time_modify, and nsc values to the 7809 requester. The server replaces sc in the delegation record with 7810 nsc. To prevent the possibility of time_modify, time_metadata, 7811 and change from appearing to go backward (which would happen if 7812 the client holding the delegation fails to write its modified data 7813 to the server before the delegation is revoked or returned), the 7814 server SHOULD update the file's metadata record with the 7815 constructed attribute values. For reasons of reasonable 7816 performance, committing the constructed attribute values to stable 7817 storage is OPTIONAL. 7819 As discussed earlier in this section, the client MAY return the same 7820 cc value on subsequent CB_GETATTR calls, even if the file was 7821 modified in the client's cache yet again between successive 7822 CB_GETATTR calls. Therefore, the server must assume that the file 7823 has been modified yet again, and MUST take care to ensure that the 7824 new nsc it constructs and returns is greater than the previous nsc it 7825 returned. An example implementation's delegation record would 7826 satisfy this mandate by including a boolean field (let us call it 7827 "modified") that is set to false when the delegation is granted, and 7828 an sc value set at the time of grant to the change attribute value. 7829 The modified field would be set to true the first time cc != sc, and 7830 would stay true until the delegation is returned or revoked. The 7831 processing for constructing nsc, time_modify, and time_metadata would 7832 use this pseudo code: 7834 if (!modified) { 7835 do CB_GETATTR for change and size; 7837 if (cc != sc) 7838 modified = TRUE; 7839 } else { 7840 do CB_GETATTR for size; 7841 } 7843 if (modified) { 7844 sc = sc + 1; 7845 time_modify = time_metadata = current_time; 7846 update sc, time_modify, time_metadata into file's metadata; 7847 } 7849 return to client (that sent GETATTR) the attributes 7850 it requested, but make sure size comes from what 7851 CB_GETATTR returned. Do not update the file's metadata 7852 with the client's modified size. 7854 In the case that the file attribute size is different than the 7855 server's current value, the server treats this as a modification 7856 regardless of the value of the change attribute retrieved via 7857 CB_GETATTR and responds to the second client as in the last step. 7859 This methodology resolves issues of clock differences between client 7860 and server and other scenarios where the use of CB_GETATTR break 7861 down. 7863 It should be noted that the server is under no obligation to use 7864 CB_GETATTR and therefore the server MAY simply recall the delegation 7865 to avoid its use. 7867 9.4.4. Recall of Open Delegation 7869 The following events necessitate recall of an open delegation: 7871 o Potentially conflicting OPEN request (or READ/WRITE done with 7872 "special" stateid) 7874 o SETATTR issued by another client 7876 o REMOVE request for the file 7878 o RENAME request for the file as either source or target of the 7879 RENAME 7881 Whether a RENAME of a directory in the path leading to the file 7882 results in recall of an open delegation depends on the semantics of 7883 the server file system. If that file system denies such RENAMEs when 7884 a file is open, the recall must be performed to determine whether the 7885 file in question is, in fact, open. 7887 In addition to the situations above, the server may choose to recall 7888 open delegations at any time if resource constraints make it 7889 advisable to do so. Clients should always be prepared for the 7890 possibility of recall. 7892 When a client receives a recall for an open delegation, it needs to 7893 update state on the server before returning the delegation. These 7894 same updates must be done whenever a client chooses to return a 7895 delegation voluntarily. The following items of state need to be 7896 dealt with: 7898 o If the file associated with the delegation is no longer open and 7899 no previous CLOSE operation has been sent to the server, a CLOSE 7900 operation must be sent to the server. 7902 o If a file has other open references at the client, then OPEN 7903 operations must be sent to the server. The appropriate stateids 7904 will be provided by the server for subsequent use by the client 7905 since the delegation stateid will not longer be valid. These OPEN 7906 requests are done with the claim type of CLAIM_DELEGATE_CUR. This 7907 will allow the presentation of the delegation stateid so that the 7908 client can establish the appropriate rights to perform the OPEN. 7909 (see the section "Operation 18: OPEN" for details.) 7911 o If there are granted file locks, the corresponding LOCK operations 7912 need to be performed. This applies to the write open delegation 7913 case only. 7915 o For a write open delegation, if at the time of recall the file is 7916 not open for write, all modified data for the file must be flushed 7917 to the server. If the delegation had not existed, the client 7918 would have done this data flush before the CLOSE operation. 7920 o For a write open delegation when a file is still open at the time 7921 of recall, any modified data for the file needs to be flushed to 7922 the server. 7924 o With the write open delegation in place, it is possible that the 7925 file was truncated during the duration of the delegation. For 7926 example, the truncation could have occurred as a result of an OPEN 7927 UNCHECKED with a size attribute value of zero. Therefore, if a 7928 truncation of the file has occurred and this operation has not 7929 been propagated to the server, the truncation must occur before 7930 any modified data is written to the server. 7932 In the case of write open delegation, file locking imposes some 7933 additional requirements. To precisely maintain the associated 7934 invariant, it is required to flush any modified data in any region 7935 for which a write lock was released while the write delegation was in 7936 effect. However, because the write open delegation implies no other 7937 locking by other clients, a simpler implementation is to flush all 7938 modified data for the file (as described just above) if any write 7939 lock has been released while the write open delegation was in effect. 7941 An implementation need not wait until delegation recall (or deciding 7942 to voluntarily return a delegation) to perform any of the above 7943 actions, if implementation considerations (e.g. resource availability 7944 constraints) make that desirable. Generally, however, the fact that 7945 the actual open state of the file may continue to change makes it not 7946 worthwhile to send information about opens and closes to the server, 7947 except as part of delegation return. Only in the case of closing the 7948 open that resulted in obtaining the delegation would clients be 7949 likely to do this early, since, in that case, the close once done 7950 will not be undone. Regardless of the client's choices on scheduling 7951 these actions, all must be performed before the delegation is 7952 returned, including (when applicable) the close that corresponds to 7953 the open that resulted in the delegation. These actions can be 7954 performed either in previous requests or in previous operations in 7955 the same COMPOUND request. 7957 9.4.5. Clients that Fail to Honor Delegation Recalls 7959 A client may fail to respond to a recall for various reasons, such as 7960 a failure of the callback path from server to the client. The client 7961 may be unaware of a failure in the callback path. This lack of 7962 awareness could result in the client finding out long after the 7963 failure that its delegation has been revoked, and another client has 7964 modified the data for which the client had a delegation. This is 7965 especially a problem for the client that held a write delegation. 7967 The server also has a dilemma in that the client that fails to 7968 respond to the recall might also be sending other NFS requests, 7969 including those that renew the lease before the lease expires. 7970 Without returning an error for those lease renewing operations, the 7971 server leads the client to believe that the delegation it has is in 7972 force. 7974 This difficulty is solved by the following rules: 7976 o When the callback path is down, the server MUST NOT revoke the 7977 delegation if one of the following occurs: 7979 * The client has issued a RENEW operation and the server has 7980 returned an NFS4ERR_CB_PATH_DOWN error. The server MUST renew 7981 the lease for any record locks and share reservations the 7982 client has that the server has known about (as opposed to those 7983 locks and share reservations the client has established but not 7984 yet sent to the server, due to the delegation). The server 7985 SHOULD give the client a reasonable time to return its 7986 delegations to the server before revoking the client's 7987 delegations. 7989 * The client has not issued a RENEW operation for some period of 7990 time after the server attempted to recall the delegation. This 7991 period of time MUST NOT be less than the value of the 7992 lease_time attribute. 7994 o When the client holds a delegation, it can not rely on operations, 7995 except for RENEW, that take a stateid, to renew delegation leases 7996 across callback path failures. The client that wants to keep 7997 delegations in force across callback path failures must use RENEW 7998 to do so. 8000 9.4.6. Delegation Revocation 8002 At the point a delegation is revoked, if there are associated opens 8003 on the client, the applications holding these opens need to be 8004 notified. This notification usually occurs by returning errors for 8005 READ/WRITE operations or when a close is attempted for the open file. 8007 If no opens exist for the file at the point the delegation is 8008 revoked, then notification of the revocation is unnecessary. 8009 However, if there is modified data present at the client for the 8010 file, the user of the application should be notified. Unfortunately, 8011 it may not be possible to notify the user since active applications 8012 may not be present at the client. See the section "Revocation 8013 Recovery for Write Open Delegation" for additional details. 8015 9.5. Data Caching and Revocation 8017 When locks and delegations are revoked, the assumptions upon which 8018 successful caching depend are no longer guaranteed. For any locks or 8019 share reservations that have been revoked, the corresponding owner 8020 needs to be notified. This notification includes applications with a 8021 file open that has a corresponding delegation which has been revoked. 8022 Cached data associated with the revocation must be removed from the 8023 client. In the case of modified data existing in the client's cache, 8024 that data must be removed from the client without it being written to 8025 the server. As mentioned, the assumptions made by the client are no 8026 longer valid at the point when a lock or delegation has been revoked. 8028 For example, another client may have been granted a conflicting lock 8029 after the revocation of the lock at the first client. Therefore, the 8030 data within the lock range may have been modified by the other 8031 client. Obviously, the first client is unable to guarantee to the 8032 application what has occurred to the file in the case of revocation. 8034 Notification to a lock owner will in many cases consist of simply 8035 returning an error on the next and all subsequent READs/WRITEs to the 8036 open file or on the close. Where the methods available to a client 8037 make such notification impossible because errors for certain 8038 operations may not be returned, more drastic action such as signals 8039 or process termination may be appropriate. The justification for 8040 this is that an invariant for which an application depends on may be 8041 violated. Depending on how errors are typically treated for the 8042 client operating environment, further levels of notification 8043 including logging, console messages, and GUI pop-ups may be 8044 appropriate. 8046 9.5.1. Revocation Recovery for Write Open Delegation 8048 Revocation recovery for a write open delegation poses the special 8049 issue of modified data in the client cache while the file is not 8050 open. In this situation, any client which does not flush modified 8051 data to the server on each close must ensure that the user receives 8052 appropriate notification of the failure as a result of the 8053 revocation. Since such situations may require human action to 8054 correct problems, notification schemes in which the appropriate user 8055 or administrator is notified may be necessary. Logging and console 8056 messages are typical examples. 8058 If there is modified data on the client, it must not be flushed 8059 normally to the server. A client may attempt to provide a copy of 8060 the file data as modified during the delegation under a different 8061 name in the file system name space to ease recovery. Note that when 8062 the client can determine that the file has not been modified by any 8063 other client, or when the client has a complete cached copy of file 8064 in question, such a saved copy of the client's view of the file may 8065 be of particular value for recovery. In other case, recovery using a 8066 copy of the file based partially on the client's cached data and 8067 partially on the server copy as modified by other clients, will be 8068 anything but straightforward, so clients may avoid saving file 8069 contents in these situations or mark the results specially to warn 8070 users of possible problems. 8072 Saving of such modified data in delegation revocation situations may 8073 be limited to files of a certain size or might be used only when 8074 sufficient disk space is available within the target file system. 8075 Such saving may also be restricted to situations when the client has 8076 sufficient buffering resources to keep the cached copy available 8077 until it is properly stored to the target file system. 8079 9.6. Attribute Caching 8081 The attributes discussed in this section do not include named 8082 attributes. Individual named attributes are analogous to files and 8083 caching of the data for these needs to be handled just as data 8084 caching is for ordinary files. Similarly, LOOKUP results from an 8085 OPENATTR directory are to be cached on the same basis as any other 8086 pathnames and similarly for directory contents. 8088 Clients may cache file attributes obtained from the server and use 8089 them to avoid subsequent GETATTR requests. Such caching is write 8090 through in that modification to file attributes is always done by 8091 means of requests to the server and should not be done locally and 8092 cached. The exception to this are modifications to attributes that 8093 are intimately connected with data caching. Therefore, extending a 8094 file by writing data to the local data cache is reflected immediately 8095 in the size as seen on the client without this change being 8096 immediately reflected on the server. Normally such changes are not 8097 propagated directly to the server but when the modified data is 8098 flushed to the server, analogous attribute changes are made on the 8099 server. When open delegation is in effect, the modified attributes 8100 may be returned to the server in the response to a CB_RECALL call. 8102 The result of local caching of attributes is that the attribute 8103 caches maintained on individual clients will not be coherent. 8104 Changes made in one order on the server may be seen in a different 8105 order on one client and in a third order on a different client. 8107 The typical file system application programming interfaces do not 8108 provide means to atomically modify or interrogate attributes for 8109 multiple files at the same time. The following rules provide an 8110 environment where the potential incoherences mentioned above can be 8111 reasonably managed. These rules are derived from the practice of 8112 previous NFS protocols. 8114 o All attributes for a given file (per-fsid attributes excepted) are 8115 cached as a unit at the client so that no non-serializability can 8116 arise within the context of a single file. 8118 o An upper time boundary is maintained on how long a client cache 8119 entry can be kept without being refreshed from the server. 8121 o When operations are performed that change attributes at the 8122 server, the updated attribute set is requested as part of the 8123 containing RPC. This includes directory operations that update 8124 attributes indirectly. This is accomplished by following the 8125 modifying operation with a GETATTR operation and then using the 8126 results of the GETATTR to update the client's cached attributes. 8128 Note that if the full set of attributes to be cached is requested by 8129 READDIR, the results can be cached by the client on the same basis as 8130 attributes obtained via GETATTR. 8132 A client may validate its cached version of attributes for a file by 8133 fetching just both the change and time_access attributes and assuming 8134 that if the change attribute has the same value as it did when the 8135 attributes were cached, then no attributes other than time_access 8136 have changed. The reason why time_access is also fetched is because 8137 many servers operate in environments where the operation that updates 8138 change does not update time_access. For example, POSIX file 8139 semantics do not update access time when a file is modified by the 8140 write system call. Therefore, the client that wants a current 8141 time_access value should fetch it with change during the attribute 8142 cache validation processing and update its cached time_access. 8144 The client may maintain a cache of modified attributes for those 8145 attributes intimately connected with data of modified regular files 8146 (size, time_modify, and change). Other than those three attributes, 8147 the client MUST NOT maintain a cache of modified attributes. 8148 Instead, attribute changes are immediately sent to the server. 8150 In some operating environments, the equivalent to time_access is 8151 expected to be implicitly updated by each read of the content of the 8152 file object. If an NFS client is caching the content of a file 8153 object, whether it is a regular file, directory, or symbolic link, 8154 the client SHOULD NOT update the time_access attribute (via SETATTR 8155 or a small READ or READDIR request) on the server with each read that 8156 is satisfied from cache. The reason is that this can defeat the 8157 performance benefits of caching content, especially since an explicit 8158 SETATTR of time_access may alter the change attribute on the server. 8159 If the change attribute changes, clients that are caching the content 8160 will think the content has changed, and will re-read unmodified data 8161 from the server. Nor is the client encouraged to maintain a modified 8162 version of time_access in its cache, since this would mean that the 8163 client will either eventually have to write the access time to the 8164 server with bad performance effects, or it would never update the 8165 server's time_access, thereby resulting in a situation where an 8166 application that caches access time between a close and open of the 8167 same file observes the access time oscillating between the past and 8168 present. The time_access attribute always means the time of last 8169 access to a file by a read that was satisfied by the server. This 8170 way clients will tend to see only time_access changes that go forward 8171 in time. 8173 9.7. Data and Metadata Caching and Memory Mapped Files 8175 Some operating environments include the capability for an application 8176 to map a file's content into the application's address space. Each 8177 time the application accesses a memory location that corresponds to a 8178 block that has not been loaded into the address space, a page fault 8179 occurs and the file is read (or if the block does not exist in the 8180 file, the block is allocated and then instantiated in the 8181 application's address space). 8183 As long as each memory mapped access to the file requires a page 8184 fault, the relevant attributes of the file that are used to detect 8185 access and modification (time_access, time_metadata, time_modify, and 8186 change) will be updated. However, in many operating environments, 8187 when page faults are not required these attributes will not be 8188 updated on reads or updates to the file via memory access (regardless 8189 whether the file is local file or is being access remotely). A 8190 client or server MAY fail to update attributes of a file that is 8191 being accessed via memory mapped I/O. This has several implications: 8193 o If there is an application on the server that has memory mapped a 8194 file that a client is also accessing, the client may not be able 8195 to get a consistent value of the change attribute to determine 8196 whether its cache is stale or not. A server that knows that the 8197 file is memory mapped could always pessimistically return updated 8198 values for change so as to force the application to always get the 8199 most up to date data and metadata for the file. However, due to 8200 the negative performance implications of this, such behavior is 8201 OPTIONAL. 8203 o If the memory mapped file is not being modified on the server, and 8204 instead is just being read by an application via the memory mapped 8205 interface, the client will not see an updated time_access 8206 attribute. However, in many operating environments, neither will 8207 any process running on the server. Thus NFS clients are at no 8208 disadvantage with respect to local processes. 8210 o If there is another client that is memory mapping the file, and if 8211 that client is holding a write delegation, the same set of issues 8212 as discussed in the previous two bullet items apply. So, when a 8213 server does a CB_GETATTR to a file that the client has modified in 8214 its cache, the response from CB_GETATTR will not necessarily be 8215 accurate. As discussed earlier, the client's obligation is to 8216 report that the file has been modified since the delegation was 8217 granted, not whether it has been modified again between successive 8218 CB_GETATTR calls, and the server MUST assume that any file the 8219 client has modified in cache has been modified again between 8220 successive CB_GETATTR calls. Depending on the nature of the 8221 client's memory management system, this weak obligation may not be 8222 possible. A client MAY return stale information in CB_GETATTR 8223 whenever the file is memory mapped. 8225 o The mixture of memory mapping and file locking on the same file is 8226 problematic. Consider the following scenario, where a page size 8227 on each client is 8192 octets. 8229 * Client A memory maps first page (8192 octets) of file X 8231 * Client B memory maps first page (8192 octets) of file X 8233 * Client A write locks first 4096 octets 8235 * Client B write locks second 4096 octets 8237 * Client A, via a STORE instruction modifies part of its locked 8238 region. 8240 * Simultaneous to client A, client B issues a STORE on part of 8241 its locked region. 8243 Here the challenge is for each client to resynchronize to get a 8244 correct view of the first page. In many operating environments, the 8245 virtual memory management systems on each client only know a page is 8246 modified, not that a subset of the page corresponding to the 8247 respective lock regions has been modified. So it is not possible for 8248 each client to do the right thing, which is to only write to the 8249 server that portion of the page that is locked. For example, if 8250 client A simply writes out the page, and then client B writes out the 8251 page, client A's data is lost. 8253 Moreover, if mandatory locking is enabled on the file, then we have a 8254 different problem. When clients A and B issue the STORE 8255 instructions, the resulting page faults require a record lock on the 8256 entire page. Each client then tries to extend their locked range to 8257 the entire page, which results in a deadlock. Communicating the 8258 NFS4ERR_DEADLOCK error to a STORE instruction is difficult at best. 8260 If a client is locking the entire memory mapped file, there is no 8261 problem with advisory or mandatory record locking, at least until the 8262 client unlocks a region in the middle of the file. 8264 Given the above issues the following are permitted: 8266 o Clients and servers MAY deny memory mapping a file they know there 8267 are record locks for. 8269 o Clients and servers MAY deny a record lock on a file they know is 8270 memory mapped. 8272 o A client MAY deny memory mapping a file that it knows requires 8273 mandatory locking for I/O. If mandatory locking is enabled after 8274 the file is opened and mapped, the client MAY deny the application 8275 further access to its mapped file. 8277 9.8. Name Caching 8279 The results of LOOKUP and READDIR operations may be cached to avoid 8280 the cost of subsequent LOOKUP operations. Just as in the case of 8281 attribute caching, inconsistencies may arise among the various client 8282 caches. To mitigate the effects of these inconsistencies and given 8283 the context of typical file system APIs, an upper time boundary is 8284 maintained on how long a client name cache entry can be kept without 8285 verifying that the entry has not been made invalid by a directory 8286 change operation performed by another client. .LP When a client is 8287 not making changes to a directory for which there exist name cache 8288 entries, the client needs to periodically fetch attributes for that 8289 directory to ensure that it is not being modified. After determining 8290 that no modification has occurred, the expiration time for the 8291 associated name cache entries may be updated to be the current time 8292 plus the name cache staleness bound. 8294 When a client is making changes to a given directory, it needs to 8295 determine whether there have been changes made to the directory by 8296 other clients. It does this by using the change attribute as 8297 reported before and after the directory operation in the associated 8298 change_info4 value returned for the operation. The server is able to 8299 communicate to the client whether the change_info4 data is provided 8300 atomically with respect to the directory operation. If the change 8301 values are provided atomically, the client is then able to compare 8302 the pre-operation change value with the change value in the client's 8303 name cache. If the comparison indicates that the directory was 8304 updated by another client, the name cache associated with the 8305 modified directory is purged from the client. If the comparison 8306 indicates no modification, the name cache can be updated on the 8307 client to reflect the directory operation and the associated timeout 8308 extended. The post-operation change value needs to be saved as the 8309 basis for future change_info4 comparisons. 8311 As demonstrated by the scenario above, name caching requires that the 8312 client revalidate name cache data by inspecting the change attribute 8313 of a directory at the point when the name cache item was cached. 8314 This requires that the server update the change attribute for 8315 directories when the contents of the corresponding directory is 8316 modified. For a client to use the change_info4 information 8317 appropriately and correctly, the server must report the pre and post 8318 operation change attribute values atomically. When the server is 8319 unable to report the before and after values atomically with respect 8320 to the directory operation, the server must indicate that fact in the 8321 change_info4 return value. When the information is not atomically 8322 reported, the client should not assume that other clients have not 8323 changed the directory. 8325 9.9. Directory Caching 8327 The results of READDIR operations may be used to avoid subsequent 8328 READDIR operations. Just as in the cases of attribute and name 8329 caching, inconsistencies may arise among the various client caches. 8330 To mitigate the effects of these inconsistencies, and given the 8331 context of typical file system APIs, the following rules should be 8332 followed: 8334 o Cached READDIR information for a directory which is not obtained 8335 in a single READDIR operation must always be a consistent snapshot 8336 of directory contents. This is determined by using a GETATTR 8337 before the first READDIR and after the last of READDIR that 8338 contributes to the cache. 8340 o An upper time boundary is maintained to indicate the length of 8341 time a directory cache entry is considered valid before the client 8342 must revalidate the cached information. 8344 The revalidation technique parallels that discussed in the case of 8345 name caching. When the client is not changing the directory in 8346 question, checking the change attribute of the directory with GETATTR 8347 is adequate. The lifetime of the cache entry can be extended at 8348 these checkpoints. When a client is modifying the directory, the 8349 client needs to use the change_info4 data to determine whether there 8350 are other clients modifying the directory. If it is determined that 8351 no other client modifications are occurring, the client may update 8352 its directory cache to reflect its own changes. 8354 As demonstrated previously, directory caching requires that the 8355 client revalidate directory cache data by inspecting the change 8356 attribute of a directory at the point when the directory was cached. 8357 This requires that the server update the change attribute for 8358 directories when the contents of the corresponding directory is 8359 modified. For a client to use the change_info4 information 8360 appropriately and correctly, the server must report the pre and post 8361 operation change attribute values atomically. When the server is 8362 unable to report the before and after values atomically with respect 8363 to the directory operation, the server must indicate that fact in the 8364 change_info4 return value. When the information is not atomically 8365 reported, the client should not assume that other clients have not 8366 changed the directory. 8368 10. Multi-Server Name Space 8370 NFSv4.1 supports attributes that allow a namespace to extend beyond 8371 the boundaries of a single server. Use of such multi-server 8372 namespaces is optional, and for many purposes, single-server 8373 namespace are perfectly acceptable. Use of multi-server namespaces 8374 can provide many advantages, however, by separating a file system's 8375 logical position in a name space from the (possibly changing) 8376 logistical and administrative considerations that result in 8377 particular file systems being located on particular servers. 8379 10.1. Location attributes 8381 NFSv4 contains recommended attributes that allow file systems on one 8382 server to be associated with one or more instances of that file 8383 system on other servers. These attributes specify such file systems 8384 by specifying a server name (either a DNS name or an IP address) 8385 together with the path of that file system within that server's 8386 single-server name space. 8388 The fs_locations_info recommended attribute allows specification of 8389 one more file systems instance locations where the data corresponding 8390 to a given file system may be found. This attribute provides to the 8391 client, in addition to information about file system instance 8392 locations, extensive information about the various file system 8393 instance choices (e.g. priority for use, writability, currency, etc.) 8394 as well as information to help the client efficiently effect as 8395 seamless a transition as possible among multiple file system 8396 instances, when and if that should be necessary. 8398 The fs_locations recommended attribute is inherited from NFSv4.0 and 8399 only allows specification of the file system locations where the data 8400 corresponding to a given file system may be found. Servers should 8401 make this attribute available whenever fs_locations_info is 8402 supported, but client use of fs_locations_info is to be preferred. 8404 10.2. File System Presence or Absence 8406 A given location in an NFSv4 namespace (typically but not necessarily 8407 a multi-server namespace) can have a number of file system instance 8408 locations associated with it (via the fs_locations or 8409 fs_locations_info attribute). There may also be an actual current 8410 file system at that location, accessible via normal namespace 8411 operations (e.g. LOOKUP). In this case, the file system is said to 8412 be "present" at that position in the namespace and clients will 8413 typically use it, reserving use of additional locations specified via 8414 the location-related attributes to situations in which the principal 8415 location is no longer available. 8417 When there is no actual file system at the namespace location in 8418 question, the file system is said to be "absent". An absent file 8419 system contains no files or directories other than the root and any 8420 reference to it, except to access a small set of attributes useful in 8421 determining alternate locations, will result in an error, 8422 NFS4ERR_MOVED. Note that if the server ever returns NFS4ERR_MOVED 8423 (i.e. file systems may be absent), it MUST support the fs_locations 8424 attribute and SHOULD support the fs_locations_info and fs_absent 8425 attributes. 8427 While the error name suggests that we have a case of a file system 8428 which once was present, and has only become absent later, this is 8429 only one possibility. A position in the namespace may be permanently 8430 absent with the file system(s) designated by the location attributes 8431 the only realization. The name NFS4ERR_MOVED reflects an earlier, 8432 more limited conception of its function, but this error will be 8433 returned whenever the referenced file system is absent, whether it 8434 has moved or not. 8436 Except in the case of GETATTR-type operations (to be discussed 8437 later), when the current filehandle at the start of an operation is 8438 within an absent file system, that operation is not performed and the 8439 error NFS4ERR_MOVED returned, to indicate that the file system is 8440 absent on the current server. 8442 Because a GETFH cannot succeed if the current filehandle is within an 8443 absent file system, filehandles within an absent file system cannot 8444 be transferred to the client. When a client does have filehandles 8445 within an absent file system, it is the result of obtaining them when 8446 the file system was present, and having the file system become absent 8447 subsequently. 8449 It should be noted that because the check for the current filehandle 8450 being within an absent file system happens at the start of every 8451 operation, operations which change the current filehandle so that it 8452 is within an absent file system will not result in an error. This 8453 allows such combinations as PUTFH-GETATTR and LOOKUP-GETATTR to be 8454 used to get attribute information, particularly location attribute 8455 information, as discussed below. 8457 The recommended file system attribute fs_absent can used to 8458 interrogate the present/absent status of a given file system. 8460 10.3. Getting Attributes for an Absent File System 8462 When a file system is absent, most attributes are not available, but 8463 it is necessary to allow the client access to the small set of 8464 attributes that are available, and most particularly those that give 8465 information about the correct current locations for this file system, 8466 fs_locations and fs_locations_info. 8468 10.3.1. GETATTR Within an Absent File System 8470 As mentioned above, an exception is made for GETATTR in that 8471 attributes may be obtained for a filehandle within an absent file 8472 system. This exception only applies if the attribute mask contains 8473 at least one attribute bit that indicates the client is interested in 8474 a result regarding an absent file system: fs_locations, 8475 fs_locations_info, or fs_absent. If none of these attributes is 8476 requested, GETATTR will result in an NFS4ERR_MOVED error. 8478 When a GETATTR is done on an absent file system, the set of supported 8479 attributes is very limited. Many attributes, including those that 8480 are normally mandatory will not be available on an absent file 8481 system. In addition to the attributes mentioned above (fs_locations, 8482 fs_locations_info, fs_absent), the following attributes SHOULD be 8483 available on absent file systems, in the case of recommended 8484 attributes at least to the same degree that they are available on 8485 present file systems. 8487 change: This attribute is useful for absent file systems and can be 8488 helpful in summarizing to the client when any of the location- 8489 related attributes changes. 8491 fsid: This attribute should be provided so that the client can 8492 determine file system boundaries, including, in particular, the 8493 boundary between present and absent file systems. 8495 mounted_on_fileid: For objects at the top of an absent file system 8496 this attribute needs to be available. Since the fileid is one 8497 which is within the present parent file system, there should be no 8498 need to reference the absent file system to provide this 8499 information. 8501 Other attributes SHOULD NOT be made available for absent file 8502 systems, even when it is possible to provide them. The server should 8503 not assume that more information is always better and should avoid 8504 gratuitously providing additional information. 8506 When a GETATTR operation includes a bit mask for one of the 8507 attributes fs_locations, fs_locations_info, or absent, but where the 8508 bit mask includes attributes which are not supported, GETATTR will 8509 not return an error, but will return the mask of the actual 8510 attributes supported with the results. 8512 Handling of VERIFY/NVERIFY is similar to GETATTR in that if the 8513 attribute mask does not include fs_locations, fs_locations_info, or 8514 fs_absent, the error NFS4ERR_MOVED will result. It differs in that 8515 any appearance in the attribute mask of an attribute not supported 8516 for an absent file system (and note that this will include some 8517 normally mandatory attributes), will also cause an NFS4ERR_MOVED 8518 result. 8520 10.3.2. READDIR and Absent File Systems 8522 A READDIR performed when the current filehandle is within an absent 8523 file system will result in an NFS4ERR_MOVED error, since, unlike the 8524 case of GETATTR, no such exception is made for READDIR. 8526 Attributes for an absent file system may be fetched via a READDIR for 8527 a directory in a present file system, when that directory contains 8528 the root directories of one or more absent file systems. In this 8529 case, the handling is as follows: 8531 o If the attribute set requested includes one of the attributes 8532 fs_locations, fs_locations_info, or fs_absent, then fetching of 8533 attributes proceeds normally and no NFS4ERR_MOVED indication is 8534 returned, even when the rdattr_error attribute is requested. 8536 o If the attribute set requested does not include one of the 8537 attributes fs_locations, fs_locations_info, or fs_absent, then if 8538 the rdattr_error attribute is requested, each directory entry for 8539 the root of an absent file system, will report NFS4ERR_MOVED as 8540 the value of the rdattr_error attribute. 8542 o If the attribute set requested does not include any of the 8543 attributes fs_locations, fs_locations_info, fs_absent, or 8544 rdattr_error then the occurrence of the root of an absent file 8545 system within the directory will result in the READDIR failing 8546 with an NFSERR_MOVED error. 8548 o The unavailability of an attribute because of a file system's 8549 absence, even one that is ordinarily mandatory, does not result in 8550 any error indication. The set of attributes returned for the root 8551 directory of the absent file system in that case is simply 8552 restricted to those actually available. 8554 10.4. Uses of Location Information 8556 The location-bearing attributes (fs_locations and fs_locations_info), 8557 provide, together with the possibility of absent file systems, a 8558 number of important facilities in providing reliable, manageable, and 8559 scalable data access. 8561 When a file system is present, these attribute can provide 8562 alternative locations, to be used to access the same data, in the 8563 event that server failures, communications problems, or other 8564 difficulties, make continued access to the current file system 8565 impossible or otherwise impractical. Under some circumstances 8566 multiple alternative locations may be used simultaneously to provide 8567 higher performance access to the file system in question. Provision 8568 of such alternate locations is referred to as "replication" although 8569 there are cases in which replicated sets of data are not in fact 8570 present, and the replicas are instead different paths to the same 8571 data. 8573 When a file system is present and becomes absent, clients can be 8574 given the opportunity to have continued access to their data, at an 8575 alternate location. In this case, a continued attempt to use the 8576 data in the now-absent file system will result in an NFSERR_MOVED 8577 error and at that point the successor locations (typically only one 8578 but multiple choices are possible) can be fetched and used to 8579 continue access. Transfer of the file system contents to the new 8580 location is referred to as "migration", but it should be kept in mind 8581 that there are cases in which this term can be used, like 8582 "replication", when there is no actual data migration per se. 8584 Where a file system was not previously present, specification of file 8585 system location provides a means by which file systems located on one 8586 server can be associated with a name space defined by another server, 8587 thus allowing a general multi-server namespace facility. Designation 8588 of such a location, in place of an absent file system, is called 8589 "referral". 8591 10.4.1. File System Replication 8593 The fs_locations and fs_locations_info attributes provide alternative 8594 locations, to be used to access data in place of or in a addition to 8595 the current file system instance. On first access to a file system, 8596 the client should obtain the value of the set alternate locations by 8597 interrogating the fs_locations or fs_locations_info attribute, with 8598 the latter being preferred. 8600 In the event that server failures, communications problems, or other 8601 difficulties, make continued access to the current file system 8602 impossible or otherwise impractical, the client can use the alternate 8603 locations as a way to get continued access to his data. Depending on 8604 specific attributes of these alternate locations, as indicated within 8605 the fs_locations_info attribute, multiple locations may be used 8606 simultaneously, to provide higher performance through the 8607 exploitation of multiple paths between client and target file system. 8609 The alternate locations may be physical replicas of the (typically 8610 read-only) file system data, or they may reflect alternate paths to 8611 the same server or provide for the use of various form of server 8612 clustering in which multiple servers provide alternate ways of 8613 accessing the same physical file system. How these different modes 8614 of file system transition are represented within the fs_locations and 8615 fs_locations_info attributes and how the client deals with file 8616 system transition issues will be discussed in detail below. 8618 When multiple server addresses correspond to the same actual server, 8619 as shown by a common so_major_id field within the eir_server_owner 8620 field returned by EXCHANGE_ID, the client may assume that for each 8621 file system in the namespace of a given server network address, there 8622 exist file systems at corresponding namespace locations for each of 8623 the other server network addresses, even in the absence of explicit 8624 listing in fs_locations and fs_locations_info. Such corresponding 8625 file system locations can be used as alternate locations, just as 8626 those explicitly specified via the fs_locations and fs_locations_info 8627 attributes. Where these specific locations are designated in the 8628 fs_locations_info attribute, the conditions of use specified in this 8629 attribute (e.g. priorities, specification of simultaneous use) may 8630 limit the clients use of these alternate locations. 8632 When multiple replicas exist and are used simultaneously or in 8633 succession by a client, they must designate the same data (with 8634 metadata being the same to the degree indicated by the 8635 fs_locations_info attribute). Where file systems are writable, a 8636 change made on one instance must be visible on all instances, 8637 immediately upon the earlier of the return of the modifying request 8638 or the visibility of that change on any of the associated replicas. 8639 Where a file system is not writable but represents a read-only copy 8640 (possibly periodically updated) of a writable file system, similar 8641 requirements apply to the propagation of updates. It must be 8642 guaranteed that any change visible on the original file system 8643 instance must be immediately visible on any replica before the client 8644 transitions access to that replica, to avoid any possibility, that a 8645 client in effecting a transition to a replica, will see any reversion 8646 in file system state. The specific means by which this will be 8647 prevented varies based on fs4_status_type reported as part of the 8648 fs_status attribute. (See Section 10.11). 8650 10.4.2. File System Migration 8652 When a file system is present and becomes absent, clients can be 8653 given the opportunity to have continued access to their data, at an 8654 alternate location, as specified by the fs_locations or 8655 fs_locations_info attribute. Typically, a client will be accessing 8656 the file system in question, get an NFS4ERR_MOVED error, and then use 8657 the fs_locations or fs_locations_info attribute to determine the new 8658 location of the data. When fs_locations_info is used, additional 8659 information will be available which will define the nature of the 8660 client's handling of the transition to a new server. 8662 Such migration can be helpful in providing load balancing or general 8663 resource reallocation. The protocol does not specify how the file 8664 system will be moved between servers. It is anticipated that a 8665 number of different server-to-server transfer mechanisms might be 8666 used with the choice left to the server implementer. The NFSv4.1 8667 protocol specifies the method used to communicate the migration event 8668 between client and server. 8670 The new location may be an alternate communication path to the same 8671 server, or, in the case of various forms of server clustering, 8672 another server providing access to the same physical file system. 8673 The client's responsibilities in dealing with this transition depend 8674 on the specific nature of the new access path and how and whether 8675 data was in fact migrated. These issues will be discussed in detail 8676 below. 8678 When multiple server addresses correspond to the same actual server, 8679 as shown by a common value for so_major_id field of the 8680 eir_server_owner field returned by EXCHANGE_ID, the location or 8681 locations may designate alternate server addresses in the form of 8682 specific server network addresses, when the file system in question 8683 is available at those addresses, and no longer accessible at the 8684 original address. 8686 Although a single successor location is typical, multiple locations 8687 may be provided, together with information that allows priority among 8688 the choices to be indicated, via information in the fs_locations_info 8689 attribute. Where suitable clustering mechanisms make it possible to 8690 provide multiple identical file systems or paths to them, this allows 8691 the client the opportunity to deal with any resource or 8692 communications issues that might limit data availability. 8694 When an alternate location is designated as the target for migration, 8695 it must designate the same data (with metadata being the same to the 8696 degree indicated by the fs_locations_info attribute). Where file 8697 systems are writable, a change made on the original file system must 8698 be visible on all migration targets. Where a file system is not 8699 writable but represents a read-only copy (possibly periodically 8700 updated) of a writable file system, similar requirements apply to the 8701 propagation of updates. Any change visible in the original file 8702 system must already be effected on all migration targets, to avoid 8703 any possibility, that a client in effecting a transition to the 8704 migration target will see any reversion in file system state. 8706 10.4.3. Referrals 8708 Referrals provide a way of placing a file system in a location 8709 essentially without respect to its physical location on a given 8710 server. This allows a single server of a set of servers to present a 8711 multi-server namespace that encompasses file systems located on 8712 multiple servers. Some likely uses of this include establishment of 8713 site-wide or organization-wide namespaces, or even knitting such 8714 together into a truly global namespace. 8716 Referrals occur when a client determines, upon first referencing a 8717 position in the current namespace, that it is part of a new file 8718 system and that that file system is absent. When this occurs, 8719 typically by receiving the error NFS4ERR_MOVED, the actual location 8720 or locations of the file system can be determined by fetching the 8721 fs_locations or fs_locations_info attribute. 8723 The locations-related attribute may designate a single file system 8724 location or multiple file system locations, to be selected based on 8725 the needs of the client. The server, in the fs_locations_info 8726 attribute may specify priorities to be associated with various file 8727 system location choices. The server may assign different priorities 8728 to different locations as reported to individual clients, in order to 8729 adapt to client physical location or to effect load balancing. When 8730 both read-only and read-write file systems are present, some of the 8731 read-only locations may not absolutely up-to-date (as they would have 8732 to be in the case of replication and migration). Servers may also 8733 specify file system locations that include client-substituted 8734 variable so that different clients are referred to different file 8735 systems (with different data contents) based on client attributes 8736 such as cpu architecture. 8738 Use of multi-server namespaces is enabled by NFSv4 but is not 8739 required. The use of multi-server namespaces and their scope will 8740 depend on the applications used, and system administration 8741 preferences. 8743 Multi-server namespaces can be established by a single server 8744 providing a large set of referrals to all of the included file 8745 systems. Alternatively, a single multi-server namespace may be 8746 administratively segmented with separate referral file systems (on 8747 separate servers) for each separately-administered section of the 8748 name space. Any segment or the top-level referral file system may 8749 use replicated referral file systems for higher availability. 8751 Generally, multi-server namespaces are for the most part uniform, in 8752 that the same data made available to one client at a given location 8753 in the namespace is made availably to all clients at that location. 8754 There are however facilities provided which allow different client to 8755 be directed to different sets of data, so as to adapt to such client 8756 characteristics as cpu architecture. 8758 10.5. Additional Client-side Considerations 8760 When clients make use of servers that implement referrals, 8761 replication, and migration, care should be taken so that a user who 8762 mounts a given file system that includes a referral or a relocated 8763 file system continue to see a coherent picture of that user-side file 8764 system despite the fact that it contains a number of server-side file 8765 systems which may be on different servers. 8767 One important issue is upward navigation from the root of a server- 8768 side file system to its parent (specified as ".." in UNIX). The 8769 client needs to determine when it hits an fsid root going up the file 8770 tree. When at such a point, and needs to ascend to the parent, it 8771 must do so locally instead of sending a LOOKUPP call to the server. 8772 The LOOKUPP would normally return the ancestor of the target file 8773 system on the target server, which may not be part of the space that 8774 the client mounted. 8776 A related issue is upward navigation from named attribute 8777 directories. The named attribute directories are essentially 8778 detached from the namespace and this property should be safely 8779 represented in the client operating environment. LOOKUPP on a named 8780 attribute directory may return the filehandle of the associated file 8781 and conveying this to applications might be unsafe as many 8782 applications expect the parent of a directory to be a directory by 8783 itself. Therefore the client may want to hide the parent of named 8784 attribute directories (represented as ".." in UNIX) or represent the 8785 named attribute directory as its own parent (as typically done for 8786 the file system root directory in UNIX) 8788 Another issue concerns refresh of referral locations. When referrals 8789 are used extensively, they may change as server configurations 8790 change. It is expected that clients will cache information related 8791 to traversing referrals so that future client side requests are 8792 resolved locally without server communication. This is usually 8793 rooted in client-side name lookup caching. Clients should 8794 periodically purge this data for referral points in order to detect 8795 changes in location information. When the change attribute changes 8796 for directories that hold referral entries or for the referral 8797 entries themselves, clients should consider any associated cached 8798 referral information to be out of date. 8800 10.6. Effecting File System Transitions 8802 Transitions between file system instances, whether due to switching 8803 between replicas upon server unavailability, or in response to a 8804 server-initiated migration events are best dealt with together. Even 8805 though the prototypical use cases of replication and migration 8806 contain distinctive sets of features, when all possibilities for 8807 these operations are considered, the underlying unity of these 8808 operations, from the client's point of view is clear, even though for 8809 the server pragmatic considerations will normally force different 8810 implementation strategies for planned and unplanned transitions. 8812 A number of methods are possible for servers to replicate data and to 8813 track client state in order to allow clients to transition between 8814 file system instances with a minimum of disruption. Such methods 8815 vary between those that use inter-server clustering techniques to 8816 limit the changes seen by the client, to those that are less 8817 aggressive, use more standard methods of replicating data, and impose 8818 a greater burden on the client to adapt to the transition. 8820 The NFSv4.1 protocol does not impose choices on clients and servers 8821 with regard to that spectrum of transition methods. In fact, there 8822 are many valid choices, depending on client and application 8823 requirements and their interaction with server implementation 8824 choices. The NFSv4.1 protocol does define the specific choices that 8825 can be made, how these choices are communicated to the client and how 8826 the client is to deal with any discontinuities. 8828 In the sections below, references will be made to various possible 8829 server implementation choices as a way of illustrating the transition 8830 scenarios that clients may deal with. The intent here is not to 8831 define or limit server implementations but rather to illustrate the 8832 range of issues that clients may face. 8834 In the discussion below, references will be made to a file system 8835 having a particular property or of two file systems (typically the 8836 source and destination) belonging to a common class of any of several 8837 types. Two file systems that belong to such a class share some 8838 important aspect of file system behavior that clients may depend upon 8839 when present, to easily effect a seamless transition between file 8840 system instances. Conversely, where the file systems do not belong 8841 to such a common class, the client has to deal with various sorts of 8842 implementation discontinuities which may cause performance or other 8843 issues in effecting a transition. 8845 Where the fs_locations_info attribute is available, such file system 8846 classification data will be made directly available to the client. 8847 See Section 10.10 for details. When only fs_locations is available, 8848 default assumptions with regard to such classifications have to be 8849 inferred. See Section 10.9 for details. 8851 In cases in which one server is expected to accept opaque values from 8852 the client that originated from another server, it is a wise 8853 implementation practice for the servers to encode the "opaque" values 8854 in big endian octet order. If this is done, servers acting as 8855 replicas or immigrating file systems will be able to parse values 8856 like stateids, directory cookies, filehandles, etc. even if their 8857 native octet order is different from that of other servers 8858 cooperating in the replication and migration of the file system. 8860 10.6.1. File System Transitions and Simultaneous Access 8862 When a single file system may be accessed at multiple locations, 8863 whether this is because of an indication of file system identity as 8864 reported by the fs_locations or fs_locations_info attributes or 8865 because two file systems instances have corresponding locations on 8866 server addresses which connect to the same server as indicated by a 8867 common so_major_id field in the eir_server_owner field returned by 8868 EXCHANGE_ID, the client will, depending on specific circumstances as 8869 discussed below, either: 8871 o Access multiple instances simultaneously, as representing 8872 alternate paths to the same data and metadata. 8874 o The client accesses one instance (or set of instances) and then 8875 transitions to an alternative instance (or set of instances) as a 8876 result of network issues, server unresponsiveness, or server- 8877 directed migration. The transition may involve changes in 8878 filehandles, fileids, the change attribute, and or locking state, 8879 depending on the attributes of the source and destination file 8880 system instances, as specified in the fs_locations_info attribute. 8882 Which of these choices is possible, and how a transition is effected 8883 is governed by equivalence classes of file system instances as 8884 reported by the fs_locations_info attribute, and, for file systems 8885 instances in the same location within multiple single-server 8886 namespace, by the so_major_id field in the eir_server_owner field 8887 returned by EXCHANGE_ID. 8889 10.6.2. Simultaneous Use and Transparent Transitions 8891 When two file system instances have the same location within their 8892 respective single-server namespaces and those two server IP addresses 8893 return the so_major_id value in the eir_server_owner value returned 8894 in response to EXCHANGE_ID, those file systems instances can be 8895 treated as the same, and either used together simultaneously or 8896 serially with no transition activity required on the part of the 8897 client. 8899 Whether simultaneous use of the two file system instances is valid is 8900 controlled by whether the fs_locations_info attribute shows the two 8901 instances as having the same _simultaneous-use_ class. 8903 Note that for two such file systems, any information within the 8904 fs_locations_info attribute that indicates the need for special 8905 transition activity, i.e. the appearance of the two file system 8906 instances with different _handle_, _fileid_, _verifier_, _change_ 8907 classes, MUST be ignored by the client. The server SHOULD not 8908 indicate that these instances belong to different _handle_, _fileid_, 8909 _verifier_, _change_ classes, whether the two instances are shown 8910 belonging to the same _simultaneous-use_ class or not. 8912 Where these conditions do not apply, a non-transparent file system 8913 instance transition is required with the details depending on the 8914 respective _handle_, _fileid_, _verifier_, _change_ classes of the 8915 two file system instances and whether the two servers in question 8916 have the same eir_server_scope value as reported by EXCHANGE_ID. 8918 10.6.2.1. Simultaneous Use of File System Instances 8920 When the conditions above hold, in either of the following two cases, 8921 the client may use the two file system instances simultaneously. 8923 o The fs_locations_info attribute does not contain separate per-IP 8924 address entries for file systems instances at the distinct IP 8925 addresses. This includes the case in which the fs_locations_info 8926 attribute is unavailable. 8928 o The fs_locations_info attribute indicates that two file system 8929 instances belong to the same _simultaneous-use_ class. 8931 In this case, the client may use both file system instances 8932 simultaneously, as representations of the same file system, whether 8933 that happens because the two IP addresses connect to the same 8934 physical server or because different servers connect to clustered 8935 file systems and export their data in common. When simultaneous use 8936 is in effect, any change made to one file system instance must be 8937 immediately reflected in the other file system instance(s). Locks 8938 are treated as part of a common lease, associated with a common 8939 client ID. Depending on the details of the eir_server_owner returned 8940 by EXCHANGE_ID, the two server instances may be accessed by different 8941 sessions or a single session in common. 8943 10.6.2.2. Transparent File System Transitions 8945 When the conditions above hold and the fs_locations_info attribute 8946 explicitly shows the file system instances for these distinct IP 8947 addresses as belonging to different _simultaneous-use_ classes, the 8948 file system instances should not be used by the client 8949 simultaneously, but rather serially with one being used unless and 8950 until communication difficulties, lack of responsiveness, or an 8951 explicit migration event causes another file system instance (or set 8952 of file system instances sharing a common _simultaneous-use_ class to 8953 be used. 8955 When a change in file system instance is to be done, the client will 8956 use the same client ID already in effect. If it already has 8957 connections to the new server address, these will be used. Otherwise 8958 new connections to existing sessions or new sessions associated with 8959 the existing client ID are established as indicated by the 8960 eir_server_owner returned by EXCHANGE_ID. 8962 In all such transparent transition cases, the following apply: 8964 o File handles stay the same if persistent and if volatile are only 8965 subject to expiration, if they would be in the absence of file 8966 system transition. 8968 o Fileid values do not change across the transition. 8970 o The file system will have the same fsid in both the old and new 8971 locations. 8973 o Change attribute values are consistent across the transition and 8974 do not have to be refetched. When change attributes indicate that 8975 a cached object is still valid, it can remain cached. 8977 o Client, and state identifier retain their validity across the 8978 transition, except where their staleness is recognized and 8979 reported by the new server. Except where such staleness requires 8980 it, no lock reclamation is needed. 8982 o Write verifiers are presumed to retain their validity and can be 8983 presented to COMMIT, with the expectation that if COMMIT on the 8984 new server accept them as valid, then that server has all of the 8985 data unstably written to the original server and has committed it 8986 to stable storage as requested. 8988 10.6.3. Filehandles and File System Transitions 8990 There are a number of ways in which filehandles can be handled across 8991 a file system transition. These can be divided into two broad 8992 classes depending upon whether the two file systems across which the 8993 transition happens share sufficient state to effect some sort of 8994 continuity of file system handling. 8996 When there is no such co-operation in filehandle assignment, the two 8997 file systems are reported as being in different _handle_ classes. In 8998 this case, all filehandles are assumed to expire as part of the file 8999 system transition. Note that this behavior does not depend on 9000 fh_expire_type attribute and supersedes the specification of 9001 FH4_VOL_MIGRATION bit, which only affects behavior when 9002 fs_locations_info is not available. 9004 When there is co-operation in filehandle assignment, the two file 9005 systems are reported as being in the same _handle_ classes. In this 9006 case, persistent filehandle remain valid after the file system 9007 transition, while volatile filehandles (excluding those while are 9008 only volatile due to the FH4_VOL_MIGRATION bit) are subject to 9009 expiration on the target server. 9011 10.6.4. Fileid's and File System Transitions 9013 In NFSv4.0, the issue of continuity of fileid's in the event of a 9014 file system transition was not addressed. The general expectation 9015 had been that in situations in which the two file system instances 9016 are created by a single vendor using some sort of file system image 9017 copy, fileid's will be consistent across the transition while in the 9018 analogous multi-vendor transitions they will not. This poses 9019 difficulties, especially for the client without special knowledge of 9020 the of the transition mechanisms adopted by the server. 9022 It is important to note that while clients themselves may have no 9023 trouble with a fileid changing as a result of a file system 9024 transition event, applications do typically have access to the fileid 9025 (e.g. via stat), and the result of this is that an application may 9026 work perfectly well if there is no file system instance transition or 9027 if any such transition is among instances created by a single vendor, 9028 yet be unable to deal with the situation in which a multi-vendor 9029 transition occurs, at the wrong time. 9031 Providing the same fileid's in a multi-vendor (multiple server 9032 vendors) environment has generally been held to be quite difficult. 9034 While there is work to be done, it needs to be pointed out that this 9035 difficulty is partly self-imposed. Servers have typically identified 9036 fileid with inode number, i.e. with a quantity used to find the file 9037 in question. This identification poses special difficulties for 9038 migration of an fs between vendors where assigning the same index to 9039 a given file may not be possible. Note here that a fileid does not 9040 require that it be useful to find the file in question, only that it 9041 is unique within the given fs. Servers prepared to accept a fileid 9042 as a single piece of metadata and store it apart from the value used 9043 to index the file information can relatively easily maintain a fileid 9044 value across a migration event, allowing a truly transparent 9045 migration event. 9047 In any case, where servers can provide continuity of fileids, they 9048 should and the client should be able to find out that such continuity 9049 is available, and take appropriate action. Information about the 9050 continuity (or lack thereof) of fileid's across a file system is 9051 represented by specifying whether the file systems in question are of 9052 the same _fileid_ class. 9054 10.6.5. Fsids and File System Transitions 9056 Since fsids are only unique within a per-server basis, it is to be 9057 expected that they will change during a file system transition. 9058 Clients should not make the fsid's received from the server visible 9059 to application since they may not be globally unique, and because 9060 they may change during a file system transition event. Applications 9061 are best served if they are isolated from such transitions to the 9062 extent possible. 9064 When a file system transition is made and the fs_locations_info 9065 indicates that file system in question may be split into multiple 9066 file systems (via the FSLI4F_MULTI_FS flag), client should do 9067 GETATTR's on all known objects within the file system undergoing 9068 transition, to determine the new file system boundaries. Clients may 9069 maintain the fsid's passed to existing applications by mapping all of 9070 the fsid for the descendent file systems to a the common fsid used 9071 for the original file system. 9073 10.6.6. The Change Attribute and File System Transitions 9075 Since the change attribute is defined as a server-specific one, 9076 change attributes fetched from one server are normally presumed to be 9077 invalid on another server. Such a presumption is troublesome since 9078 it would invalidate all cached change attributes, requiring 9079 refetching. Even more disruptive, the absence of any assured 9080 continuity for the change attribute means that even if the same value 9081 is gotten on refetch no conclusions can drawn as to whether the 9082 object in question has changed. The identical change attribute could 9083 be merely an artifact, of a modified file with a different change 9084 attribute construction algorithm, with that new algorithm just 9085 happening to result in an identical change value. 9087 When the two file systems have consistent change attribute formats, 9088 and this fact is communicated to the client by reporting as in the 9089 same _change_ class, the client may assume a continuity of change 9090 attribute construction and handle this situation just as it would be 9091 handled without any file system transition. 9093 10.6.7. Lock State and File System Transitions 9095 In a file system transition, the client needs to handle cases in 9096 which the two servers have cooperated in state management and in 9097 which they have not. Cooperation by two servers in state management 9098 requires coordination of clientids. Before the client attempts to 9099 use a client ID associated with one server in a request to the server 9100 of the other file system, it must eliminate the possibility that two 9101 non-cooperating servers have assigned the same client ID by accident. 9102 The client needs to compare the eir_server_scope values returned by 9103 each server. If the scope values do not match, then the servers have 9104 not cooperated in state management. If the scope values match, then 9105 this indicates the servers have cooperated in assigning clientids to 9106 the point that they will reject clientids that refer to state they do 9107 not know about. 9109 In the case of migration, the servers involved in the migration of a 9110 file system SHOULD transfer all server state from the original to the 9111 new server. When this done, it must be done in a way that is 9112 transparent to the client. With replication, such a degree of common 9113 state is typically not the case. Clients, however should use the 9114 information provided by the eir_server_scope returned by EXCHANGE_ID 9115 to determine whether such sharing may be in effect, rather than 9116 making assumptions based on the reason for the transition. 9118 This state transfer will reduce disruption to the client when a file 9119 system transition If the servers are successful in transferring all 9120 state, the client can attempt to establish sessions associated with 9121 the client ID used for the source file system instance. If the 9122 server accepts that as a valid client ID, then the client may used 9123 the existing stateid's associated with that client ID for the old 9124 file system instance in connection with the that same client ID in 9125 connection with the file system instance. 9127 When the two servers belong to the same server scope, it does 9128 necessarily mean that when dealing with the transition, the client 9129 will not have to reclaim state. However it does mean that the client 9130 may proceed using his current client ID when establishing 9131 communication with the new server and that that new server will 9132 either recognize that client ID as valid, or reject it, in which case 9133 locks must be reclaimed by the client. 9135 File systems co-operating in state management may actually share 9136 state or simply divide the id space so as to recognize (and reject as 9137 stale) each others state and clients id's. Servers which do share 9138 state may not do so under all conditions or at all times. The 9139 requirement for the server is that if it cannot be sure in accepting 9140 a client ID that it reflects the locks the client was given, it must 9141 treat all associated state as stale and report it as such to the 9142 client. 9144 When the two file systems instances are on servers that do not share 9145 a server scope value the client must establish a new client ID on the 9146 destination, if it does not have one already and reclaim if possible. 9147 In this case, old stateids and client ID's should not be presented to 9148 the new server since there is no assurance that they will not 9149 conflict with IDs valid on that server. 9151 In either case, when actual locks are not known to be maintained, the 9152 destination server may establish a grace period specific to the given 9153 file system, with non-reclaim locks being rejected for that file 9154 system, even though normal locks are being granted for other file 9155 systems. Clients should not infer the absence of a grace period for 9156 file systems being transitioned to a server from responses to 9157 requests for other file systems. 9159 In the case of lock reclamation for a given file system after a file 9160 system transition, edge conditions can arise similar to those for 9161 reclaim after server reboot (although in the case of the planned 9162 state transfer associated with migration, these can be avoided by 9163 securely recording lock state as part of state migration. Where the 9164 destination server cannot guarantee that locks will not be 9165 incorrectly granted, the destination server should not establish a 9166 file-system-specific grace period. 9168 In place of a file-system-specific version of RECLAIM_COMPLETE, 9169 servers may assume that an attempt to obtain a new lock, other than 9170 be reclaim, indicate the end of the client's attempt to reclaim locks 9171 for that file system. [NOTE: The alternative would be to adapt 9172 RECLAIM_COMPLETE to this task]. 9174 Information about client identity that may be propagated between 9175 servers in the form of client_owner4 and associated verifiers, under 9176 the assumption that the client presents the same values to all the 9177 servers with which it deals. 9179 Servers are encouraged to provide facilities to allow locks to be 9180 reclaimed on the new server after a file system transition. Often, 9181 however, in cases in which the two servers do not share a server 9182 scope value, such facilities may not be available and client should 9183 be prepared to re-obtain locks, even though it is possible that the 9184 client may have his LOCK or OPEN request denied due to a conflicting 9185 lock. In some environments, such as the transition between read-only 9186 file systems, such denial of locks should not pose large difficulties 9187 in practice. When an attempt to re-establish a lock on a new server 9188 is denied, the client should treat the situation as if his original 9189 lock had been revoked. In all cases in which the lock is granted, 9190 the client cannot assume that no conflicting could have been granted 9191 in the interim. Where change attribute continuity is present, the 9192 client may check the change attribute to check for unwanted file 9193 modifications. Where even this is not available, and the file system 9194 is not read-only, a client may reasonably treat all pending locks as 9195 having been revoked. 9197 10.6.7.1. Leases and File System Transitions 9199 In the case of lease renewal, the client may not be submitting 9200 requests for a file system that has been transferred to another 9201 server. This can occur because of the lease renewal mechanism. The 9202 client renews leases for all file systems when submitting a request 9203 on an associated session, regardless of the specific file system 9204 being referenced. 9206 In order for the client to schedule renewal of leases that may have 9207 been relocated to the new server, the client must find out about 9208 lease relocation before those leases expire. To accomplish this, the 9209 SEQUENCE operation will return the status bit 9210 SEQ4_STATUS_LEASE_MOVED, if responsibility for any of the leases to 9211 be renewed has been transferred to a new server. This condition will 9212 continue until the client receives an NFS4ERR_MOVED error and the 9213 server receives the subsequent GETATTR for the fs_locations or 9214 fs_locations_info attribute for an access to each file system for 9215 which a lease has been moved to a new server. 9217 When a client receives an SEQ4_STATUS_LEASE_MOVED indication, it 9218 should perform an operation on each file system associated with the 9219 server in question. When the client receives an NFS4ERR_MOVED error, 9220 the client can follow the normal process to obtain the new server 9221 information (through the fs_locations and fs_locations_info 9222 attributes) and perform renewal of those leases on the new server, 9223 unless information in fs_locations_info attribute shows that no state 9224 could have been transferred. If the server has not had state 9225 transferred to it transparently, the client will receive 9226 NFS4ERR_STALE_CLIENTID from the new server, as described above, and 9227 the client can then reclaim locks as is done in the event of server 9228 failure. 9230 10.6.7.2. Transitions and the Lease_time Attribute 9232 In order that the client may appropriately manage its leases in the 9233 case of a file system transition, the destination server must 9234 establish proper values for the lease_time attribute. 9236 When state is transferred transparently, that state should include 9237 the correct value of the lease_time attribute. The lease_time 9238 attribute on the destination server must never be less than that on 9239 the source since this would result in premature expiration of leases 9240 granted by the source server. Upon transitions in which state is 9241 transferred transparently, the client is under no obligation to re- 9242 fetch the lease_time attribute and may continue to use the value 9243 previously fetched (on the source server). 9245 If state has not been transferred transparently, either because the 9246 associated servers are show as have different eir_server_scope 9247 strings or because the client ID is rejected when presented to the 9248 new server, the client should fetch the value of lease_time on the 9249 new (i.e. destination) server, and use it for subsequent locking 9250 requests. However the server must respect a grace period at least as 9251 long as the lease_time on the source server, in order to ensure that 9252 clients have ample time to reclaim their lock before potentially 9253 conflicting non-reclaimed locks are granted. 9255 10.6.8. Write Verifiers and File System Transitions 9257 In a file system transition, the two file systems may be clustered in 9258 the handling of unstably written data. When this is the case, and 9259 the two file systems belong to the same _verifier_ class, valid 9260 verifiers from one system may be recognized by the other and 9261 superfluous writes avoided. There is no requirement that all valid 9262 verifiers be recognized, but it cannot be the case that a verifier is 9263 recognized as valid when it is not. [NOTE: We need to resolve the 9264 issue of proper verifier scope]. 9266 When two file systems belong to different _verifier_ classes, the 9267 client must assume that all unstable writes in existence at the time 9268 file system transition, have been lost since there is no way the old 9269 verifier can recognized as valid (or not) on the target server. 9271 10.7. Effecting File System Referrals 9273 Referrals are effected when an absent file system is encountered, and 9274 one or more alternate locations are made available by the 9275 fs_locations or fs_locations_info attributes. The client will 9276 typically get an NFS4ERR_MOVED error, fetch the appropriate location 9277 information and proceed to access the file system on different 9278 server, even though it retains its logical position within the 9279 original namespace. 9281 The examples given in the sections below are somewhat artificial in 9282 that an actual client will not typically do a multi-component lookup, 9283 but will have cached information regarding the upper levels of the 9284 name hierarchy. However, these example are chosen to make the 9285 required behavior clear and easy to put within the scope of a small 9286 number of requests, without getting unduly into details of how 9287 specific clients might choose to cache things. 9289 10.7.1. Referral Example (LOOKUP) 9291 Let us suppose that the following COMPOUND is issued in an 9292 environment in which /this/is/the/path is absent from the target 9293 server. This may be for a number of reasons. It may be the case 9294 that the file system has moved, or, it may be the case that the 9295 target server is functioning mainly, or solely, to refer clients to 9296 the servers on which various file systems are located. 9298 o PUTROOTFH 9300 o LOOKUP "this" 9302 o LOOKUP "is" 9304 o LOOKUP "the" 9306 o LOOKUP "path" 9308 o GETFH 9310 o GETATTR fsid,fileid,size,ctime 9312 Under the given circumstances, the following will be the result. 9314 o PUTROOTFH --> NFS_OK. The current fh is now the root of the 9315 pseudo-fs. 9317 o LOOKUP "this" --> NFS_OK. The current fh is for /this and is 9318 within the pseudo-fs. 9320 o LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 9321 within the pseudo-fs. 9323 o LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 9324 is within the pseudo-fs. 9326 o LOOKUP "path" --> NFS_OK. The current fh is for /this/is/the/path 9327 and is within a new, absent fs, but ... the client will never see 9328 the value of that fh. 9330 o GETFH --> NFS4ERR_MOVED. Fails because current fh is in an absent 9331 fs at the start of the operation and the spec makes no exception 9332 for GETFH. 9334 o GETATTR fsid,fileid,size,ctime. Not executed because the failure 9335 of the GETFH stops processing of the COMPOUND. 9337 Given the failure of the GETFH, the client has the job of determining 9338 the root of the absent file system and where to find that file 9339 system, i.e. the server and path relative to that server's root fh. 9340 Note here that in this example, the client did not obtain filehandles 9341 and attribute information (e.g. fsid) for the intermediate 9342 directories, so that he would not be sure where the absent file 9343 system starts. It could be the case, for example, that /this/is/the 9344 is the root of the moved file system and that the reason that the 9345 lookup of "path" succeeded is that the file system was not absent on 9346 that op but was moved between the last LOOKUP and the GETFH (since 9347 COMPOUND is not atomic). Even if we had the fsid's for all of the 9348 intermediate directories, we could have no way of knowing that /this/ 9349 is/the/path was the root of a new fs, since we don't yet have its 9350 fsid. 9352 In order to get the necessary information, let us re-issue the chain 9353 of lookup's with GETFH's and GETATTR's to at least get the fsid's so 9354 we can be sure where the appropriate fs boundaries are. The client 9355 could choose to get fs_locations_info at the same time but in most 9356 cases the client will have a good guess as to where fs boundaries are 9357 (because of where NFS4ERR_MOVED was gotten and where not) making 9358 fetching of fs_locations_info unnecessary. 9360 OP01: PUTROOTFH --> NFS_OK 9362 - Current fh is root of pseudo-fs. 9364 OP02: GETATTR(fsid) --> NFS_OK 9366 - Just for completeness. Normally, clients will know the fsid of 9367 the pseudo-fs as soon as they establish communication with a 9368 server. 9370 OP03: LOOKUP "this" --> NFS_OK 9372 OP04: GETATTR(fsid) --> NFS_OK 9374 - Get current fsid to see where fs boundaries are. The fsid will be 9375 that for the pseudo-fs in this example, so no boundary. 9377 OP05: GETFH --> NFS_OK 9379 - Current fh is for /this and is within pseudo-fs. 9381 OP06: LOOKUP "is" --> NFS_OK 9383 - Current fh is for /this/is and is within pseudo-fs. 9385 OP07: GETATTR(fsid) --> NFS_OK 9387 - Get current fsid to see where fs boundaries are. The fsid will be 9388 that for the pseudo-fs in this example, so no boundary. 9390 OP08: GETFH --> NFS_OK 9392 - Current fh is for /this/is and is within pseudo-fs. 9394 OP09: LOOKUP "the" --> NFS_OK 9396 - Current fh is for /this/is/the and is within pseudo-fs. 9398 OP10: GETATTR(fsid) --> NFS_OK 9400 - Get current fsid to see where fs boundaries are. The fsid will be 9401 that for the pseudo-fs in this example, so no boundary. 9403 OP11: GETFH --> NFS_OK 9405 - Current fh is for /this/is/the and is within pseudo-fs. 9407 OP12: LOOKUP "path" --> NFS_OK 9409 - Current fh is for /this/is/the/path and is within a new, absent 9410 fs, but ... 9412 - The client will never see the value of that fh 9414 OP13: GETATTR(fsid, fs_locations_info) --> NFS_OK 9415 - We are getting the fsid to know where the fs boundaries are. Note 9416 that the fsid we are given will not necessarily be preserved at 9417 the new location. That fsid might be different and in fact the 9418 fsid we have for this fs might a valid fsid of a different fs on 9419 that new server. 9421 - In this particular case, we are pretty sure anyway that what has 9422 moved is /this/is/the/path rather than /this/is/the since we have 9423 the fsid of the latter and it is that of the pseudo-fs, which 9424 presumably cannot move. However, in other examples, we might not 9425 have this kind of information to rely on (e.g. /this/is/the might 9426 be a non-pseudo file system separate from /this/is/the/path), so 9427 we need to have another reliable source information on the 9428 boundary of the fs which is moved. If, for example, the file 9429 system "/this/is" had moved we would have a case of migration 9430 rather than referral and once the boundaries of the migrated file 9431 system was clear we could fetch fs_locations_info. 9433 - We are fetching fs_locations_info because the fact that we got an 9434 NFS4ERR_MOVED at this point means that it most likely that this is 9435 a referral and we need the destination. Even if it is the case 9436 that "/this/is/the" is a file system which has migrated, we will 9437 still need the location information for that file system. 9439 OP14: GETFH --> NFS4ERR_MOVED 9441 - Fails because current fh is in an absent fs at the start of the 9442 operation and the spec makes no exception for GETFH. Note that 9443 this has the happy consequence that we don't have to worry about 9444 the volatility or lack thereof of the fh. If the root of the fs 9445 on the new location is a persistent fh, then we can assume that 9446 this fh, which we never saw is a persistent fh, which, if we could 9447 see it, would exactly match the new fh. At least, there is no 9448 evidence to disprove that. On the other hand, if we find a 9449 volatile root at the new location, then the filehandle which we 9450 never saw must have been volatile or at least nobody can prove 9451 otherwise. 9453 Given the above, the client knows where the root of the absent file 9454 system is, by noting where the change of fsid occurred. The 9455 fs_locations_info attribute also gives the client the actual location 9456 of the absent file system, so that the referral can proceed. The 9457 server gives the client the bare minimum of information about the 9458 absent file system so that there will be very little scope for 9459 problems of conflict between information sent by the referring server 9460 and information of the file system's home. No filehandles and very 9461 few attributes are present on the referring server and the client can 9462 treat those it receives as basically transient information with the 9463 function of enabling the referral. 9465 10.7.2. Referral Example (READDIR) 9467 Another context in which a client may encounter referrals is when it 9468 does a READDIR on directory in which some of the sub-directories are 9469 the roots of absent file systems. 9471 Suppose such a directory is read as follows: 9473 o PUTROOTFH 9475 o LOOKUP "this" 9477 o LOOKUP "is" 9479 o LOOKUP "the" 9481 o READDIR (fsid, size, ctime, mounted_on_fileid) 9483 In this case, because rdattr_error is not requested, 9484 fs_locations_info is not requested, and some of attributes cannot be 9485 provided the result will be an NFS4ERR_MOVED error on the READDIR, 9486 with the detailed results as follows: 9488 o PUTROOTFH --> NFS_OK. The current fh is at the root of the 9489 pseudo-fs. 9491 o LOOKUP "this" --> NFS_OK. The current fh is for /this and is 9492 within the pseudo-fs. 9494 o LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 9495 within the pseudo-fs. 9497 o LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 9498 is within the pseudo-fs. 9500 o READDIR (fsid, size, ctime, mounted_on_fileid) --> NFS4ERR_MOVED. 9501 Note that the same error would have been returned if /this/is/the 9502 had migrated, when in fact it is because the directory contains 9503 the root of an absent fs. 9505 So now suppose that we reissue with rdattr_error: 9507 o PUTROOTFH 9509 o LOOKUP "this" 9510 o LOOKUP "is" 9512 o LOOKUP "the" 9514 o READDIR (rdattr_error, fsid, size, ctime, mounted_on_fileid) 9516 The results will be: 9518 o PUTROOTFH --> NFS_OK. The current fh is at the root of the 9519 pseudo-fs. 9521 o LOOKUP "this" --> NFS_OK. The current fh is for /this and is 9522 within the pseudo-fs. 9524 o LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 9525 within the pseudo-fs. 9527 o LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 9528 is within the pseudo-fs. 9530 o READDIR (rdattr_error, fsid, size, ctime, mounted_on_fileid) --> 9531 NFS_OK. The attributes for "path" will only contain rdattr_error 9532 with the value will be NFS4ERR_MOVED, together with an fsid value 9533 and an a value for mounted_on_fileid. 9535 So suppose we do another READDIR to get fs_locations_info, although 9536 we could have used a GETATTR directly, as in the previous section. 9538 o PUTROOTFH 9540 o LOOKUP "this" 9542 o LOOKUP "is" 9544 o LOOKUP "the" 9546 o READDIR (rdattr_error, fs_locations_info, mounted_on_fileid, fsid, 9547 size, ctime) 9549 The results would be: 9551 o PUTROOTFH --> NFS_OK. The current fh is at the root of the 9552 pseudo-fs. 9554 o LOOKUP "this" --> NFS_OK. The current fh is for /this and is 9555 within the pseudo-fs. 9557 o LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 9558 within the pseudo-fs. 9560 o LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 9561 is within the pseudo-fs. 9563 o READDIR (rdattr_error, fs_locations_info, mounted_on_fileid, fsid, 9564 size, ctime) --> NFS_OK. The attributes will be as shown below. 9566 The attributes for "path" will only contain 9568 o rdattr_error (value: NFS4ERR_MOVED) 9570 o fs_locations_info ) 9572 o mounted_on_fileid (value: unique fileid within referring fs) 9574 o fsid (value: unique value within referring server) 9576 The attribute entry for "latest" will not contain size or ctime. 9578 10.8. The Attribute fs_absent 9580 In order to provide the client information about whether the current 9581 file system is present or absent, the fs_absent attribute may be 9582 interrogated. 9584 As noted above, this attribute, when supported, may be requested of 9585 absent file systems without causing NFS4ERR_MOVED to be returned and 9586 it should always be available. Servers are strongly urged to support 9587 this attribute on all file systems if they support it on any file 9588 system. 9590 10.9. The Attribute fs_locations 9592 The fs_locations attribute is structured in the following way: 9594 struct fs_location { 9595 utf8str_cis server<>; 9596 pathname4 rootpath; 9597 }; 9599 struct fs_locations { 9600 pathname4 fs_root; 9601 fs_location locations<>; 9602 }; 9604 The fs_location struct is used to represent the location of a file 9605 system by providing a server name and the path to the root of the 9606 file system within that server's namespace. When a set of servers 9607 have corresponding file systems at the same path within their 9608 namespaces, an array of server names may be provided. An entry in 9609 the server array is an UTF8 string and represents one of a 9610 traditional DNS host name, IPv4 address, or IPv6 address. It is not 9611 a requirement that all servers that share the same rootpath be listed 9612 in one fs_location struct. The array of server names is provided for 9613 convenience. Servers that share the same rootpath may also be listed 9614 in separate fs_location entries in the fs_locations attribute. 9616 The fs_locations struct and attribute contains an array of such 9617 locations. Since the name space of each server may be constructed 9618 differently, the "fs_root" field is provided. The path represented 9619 by fs_root represents the location of the file system in the current 9620 server's name space, i.e. that of the server from which the 9621 fs_locations attribute was obtained. The fs_root path is meant to 9622 aid the client by clearly referencing the root of the file system 9623 whose locations are being reported, no matter what object within the 9624 current file system, the current filehandle designates. 9626 As an example, suppose there is a replicated file system located at 9627 two servers (servA and servB). At servA, the file system is located 9628 at path "/a/b/c". At, servB the file system is located at path 9629 "/x/y/z". If the client were to obtain the fs_locations value for 9630 the directory at "/a/b/c/d", it might not necessarily know that the 9631 file system's root is located in servA's name space at "/a/b/c". 9632 When the client switches to servB, it will need to determine that the 9633 directory it first referenced at servA is now represented by the path 9634 "/x/y/z/d" on servB. To facilitate this, the fs_locations attribute 9635 provided by servA would have a fs_root value of "/a/b/c" and two 9636 entries in fs_locations. One entry in fs_locations will be for 9637 itself (servA) and the other will be for servB with a path of 9638 "/x/y/z". With this information, the client is able to substitute 9639 "/x/y/z" for the "/a/b/c" at the beginning of its access path and 9640 construct "/x/y/z/d" to use for the new server. 9642 Since fs_locations attribute lacks information defining various 9643 attributes of the various file system choices presented, it should 9644 only be interrogated and used when fs_locations_info is not 9645 available. When fs_locations is used, information about the specific 9646 locations should be assumed based on the following rules. 9648 The following rules are general and apply irrespective of the 9649 context. 9651 o All listed file system instances should be considered as of the 9652 same _handle_ class, if and only if, the current fh_expire_type 9653 attribute does not include the FH4_VOL_MIGRATION bit. Note that 9654 in the case of referral, filehandle issues do not apply since 9655 there can be no filehandles known within the current file system 9656 nor is there any access to the fh_expire_type attribute on the 9657 referring (absent) file system. 9659 o All listed file system instances should be considered as of the 9660 same _fileid_ class, if and only if, the fh_expire_type attribute 9661 indicates persistent filehandles and does not include the 9662 FH4_VOL_MIGRATION bit. Note that in the case of referral, fileid 9663 issues do not apply since there can be no fileids known within the 9664 referring (absent) file system nor is there any access to the 9665 fh_expire_type attribute. 9667 o All file system instances servers should be considered as of 9668 different _change_ classes. 9670 For other class assignments, handling depends of file system 9671 transitions depends on the reasons for the transition: 9673 o When the transition is due to migration, the target should be 9674 treated as being of the same _verifier_ class as the source. 9676 o When the transition is due to failover to another replica, the 9677 target should be treated as being of a different _verifier_ class 9678 from the source. 9680 The specific choices reflect typical implementation patterns for 9681 failover and controlled migration respectively. Since other choices 9682 are possible and useful, this information is better obtained by using 9683 fs_locations_info. 9685 See the section "Security Considerations" for a discussion on the 9686 recommendations for the security flavor to be used by any GETATTR 9687 operation that requests the "fs_locations" attribute. 9689 10.10. The Attribute fs_locations_info 9691 The fs_locations_info attribute is intended as a more functional 9692 replacement for fs_locations which will continue to exist and be 9693 supported. Clients can use it get a more complete set of information 9694 about alternative file system locations. When the server does not 9695 support fs_locations_info, fs_locations can be used to get a subset 9696 of the information. A server which supports fs_locations_info MUST 9697 support fs_locations as well. 9699 There is additional information present in fs_locations_info, that is 9700 not available in fs_locations: 9702 o Attribute continuity information to allow a client to select a 9703 location which meets the transparency requirements of the 9704 applications accessing the data and to take advantage of 9705 optimizations that server guarantees as to attribute continuity 9706 may provide (e.g. change attribute). 9708 o File System identity information which indicates when multiple 9709 replicas, from the clients point of view, correspond to the same 9710 target file system, allowing them to be used interchangeably, 9711 without disruption, as multiple paths to the same thing. 9713 o Information which will bear on the suitability of various 9714 replicas, depending on the use that the client intends. For 9715 example, many applications need an absolutely up-to-date copy 9716 (e.g. those that write), while others may only need access to the 9717 most up-to-date copy reasonably available. 9719 o Server-derived preference information for replicas, which can be 9720 used to implement load-balancing while giving the client the 9721 entire fs list to be used in case the primary fails. 9723 The fs_locations_info attribute consists of a root pathname (just 9724 like fs_locations), together with an array of fs_location_item4 9725 structures. 9727 struct fs_locations_server4 { 9728 int32_t fls_currency; 9729 opaque fls_info<>; 9730 utf8str_cis fls_server; 9731 }; 9733 const FSLI4BX_GFLAGS = 0; 9734 const FSLI4BX_TFLAGS = 1; 9736 const FSLI4BX_CLSIMUL = 2; 9737 const FSLI4BX_CLHANDLE = 3; 9738 const FSLI4BX_CLFILEID = 4; 9739 const FSLI4BX_CLVERIFIER = 5; 9740 const FSLI4BX_CHANGE = 6; 9742 const FSLI4BX_READRANK = 7; 9743 const FSLI4BX_WRITERANK = 8; 9744 const FSLI4BX_READORDER = 9; 9745 const FSLI4BX_WRITEORDER = 10; 9747 const FSLI4GF_WRITABLE = 0x01; 9748 const FSLI4GF_CUR_REQ = 0x02; 9749 const FSLI4GF_ABSENT = 0x04; 9750 const FSLI4GF_GOING = 0x08; 9751 const FSLI4GF_SPLIT = 0x10; 9753 const FSLI4TF_RDMA = 0x01; 9755 struct fs_locations_item4 { 9756 fs_locations_server4 fli_entries<>; 9757 pathname4 fli_rootpath; 9758 }; 9760 struct fs_locations_info4 { 9761 uint32_t fli_flags; 9762 pathname4 fli_fs_root; 9763 fs_locations_item4 fli_items<>; 9764 }; 9766 const FSLI4IF_VAR_SUB = 0x00000001; 9768 typedef fs_locations_info4 fattr4_fs_locations_info; 9770 The fs_locations_info attribute is structured similarly to the 9771 fs_locations attribute. A top-level structure (fs_locations_info4) 9772 contains the entire attribute including the root pathname of the fs 9773 and an array of lower-level structures that define replicas that 9774 share a common root path on their respective servers. The lower- 9775 level structure in turn ( fs_locations_item4) contain a specific 9776 pathname and information on one or more individual server replicas. 9777 For that last lowest-level fs_locations_info has a 9778 fs_locations_server4 structure that contains per-server-replica 9779 information in addition to the server name. 9781 As noted above, the fs_locations_info attribute, when supported, may 9782 be requested of absent file systems without causing NFS4ERR_MOVED to 9783 be returned and it is generally expected that it will be available 9784 for both present and absent file systems even if only a single 9785 fs_locations_server4 entry is present, designating the current 9786 (present) file system, or two fs_locations_server4 entries 9787 designating the current (and now previous) location of an absent file 9788 system and its successor location. Servers are strongly urged to 9789 support this attribute on all file systems if they support it on any 9790 file system. 9792 10.10.1. The fs_locations_server4 Structure 9794 The fs_locations_server4 structure consists of the following items: 9796 o An indication of file system up-to-date-ness (fls_currency) in 9797 terms of approximate seconds before the present. A negative value 9798 indicates that the server is unable to give any reasonably useful 9799 value here. A zero indicates that file system is the actual 9800 writable data or a reliably coherent and fully up-to-date copy. 9801 Positive values indicate how out- of-date this copy can normally 9802 be before it is considered for update. Such a value is not a 9803 guarantee that such updates will always be performed on the 9804 required schedule but instead serve as a hint about how far behind 9805 the most up-to-date copy of the data, this copy would normally be 9806 expected to be. 9808 o A counted array of one-octet values (fls_info) containing 9809 information about the particular file system instance. This data 9810 includes general flags, transport capability flags, file system 9811 equivalence class information, and selection priority information. 9812 The encoding will be discussed below. 9814 o The server string (fls_server). For the case of the replica 9815 currently being accessed (via GETATTR), a null string may be used 9816 to indicate the current address being used for the RPC call. 9818 Data within the fls_info array, is in the form of 8-bit data items 9819 with constants giving the offsets within the array of various values 9820 describing this particular file system instance. This style of 9821 definition was chosen, in preference to explicit XDR structure 9822 definitions for these values for a number of reasons. 9824 o The kinds of data in the fls_info array, representing, flags, file 9825 system classes and priorities among set of file systems 9826 representing the same data are such that eight bits provides a 9827 quite acceptable range of values. Even where there might be more 9828 than 256 such file system instances, having more than 256 distinct 9829 classes or priorities is unlikely. 9831 o Explicit definition of the various specific data items within XDR 9832 would limit expandability in that any extension within a 9833 subsequent minor version would require yet another attribute, 9834 leading to specification and implementation clumsiness. 9836 o Such explicit definitions would also make it impossible to propose 9837 standards-track extensions apart from a full minor version. 9839 This encoding scheme can be adapted to the specification of multi- 9840 octet numeric values, even though none are currently defined. If 9841 extensions are made via standards-track RFC's, multi-octet quantities 9842 will be encoded as a range of octet with a range of indices with the 9843 octet interpreted in big endian octet order. 9845 The set of fls_info data is subject to expansion in a future minor 9846 version, or in a standard-track RFC, within the context of a single 9847 minor version. The server SHOULD NOT send and the client MUST not 9848 use indices within the fls_info array that are not defined in 9849 standards-track RFC's. 9851 The fls_info array contains within it: 9853 o Two 8-bit flag fields, one devoted to general file-system 9854 characteristics and a second reserved for transport-related 9855 capabilities. 9857 o Four 8-bit class values which define various file system 9858 equivalence classes as explained below. 9860 o Four 8-bit priority values which govern file system selection as 9861 explained below. 9863 The general file system characteristics flag (at octet index 9864 FSLI4BX_GFLAGS) has the following bits defined within it: 9866 o FSLI4GF_WRITABLE indicates that this fs target is writable, 9867 allowing it to be selected by clients which may need to write on 9868 this file system. When the current file system instance is 9869 writable, then any other file system to which the client might 9870 switch must incorporate within its data any committed write made 9871 on the current file system instance. See the section on verifier 9872 class, for issues related to uncommitted writes. While there is 9873 no harm in not setting this flag for a file system that turns out 9874 to be writable, turning the flag on for read-only file system can 9875 cause problems for clients who select a migration or replication 9876 target based on it and then find themselves unable to write. 9878 o FSLI4GF_CUR_REQ indicates that this replica is the one on which 9879 the request is being made. Only a single server entry may have 9880 this flag set and in the case of a referral, no entry will have 9881 it. 9883 o FSLI4GF_ABSENT indicates that this entry corresponds an absent 9884 file system replica. It can only be set if FSLI4GF_CUR_REQ is 9885 set. When both such bits are set it indicates that a file system 9886 instance is not usable but that the information in the entry can 9887 be used to determine the sorts of continuity available when 9888 switching from this replica to other possible replicas. Since 9889 this bit can only be true if FSLI4GF_CUR_REQ is true, the value 9890 could be determined using the fs_absent attribute but the 9891 information is also made available here for the convenience of the 9892 client. An entry with this bit, since it represents a true file 9893 system (albeit absent) does not appear in the event of a referral, 9894 but only where a file system has been accessed at this location 9895 and subsequently been migrated. 9897 o FSLI4GF_GOING indicates that a replica, while still available, 9898 should not be used further. The client, if using it, should make 9899 an orderly transfer to another file system instance as 9900 expeditiously as possible. It is expected that file systems going 9901 out of service will be announced as FSLI4GF_GOING some time before 9902 the actual loss of service and that the valid_for value will be 9903 sufficiently small to allow clients to detect and act on scheduled 9904 events while large enough that the cost of the requests to fetch 9905 the fs_locations_info values will not be excessive. Values on the 9906 order of ten minutes seem reasonable. 9908 o FSLI4GF_SPLIT indicates that when a transition occurs from the 9909 current file system instance to this one, the replacement may 9910 consist of multiple file systems. In this case, the client has to 9911 be prepared for the possibility that objects on the same fs before 9912 migration will be on different ones after. Note that 9913 FSLI4GF_SPLIT is not incompatible with the file systems belong to 9914 the same _fileid_ class since, if one has a set of fileid's that 9915 are unique within an fs, each subset assigned to a smaller fs 9916 after migration would not have any conflicts internal to that fs. 9918 A client, in the case of a split file system will interrogate 9919 existing files with which it has continuing connection (it is free 9920 simply forget cached filehandles). If the client remembers the 9921 directory filehandle associated with each open file, it may 9922 proceed upward using LOOKUPP to find the new fs boundaries. 9924 Once the client recognizes that one file system has been split 9925 into two, it could maintain applications running without 9926 disruption by presenting the two file systems as a single one 9927 until a convenient point to recognize the transition, such as a 9928 reboot. This would require a mapping of fsids from the server's 9929 fsids to fsids as seen by the client but this is already necessary 9930 for other reasons. As noted above, existing fileids within the 9931 two descendant fs's will not conflict. Creation of new files in 9932 the two descendent fs's may require some amount of fileid mapping 9933 which can be performed very simply in many important cases. 9935 The transport-flag field (at octet index FSLI4BX_TFLAGS) contains the 9936 following bits related to the transport capabilities of the specific 9937 file system. 9939 o FSLI4TF_RDMA indicates that this file system provides NFSv4.1 file 9940 system access using an RDMA-capable transport. 9942 Attribute continuity and file system identity information are 9943 expressed by defining equivalence relations on the sets of file 9944 systems presented to the client. Each such relation is expressed as 9945 a set of file system equivalence classes. For each relation, a file 9946 system has an 8-bit class number. Two file systems belong to the 9947 same class if both have identical non-zero class numbers. Zero is 9948 treated as non-matching. Most often, the relevant question for the 9949 client will be whether a given replica is identical-to/ 9950 continuous-with the current one in a given respect but the 9951 information should be available also as to whether two other replicas 9952 match in that respect as well. 9954 The following fields specify the file system's class numbers for the 9955 equivalence relations used in determining the nature of file system 9956 transitions. See Section 10.6 for details about how this information 9957 is to be used. 9959 o The field with octet index FSLI4BX_CLSIMUL defines the 9960 simultaneous-use class for the file system. 9962 o The field with octet index FSLI4BX_CLHANDLE defines the handle 9963 class for the file system. 9965 o The field with octet index FSLI4BX_CLFILEID defines the fileid 9966 class for the file system. 9968 o The field with octet index FSLI4BX_CLVERIFIER defines the verifier 9969 class for the file system. 9971 o The field with octet index FSLI4BX_CLCHANGE defines the change 9972 class for the file system. 9974 Server-specified preference information is also provided via 8-bit 9975 values within the fls_info array. The values provide a rank and an 9976 order (see below) to be used with separate values specifiable for the 9977 cases of read-only and writable file systems. These values are 9978 compared for different file systems to establish the server-specified 9979 preference, with lower values indicating "more preferred". 9981 Rank is used to express a strict server-imposed ordering on clients, 9982 with lower values indicating "more preferred." Clients should 9983 attempt to use all replicas with a given rank before they use one 9984 with a higher rank. Only if all of those file systems are 9985 unavailable should the client proceed to those of a higher rank. 9987 Within a rank, the order value is used to specify the server's 9988 preference to guide the client's selection when the client's own 9989 preferences are not controlling, with lower values of order 9990 indicating "more preferred." If replicas are approximately equal in 9991 all respects, clients should defer to the order specified by the 9992 server. When clients look at server latency as part of their 9993 selection, they are free to use this criterion but it is suggested 9994 that when latency differences are not significant, the server- 9995 specified order should guide selection. 9997 o The field at octet index FSLI4BX_READRANK gives the rank value to 9998 be used for read-only access. 10000 o The field at octet index FSLI4BX_READOREDER gives the order value 10001 to be used for read-only access. 10003 o The field at octet index FSLI4BX_WRITERANK gives the rank value to 10004 be used for writable access. 10006 o The field at octet index FSLI4BX_WRITEOREDER gives the order value 10007 to be used for writable access. 10009 Depending on the potential need for write access by a given client, 10010 one of the pairs of rank and order values is used. The read rank and 10011 order should only be used if the client knows that only reading will 10012 ever be done or if it is prepared to switch to a different replica in 10013 the event that any write access capability is required in the future. 10015 10.10.2. The fs_locations_info4 Structure 10017 The fs_locations_info4 structure, encoding the fs_locations_info 10018 attribute, contains the following: 10020 o The fli_flags field which contains general flags that affect the 10021 interpretation of this fs_locations_info4 structure and all 10022 fs_locations_item4 structures within it. The only flag currently 10023 defined is FSLI4IF_VAR_SUB. All bits in the fli_flags field which 10024 are not defined should always be returned as zero. 10026 o The fli_fs_root field which contains the pathname of the root of 10027 the current file system on the current server, just as it does the 10028 fs_locations4 structure. 10030 o An array called fli_items of fs_locations4_item structures, which 10031 contain information about replicas of the current file system. 10032 Where the current file system is actually present, or has been 10033 present, i.e. this is not a referral situation, one of the 10034 fs_locations_item4 structures will contain an fs_locations_server4 10035 for the current server. This structure will have FSLI4GF_ABSENT 10036 set if the current file system is absent, i.e. normal access to it 10037 will return NFS4ERR_MOVED. 10039 o The fli_valid_for field specifies a time in seconds for which it 10040 is reasonable for a client to use the fs_locations_info attribute 10041 without refetch. The fli_valid_for value does not provide a 10042 guarantee of validity since servers can unexpectedly go out of 10043 service or become inaccessible for any number of reasons. Clients 10044 are well-advised to refetch this information for actively accessed 10045 file system at every fli_valid_for seconds. This is particularly 10046 important when file system replicas may go out of service in a 10047 controlled way using the FSLI4GF_GOING flag to communicate an 10048 ongoing change. The server should set fli_valid_for to a value 10049 which allows well-behaved clients to notice the FSLI4GF_GOING flag 10050 and make an orderly switch before the loss of service becomes 10051 effective. If this value is zero, then no refetch interval is 10052 appropriate and the client need not refetch this data on any 10053 particular schedule. In the event of a transition to a new file 10054 system instance, a new value of the fs_locations_info attribute 10055 will be fetched at the destination and it is to be expected that 10056 this may have a different valid_for value, which the client should 10057 then use, in the same fashion as the previous value. 10059 The FSLI4IF_VAR_SUB flag within fli_flags controls whether variable 10060 substitution is to be enabled 10062 10.10.3. The fs_locations_item4 Structure 10064 The fs_locations_item4 structure contains a pathname (in the field 10065 fli_rootpath) which encodes the path of the target file system 10066 replicas on the set of servers designated by the included 10067 fs_locations_server4 entries. The precise manner in which this 10068 target location is specified depends on the value of the 10069 FSLI4IF_VAR_SUB flag within the associated fs_locations_info4 10070 structure. 10072 If this flag is not set, then fli_rootpath simply designates the 10073 location of the target file system within each server's single-server 10074 namespace just as it does for the rootpath within the fs_location 10075 structure. When this bit is set, however, component entries of a 10076 certain form are subject to client-specific variable substitution so 10077 as to allow a degree of namespace non-uniformity in order to 10078 accommodate the selection of client-specific file system targets to 10079 adapt to different client architectures or other characteristics. 10081 When such substitution is in effect a variable beginning with the 10082 string "${" and ending with the string "}" and containing a colon is 10083 to be replaced by the client-specific value associated with that 10084 variable. The string "unknown" should be used by the client when it 10085 has no value for such a variable. The pathname resulting from such 10086 substitutions is used to designate the target file system, so that 10087 different clients may have different file systems, corresponding to 10088 that location in the multi-server namespace. 10090 As mentioned above, such substituted pathname variables contain a 10091 colon. The part before the colon is to be a DNS domain name with the 10092 part after being a case-insensitive alphanumeric string. 10094 Where the domain is "ietf.org", only variable names defined in this 10095 document or subsequent standards-track RFC's are subject to such 10096 substitution. Organizations are free to use their domain names to 10097 create their own sets of client-specific variables, to be subject to 10098 such substitution. In case where such variables are intended to be 10099 used more broadly than a single organization, publication of an 10100 informational RFC defining such variables is recommended. 10102 The variable ${ietf.org:CPU_ARCH} is used to denote the CPU 10103 architecture object files are compiled. This specification does not 10104 limit the acceptable values (except that they must be valid UTF-8 10105 strings) but such values as "x86", "x86_64" and "sparc" would be 10106 expected to be used in line with industry practice. 10108 The variable ${ietf.org:OS_TYPE} is used to denote the operating 10109 system and thus the kernel and library API's for which code might be 10110 compiled. This specification does not limit the acceptable values 10111 (except that they must be valid UTF-8 strings) but such values as 10112 "linux" and "freebsd" would be expected to be used in line with 10113 industry practice. 10115 The variable ${ietf.org:OS_VERSION} is used to denote the operating 10116 system version and the thus the specific details of versioned 10117 interfaces for which code might be compiled. This specification does 10118 not limit the acceptable values (except that they must be valid UTF-8 10119 strings) but combinations of numbers and letters with interspersed 10120 dots would be expected to be used in line with industry practice, 10121 with the details of the version format depending on the specific 10122 value of the value of the variable ${ietf.org:OS_TYPE} with which it 10123 is used. 10125 Use of these variable could result in direction of different clients 10126 to different file systems on the same server, as appropriate to 10127 particular clients. In cases in which the target file systems are 10128 located on different servers, a single server could serve as a 10129 referral point so that each valid combination of variable values 10130 would designate a referral hosted on a single server, with the 10131 targets of those referrals on a number of different servers. 10133 Although variable substitution is most suitable for use in the 10134 context of referrals, if may be used in the context of replication 10135 and migration. If it is used in these contexts, the server must 10136 ensure that no matter what values the client presents for the 10137 substituted variables, the result is always a valid successor file 10138 system instance to that from which a transition is occurring, i.e. 10139 that the data is identical or represents a later image of a writable 10140 file system. 10142 Note that when fli_rootpath is a null pathname (that is, one with 10143 zero components), the file system designated is at the root of the 10144 specified server, whether the FSLI4IF_VAR_SUB flag within the 10145 associated fs_locations_info4 structure is set or not. 10147 10.11. The Attribute fs_status 10149 In an environment in which multiple copies of the same basic set of 10150 data are available, information regarding the particular source of 10151 such data and the relationships among different copies, can be very 10152 helpful in providing consistent data to applications. 10154 enum fs4_status_type { 10155 STATUS4_FIXED = 1, 10156 STATUS4_VERSIONED = 2, 10157 STATUS4_UPDATED = 3, 10158 STATUS4_WRITABLE = 4, 10159 STATUS4_ABSENT = 5 10160 }; 10162 struct fs4_status { 10163 fs4_status_type fsstat_type; 10164 utf8str_cs fsstat_source; 10165 utf8str_cs fsstat_current; 10166 int32_t fsstat_age; 10167 nfstime4 fsstat_version; 10168 }; 10170 The type value indicates the kind of file system image represented. 10171 This is of particular importance when using the version values to 10172 determine appropriate succession of file system images. Five types 10173 are distinguished: 10175 o STATUS4_FIXED which indicates a read-only image in the sense that 10176 it will never change. The possibility is allowed that as a result 10177 of migration or switch to a different image, changed data can be 10178 accessed but within the confines of this instance, no change is 10179 allowed. The client can use this fact to aggressively cache. 10181 o STATUS4_VERSIONED which indicates that the image, like the 10182 STATUS4_UPDATED case, is updated exogenously, but it provides a 10183 guarantee that the server will carefully update an associated 10184 version value so that the client can protect itself from a 10185 situation in which it reads data from one version of the file 10186 system, and then later reads data from an earlier version of the 10187 same file system. See below for a discussion of how this can be 10188 done. 10190 o STATUS4_UPDATED which indicates an image that cannot be updated by 10191 the user writing to it but may be changed exogenously, typically 10192 because it is a periodically updated copy of another writable file 10193 system somewhere else. In this case, version information is not 10194 provided and the client does not have the responsibility of making 10195 sure that this version only advances upon a file system instance 10196 transition. In this case, it is the responsibility of the server 10197 to make sure that the data presented after a file system instance 10198 transition is a proper successor image and includes all changes 10199 seen by the client and any change made before all such changes. 10201 o STATUS4_WRITABLE which indicates that the file system is an actual 10202 writable one. The client need not of course actually write to the 10203 file system, but once it does, it should not accept a transition 10204 to anything other than a writable instance of that same file 10205 system. 10207 o STATUS4_ABSENT which indicates that the information is the last 10208 valid for a file system which is no longer present. 10210 The opaque strings source and current provide a way of presenting 10211 information about the source of the file system image being present. 10212 It is not intended that client do anything with this information 10213 other than make it available to administrative tools. It is intended 10214 that this information be helpful when researching possible problems 10215 with a file system image that might arise when it is unclear if the 10216 correct image is being accessed and if not, how that image came to be 10217 made. This kind of debugging information will be helpful, if, as 10218 seems likely, copies of file systems are made in many different ways 10219 (e.g. simple user-level copies, file system- level point-in-time 10220 copies, cloning of the underlying storage), under a variety of 10221 administrative arrangements. In such environments, determining how a 10222 given set of data was constructed can be very helpful in resolving 10223 problems. 10225 The opaque string 'source' is used to indicate the source of a given 10226 file system with the expectation that tools capable of creating a 10227 file system image propagate this information, when that is possible. 10228 It is understood that this may not always be possible since a user- 10229 level copy may be thought of as creating a new data set and the tools 10230 used may have no mechanism to propagate this data. When a file 10231 system is initially created associating with it data regarding how 10232 the file system was created, where it was created, by whom, etc. can 10233 be put in this attribute in a human- readable string form so that it 10234 will be available when propagated to subsequent copies of this data. 10236 The opaque string 'current' should provide whatever information is 10237 available about the source of the current copy. Such information as 10238 the tool creating it, any relevant parameters to that tool, the time 10239 at which the copy was done, the user making the change, the server on 10240 which the change was made etc. All information should be in a human- 10241 readable string form. 10243 The age provides an indication of how out-of-date the file system 10244 currently is with respect to its ultimate data source (in case of 10245 cascading data updates). This complements the fls_currency field of 10246 fs_locations_server4 (See Section 10.10) in the following way: the 10247 information in fls_currency gives a bound for how out of date the 10248 data in a file system might typically get, while the age gives a 10249 bound on how out of date that data actually is. Negative values 10250 imply no information is available. A zero means that this data is 10251 known to be current. A positive value means that this data is known 10252 to be no older than that number of seconds with respect to the 10253 ultimate data source. 10255 The version field provides a version identification, in the form of a 10256 time value, such that successive versions always have later time 10257 values. When the file system type is anything other than 10258 STATUS4_VERSIONED, the server may provide such a value but there is 10259 no guarantee as to its validity and clients will not use it except to 10260 provide additional information to add to 'source' and 'current'. 10262 When the type is STATUS4_VERSIONED, servers should provide a value of 10263 version which progresses monotonically whenever any new version of 10264 the data is established. This allows the client, if reliable image 10265 progression is important to it, to fetch this attribute as part of 10266 each COMPOUND where data or metadata from the file system is used. 10268 When it is important to the client to make sure that only valid 10269 successor images are accepted, it must make sure that it does not 10270 read data or metadata from the file system without updating its sense 10271 of the current state of the image, to avoid the possibility that the 10272 fs_status which the client holds will be one for an earlier image, 10273 and so accept a new file system instance which is later than that but 10274 still earlier than updated data read by the client. 10276 In order to do this reliably, it must do a GETATTR of fs_status that 10277 follows any interrogation of data or metadata within the file system 10278 in question. Often this is most conveniently done by appending such 10279 a GETATTR after all other operations that reference a given file 10280 system. When errors occur between reading file system data and 10281 performing such a GETATTR, care must be exercised to make sure that 10282 the data in question is not used before obtaining the proper 10283 fs_status value. In this connection, when an OPEN is done within 10284 such a versioned file system and the associated GETATTR of fs_status 10285 is not successfully completed, the open file in question must not be 10286 accessed until that fs_status is fetched. 10288 The procedure above will ensure that before using any data from the 10289 file system the client has in hand a newly-fetched current version of 10290 the file system image. Multiple values for multiple requests in 10291 flight can be resolved by assembling them into the required partial 10292 order (and the elements should form a total order within it) and 10293 using the last. The client may then, when switching among file 10294 system instances, decline to use an instance which is not of type 10295 STATUS4_VERSIONED or whose version field is earlier than the last one 10296 obtained from the predecessor file system instance. 10298 11. Directory Delegations 10300 11.1. Introduction to Directory Delegations 10302 Directory caching for the NFSv4.1 protocol is similar to previous 10303 versions. Clients typically cache directory information for a 10304 duration determined by the client. At the end of a predefined 10305 timeout, the client will query the server to see if the directory has 10306 been updated. By caching attributes, clients reduce the number of 10307 GETATTR calls made to the server to validate attributes. 10308 Furthermore, frequently accessed files and directories, such as the 10309 current working directory, have their attributes cached on the client 10310 so that some NFS operations can be performed without having to make 10311 an RPC call. By caching name and inode information about most 10312 recently looked up entries in the Directory Name Lookup Cache (DNLC), 10313 clients do not need to send LOOKUP calls to the server every time 10314 these files are accessed. 10316 This caching approach works reasonably well at reducing network 10317 traffic in many environments. However, it does not address 10318 environments where there are numerous queries for files that do not 10319 exist. In these cases of "misses", the client must make RPC calls to 10320 the server in order to provide reasonable application semantics and 10321 promptly detect the creation of new directory entries. Examples of 10322 high miss activity are compilation in software development 10323 environments. The current behavior of NFS limits its potential 10324 scalability and wide-area sharing effectiveness in these types of 10325 environments. Other distributed stateful file system architectures 10326 such as AFS and DFS have proven that adding state around directory 10327 contents can greatly reduce network traffic in high miss 10328 environments. 10330 Delegation of directory contents is a RECOMMENDED feature of NFSv4.1. 10331 Directory delegations provide similar traffic reduction benefits as 10332 with file delegations. By allowing clients to cache directory 10333 contents (in a read-only fashion) while being notified of changes, 10334 the client can avoid making frequent requests to interrogate the 10335 contents of slowly-changing directories, reducing network traffic and 10336 improving client performance. 10338 Directory delegations allow improved namespace cache consistency to 10339 be achieved through delegations and synchronous recalls alone without 10340 asking for notifications. In addition, if time-based consistency is 10341 sufficient, asynchronous notifications can provide performance 10342 benefits for the client, and possibly the server, under some common 10343 operating conditions such as slowly-changing and/or very large 10344 directories. 10346 11.2. Directory Delegation Design 10348 NFSv4.1 introduces the GET_DIR_DELEGATION (Section 17.39) operation 10349 to allow the client to ask for a directory delegation. The 10350 delegation covers directory attributes and all entries in the 10351 directory. If either of these change the delegation will be recalled 10352 synchronously. The operation causing the recall will have to wait 10353 before the recall is complete. Any changes to directory entry 10354 attributes will not cause the delegation to be recalled. 10356 In addition to asking for delegations, a client can also ask for 10357 notifications for certain events. These events include changes to 10358 directory attributes and/or its contents. If a client asks for 10359 notification for a certain event, the server will notify the client 10360 when that event occurs. This will not result in the delegation being 10361 recalled for that client. The notifications are asynchronous and 10362 provide a way of avoiding recalls in situations where a directory is 10363 changing enough that the pure recall model may not be effective while 10364 trying to allow the client to get substantial benefit. In the 10365 absence of notifications, once the delegation is recalled the client 10366 has to refresh its directory cache which might not be very efficient 10367 for very large directories. 10369 The delegation is read only and the client may not make changes to 10370 the directory other than by performing NFSv4 operations that modify 10371 the directory or the associated file attributes so that the server 10372 has knowledge of these changes. In order to keep the client 10373 namespace synchronized with the server, the server will notify the 10374 client holding the delegation of the changes made as a result. This 10375 is to avoid any subsequent GETATTR or READDIR calls to the server. 10376 If a single client is holding the delegation and that client makes 10377 any changes to the directory, the delegation will not be recalled. 10378 Multiple clients may hold a delegation on the same directory, but if 10379 any such client modifies the directory, the server MUST recall the 10380 delegation from the other clients. 10382 Delegations can be recalled by the server at any time. Normally, the 10383 server will recall the delegation when the directory changes in a way 10384 that is not covered by the notification, or when the directory 10385 changes and notifications have not been requested. 10387 Also if the server notices that handing out a delegation for a 10388 directory is causing too many notifications or recalls to be sent 10389 out, it may decide not to hand out a delegation for that directory or 10390 recall existing delegations. If another client removes the directory 10391 for which a delegation has been granted, the server will recall the 10392 delegation. 10394 11.3. Attributes in Support of Directory Notifications 10396 See Section 5.12 for a description of the attributes associated with 10397 directory notifications. 10399 11.4. Delegation Recall 10401 The server will recall the directory delegation by sending a callback 10402 to the client. It will use the same callback procedure as used for 10403 recalling file delegations. The server will recall the delegation 10404 when the directory changes in a way that is not covered by the 10405 notification. However the server will not recall the delegation if 10406 attributes of an entry within the directory change. Also if the 10407 server notices that handing out a delegation for a directory is 10408 causing too many notifications to be sent out, it may decide not to 10409 hand out a delegation for that directory. If another client tries to 10410 remove the directory for which a delegation has been granted, the 10411 server will recall the delegation. 10413 The server will recall the delegation by sending a CB_RECALL callback 10414 to the client. If the recall is done because of a directory changing 10415 event, the request making that change will need to wait while the 10416 client returns the delegation. 10418 11.5. Directory Delegation Recovery 10420 Crash recovery for state on regular files has two main goals, 10421 avoiding the necessity of breaking application guarantees with 10422 respect to locked files and delivery of updates cached at the client. 10423 Neither of these applies to directories protected by read delegations 10424 and notifications. Thus, the client is required to establish a new 10425 delegation on a server or client reboot. [[Comment.15: we have 10426 special reclaim types allow clients to recovery delegations through 10427 client reboot. Do we really want EXCHANGE_ID/CREATE_SESSION to 10428 destroy directory delegation state?]] 10430 12. Parallel NFS (pNFS) 10432 12.1. Introduction 10434 PNFS is a set of OPTIONAL features of NFSv4.1 which allow direct 10435 client access to the storage devices containing the file data. When 10436 file data for a single NFSv4 server is stored on multiple and/or 10437 higher throughput storage devices (by comparison to the server's 10438 throughput capability), the result can be significantly better file 10439 access performance. The relationship among multiple clients, a 10440 single server, and multiple storage devices for pNFS (server and 10441 clients have access to all storage devices) is shown in this diagram: 10443 +-----------+ 10444 |+-----------+ +-----------+ 10445 ||+-----------+ | | 10446 ||| | NFSv4 + pNFS | | 10447 +|| Clients |<------------------------------>| Server | 10448 +| | | | 10449 +-----------+ | | 10450 ||| +-----------+ 10451 ||| | 10452 ||| | 10453 ||| Storage +-----------+ | 10454 ||| Protocol |+-----------+ | 10455 ||+----------------||+-----------+ Control | 10456 |+-----------------||| | Protocol| 10457 +------------------+|| Storage |------------+ 10458 +| Devices | 10459 +-----------+ 10461 Figure 62 10463 In this structure, the responsibility for coordination of file access 10464 by multiple clients is shared among the server, clients, and storage 10465 devices. This is in contrast to NFSv4 without pNFS in which this is 10466 primarily the server's responsibility, some of which can be delegated 10467 to clients under strictly specified conditions. 10469 PNFS takes the form of OPTIONAL operations that manage data location 10470 information called a layout. The layout is managed in a similar 10471 fashion as NFSv4 data delegations (e.g., they are recallable and 10472 revocable). However, they are distinct abstractions and are 10473 manipulated with new operations. When a client holds a layout, it 10474 has rights to access the data directly using the location information 10475 in the layout. 10477 This document specifies the use of NFSv4.1 as a storage protocol. 10478 PNFS allows other storage protocols, and these protocols are 10479 deliberately not specified here. These might include: 10481 o Block/volume protocols such as iSCSI ([29]), and FCP ([30]). The 10482 block/volume protocol support can be independent of the addressing 10483 structure of the block/volume protocol used, allowing more than 10484 one protocol to access the same file data and enabling 10485 extensibility to other block/volume protocols. 10487 o Object protocols such as OSD over iSCSI or Fibre Channel [31]. 10489 o Other storage protocols, including PVFS and other file systems 10490 that are in use in HPC environments. 10492 With some storage protocols, the storage devices cannot perform fine- 10493 grained access checks to ensure that clients are only performing 10494 accesses within the bounds permitted to them by the pNFS operations 10495 with the server (e.g., the checks may only be possible at file system 10496 granularity rather than file granularity). In situations where this 10497 added responsibility placed on clients creates unacceptable security 10498 risks, pNFS configurations in which storage devices cannot perform 10499 fine-grained access checks SHOULD NOT be used. All pNFS server 10500 implementations MUST support NFSv4.1 access to any file accessible 10501 via pNFS in order to provide an interoperable means of file access in 10502 such situations. See Section 12.9 on Security for further 10503 discussion. 10505 There are issues about how layouts interact with the existing NFSv4 10506 abstractions of data delegations and record locking. Delegation 10507 issues are discussed in Section 12.5.4. Byte range locking issues 10508 are discussed in Section 12.2.10 and Section 12.5.1. 10510 12.2. PNFS Definitions 10512 PNFS partitions the NFSv4.1 file system protocol into two parts, the 10513 metadata path and the data path. The metadata path is implemented by 10514 a metadata server that supports pNFS and the operations described in 10515 this document (Section 17). The data path is implemented by a 10516 storage device that supports the storage protocol. A subset (defined 10517 in Section 13.7) of NFSv4.1 is one such storage protocol. This leads 10518 to new terms used to describe the protocol extension and some 10519 clarifications of existing terms. 10521 12.2.1. Metadata 10523 This is information about a file, such as its name, owner, where it 10524 stored, and so forth. Metadata also includes lower-level information 10525 like block addresses and indirect block pointers. 10527 12.2.2. Metadata Server 10529 A pNFS metadata server is an NFSv4.1 server which supports pNFS 10530 operations and features. When supporting pNFS the metadata server 10531 might hold only the metadata associated with a file, while the data 10532 can be stored on the storage devices. However, data may also be 10533 written through the metadata server which in turn ensures data is 10534 written to the storage devices. 10536 12.2.3. Client 10538 A pNFS client is a NFSv4.1 client as defined by this document, which 10539 supports pNFS operations and features, and supports least one storage 10540 protocol for performing I/O directly to storage devices. 10542 12.2.4. Storage Device 10544 A storage device controls a regular file's data, but leaves other 10545 metadata management up to the metadata server. A storage device 10546 could be another NFSv4.1 server, an object storage device (OSD), a 10547 block device accessed over a SAN (e.g., either FiberChannel or iSCSI 10548 SAN), or some other entity. 10550 12.2.5. Data Server 10552 A data server is a storage device that is implemented by a server of 10553 higher level storage access protocol, such as NFSv4.1. 10555 12.2.6. Storage Protocol or Data Protocol 10557 A storage protocol or data protocol is the used between the pNFS 10558 client and the storage device to access the file data. Three layout 10559 types have been described: file protocols (i.e., NFSv4.1), object 10560 protocols (e.g., OSD), and block/volume protocols (e.g., based on 10561 SCSI-block commands). These protocols are in turn realizable over a 10562 variety of transport stacks. 10564 Depending the storage protocol, block-level metadata may or may not 10565 be managed by the metadata server, but is instead managed by object 10566 storage devices or other servers acting as a storage device. 10568 12.2.7. Control Protocol 10570 The control protocol is used by the exported file system between the 10571 metadata server and storage devices. Specification of such protocols 10572 is outside the scope of this document. Such control protocols would 10573 be used to control such activities as the allocation and deallocation 10574 of storage and the management of state required by the data servers 10575 to perform client access control. 10577 While pNFS allows for any control protocol, in practice the control 10578 protocol is closely related to the storage protocol. For example, if 10579 the data servers are NFSv4.1 servers, then the protocol between the 10580 metadata server and the data servers is likely to involve NFSv4.1 10581 operations. Similarly, when object storage devices are used, the 10582 pNFS metadata server will likely use iSCSI/OSD commands to manipulate 10583 storage. 10585 Regardless, this document does not mandate any particular control 10586 protocol. Instead, it just describes the requirements on the control 10587 protocol for maintaining attributes like modify time, the change 10588 attribute, and the end-of-file (EOF) position. 10590 12.2.8. Layout 10592 A layout defines how a file's data is organized on one or more 10593 storage devices. There are many possible layout types. They vary in 10594 the storage protocol used to access the data, and in the aggregation 10595 scheme that lays out the file data on the underlying storage devices. 10596 A layout is more precisely identified by the following tuple: 10597 ; where filehandle refers to the 10598 filehandle of the file on the metadata server. Layouts describe a 10599 file, not an octet-range of a file; Section 12.2.11 describes layout 10600 segments which do pertain to a range. 10602 12.2.9. Layout Types 10604 A layout describes the mapping of a file's data to the storage 10605 devices that hold the data. A layout is said to belong to a specific 10606 layout type (data type layouttype4, see Section 3.2.15). The layout 10607 type allows for variants to handle different storage protocols, such 10608 as block/volume [24], object [23], and file (Section 13) layout 10609 types. A metadata server, along with its control protocol, MUST 10610 support at least one layout type. A private sub-range of the layout 10611 type name space is also defined. Values from the private layout type 10612 range can be used for internal testing or experimentation. 10614 As an example, a file layout type could be an array of tuples (e.g., 10615 deviceID, file_handle), along with a definition of how the data is 10616 stored across the devices (e.g., striping). A block/volume layout 10617 might be an array of tuples that store along with information about block size and the file offset of 10619 the first block. An object layout might be an array of tuples 10620 and an additional structure (i.e., the 10621 aggregation map) that defines how the logical octet sequence of the 10622 file data is serialized into the different objects. Note, the actual 10623 layouts are more complex than these simple expository examples. 10625 12.2.10. Layout Iomode 10627 The layout iomode (data type layoutiomode4, see Section 3.2.23) 10628 indicates to the metadata server the client's intent to perform 10629 either just READ operations (Section 17.22) or a mixture of I/O 10630 possibly containing WRITE (Section 17.32) and READ operations. For 10631 certain layout types, it is useful for a client to specify this 10632 intent at LAYOUTGET (Section 17.43) time. E.g., for block/volume 10633 based protocols, block allocation could occur when a READ/WRITE 10634 iomode is specified. A special LAYOUTIOMODE4_ANY iomode is defined 10635 and can only be used for LAYOUTRETURN and LAYOUTRECALL, not for 10636 LAYOUTGET. It specifies that layouts pertaining to both READ and 10637 READ/WRITE iomodes are being returned or recalled, respectively. 10639 A storage device may validate I/O with regards to the iomode; this is 10640 dependent upon storage device implementation and layout type. Thus, 10641 if the client's layout iomode differs from the I/O being performed, 10642 the storage device may reject the client's I/O with an error 10643 indicating a new layout with the correct I/O mode should be fetched. 10644 E.g., if a client gets a layout with a READ iomode and performs a 10645 WRITE to a storage device, the storage device is allowed to reject 10646 that WRITE. 10648 The iomode does not conflict with OPEN share modes or lock requests; 10649 open mode checks and lock enforcement are always enforced, and are 10650 logically separate from the pNFS layout level. As well, open modes 10651 and locks are the preferred method for restricting user access to 10652 data files. E.g., an OPEN of read, deny-write does not conflict with 10653 a LAYOUTGET containing an iomode of READ/WRITE performed by another 10654 client. Applications that depend on writing into the same file 10655 concurrently may use record locking to serialize their accesses. 10657 12.2.11. Layout Segment 10659 Since a layout that describes an entire file may be very large, there 10660 is a desire to manage layouts in smaller chunks that correspond to 10661 octet-ranges of the file. For example, the entire layout need not be 10662 returned, recalled, or committed. These chunks are called layout 10663 segments and are further identified by the octet-range and iomode 10664 they represent, yielding a layout segment identifier consisting of 10665 . The concepts of 10666 a layout and its layout segments allow clients and metadata servers 10667 to aggregate the results of layout operations into a singly 10668 maintained layout. 10670 It is important to define when layout segments overlap and/or 10671 conflict with each other. For two layout segments with overlapping 10672 octet ranges to actually overlap each other, both segments must be of 10673 the same layout type, correspond to the same filehandle, and have the 10674 same iomode. Layout segments conflict, when they overlap and differ 10675 in the content of the layout (i.e., the storage device/file mapping 10676 parameters differ). Note, differing iomodes do not lead to 10677 conflicting layouts. It is permissible for layout segments with 10678 different iomodes, pertaining to the same octet range, to be held by 10679 the same client. 10681 12.2.12. Device IDs 10683 The device ID (data type deviceid4, see Section 3.2.16) names a 10684 storage device. In practice, a significant amount of information may 10685 be required to fully address a storage device. Instead of embedding 10686 all that information in a layout, layouts embed device IDs. The 10687 NFSv4.1 operation GETDEVICEINFO (Section 17.40) is used to retrieve 10688 the complete address information about the storage device according 10689 to its layout type. For example, the address of an NFSv4.1 data 10690 server or of an object storage device could be an IP address and 10691 port. The address of a block storage device could be a volume label. 10693 The device ID is qualified by the layout type and unique per file 10694 system identifier (FSID, see Section 3.2.5). This allows different 10695 layout drivers to generate device IDs without the need for co- 10696 ordination. 10698 Clients cannot expect the mapping between device ID and storage 10699 device address to persist across metadata server restart. See 10700 Section 12.7.4 for a description of how recovery works in that 10701 situation. 10703 12.3. PNFS Operations 10705 NFSv4.1 has several operations that are needed for pNFS servers, 10706 regardless of layout type or storage protocol. These operations are 10707 all issued to a metadata server and summarized here. 10709 GETDEVICEINFO. As noted previously (Section 12.2.12), GETDEVICEINFO 10710 (Section 17.40) returns the mapping of device ID to storage device 10711 address. 10713 GETDEVICELIST (Section 17.41), allows clients to fetch the all the 10714 device ID to storage device address mappings of particular file 10715 system. 10717 LAYOUTGET (Section 17.43) is used by a client to get a layout 10718 segment for a file. 10720 LAYOUTCOMMIT (Section 17.42) is used to inform the metadata server 10721 that the client wants to commit data it wrote to the storage 10722 device (which as indicated in the layout segment returned by 10723 LAYOUTGET). 10725 LAYOUTRETURN (Section 17.44) is used to return a layout segment or 10726 all layouts belong to a file system to a metadata server. 10728 The following pNFS-related operations are callback operations a 10729 metadata server might issue to a pNFS client. 10731 CB_LAYOUTRECALL (Section 19.3) recalls a layout segment or all 10732 layouts belonging to a file system, or all layouts belong to a 10733 client ID. 10735 CB_RECALL_ANY (Section 19.6), tells a client that it needs to return 10736 some number of recallable objects, including layouts, to the 10737 metadata server. 10739 CB_RECALLABLE_OBJ_AVAIL (Section 19.7) tells a client that a 10740 recallable object that it was denied (in case of pNFS, a layout, 10741 denied by LAYOUTGET) due to resource exhaustion, is now available. 10743 12.4. PNFS Attributes 10745 A number of attributes specific to pNFS are listed and described in 10746 Section 5.13 10748 12.5. Layout Semantics 10750 12.5.1. Guarantees Provided by Layouts 10752 Layouts delegate to the client the ability to access data out of 10753 band. The layout guarantees the holder that the layout will be 10754 recalled when the state encapsulated by the layout becomes invalid 10755 (e.g., through some operation that directly or indirectly modifies 10756 the layout) or, possibly, when a conflicting layout is requested, as 10757 determined by the layout's iomode. When a layout is recalled, and 10758 then returned by the client, the client retains the ability to access 10759 file data with normal NFSv4.1 I/O operations through the metadata 10760 server. Only the right to do I/O out-of-band is affected. 10762 Holding a layout does not guarantee that a user of the layout has the 10763 rights to access the data represented by the layout. All user access 10764 rights MUST be obtained through the appropriate open, lock, and 10765 access operations (i.e., those that would be used in the absence of 10766 pNFS). However, if a valid layout for a file is not held by the 10767 client, the storage device should reject all I/Os to that file's 10768 octet range that originate from that client. In summary, layouts and 10769 ordinary file access controls are independent. The act of modifying 10770 a file for which a layout is held, does not necessarily conflict with 10771 the holding of the layout that describes the file being modified. 10772 However, with certain layout types (e.g., block/volume layouts), the 10773 layout's iomode must agree with the type of I/O being performed. 10775 Depending upon the layout type and storage protocol in use, storage 10776 device access permissions may be granted by LAYOUTGET and may be 10777 encoded within the type specific layout. If access permissions are 10778 encoded within the layout, the metadata server must recall the layout 10779 when those permissions become invalid for any reason; for example 10780 when a file becomes unwritable or inaccessible to a client. Note, 10781 clients are still required to perform the appropriate access 10782 operations as described above (e.g., open and lock ops). The degree 10783 to which it is possible for the client to circumvent these access 10784 operations must be clearly addressed by the individual layout type 10785 documents, as well as the consequences of doing so. In addition, 10786 these documents must be clear about the requirements and non- 10787 requirements for the checking performed by the server. 10789 If the pNFS metadata server supports mandatory record locks then 10790 record locks must behave as specified by the NFSv4.1 protocol, as 10791 observed by users of files. If a storage device is unable to 10792 restrict access by a pNFS client which does not hold a required 10793 mandatory record lock then the metadata server must not grant layouts 10794 to a client, for that storage device, that permits any access that 10795 conflicts with a mandatory record lock held by another client. In 10796 this scenario, it is also necessary for the metadata server to ensure 10797 that record locks are not granted to a client if any other client 10798 holds a conflicting layout (a layout that overlaps the range, and has 10799 an iomode that conflicts with the lock type); in this case all 10800 conflicting layouts must be recalled and returned before the lock 10801 request can be granted. This requires the metadata server to 10802 understand the capabilities of its storage devices. 10804 12.5.2. Getting a Layout 10806 A client obtains a layout through a new operation, LAYOUTGET. The 10807 metadata server will give out layouts of a particular type (e.g., 10808 block/volume, object, or file) and aggregation as requested by the 10809 client. The client selects an appropriate layout type which the 10810 server supports and the client is prepared to use. The layout 10811 returned to the client may not line up exactly with the requested 10812 octet range. A field within the LAYOUTGET request, loga_minlength, 10813 specifies the minimum overlap that MUST exist between the requested 10814 layout and the layout returned by the metadata server. The 10815 loga_minlength field should at least one. A metadata server may give 10816 out multiple overlapping, non-conflicting layout segments to the same 10817 client in response to a LAYOUTGET. 10819 There is no implied ordering between getting a layout and performing 10820 a file OPEN. For example, a layout may first be retrieved by placing 10821 a LAYOUTGET operation in the same COMPOUND as the initial file OPEN. 10822 Once the layout has been retrieved, it can be held across multiple 10823 OPEN and CLOSE sequences. 10825 The storage protocol used by the client to access the data on the 10826 storage device is determined by the layout's type. The client needs 10827 to select a layout driver that understands how to interpret and use 10828 that layout. The method for layout driver selection used by the 10829 client is outside the scope of the pNFS extension. 10831 Although the metadata server is in control of the layout for a file, 10832 the pNFS client can provide hints to the server when a file is opened 10833 or created about the preferred layout type and aggregation schemes. 10834 PNFS introduces a layout_hint (Section 5.13.4) attribute that the 10835 client can set at file creation time to provide a hint to the server 10836 for new files. Setting this attribute separately, after the file has 10837 been created could make it difficult, or impossible, for the server 10838 implementation to comply. This in turn further complicates the 10839 exclusive file creation via OPEN, which when done via the EXCLUSIVE4 10840 createmode does not allow the setting of attributes at file creation 10841 time. However as noted in Section 17.16.4, if the server supports a 10842 persistent reply cache, the EXCLUSIVE4 createmode is not needed. 10843 Therefore, a metadata server that supports the layout_hint attribute 10844 MUST support a persistent session reply cache, and a pNFS client that 10845 wants to set layout_hint at file creation (OPEN) time MUST NOT use 10846 the EXCLUSIVE4 createmode, and instead MUST used GUARDED for an 10847 exclusive regular file creation. 10849 12.5.3. Committing a Layout 10851 Due to the nature of the protocol, the file attributes, and data 10852 location mapping (e.g., which offsets store data versus store holes, 10853 see Section 13.5) information that exists on the metadata server may 10854 become inconsistent in relation to the data stored on the storage 10855 devices; e.g., when WRITEs occur before a layout has been committed 10856 (e.g., between a LAYOUTGET and a LAYOUTCOMMIT). Thus, it is 10857 necessary to occasionally re-synchronized this state and make it 10858 visible to other clients through the metadata server. 10860 The LAYOUTCOMMIT operation is responsible for committing a modified 10861 layout segment to the metadata server. Note: the data should be 10862 written and committed to the appropriate storage devices before the 10863 LAYOUTCOMMIT occurs. Note, if the data is being written 10864 asynchronously (i.e., if using NFSv4.1 as the storage protocol, the 10865 field committed in WRITE4resok is UNSTABLE4) through the metadata 10866 server, a COMMIT to the metadata server is required to synchronize 10867 the data and make it visible on the storage devices (see 10868 Section 12.5.5 for more details). The scope of this operation 10869 depends on the storage protocol in use. For block/volume-based 10870 layouts, it may require updating the block list that comprises the 10871 file and committing this layout to stable storage. Whereas, for 10872 file-layouts it requires some synchronization of attributes between 10873 the metadata and storage devices (i.e., mainly the size attribute: 10874 EOF). It is important to note that the level of synchronization is 10875 from the point of view of the client which issued the LAYOUTCOMMIT. 10876 The updated state on the metadata server need only reflect the state 10877 as of the client's last operation previous to the LAYOUTCOMMIT, it 10878 need not reflect a globally synchronized state (e.g., other clients 10879 may be performing, or may have performed I/O since the client's last 10880 operation and the LAYOUTCOMMIT). 10882 The control protocol is free to synchronize the attributes before it 10883 receives a LAYOUTCOMMIT, however upon successful completion of a 10884 LAYOUTCOMMIT, state that exists on the metadata server that describes 10885 the file MUST be in sync with the state existing on the storage 10886 devices that comprise that file as of the issuing client's last 10887 operation. Thus, a client that queries the size of a file between a 10888 WRITE to a storage device and the LAYOUTCOMMIT may observe a size 10889 that does not reflects the actual data written. 10891 12.5.3.1. LAYOUTCOMMIT and mtime/atime/change 10893 The change attribute and the modify/access times may be updated, by 10894 the server, at LAYOUTCOMMIT time; since for some layout types, the 10895 change attribute and atime/mtime cannot be updated by the appropriate 10896 I/O operation performed at a storage device. The arguments to 10897 LAYOUTCOMMIT allow the client to provide suggested access and modify 10898 time values to the server. Again, depending upon the layout type, 10899 these client provided values may or may not be used. The server 10900 should sanity check the client provided values before they are used. 10901 For example, the server should ensure that time does not flow 10902 backwards. According to the NFSv4 specification, The client always 10903 has the option to set these attributes through an explicit SETATTR 10904 operation. 10906 As mentioned, for some layout protocols the change attribute and 10907 mtime/atime may be updated at or after the time the I/O occurred 10908 (e.g., if the storage device is able to communicate these attributes 10909 to the metadata server). If, upon receiving a LAYOUTCOMMIT, the 10910 server implementation is able to determine that the file did not 10911 change since the last time the change attribute was updated (e.g., no 10912 WRITEs or over-writes occurred), the implementation need not update 10913 the change attribute; file-based protocols may have enough state to 10914 make this determination or may update the change attribute upon each 10915 file modification. This also applies for mtime and atime; if the 10916 server implementation is able to determine that the file has not been 10917 modified since the last mtime update, the server need not update 10918 mtime at LAYOUTCOMMIT time. Once LAYOUTCOMMIT completes, the new 10919 change attribute and mtime/atime should be visible if that file was 10920 modified since the latest previous LAYOUTCOMMIT or LAYOUTGET. 10922 12.5.3.2. LAYOUTCOMMIT and size 10924 The file's size may be updated at LAYOUTCOMMIT time as well. The 10925 LAYOUTCOMMIT argument contains a field, loca_last_write_offset, that 10926 indicates the highest octet offset written but not yet committed via 10927 LAYOUTCOMMIT. Note: this argument is switched on a boolean value 10928 (field no_newoffset) indicating whether or not a previous write 10929 occurred. If no_newoffset is FALSE, no loca_last_write_offset is 10930 given. A loca_last_write_offset specifying an offset of 0 means 10931 octet 0 was the highest last octet written. 10933 The metadata server may do one of the following: 10935 1. It may update the file's size based on the last write offset. 10936 However, to the extent possible, the metadata server should 10937 sanity check any value to which the file's size is going to be 10938 set. E.g., it must not truncate the file based on the client 10939 presenting a smaller last write offset than the file's current 10940 size. 10942 2. If it has sufficient other knowledge of file size (e.g., by 10943 querying the storage devices through the control protocol), it 10944 may ignore the client provided argument and use the query-derived 10945 value. 10947 3. It may use the last write offset as a hint, subject to correction 10948 when other information is available as above. 10950 The method chosen to update the file's size will depend on the 10951 storage device's and/or the control protocol's implementation. For 10952 example, if the storage devices are block devices with no knowledge 10953 of file size, the metadata server must rely on the client to set the 10954 size appropriately. A new size flag and length are also returned in 10955 the results of a LAYOUTCOMMIT. This union indicates whether a new 10956 size was set, and to what length it was set. If a new size is set as 10957 a result of LAYOUTCOMMIT, then the metadata server must reply with 10958 the new size. As well, if the size is updated, the metadata server 10959 in conjunction with the control protocol SHOULD ensure that the new 10960 size is reflected by the storage devices immediately upon return of 10961 the LAYOUTCOMMIT operation; e.g., a READ up to the new file size 10962 should succeed on the storage devices (assuming no intervening 10963 truncations). Again, if the client wants to explicitly zero-extend 10964 or truncate a file, SETATTR must be used; it need not be used when 10965 simply writing past EOF via WRITE. 10967 12.5.3.3. LAYOUTCOMMIT and layoutupdate 10969 The LAYOUTCOMMIT argument contains a loca_layoutupdate field 10970 (Section 17.42.2) of data type layoutupdate4 (Section 3.2.21). This 10971 argument is a layout type-specific structure. The structure can be 10972 used to pass arbitrary layout type-specific information from the 10973 client to the metadata server at LAYOUTCOMMIT time. For example, if 10974 using a block/volume layout, the client can indicate to the metadata 10975 server which reserved or allocated blocks the client used and did not 10976 use. The content of loca_layoutupdate (field lou_body) need not be 10977 the same the layout type-specific content returned by LAYOUTGET 10978 (Section 17.43.3) in the loc_body field of the lo_content field, of 10979 the logr_layout field. The content of loca_layoutupdate is defined 10980 by the layout type specification and is opaque to LAYOUTCOMMIT. 10982 12.5.4. Recalling a Layout 10984 Since a layout protects a client's access to a file via a direct 10985 client-storage-device path, a layout need only be recalled when it is 10986 semantically unable to serve this function. Typically, this occurs 10987 when the layout no longer encapsulates the true location of the file 10988 over the octet range it represents. Any operation or action (e.g., 10989 server driven restriping or load balancing) that changes the layout 10990 will result in a recall of the layout. A layout is recalled by the 10991 CB_LAYOUTRECALL callback operation (see Section 19.3). This callback 10992 can either recall a layout segment identified by a octet range, all 10993 the layouts associated with a file system (FSID), or all layouts. 10994 Recalling all layouts or all the layouts associated with a file 10995 system also invalidates the client's device cache for the affected 10996 file systems. Multiple layout segments may be returned in a single 10997 compound operation. Section 12.5.4.2 discusses sequencing issues 10998 surrounding the getting, returning, and recalling of layouts. 11000 The iomode is also specified when recalling a layout or layout 11001 segment. Generally, the iomode in the recall request must match the 11002 layout, or segment, being returned; e.g., a recall with an iomode of 11003 LAYOUTIOMODE4_RW should cause the client to only return 11004 LAYOUTIOMODE4_RW layout segments (not 11005 LAYOUTIOMODE4_REALAYOUTIOMODE4_READ segments). However, a special 11006 LAYOUTIOMODE4_ANY enumeration is defined to enable recalling a layout 11007 of any type (i.e., the client must return both read-only and read/ 11008 write layouts). 11010 A REMOVE operation may cause the metadata server to recall the layout 11011 to prevent the client from accessing a non-existent file and to 11012 reclaim state stored on the client. Since a REMOVE may be delayed 11013 until the last close of the file has occurred, the recall may also be 11014 delayed until this time. As well, once the file has been removed, 11015 after the last reference, the client SHOULD no longer be able to 11016 perform I/O using the layout (e.g., with file-based layouts an error 11017 such as ESTALE could be returned). 11019 Although pNFS does not alter the caching capabilities of clients, or 11020 their semantics, it recognizes that some clients may perform more 11021 aggressive write-behind caching to optimize the benefits provided by 11022 pNFS. However, write-behind caching may impact the latency in 11023 returning a layout in response to a CB_LAYOUTRECALL; just as caching 11024 impacts DELEGRETURN with regards to data delegations. Client 11025 implementations should limit the amount of unwritten data they have 11026 outstanding at any one time. Server implementations may fence 11027 clients from performing direct I/O to the storage devices if they 11028 perceive that the client is taking too long to return a layout once 11029 recalled. A server may be able to monitor client progress by 11030 watching client I/Os or by observing LAYOUTRETURNs of sub-portions of 11031 the recalled layout. The server can also limit the amount of dirty 11032 data to be flushed to storage devices by limiting the octet ranges 11033 covered in the layouts it gives out. 11035 Once a layout has been returned, the client MUST NOT issue I/Os to 11036 the storage devices for the file, octet range, and iomode represented 11037 by the returned layout. If a client does issue an I/O to a storage 11038 device for which it does not hold a layout, the storage device SHOULD 11039 reject the I/O. 11041 12.5.4.1. Recall Callback Robustness 11043 It has been assumed thus far that pNFS client state for a file 11044 exactly matches the pNFS server state for that file and client 11045 regarding layout ranges and permissions. This assumption leads to 11046 the implication that any callback results in a LAYOUTRETURN or set of 11047 LAYOUTRETURNs that exactly match the range in the callback, since 11048 both client and server agree about the state being maintained. 11049 However, it can be useful if this assumption does not always hold. 11050 For example: 11052 o It may be useful for clients to be able to discard layout 11053 information without calling LAYOUTRETURN. If conflicts that 11054 require callbacks are very rare, and a server can use a multi-file 11055 callback to recover per-client resources (e.g., via a FSID recall, 11056 or a multi-file recall within a single compound), the result may 11057 be significantly less client-server pNFS traffic. 11059 o It may be similarly useful for servers to maintain information 11060 about what ranges are held by a client on a coarse-grained basis, 11061 leading to the server's layout ranges being beyond those actually 11062 held by the client. In the extreme, a server could manage 11063 conflicts on a per-file basis, only issuing whole-file callbacks 11064 even though clients may request and be granted sub-file ranges. 11066 o In order to avoid errors, it is vital that a client not assign 11067 itself layout permissions beyond what the server has granted and 11068 that the server not forget layout permissions that have been 11069 granted. On the other hand, if a server believes that a client 11070 holds a layout segment that the client does not know about, it's 11071 useful for the client to cleanly indicate completion of the 11072 requested recall either by issuing a LAYOUTRETURN for the entire 11073 requested range or by returning an NFS4ERR_NOMATCHING_LAYOUT error 11074 to the layout recall callback. 11076 Thus, in light of the above, it is useful for a server to be able to 11077 issue callbacks for layout ranges it has not granted to a client, and 11078 for a client to return ranges it does not hold. A pNFS client must 11079 always return layout segments that comprise the full range specified 11080 by the recall. Note, the full recalled layout range need not be 11081 returned as part of a single operation, but may be returned in 11082 segments. This allows the client to stage the flushing of dirty 11083 data, layout commits, and returns. Also, it indicates to the 11084 metadata server that the client is making progress. 11086 It is possible that write requests may be presented to a storage 11087 device no longer allowed to perform them. This behavior is limited 11088 by requiring that a client MUST wait for completion of all writes 11089 covered by a layout range before returning a layout that covers that 11090 range. Since the server has no control as to when the client will 11091 return the layout, the server may later decide to unilaterally revoke 11092 the client's access provided by the layout in question. Upon doing 11093 so the server must deal with the possibility of lingering writes, 11094 outstanding writes still in flight to data servers identified by the 11095 revoked layout. Each layout-specification MUST define whether 11096 unilateral layout revocation by the metadata server is supported, and 11097 if so, the specification must also outline how lingering writes are 11098 to be dealt with; e.g., storage devices identified by the revoked 11099 layout in question could be fenced off from the appropriate client. 11100 If unilateral revocation is not supported, there MUST be no 11101 possibility that the client has outstanding write requests when a 11102 layout is returned. 11104 In order to ensure client/server convergence on the layout state, the 11105 final LAYOUTRETURN operation in a sequence of LAYOUTRETURN operations 11106 for a particular recall, MUST specify the entire range being 11107 recalled, echoing the recalled layout type, iomode, recall/return 11108 type (FILE, FSID, or ALL), and octet range; even if layout segments 11109 pertaining to partial ranges were previously returned. In addition, 11110 if the client holds no layout segment that overlaps the range being 11111 recalled, the client should return the NFS4ERR_NOMATCHING_LAYOUT 11112 error code. This allows the server to update its view of the 11113 client's layout state. 11115 12.5.4.2. Serialization of Layout Operations 11117 As with other stateful operations, pNFS requires the correct 11118 sequencing of layout operations. PNFS uses the sessions feature of 11119 NFSv4.1 to provide the correct sequencing between regular operations 11120 and callbacks. It is the server's responsibility to avoid 11121 inconsistencies regarding the layouts it hands out and the client's 11122 responsibility to properly serialize its layout requests and layout 11123 returns. 11125 12.5.4.2.1. Get/Return Serialization 11127 The protocol allows the client to send concurrent LAYOUTGET and 11128 LAYOUTRETURN operations to the server. However, the protocol does 11129 not provide any means for the server to process the requests in the 11130 same order in which they were created, nor does it provide a way for 11131 the client to determine the order in which parallel outstanding 11132 operations were processed by the server. Thus, when a layout segment 11133 retrieved by an outstanding LAYOUTGET operation intersects with a 11134 layout segment returned by an outstanding LAYOUTRETURN the order in 11135 which the two conflicting operations are processed determines the 11136 final state of the overlapping segment. To disambiguate between the 11137 two cases the client MUST serialize LAYOUTGET operations and 11138 voluntary LAYOUTRETURN operations for the same file. 11140 It is permissible for the client to send in parallel multiple 11141 LAYOUTGET operations for the same file or multiple LAYOUTRETURN 11142 operations for the same file; but never a mix of both. It is also 11143 permissible for the client to combine LAYOUTRETURN and LAYOUTGET 11144 operations for the same file in the same COMPOUND request as the 11145 server MUST process these in order. If a client does issue such 11146 requests, it MUST NOT have more than one outstanding for the same 11147 file at the same time and MUST NOT have other LAYOUTGET or 11148 LAYOUTRETURN operations outstanding at the same time for that same 11149 file. 11151 12.5.4.2.2. Recall/Return Sequencing 11153 One critical issue with operation sequencing concerns callbacks. The 11154 protocol must defend against races between the reply to a LAYOUTGET 11155 operation and a subsequent CB_LAYOUTRECALL. A client MUST NOT 11156 process a CB_LAYOUTRECALL that identifies an outstanding LAYOUTGET 11157 operation to which the client has not yet received a reply. 11158 Conflicting LAYOUTGET operations are identified in the CB_SEQUENCE 11159 preceding the CB_LAYOUTRECALL. 11161 The callback races section (Section 2.10.4.3) describes the sessions 11162 mechanism for allowing the client to detect such situations in order 11163 to not process such a CB_LAYOUTRECALL. The server MUST reference all 11164 conflicting LAYOUTGET operations in the CB_SEQUENCE that precedes the 11165 CB_LAYOUTRECALL. A zero length array of referenced operations is 11166 used by the server to tell the client that the server does not know 11167 of any LAYOUTGET operations that conflict with the recall. 11169 12.5.4.2.2.1. Client Side Considerations 11171 Consider a pNFS client that has issued a LAYOUTGET and then receives 11172 an overlapping recall callback for the same file. There are two 11173 possibilities, which the client would be unable to distinguish 11174 without additional information provided by the sessions 11175 implementation. 11177 1. The server processed the LAYOUTGET before issuing the recall, so 11178 the LAYOUTGET response is in flight, and must be waited for 11179 because it may be carrying layout info that will need to be 11180 returned to deal with the recall callback. 11182 2. The server issued the callback before receiving the LAYOUTGET. 11183 The server will not respond to the LAYOUTGET until the recall 11184 callback is processed. 11186 These possibilities could cause deadlock, as the client must wait for 11187 the LAYOUTGET response before processing the recall in the first 11188 case, but that response will not arrive until after the recall is 11189 processed in the second case. Via the CB_SEQUENCE operation, the 11190 server provides the client with the { slotid , sequenceid } of any 11191 earlier LAYOUTGET operations which remain unconfirmed at the server 11192 by the session slot usage rules. This allows the client to 11193 disambiguate between the two cases, in case 1, the server will 11194 provide the operation reference(s), whereas in case 2 it will not 11195 (because there are no dependent client operations). Therefore, the 11196 action at the client will only require waiting in the case that the 11197 client has not yet seen the server's earlier responses to the 11198 LAYOUTGET operation(s). 11200 The following requirements apply to avoid this deadlock: by adhering 11201 to the following requirements: 11203 o A LAYOUTGET MUST be rejected with the error NFS4ERR_RECALLCONFLICT 11204 if there's an overlapping outstanding recall callback to the same 11205 client. 11207 o When processing a recall, the client MUST wait for a response to 11208 all conflicting outstanding LAYOUTGETs that are referenced in the 11209 CB_SEQUENCE for the recall before performing any RETURN that could 11210 be affected by any such response. 11212 o The client SHOULD wait for responses to all operations required to 11213 complete a recall before sending any LAYOUTGETs that would 11214 conflict with the recall because the server is likely to return 11215 errors for them. 11217 o Before sending a new LAYOUTGET for a range covered by a layout 11218 recall, the client SHOULD wait for responses to any outstanding 11219 LAYOUTGET that overlaps any portion of the new LAYOUTGET's range . 11220 This is because it is possible (although unlikely) that the prior 11221 operation may have arrived at the server after the recall 11222 completed and hence will succeed. 11224 o The recall process can be considered as done by the client when 11225 the final LAYOUTRETURN operation for the recalled range is issued. 11227 12.5.4.2.2.2. Server Side Considerations 11229 Consider a related situation from the metadata server's point of 11230 view. The metadata server has issued a recall layout callback and 11231 receives an overlapping LAYOUTGET for the same file before the 11232 LAYOUTRETURN(s) that respond to the recall callback. Again, there 11233 are two cases: 11235 1. The client issued the LAYOUTGET before processing the recall 11236 callback. 11238 2. The client issued the LAYOUTGET after processing the recall 11239 callback, but it arrived before the LAYOUTRETURN that completed 11240 that processing. 11242 The metadata server MUST reject the overlapping LAYOUTGET. The 11243 client has two ways to avoid this result - it can issue the LAYOUTGET 11244 as a subsequent element of a COMPOUND containing the LAYOUTRETURN 11245 that completes the recall callback, or it can wait for the response 11246 to that LAYOUTRETURN. 11248 There is little the session sequence logic can do to disambiguate 11249 between these two cases, because both operations are independent of 11250 one another. They are simply asynchronous events which crossed. The 11251 situation can even occur if the session is configured to use a single 11252 connection for both operations and callbacks. 11254 12.5.5. Metadata Server Write Propagation 11256 Asynchronous writes written through the metadata server may be 11257 propagated lazily to the storage devices. For data written 11258 asynchronously through the metadata server, a client performing a 11259 read at the appropriate storage device is not guaranteed to see the 11260 newly written data until a COMMIT occurs at the metadata server. 11261 While the write is pending, reads to the storage device can give out 11262 either the old data, the new data, or a mixture thereof. After 11263 either a synchronous write completes, or a COMMIT is received (for 11264 asynchronously written data), the metadata server must ensure that 11265 storage devices give out the new data and that the data has been 11266 written to stable storage. If the server implements its storage in 11267 any way such that it cannot obey these constraints, then it must 11268 recall the layouts to prevent reads being done that cannot be handled 11269 correctly. 11271 12.6. PNFS Mechanics 11273 This section describes the operations flow taken by a pNFS client to 11274 a metadata server and storage device. 11276 When a pNFS client encounters a new FSID, it issues a GETATTR to the 11277 NFSv4.1 server for the fs_layout_type (Section 5.13.1) attribute. If 11278 the attribute returns at least one layout type, and the layout 11279 type(s) returned is(are) among the set supported by the client, the 11280 client knows that pNFS is a possibility for the filesystem. If, from 11281 the server that returned the new FSID, the client does not have a 11282 client ID that came from an EXCHANGE_ID result that returned 11283 EXCHGID4_FLAG_USE_PNFS_MDS, it must send an EXCHANGE_ID to the server 11284 with the EXCHGID4_FLAG_USE_PNFS_MDS bit set. If the server's 11285 response does not have EXCHGID4_FLAG_USE_PNFS_MDS, then contrary to 11286 what the fs_layout_type attribute said, the server does not support 11287 pNFS, and the client will not be able use pNFS to that server. 11289 Once the client has a client ID that supports pNFS, it creates a 11290 persistent session over the client ID, requesting persistent. 11292 If the client wants to create a file on the file system identified by 11293 the FSID that supports pNFS, it issues an OPEN with a create type of 11294 GUARDED4 (if it wants an exclusive create), or UNCHECKED4 (if it does 11295 not want an exclusive create). Among the various attributes it sets 11296 in createattrs, it includes layout_hint and fills it with information 11297 pertinent to the layout type it wants to use. The COMPOUND procedure 11298 that the OPEN is sent with should include a GETATTR operation (on the 11299 filehandle OPEN sets) that retrieves the layout_type attribute. This 11300 is so the client can determine what layout type the server will in 11301 fact support, and thus what storage protocol the client must use. 11303 If the client wants to open an existing file, then it also includes a 11304 GETATTR to determine what layout type the file supports. 11306 The GETATTR in either the file creation or plain file open case can 11307 also include the layout_blksize and layout_alignment attributes so 11308 that the client can determine optimal offsets and lengths for I/O on 11309 the file. 11311 Assuming the client supports the layout type returned by GETATTR, it 11312 then issues LAYOUTGET using the filehandle returned by OPEN, 11313 specifying the range it wants to do I/O on. The response is a layout 11314 segment, which may be a subset of the range the client asked for. It 11315 also includes device IDs and a description of how data is organized 11316 (or in the case of writing, how data is to be organized) across the 11317 devices. The device IDs and data description are encoded in a format 11318 that is specific to the layout type, but the client is expected to 11319 understand. 11321 When the client wants to issue an I/O, it determines which device ID 11322 it needs to send the I/O command to by examining the data description 11323 in the layout. It then issues a GETDEVICEINFO to find the device 11324 address of the device ID. The client then sends the I/O command to 11325 device address, using the storage protocol defined for the layout 11326 type. 11328 If the I/O was an input request, then at some point the client may 11329 want to commit the access time to the metadata server. It uses the 11330 LAYOUTCOMMIT operation. If the I/O was an output request, then at 11331 some point the client may want to commit the modification time and 11332 the new size of the file if it believes it lengthed the file, to the 11333 metadata server and the modified data to the filesystem. Again, it 11334 uses LAYOUTCOMMIT. 11336 12.7. Recovery 11338 Recovery is complicated due to the distributed nature of the pNFS 11339 protocol. In general, crash recovery for layouts is similar to crash 11340 recovery for delegations in the base NFSv4 protocol. However, the 11341 client's ability to perform I/O without contacting the metadata 11342 server and the fact that unlike delegations, layouts are not bound to 11343 stateids introduces subtleties that must be handled correctly if file 11344 system corruption is to be avoided. 11346 12.7.1. Client Recovery 11348 Client recovery for layouts is similar to client recovery for other 11349 lock/delegation state. When an pNFS client reboots, it will lose all 11350 information about the layouts that it previously owned. There are 11351 two methods by which the server can reclaim these resources and allow 11352 otherwise conflicting layouts to be provided to other clients. 11354 The first is through the expiry of the client's lease. If the client 11355 recovery time is longer than the lease period, the client's lease 11356 will expire and the server will know that state may be released. For 11357 layouts the server may release the state immediately upon lease 11358 expiry or it may allow the layout to persist awaiting possible lease 11359 revival, as long as there are no conflicting requests. 11361 On the other hand, the client may restart in less time than it takes 11362 for the lease period to expire. In such a case, the client will 11363 contact the server through the standard EXCHANGE_ID protocol. The 11364 server will find that the client's co_ownerid matches the co_ownerid 11365 of the previous client invocation, but that the verifier is 11366 different. The server uses this as a signal to release all layout 11367 state associated with the client's previous invocation. It is 11368 possible that all data written by the client to storage devices but 11369 not completed via LAYOUTCOMMIT is lost. 11371 12.7.2. Dealing with Lease Expiration on the Client 11373 The mappings between device IDs and device addresses are what allow a 11374 pNFS client to safely write data to and read data from a storage 11375 device. These mappings are leased (just like with locking state) 11376 from the metadata server, and as long as the lease is valid, the 11377 client has a right to issue I/O to the storage devices. The lease on 11378 device ID to device address mappings is renewed when the metadata 11379 server receives a SEQUENCE operation from the pNFS client. The same 11380 is not specified to be true for the data server receiving a SEQUENCE 11381 operation, and the client MUST NOT assume that a SEQUENCE sent to a 11382 data server will renew its lease. 11384 The loss of the lease leads to the loss of the device ID to device 11385 address mappings. If a mapping is used for I/O after lease 11386 expiration, the consequences could be data corruption. To avoid 11387 losing its lease, the client should start its lease timer based on 11388 the time that it issued the operation to the metadata server rather 11389 than based on the time the response was received. It is also 11390 necessary to take propagation delay into account as described in 11391 Section 8.12. Thus, the client must be aware of the one-way 11392 propagation delay and should issue renewals well in advance of lease 11393 expiration. 11395 If a client believes its lease has expired, it MUST NOT issue I/O to 11396 the storage device until it has validated its lease. The client can 11397 issue a SEQUENCE operation to the metadata server. If the SEQUENCE 11398 operation is successful, but sr_status_flag has 11399 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, 11400 SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, or 11401 SEQ4_STATUS_ADMIN_STATE_REVOKED set, the client must recover by 11402 deleting all its records of layouts and device ID to device address 11403 mappings, then writing any modified but uncommitted data in its 11404 memory directly to the metadata server with the stable argument to 11405 WRITE set to FILE_SYNC4, and finally reacquiring any layouts it needs 11406 via LAYOUTGET. 11408 If sr_status_flags from the metadata server has 11409 SEQ4_STATUS_RESTART_RECLAIM_NEEDED set (or SEQUENCE returns 11410 NFS4ERR_STATE_CLIENTID, or SEQUENCE returns NFS4ERR_BAD_SESSION and 11411 CREATE_SESSION returns NFS4ERR_STATE_CLIENTID) then the metadata 11412 server has restarted, and the client must recovery using the methods 11413 described in Section 12.7.4. 11415 If sr_status_flags from the metadata server has 11416 SEQ4_STATUS_LEASE_MOVED set, then the client recovers by following 11417 the procedure described in Section 10.6.7.1. After that, the client 11418 may get an indication that the layout state was not moved with the 11419 filesystem. The client is then required the client to recover per 11420 other applicable situations discussed in Paragraph 3 or Paragraph 4 11421 of this section. 11423 If sr_status_flags reports no loss of state, then the lease the 11424 client has with the metadata server is valid and renewed, and the 11425 client can re-commence I/O to the storage devices. 11427 While clients should not issue I/Os to storage devices that may 11428 extend past the lease expiration time period, this is not always 11429 possible (e.g. an extended network partition that starts after the 11430 I/O is send and does nor heal till the I/O request is received by the 11431 data server). Thus the metadata server and/or storage device are 11432 responsible for protecting the pNFS server from I/Os that are sent 11433 before the lease expires, but arrive after the lease expires. See 11434 Section 12.7.3. 11436 12.7.3. Dealing with Loss of Layout State on the Metadata Server 11438 This section describes recovery from the situation where all of the 11439 following are true: the metadata server has not restarted; a pNFS 11440 client's device ID to device address mappings and/or layouts have 11441 been discarded (usually because the client's lease expired) and are 11442 invalid; and an I/O from the pNFS client arrives at the storage 11443 device. The metadata server and its storage devices may solve this 11444 by fencing the client (i.e. prevent the execution of I/O operations 11445 from the client to the storage devices after layout state loss). The 11446 details of how fencing is done are specific to the layout type. The 11447 solution for NFSv4.1 file-based layouts is described in this document 11448 (Section 13.13), and for other layout types in their respective 11449 external specification documents. 11451 12.7.4. Recovery from Metadata Server Restart 11453 The pNFS client will discover that the metadata server has restarted 11454 (e.g. rebooted) via the methods described in Section 8.6.2 and 11455 discussed in a pNFS-specific context in Paragraph 4, of 11456 Section 12.7.2. The client MUST stop using and delete device ID to 11457 device address mappings it previously received from the metadata 11458 server. Having done that, if the client wrote data to the storage 11459 device without committing the layout segment(s) via LAYOUTCOMMIT, 11460 then client has additional work to do in order to get the client, 11461 metadata server and storage device(s) all synchronized on the state 11462 of the data. 11464 o If the client has data still modified and unwritten in the 11465 client's memory, the client has only two choices. 11467 1. The client can obtain a layout segment via LAYOUTGET after the 11468 server's grace period and write the data to the storage 11469 devices. 11471 2. The client can write that data through the metadata server 11472 using the WRITE (Section 17.32) operation, and then obtain 11473 layout segments as needed. 11475 As noted in Paragraph 2 of Section 8.6.2.1, and in 11476 Section 17.43.4, LAYOUTGET and WRITE may not be allowed until the 11477 grace period expires. Under some conditions, as described in 11478 Section 12.7.5, LAYOUTGET and/or WRITE maybe permitted during the 11479 metadata server's grace period. 11481 o If the client synchronously wrote data to the storage device, but 11482 still a copy of that data in its memory, then it has available to 11483 it the recovery options listed above in the previous bullet point. 11484 If the metadata server is also in its grace period, the client has 11485 available to it the options below in the next bullet item. 11487 o The client does not have a copy of the data in its memory and the 11488 metadata server is still in its grace period. The client cannot 11489 use LAYOUTGET (within or outside the grace period) to reclaim a 11490 layout segment because the contents of the response from LAYOUTGET 11491 may not match what it had previously. The range might be 11492 different or it might get the same range but the content of the 11493 layout might be different. Even if the content of the layout 11494 appears to be the same, the device IDs may map to difference 11495 device addresses, and even if the device addresses are the same, 11496 the device addresses could have been assigned to a different 11497 storage device. The option of retrieving the data from the 11498 storage device and writing it to the metadata server per the 11499 recovery scenario described above in the previous two bullets is 11500 not available because, again, the mappings of range to device ID, 11501 device ID to device address, device address to physical device are 11502 stale and new mappings via new LAYOUTGET do not solve the problem. 11504 The only recovery option for this scenario is to issue a 11505 LAYOUTCOMMIT in reclaim mode, which the metadata server will 11506 accept as long as it is in its grace period. The use of 11507 LAYOUTCOMMIT in reclaim mode informs the metadata server that the 11508 layout segment has changed. It is critical the metadata server 11509 receive this information before its grace period ends, and thus 11510 before it starts allowing updates to the filesystem. 11512 To issue LAYOUTCOMMIT in reclaim mode, the client sets the 11513 loca_reclaim field of the operation's arguments (Section 17.42.2) 11514 to TRUE. During the metadata server's recovery grace period (and 11515 only during the recovery grace period) the metadata server is 11516 prepared to accept LAYOUTCOMMIT requests with the loca_reclaim 11517 field set to TRUE. 11519 When loca_reclaim is TRUE, the client is attempting to commit 11520 changes to the layout segment that occurred prior to the restart 11521 of the metadata server. The metadata server applies some 11522 consistency checks on the loca_layoutupdate field of the arguments 11523 to determine whether the client can commit the data written to the 11524 data server to the filesystem. The loca_layoutupdate field is of 11525 data type layoutupdate4, and contains layout type-specific content 11526 (in the lou_body field of loca_layoutupdate). The layout type- 11527 specific information that loca_layoutupdate might have is 11528 discussed in Section 12.5.3.3. If the metadata server's 11529 consistency checks on loca_layoutupdate succeed, then the metadata 11530 server MUST commit the data (as described by the loca_offset, 11531 loca_length, and loca_layoutupdate fields of the arguments) that 11532 was written to storage device. If the metadata server's 11533 consistency checks on loca_layoutupdate fail, the metadata server 11534 rejects the LAYOUTCOMMIT operation, and makes no changes to the 11535 file system. However, any time LAYOUTCOMMIT with loca_reclaim 11536 TRUE fails, the pNFS client has lost all the data in the range 11537 defined by . A client can defend 11538 against this risk by caching all data, whether written 11539 synchronously or asynchronously in its memory and not release the 11540 cached data until a successful LAYOUTCOMMIT. 11542 o The client does not have a copy of the data in its memory and the 11543 metadata server is no longer in its grace period; i.e. the 11544 metadata server returns NFS4ERR_NO_GRACE. As with the scenario in 11545 the above bullet item, the failure of LAYOUTCOMMIT means the data 11546 in the range lost. The defense against 11547 the risk is the same; cache all written data on the client until a 11548 successful LAYOUTCOMMIT. 11550 12.7.5. Operations During Metadata Server Grace Period 11552 Some of the recovery scenarios thus far noted that some operations, 11553 namely WRITE and LAYOUTGET might be permitted during the metadata 11554 server's grace period. The metadata server may allow these 11555 operations during its grace period, if it can reliably determine that 11556 servicing such a request will not conflict with an impending 11557 LAYOUTCOMMIT (or, in the case of WRITE, conflicting with an impending 11558 OPEN, or a LOCK on a file with mandatory record locking enabled) 11559 reclaim request. As mentioned previously, some operations, namely 11560 WRITE and LAYOUTGET are likely to be rejected during the metadata 11561 server's grace period, because to provide simple, valid handling 11562 during the grace period, the easiest method is to simply reject all 11563 non-reclaim pNFS requests and WRITE operations by returning the 11564 NFS4ERR_GRACE error. However, depending on the storage protocol 11565 (which is specific to the layout type) and metadata server 11566 implementation, the metadata server may be able to determine that a 11567 particular request is safe. For example, a metadata server may save 11568 provisional allocation mappings for each file to stable storage, as 11569 well as information about potentially conflicting OPEN share modes 11570 and mandatory record locks that might have been in effect at the time 11571 of restart, and use this information during the recovery grace period 11572 to determine that a WRITE request is safe. 11574 12.7.6. Storage Device Recovery 11576 Recovery from storage device restart is mostly dependent upon the 11577 layout type in use. However, there are a few general techniques a 11578 client can use if it discovers a storage device has crashed while 11579 holding modified, uncommitted data that was asynchronously written. 11580 First and foremost, it is important to realize that the client is the 11581 only one who has the information necessary to recover non-committed 11582 data; since, it holds the modified data and most probably nobody else 11583 does. Second, the best solution is for the client to err on the side 11584 of caution and attempt to re-write the modified data through another 11585 path. 11587 The client should immediately write the data to the metadata server, 11588 with the stable field in the WRITE4args set to FILE_SYNC4. Once it 11589 does this, there is no need to wait for the original storage device. 11591 12.8. Metadata and Storage Device Roles 11593 If the same physical hardware is used to implement both a metadata 11594 server and storage device, then the same hardware entity is to be 11595 understood to be implementing two distinct roles and it is important 11596 that it be clearly understood on behalf of which role the hardware is 11597 executing at any given time. 11599 Various sub-cases can be distinguished. 11601 1. The storage device uses NFSv4.1 as the storage protocol. The 11602 same physical hardware is used to implement both a metadata and 11603 data server. If an EXCHANGE_ID operation issued to the metadata 11604 server has EXCHGID4_FLAG_USE_PNFS_MDS set and not 11605 EXCHGID4_FLAG_USE_PNFS_DS not set, the role of all sessions 11606 derived from the client ID is metadata server-only. If an 11607 EXCHANGE_ID operation issued to the data server has 11608 EXCHGID4_FLAG_USE_PNFS_DS set and EXCHGID4_FLAG_USE_PNFS_MDS not 11609 set, the role of all sessions derived from the client ID is data 11610 server only. These assertions are true regardless whether the 11611 network addresses of the metadata server and data server are the 11612 same or not. 11614 The client will use the same client owner for both the metadata 11615 server EXCHANGE_ID and the data server EXCHANGE_ID. Since the 11616 client issues one with EXCHGID4_FLAG_USE_PNFS_MDS set, and the 11617 other with EXCHGID4_FLAG_USE_PNFS_DS set, the server will need to 11618 return unique client IDs, as well as server_owners, which will 11619 eliminate ambiguity about dual roles the same physical entity 11620 serves. 11622 2. The metadata and data server each return EXCHANGE_ID results with 11623 EXCHGID4_FLAG_USE_PNFS_DS and EXCHGID4_FLAG_USE_PNFS_MDS both 11624 set, the server_owner and server_scope results are the same, and 11625 the client IDs are the same, and if RPCSEC_GSS is used, the 11626 server principals are the same. As noted in Section 2.10.3.4.1 11627 the two servers are the same, whether they have the same network 11628 address or not. If the pNFS server is ambiguous in its 11629 EXCHANGE_ID results as to what role a client ID may be used for, 11630 yet still requires the NFSv4.1 request be directed in a manner 11631 specific to a role (e.g. a READ request for a particular offset 11632 directed to the metadata server role might use a different offset 11633 if the READ was intended for the data server role, if the file is 11634 using STRIPE4_DENSE packing, see Section 13.5), the pNFS server 11635 may mark the the metadata filehandle differently from the data 11636 filehandle so that operations addressed to the metadata server 11637 can be distinguished from those directed to the data servers. 11638 Marking the metadata and data server filehandles differently (and 11639 this is RECOMMENDED) is possible because the former are derived 11640 from OPEN operations, and the latter are derived from LAYOUTGET 11641 operations. 11643 Note, that it may be the case that while the metadata server and 11644 the storage device are distinct from one client's point of view, 11645 the roles may be reversed according to another client's point of 11646 view. For example, in the cluster file system model a metadata 11647 server to one client, may be a data server to another client. If 11648 NFSv4.1 is being used as the storage protocol, then pNFS servers 11649 need to mark filehandles according to their specific roles. 11651 If a current filehandle is set that is inconsistent with the role 11652 to which it is directed, then the error NFS4ERR_BADHANDLE should 11653 result. For example, if a request is directed at the data 11654 server, because the first current handle is from a layout, any 11655 attempt to set the current filehandle to be a value not from a 11656 layout should be rejected. Similarly, if the first current file 11657 handle was for a value not from a layout, a subsequent attempt to 11658 set the current filehandle to a value obtained from a layout 11659 should be rejected. 11661 3. The storage device does not use NFSv4.1 as the storage protocol, 11662 and the same physical hardware is used to implement both a 11663 metadata and storage device. Whether distinct network addresses 11664 are used to access metadata server and storage device is 11665 immaterial, because, it is always clear to the pNFS client and 11666 server, from upper layer protocol being used (NFSv4.1 or non- 11667 NFSv4.1) what role the request to the common server network 11668 address is directed to. 11670 12.9. Security Considerations 11672 PNFS has a metadata path and a data path (i.e., storage protocol). 11673 The metadata path includes the pNFS-specific operations (listed in 11674 Section 12.3); all existing NFSv4.1 conventional (non-pNFS) security 11675 mechanisms and features apply to the metadata path. The combination 11676 of components in a pNFS system (see Figure 62) is required to 11677 preserve the security properties of NFSv4.1 with respect to an entity 11678 accessing storage device from a client, including security 11679 countermeasures to defend against threats that NFSv4.1 provides 11680 defenses for in environments where these threats are considered 11681 significant. 11683 In some cases, the security countermeasures for connections to 11684 storage devices may take the form of physical isolation or a 11685 recommendation not to use pNFS in an environment. For example, it 11686 may be impractical to provide confidentiality protection for some 11687 storage protocols to protect against eavesdropping; in environments 11688 where eavesdropping on such protocols is of sufficient concern to 11689 require countermeasures, physical isolation of the communication 11690 channel (e.g., via direct connection from client(s) to storage 11691 device(s)) and/or a decision to forego use of pNFS (e.g., and fall 11692 back to conventional NFSv4.1) may be appropriate courses of action. 11694 Where communication with storage devices is subject to the same 11695 threats as client to metadata server communication, the protocols 11696 used for that communication need to provide security mechanisms 11697 comparable to those available via RPSEC_GSS for NFSv4.1. Many 11698 situations in which pNFS is likely to be used will not be subject to 11699 the overall threat profile for which NFSv4.1 is required to provide 11700 countermeasures. 11702 PNFS implementations MUST NOT remove NFSv4's access controls. The 11703 combination of clients, storage devices, and the metadata server are 11704 responsible for ensuring that all client to storage device file data 11705 access respects NFSv4.1's ACLs and file open modes. This entails 11706 performing both of these checks on every access in the client, the 11707 storage device, or both (as applicable; when the storage device is an 11708 NFSv4.1 server, the storage device is ultimately responsible for 11709 controlling access). If a pNFS configuration performs these checks 11710 only in the client, the risk of a misbehaving client obtaining 11711 unauthorized access is an important consideration in determining when 11712 it is appropriate to use such a pNFS configuration. Such 11713 configurations SHOULD NOT be used when client-only access checks do 11714 not provide sufficient assurance that NFSv4.1 access control is being 11715 applied correctly. 11717 13. PNFS: NFSv4.1 File Layout Type 11719 This section describes the semantics and format of NFSv4.1 file-based 11720 layouts for pNFS. NFSv4.1 file-based layouts uses the 11721 LAYOUT4_NFSV4_1_FILES layout type. The LAYOUT4_NFSV4_1_FILES type 11722 defines striping data across multiple NFSv4.1 data servers. 11724 13.1. Session Considerations 11726 Sessions are a mandatory feature of NFSv4.1, and this extends to both 11727 the metadata server and file-based (NFSv4.1-based) data servers. If 11728 data is served by both the metadata server and an NFSv4.1-based data 11729 server, the metadata and data server MUST have separate client IDs 11730 (unless the EXCHANGE_ID results indicate the server will allow the 11731 client ID to support both metadata and data pNFS operations). 11733 When a creating a client ID to access a pNFS metadata server, the 11734 pNFS metadata client sends an EXCHANGE_ID operation that has 11735 EXCHGID4_FLAG_USE_PNFS_MDS set (EXCHGID4_FLAG_USE_NON_PNFS and 11736 EXCHGID4_FLAG_USE_PNFS_DS MAY be set as well). If the server's 11737 EXCHANGE_ID results have EXCHGID4_FLAG_USE_PNFS_MDS set, then the 11738 client may use the client ID to create sessions that will exchange 11739 pNFS metadata operations. 11741 If pNFS metadata client gets a layout that refers it to an NFSv4.1 11742 data server, it needs a client ID on that data server. If it does 11743 not yet have a client ID from the server that had the 11744 EXCHGID4_FLAG_USE_PNFS_DS flag set in the EXCHANGE_ID results, then 11745 the client must send an EXCHANGE_ID to the data server, using the 11746 same co_ownerid as it sent to the metadata server, with the 11747 EXCHGID4_FLAG_USE_PNFS_DS flag set in arguments. If the server's 11748 EXCHANGE_ID results have EXCHGID4_FLAG_USE_PNFS_DS set, then the 11749 client may use the client ID to create sessions that will exchange 11750 pNFS data operations. 11752 The client ID returned by a metadata server has no required 11753 association to the client ID returned by a data server that the 11754 metadata server's layouts referred the client to, although a server 11755 implementation is free construct such an association (e.g. via a 11756 private data server/metadata server protocol and client ID table). 11757 Similarly the EXCHANGE_ID/CREATE_SESSION sequenceid state used by the 11758 pNFS metadata client and server has no association with the 11759 EXCHANGE_ID/CREATE_SESSION sequenceid state used by the data client/ 11760 server (and the pNFS server and the pNFS client MUST NOT make this 11761 association). By decoupling the client IDs of metadata and data 11762 servers from each other, implementation of the session on pNFS 11763 servers is potentially simpler. 11765 In a non-pNFS server or in a metadata server, the sessionid in the 11766 SEQUENCE operation implies the client ID, which in turn might be used 11767 by the server to map the stateid to the right client/server pair. 11768 However, when a data server is presented with a READ or WRITE 11769 operation with a stateid, because the stateid is associated with 11770 client ID on a metadata server, and because the sessionid in the 11771 preceding SEQUENCE operation is tied to the potentially unrelated 11772 data server client ID, the data server has no obvious way to 11773 determine the metadata server from the COMPOUND procedure, and thus 11774 has no way to validate the stateid. One recommended approach is for 11775 pNFS servers to encode metadata server routing and/or identity 11776 information in the data server filehandles as returned in the layout. 11777 If the metadata server identity or location changes, requiring the 11778 data server filehandles to become invalid (stale), the metadata 11779 server must first recall the layouts. 11781 Invalidating data server filehandles does not render the pNFS data 11782 cache invalid. If the metadata server file handle of a file is 11783 persistent, the client can map the metadata server filehandle to 11784 cached data, and when granted data server filehandles, map the data 11785 server filehandles to their metadata server filehandle. 11787 13.2. File Layout Definitions 11789 The following definitions apply to the LAYOUT4_NFSV4_1_FILES layout 11790 type, and may be applicable to other layout types. 11792 Unit. A unit is a set of data written to a data server. 11794 Pattern. A pattern is a method of distributing fix sized units 11795 across a set of data servers. A pattern is iterated one or more 11796 times. A pattern has one or more units. Each unit in each 11797 iteration of a pattern MUST be the same size. 11799 Stripe. An stripe is a set of data distributed across a set of data 11800 servers in a pattern before that pattern repeats. 11802 Stripe Width. A stripe width is the size of stripe in octets. 11804 Hereafter, this document will refer to a unit that is a written in a 11805 pattern as a "stripe unit". 11807 A pattern may have more stripe units than data servers. If so, some 11808 data servers will have more than one stripe unit per stripe. A data 11809 server that has multiple stripe units per stripe MAY store each unit 11810 in a different data file. 11812 13.3. File Layout Data Types 11814 The high level NFSv4.1 layout types are nfsv4_1_file_layout_ds_addr4, 11815 nfsv4_1_file_layouthint4, and nfsv4_1_file_layout4. 11817 When LAYOUTGET returns a LAYOUT4_NFSV4_1_FILES layout (indicated in 11818 the loc_type field of the lo_content field), the loc_body field of 11819 the lo_content field contains a value of data type 11820 nfsv4_1_file_layout4. Among other content, nfsv4_1_file_layout4 has 11821 storage device IDs (within the nfl_ds_fh_list array) of data type 11822 deviceid4. 11824 The GETDEVICEINFO operation maps a device ID to a storage device 11825 address (type device_addr4). When GETDEVICEINFO returns a device 11826 address with a layout type of LAYOUT4_NFSV4_1_FILES (the 11827 da_layout_type field), the da_addr_body field contains a value of 11828 data type nfsv4_1_file_layout_ds_addr4. 11830 The SETATTR operation supports a layout hint attribute 11831 (Section 5.13.4). When the client sets a layout hint (data type 11832 layouthint4) with a layout type of LAYOUT4_NFSV4_1_FILES (the 11833 loh_type field), the loh_body field contains a value of data type 11834 nfsv4_1_file_layouthint4. 11836 The top level and lower level NFSv4.1 layout data types have the 11837 following XDR descriptions. 11839 enum file_layout_ds_type4 { 11840 FILEDS4_SIMPLE = 1, 11841 FILEDS4_COMPLEX = 2 11842 }; 11844 %/* Encoded in the da_addr_body field of type device_addr4: */ 11845 union nfsv4_1_file_layout_ds_addr4 11846 switch (file_layout_ds_type4 nflda_type) { 11847 case FILEDS4_SIMPLE: 11848 netaddr4 nflda_simp_ds_list<>; 11849 case FILEDS4_COMPLEX: 11850 deviceid4 nflda_comp_ds_list<>; 11851 default: 11852 void; 11853 }; 11855 enum stripetype4 { 11856 STRIPE4_SPARSE = 1, 11857 STRIPE4_DENSE = 2 11858 }; 11860 %/* Encoded in the loh_body field of type layouthint4: */ 11861 struct nfsv4_1_file_layouthint4 { 11862 stripetype4 nflh_stripe_type; 11863 length4 nflh_stripe_unit_size; 11864 uint32_t nflh_stripe_width; 11865 }; 11867 struct nfsv4_1_file_layout_ds_fh4 { 11868 deviceid4 nfldf_ds_id; 11869 uint32_t nfldf_ds_index; 11870 nfs_fh4 nfldf_fh; 11871 }; 11873 %/* Encoded in the loc_body field of type layout_content4: */ 11874 struct nfsv4_1_file_layout4 { 11875 stripetype4 nfl_stripe_type; 11876 bool nfl_commit_through_mds; 11877 length4 nfl_stripe_unit_size; 11878 length4 nfl_file_size; 11879 uint32_t nfl_stripe_indices<>; 11880 nfsv4_1_file_layout_ds_fh4 nfl_ds_fh_list<>; 11881 }; 11883 %/* 11884 % * Encoded in the lou_body field of type layoutupdate4: 11885 % * Nothing. lou_body is a zero length array of octets. 11886 % */ 11887 The nfsv4_1_file_layout_ds_addr4 data server address is composed of a 11888 FILEDS4_SIMPLE or a FILEDS4_COMPLEX data server address. A 11889 FILEDS4_SIMPLE data server address is composed of an array of network 11890 addresses (data type netaddr4). All data servers in a FILEDS4_SIMPLE 11891 list (field nflda_simp_ds_list) must be equivalent and are used for 11892 data server multipathing; see Section 13.6 for more details on 11893 equivalent data servers. FILEDS4_SIMPLE data servers always refer to 11894 actual data servers. On the other hand, a FILEDS4_COMPLEX data 11895 server address is constructed of list of device IDs (field 11896 nflda_comp_ds_list). Each device ID in nflda_comp_ds_list 11897 corresponds to the device ID of a data server address of type 11898 FILEDS4_SIMPLE. A FILEDS4_COMPLEX data server list MUST NOT contain 11899 device IDs of other FILEDS4_COMPLEX data servers; only device IDs of 11900 FILEDS4_SIMPLE data servers are to be referenced. This enables 11901 multiple equivalent data servers to be identified through a single 11902 device ID and provides a space efficient mechanism by which to 11903 identify multiple data servers within a layout. FILEDS4_COMPLEX and 11904 FILEDS4_SIMPLE data servers share the same device ID space and should 11905 be cached similarly by the client. 11907 The nfsv4_1_file_layout4 data type specifies an ordered array of 11908 tuples, as well as the stripe unit size, type 11909 of stripe layout (discussed later in this section and in 11910 Section 13.4), and the file's current size as of LAYOUTGET 11911 (Section 17.43) time. 11913 The nfl_ds_fh_list array within the nfsv4_1_file_layout4 data type 11914 contains a list of nfsv4_1_file_layout_devfh4 structures. Each of 11915 these structures describes one or more FILEDS4_SIMPLE or 11916 FILEDS4_COMPLEX data servers that contribute to a stripe of the file. 11917 The nfl_stripe_indices array contains a list of indices into the 11918 nfl_ds_fh_list array; an index of zero specifies the first entry in 11919 nfl_ds_fh_list. Each successive index selects a nfl_ds_fh_list entry 11920 which are to be used next in sequence for that stripe. This allows 11921 an arbitrary sequencing through the possible data servers to be 11922 encoded compactly. The value of every element in nfl_stripe_indices 11923 must be less than the number of elements in the nfl_ds_fh_list array. 11925 When the nfl_stripe_indices array is of zero length, the elements of 11926 the nfl_ds_fh_list array are simply used in order, so that the 11927 portion of the stripe held by the corresponding entry is determined 11928 by its position within the data server list. 11930 If the nfl_stripe_indices array is of non-zero length, there is no 11931 requirement that the nfl_stripe_indices and nfl_ds_fh_list arrays 11932 have the same number of entries. If the nfl_stripe_indices array has 11933 fewer entries than the nfl_ds_fh_list array, this simply means not 11934 all entries of nfl_ds_fh_list are in the striping pattern. 11936 Even if nfl_stripe_indices has the same number of entries as the 11937 nfl_ds_fh_list array, this does not necessarily mean all entries of 11938 nfl_ds_fh_list are used, because nothing prevents an index value from 11939 appearing in multiple entries of nfl_stripe_indices. 11941 If the nfl_stripe_indices array has more entries than the 11942 nfl_ds_fh_list array, then this simply means index values in 11943 nfl_stripe_indices are appearing more than once. 11945 Each nfl_ds_fh_list entry contains a device ID, data server index, 11946 and a filehandle. The device ID (field nfldf_ds_id), identifies the 11947 data server. The GETDEVICEINFO operation is used to map nfldf_ds_id 11948 to a data server address, which will be either a FILEDS4_COMPLEX or 11949 FILEDS4_SIMPLE data server address. When the device ID maps to a 11950 FILEDS4_COMPLEX data server address server, the data server index 11951 (field nfldf_ds_index) indicates the starting element of the to use 11952 from the list of device IDs (nflda_comp_ds_list) of the 11953 FILEDS4_COMPLEX address. (As discussed in Section 13.4 the 11954 nfldf_ds_index field plays a critical role in the flattening of a 11955 FILEDS4_COMPLEX device.) If the nfldf_ds_id field maps to a 11956 FILEDS4_SIMPLE device, the nfldf_ds_index field has no meaning and 11957 should be zero. The filehandle, nfldf_fh, identifies the file on the 11958 data server identified by the device ID. 11960 The generic layout hint structure is described in Section 3.2.22. 11961 The client uses the layout hint in the layout_hint (Section 5.13.4) 11962 attribute to specify the type of layout to be used for a newly 11963 created file. The LAYOUT4_NFSV4_1_FILES layout type-specific content 11964 for the layout hint is composed of the preferred stripe packing type 11965 (field nflh_stripe_type, discussed in Section 13.5), the size of the 11966 stripe unit (field nflh_stripe_unit_size), and the width of the 11967 stripe (field nflh_stripe_width). 11969 13.4. Interpreting the File Layout 11971 The client is expected to construct a flat list of pairs over which the file is striped. A flat data server 11973 list contains no FILEDS4_COMPLEX data servers, and is constructed by 11974 concatenating each data server encountered while traversing 11975 nfl_stripe_indices (or nfl_ds_fh_list in the case of a zero sized 11976 nfl_stripe_indices array), while expanding each FILEDS4_COMPLEX data 11977 server address. The client must expand the FILEDS4_COMPLEX data 11978 server address's device ID list by starting at the device ID entry of 11979 the nflda_comp_ds_list array indexed by nfldf_ds_index, ending with 11980 the device ID prior to nfldf_ds_index (or ending with the last entry 11981 the of the nflda_comp_ds_list array if nfldf_ds_index is zero. All 11982 devices IDs in the nflda_comp_ds_list must be consumed; this may 11983 require wrapping around the end of the array if nfldf_ds_index is 11984 non-zero. The stripe width is determined by the stripe unit size 11985 multiplied by the number of data server entries within the flattened 11986 stripe. 11988 Consider the following example: 11990 Given a set of data servers with the following device IDs: 11992 1->{simple}; 2->{complex, ds_list=<3, 4>}; 3->{simple}; 11993 4->{simple}; 5->{simple}; 6->{complex, ds_list=<1, 5>}; 11995 Device IDs 1, 3, 4 and 5 identify FILEDS4_SIMPLE data servers. 11996 Device ID 2 is a FILEDS4_COMPLEX data server constructed of 11997 FILEDS4_SIMPLE data servers 3 and 4. Device ID 6 is a 11998 FILEDS4_COMPLEX data server constructed of FILEDS4_SIMPLE data 11999 servers 4, 1, and 5. 12001 Within an instance of nfsv4_1_file_layout4, imagine a nfl_ds_fh_list 12002 constructed of tuples: 12004 ds_fh_list = [<6, 1, 0x17>, <1, 0, 0x12>, <5, 0, 0x22>, 12005 <2, 0, 0x13>, <3, 0, 0x14>, <4, 0, 0x15>] 12007 And a nfl_stripe_indices array containing the following indices: 12009 nfl_stripe_indices = [5, 2, 4, 0, 1, 3] 12011 Using nfl_stripe_indices as indices into the nfl_ds_fh_list, we get 12012 the following re-ordered list of nfsv4_1_file_layout_devfh4 values: 12014 [<4, 0, 0x15>, <5, 0, 0x22>, <2, 0, 0x13>, 12015 <6, 3, 0x17>, <1, 0, 0x12>, <5, 0, 0x22>] 12017 Converting the FILEDS4_COMPLEX devices to FILEDS4_SIMPLE devices 12018 gives us the following list of 9 FILEDS4_SIMPLE 12019 tuples. 12021 [<4, 0x15>, <5, 0x22>, <3, 0x13>, <4, 0x13>, 12022 <1, 0x17>, <5, 0x17>, <4, 0x14>, <1, 0x12>, 12023 <5, 0x22>] 12025 The above list of tuples fully describes the striping pattern. We 12026 observe several things. First, the tuples are not 3-tuples; they do 12027 not have an index value because FILEDS4_SIMPLE devices do not use the 12028 index. Second, each tuple in the sequence represents a destination 12029 for each stripe unit in the pattern. Third, device 2 is a 12030 FILEDS4_COMPLEX device that gets replaced with devices 3 and 4. 12031 Fourth, device 6 is a FILEDS4_COMPLEX device that gets replaced with 12032 devices 1, 5, 4 (and not in the order 4, 1, 5, because the 12033 nfl_ds_fh_list entry for device 6 has a non-zero index value 1, so we 12034 start with second simple device that device 6 maps to and wrap around 12035 to the first simple device after processing the third simple device 12036 that device 6 maps to). Fifth, when converting from FILEDS4_COMPLEX 12037 to FILEDS4_SIMPLE, the filehandle in the FILEDS4_SIMPLE entries that 12038 replace a FILEDS4_COMPLEX entry is from the replaced FILEDS4_COMPLEX 12039 entry. As a result the striping pattern can have the same device ID 12040 appear multiple times, and with different filehandles. 12042 The flattened data server list specifies the pattern over which the 12043 devices must be striped and over which data is written (in increments 12044 of the stripe unit size). It also specifies the filehandle to be 12045 used for each stripe unit of the pattern. A data server that has 12046 more than one stripe unit of a pattern to store each unit may store 12047 those stripes in different files, but to do so, will need unique 12048 filehandles in the data server list, as the previous example showed. 12049 While data servers may be repeated multiple times within the 12050 flattened data server list, if a STRIPE4_DENSE stripe type is used 12051 (see Section 13.5), the same filehandle MUST NOT be used on the same 12052 data server for different stripe units of the same file. 12054 A data file stored on a data server MUST map to a single file as 12055 defined by the metadata server; i.e., data from two files as viewed 12056 by the metadata server MUST NOT be stored within the same data file 12057 on any data server. 12059 13.5. Sparse and Dense Stripe Unit Packing 12061 The nfl_stripe_type field specifies how the data is packed within the 12062 data file on a data server. It allows for two different data 12063 packings: STRIPE4_SPARSE and STRIPE4_DENSE. The stripe type 12064 determines the calculation that must be made to map the client 12065 visible file offset to the offset within the data file located on the 12066 data server. 12068 STRIPE4_SPARSE merely means that the logical offsets of the file as 12069 viewed by a client issuing READs and WRITEs directly to the metadata 12070 server are the same offsets each data server uses when storing a 12071 stripe unit. The effect then, for striping patterns consisting of at 12072 least two stripe units, is for each data server file to be sparse or 12073 holey. So for example, suppose a pattern with 3 stripe units, the 12074 stripe unit size is a block of 4 kilobytes, there are 3 data servers 12075 in the pattern, then the file in data server 1 will have blocks 0, 3, 12076 6, 9, ... filled, data server 2's file will have blocks 1, 4, 7, 10, 12077 ... filled, and data server 3's file will have blocks 2, 5, 8, 11, 12078 ... filled. The unfilled blocks of each file will be holes, hence 12079 the files in each data server are sparse. Logical blocks 0, 3, 6, 12080 ... of the file would exist as physical blocks 0, 3, 6 on data server 12081 1, logical blocks 1, 4, 7, ... would exists as physical blocks 1, 4, 12082 7 on data server 2, and logical blocks 2, 5, 8, ... would exist as 12083 physical blocks 2, 5, 8 on data server 3. 12085 The STRIPE4_SPARSE stripe type has holes for the octet ranges not 12086 exported by that data server, thereby allowing pNFS clients to use 12087 the real offset into the data server's file, regardless of the data 12088 server's position within the pattern. However, if a client attempts 12089 I/O to one of the holes, then an error MUST be returned by the data 12090 server. Using the above example, if data server 2 received a READ or 12091 WRITE request for block 4, the data server would return 12092 NFS4ERR_PNFS_IO_HOLE. Thus data servers need to understand the 12093 striping pattern in order to support STRIPE4_SPARSE layouts. 12095 STRIPE4_DENSE means that the data server files have no holes. 12096 STRIPE4_DENSE might be selected because the data server does not 12097 (efficiently) support holey files, e.g. the data server's file system 12098 allocates storage in the gaps, making STRIPE4_SPARSE a waste of 12099 space. If the STRIPE4_DENSE stripe type is indicated in the layout, 12100 the data files must be packed. Using the example striping pattern 12101 and stripe unit size that was used for the STRIPE4_SPARSE example, 12102 the STRIPE4_DENSE example would have all data servers' data files 12103 blocks, 0, 1, 2, 3, 4, ... filled. Logical blocks 0, 3, 6, ... of 12104 the file would live on blocks 0, 1, 2, ... of the file of data server 12105 1, logical blocks 1, 4, 7, ... of the file would live on blocks 0, 1, 12106 2, ... of the file of data server 2, and logical blocks 2, 5, 8, ... 12107 of the file would live on blocks 0, 1, 2, ... of the file of data 12108 server 3. 12110 Since the STRIPE4_DENSE layout does not leave holes on the data 12111 servers, the pNFS client is allowed to write to any offset of any 12112 data file of any data server in the stripe. Thus the the data 12113 servers need not know the file's striping pattern. 12115 The calculation to determine the octet offset within the data file 12116 for dense data server layouts is: 12118 stripe_width = stripe_unit_size * N; 12119 where N = number of 12120 in flattened nfl_ds_fh_list 12122 data_file_offset = floor(file_offset / stripe_width) 12123 * stripe_unit_size 12124 + file_offset % stripe_unit_size 12126 Regardless of the data server layout, the calculation to determine 12127 the index into the device array is the same: 12129 data_server_idx = floor(file_offset / stripe_unit_size) mod N 12131 Section 13.12 describe the semantics for dealing with reads to holes 12132 within the striped file. This is of particular concern, since each 12133 individual component stripe file (i.e., the component of the striped 12134 file that lives on a particular data server) may be of different 12135 length. Thus, clients may experience 'short' reads when reading off 12136 the end of one of these component files. 12138 13.6. Data Server Multipathing 12140 The NFSv4.1 file layout supports multipathing to "equivalent" 12141 (defined later in this section) data servers. Data server-level 12142 multipathing is primarily of use in the case of a data server 12143 failure; it allows the client to switch to another data server that 12144 is exporting the same data stripe unit, without having to contact the 12145 metadata server for a new layout. 12147 To support data server multipathing, there is an array of data server 12148 network addresses (nflda_simp_ds_list) within the FILEDS4_SIMPLE case 12149 of the nfsv4_1_file_layout_ds_addr4 switched union. This array 12150 represents an ordered list of data server (each identified by a 12151 network address) where the first element has the highest priority. 12152 Each data server in the list MUST be equivalent to every other data 12153 server in the list and each data server MUST be attempted in the 12154 order specified. 12156 Two data servers are equivalent if they export the same system image 12157 (e.g., the stateids and filehandles that they use are the same) and 12158 provide the same consistency guarantees. Two equivalent data servers 12159 must also have sufficient connections to the storage, such that 12160 writing to one data server is equivalent to writing to another; this 12161 also applies to reading. Also, if multiple copies of the same data 12162 exist, reading from one must provide access to all existing copies. 12163 As such, it is unlikely that multipathing will provide additional 12164 benefit in the case of an I/O error. 12166 [[Comment.16: [NOTE: the error cases in which a client is expected to 12167 attempt an equivalent data server should be specified.]]] 12169 13.7. Operations Issued to NFSv4.1 Data Servers 12171 Clients MUST use the filehandle described within the layout when 12172 accessing data on NFSv4.1 data servers. When using the layout's 12173 filehandle, the client MUST only issue the NULL procedure and the 12174 COMPOUND procedure's BACKCHANNEL_CTL, BIND_CONN_TO_SESSION, 12175 CREATE_SESSION, COMMIT, DESTROY_CLIENTID, DESTROY_SESSION, 12176 EXCHANGE_ID, READ, WRITE, PUTFH, SECINFO_NO_NAME, SET_SSV, and 12177 SEQUENCE operations to the NFSv4.1 data server associated with that 12178 data server filehandle. If a client issues an operation to the data 12179 server other than those specified above, using the filehandle and 12180 data server listed in the file's layout, that data server MUST return 12181 an error to the client (unless the pNFS server has chosen to not 12182 disambiguate the data server filehandle from the metadata server 12183 filehandle, and/or the pNFS server has chosen to not disambiguate the 12184 metadata server client ID from the data server client ID). The 12185 client MUST follow the instruction implied by the layout (i.e., which 12186 filehandles to use on which data servers). As described in 12187 Section 12.5.1, a client MUST NOT issue I/Os to data servers for 12188 which it does not hold a valid layout. The data servers MAY reject 12189 such requests. 12191 GETATTR and SETATTR MUST be directed to the metadata server. In the 12192 case of a SETATTR of the size attribute, the control protocol is 12193 responsible for propagating size updates/truncations to the data 12194 servers. In the case of extending WRITEs to the data servers, the 12195 new size must be visible on the metadata server once a LAYOUTCOMMIT 12196 has completed (see Section 12.5.3.2). Section 13.12, describes the 12197 mechanism by which the client is to handle data server files that do 12198 not reflect the metadata server's size. 12200 13.8. COMMIT Through Metadata Server 12202 The nfl_commit_through_mds field in the file layout (data type 12203 nfsv4_1_file_layout4) gives the metadata server the preferred way of 12204 performing COMMIT. If this field is TRUE, the client SHOULD send 12205 COMMIT to the metadata server instead of sending it to the same data 12206 server to which the associated WRITEs were sent. In order to 12207 maintain the current NFSv4.1 commit and recovery model, all the data 12208 servers MUST return a common writeverf verifier in all WRITE 12209 responses for a given file layout. The value of the writeverf 12210 verifier MUST be changed at the metadata server or any data server 12211 that is referenced in the layout, whenever there is a server event 12212 that can possibly lead to loss of uncommitted data. The scope of the 12213 verifier can be for a file or for the entire pNFS server. It might 12214 be more difficult for the server to maintain the verifier at the file 12215 level but the benefit is that only events that impact a given file 12216 will require recovery action. 12218 The single COMMIT to the metadata server will return a verifier and 12219 the client should compare it to all the verifiers from the WRITEs and 12220 fail the COMMIT if there is any mismatched verifiers. If COMMIT to 12221 the metadata server fails, the client should reissue WRITEs for all 12222 the modified data in the file. The client should treat modified data 12223 with a mismatched verifier as a WRITE failure and try to recover by 12224 reissuing the WRITEs to the original data server or using another 12225 path to that data if the layout has not been recalled. Another 12226 option the client has is getting a new layout or just rewrite the 12227 data through the metadata server. If the flag nfl_commit_through_mds 12228 is FALSE, the client should not send COMMIT to the metadata server. 12229 Although it is valid to send COMMIT to the metadata server it should 12230 be used only to commit data that was written through the metadata 12231 server. See Section 12.7.6 for recovery options. 12233 13.9. Global Stateid Requirements 12235 Note, there are no stateids embedded within the layout returned by 12236 the metadata server to the pNFS client. The client uses a stateid 12237 returned previously by the metadata server (including results from 12238 OPEN -- a delegation stateid is acceptable as well as a non- 12239 delegation stateid -- lock operations, WANT_DELEGATION, and also from 12240 the CB_PUSH_DELEG callback operation) or a special stateid to perform 12241 I/O on the data servers, as in regular NFSv4.1. Special stateid 12242 usage for I/O is subject to the NFSv4.1 protocol specification. The 12243 stateid used for I/O MUST have the same effect and be subject to the 12244 same validation on data server as it would if the I/O was being 12245 performed on the metadata server itself in the absence of pNFS. This 12246 has the implication that stateids are globally valid on both the 12247 metadata and data servers. This requires the metadata server to 12248 propagate changes in lock and open state to the data servers, so that 12249 the data servers can validate I/O accesses. This is discussed 12250 further in Section 13.11. Depending on when stateids are propagated, 12251 the existence of a valid stateid on the data server may act as proof 12252 of a valid layout. 12254 13.10. The Layout Iomode 12256 The layout iomode need not be used by the metadata server when 12257 servicing NFSv4.1 file-based layouts, although in some circumstances 12258 it may be useful to use. For example, if the server implementation 12259 supports reading from read-only replicas or mirrors, it would be 12260 useful for the server to return a layout enabling the client to do 12261 so. As such, the client SHOULD set the iomode based on its intent to 12262 read or write the data. The client may default to an iomode of 12263 LAYOUTIOMODE4_RW. The iomode need not be checked by the data servers 12264 when clients perform I/O. However, the data servers SHOULD still 12265 validate that the client holds a valid layout and return an error if 12266 the client does not. 12268 13.11. Data Server State Propagation 12270 Since the metadata server, which handles lock and open-mode state 12271 changes, as well as ACLs, may not be co-located with the data servers 12272 where I/O access are validated, as such, the server implementation 12273 MUST take care of propagating changes of this state to the data 12274 servers. Once the propagation to the data servers is complete, the 12275 full effect of those changes must be in effect at the data servers. 12276 However, some state changes need not be propagated immediately, 12277 although all changes SHOULD be propagated promptly. These state 12278 propagations have an impact on the design of the control protocol, 12279 even though the control protocol is outside of the scope of this 12280 specification. Immediate propagation refers to the synchronous 12281 propagation of state from the metadata server to the data server(s); 12282 the propagation must be complete before returning to the client. 12284 13.11.1. Lock State Propagation 12286 If the pNFS server supports mandatory locking, any mandatory locks on 12287 a file MUST be made effective at the data servers before the request 12288 that establishes them returns to the caller. Thus, mandatory lock 12289 state MUST be synchronously propagated to the data servers. On the 12290 other hand, since advisory lock state is not used for checking I/O 12291 accesses at the data servers, there is no semantic reason for 12292 propagating advisory lock state to the data servers. However, since 12293 all lock, unlock, open downgrades and upgrades MAY affect the "seqid" 12294 stored within the stateid (see Section 8.1.3.1), the stateid changes 12295 may cause difficulty if this state is not propagated. Thus, when a 12296 client uses a stateid on a data server for I/O with a newer "seqid" 12297 number than the one the data server has, the data server may need to 12298 query the metadata server and get any pending updates to that 12299 stateid. This allows stateid sequence number changes to be 12300 propagated lazily, on-demand. 12302 Since updates to advisory locks neither confer nor remove privileges, 12303 these changes need not be propagated immediately, and may not need to 12304 be propagated promptly. The updates to advisory locks need only be 12305 propagated when the data server needs to resolve a question about a 12306 stateid. In fact, if record locking is not mandatory (i.e., is 12307 advisory) the clients are advised not to use the lock-based stateids 12308 for I/O at all. The stateids returned by open are sufficient and 12309 eliminate overhead for this kind of state propagation. 12311 13.11.2. Open-mode Validation 12313 Open-mode validation MUST be performed against the open mode(s) held 12314 by the data servers. However, the server implementation may not 12315 always require the immediate propagation of changes. Reduction in 12316 access because of CLOSEs or DOWNGRADEs does not have to be propagated 12317 immediately, but SHOULD be propagated promptly; whereas changes due 12318 to revocation MUST be propagated immediately. On the other hand, 12319 changes that expand access (e.g., new OPEN's and upgrades) do not 12320 have to be propagated immediately but the data server SHOULD NOT 12321 reject a request because of open mode issues without making sure that 12322 the upgrade is not in flight. 12324 13.11.3. File Attributes 12326 Since the SETATTR operation has the ability to modify state that is 12327 visible on both the metadata and data servers (e.g., the size), care 12328 must be taken to ensure that the resultant state across the set of 12329 data servers is consistent; especially when truncating or growing the 12330 file. 12332 As described earlier, the LAYOUTCOMMIT operation is used to ensure 12333 that the metadata is synchronized with changes made to the data 12334 servers. For the NFSv4.1-based data storage protocol, it is 12335 necessary to re-synchronize state such as the size attribute, and the 12336 setting of mtime/change/atime. See Section 12.5.3 for a full 12337 description of the semantics regarding LAYOUTCOMMIT and attribute 12338 synchronization. It should be noted, that by using an NFSv4.1-based 12339 layout type, it is possible to synchronize this state before 12340 LAYOUTCOMMIT occurs. For example, the control protocol can be used 12341 to query the attributes present on the data servers. 12343 Any changes to file attributes that control authorization or access 12344 as reflected by ACCESS calls or READs and WRITEs on the metadata 12345 server, MUST be propagated to the data servers for enforcement on 12346 READ and WRITE I/O calls. If the changes made on the metadata server 12347 result in more restrictive access permissions for any user, those 12348 changes MUST be propagated to the data servers synchronously. 12350 The OPEN operation (Section 17.16.5) does not impose any requirement 12351 that I/O operations on an open file have the same credentials as the 12352 OPEN itself, and so requires the server's READ and WRITE operations 12353 to perform appropriate access checking. Changes to ACLs also require 12354 new access checking by READ and WRITE on the server. The propagation 12355 of access right changes due to changes in ACLs may be asynchronous 12356 only if the server implementation is able to determine that the 12357 updated ACL is not more restrictive for any user specified in the old 12358 ACL. Due to the relative infrequency of ACL updates, it is suggested 12359 that all changes be propagated synchronously. 12361 13.12. Data Server Component File Size 12363 A potential problem exists when a component data file on a particular 12364 data server is grown past EOF; the problem exists for both dense and 12365 sparse layouts. Imagine the following scenario: a client creates a 12366 new file (size == 0) and writes to octet 131072; the client then 12367 seeks to the beginning of the file and reads octet 100. The client 12368 should receive 0s back as a result of the READ. However, if the READ 12369 falls on a data server different than that that received client's 12370 original WRITE, the data server servicing the READ may still believe 12371 that the file's size is at 0 and return no data with the EOF flag 12372 set. The data server can only return 0s if it knows that the file's 12373 size has been extended. This would require the immediate propagation 12374 of the file's size to all data servers, which is potentially very 12375 costly. Therefore, the client that has initiated the extension of 12376 the file's size MUST be prepared to deal with these EOF conditions; 12377 the EOF'ed or short READs will be treated as a hole in the file and 12378 the NFS client will substitute 0s for the data when the offset is 12379 less than the client's view of the file size. 12381 The NFSv4.1 protocol only provides close to open file data cache 12382 semantics; meaning that when the file is closed all modified data is 12383 written to the server. When a subsequent OPEN of the file is done, 12384 the change attribute is inspected for a difference from a cached 12385 value for the change attribute. For the case above, this means that 12386 a LAYOUTCOMMIT will be done at close (along with the data WRITEs) and 12387 will update the file's size and change attribute. Access from 12388 another client after that point will result in the appropriate size 12389 being returned. 12391 13.13. Recovery Considerations 12393 As described in Section 12.7, the layout type-specific storage 12394 protocol is responsible for handling the effects of I/Os started 12395 before lease expiration, extending through lease expiration. The 12396 NFSv4.1 file layout type prevents all I/Os from being executed after 12397 lease expiration, without relying on a precise client lease timer and 12398 without requiring data servers to maintain lease timers. 12400 It works as follows. As described in Section 13.1, in COMPOUND 12401 procedure requests to the data server, the data filehandle provided 12402 by the PUTFH operation and the stateid in the READ or WRITE operation 12403 are used to validate that the client has a valid layout for the I/O 12404 being performed, if it does not, the I/O is rejected. Before the 12405 metadata server takes any action to invalidate a layout given out by 12406 a previous instance, it must make sure that all layouts from that 12407 previous instance are invalidated at the data servers. 12409 This means that a metadata server may not restripe a file until it 12410 has contacted all of the data servers to invalidate the layouts from 12411 the previous instance nor may it give out mandatory locks that 12412 conflict with layouts from the previous instance without either doing 12413 a specific invalidation (as it would have to do anyway) or doing a 12414 global data server invalidation. 12416 13.14. Security Considerations for the File Layout Type 12418 The NFSv4.1 file layout type MUST adhere to the security 12419 considerations outlined in Section 12.9. NFSv4.1 data servers must 12420 make all of the required access checks on each READ or WRITE I/O as 12421 determined by the NFSv4.1 protocol. If the metadata server would 12422 deny READ or WRITE operation on a given file due its ACL, mode 12423 attribute, open mode, open deny mode, mandatory lock state, or any 12424 other attributes and state, the data server MUST also deny the READ 12425 or WRITE operation. This impacts the control protocol and the 12426 propagation of state from the metadata server to the data servers; 12427 see Section 13.11 for more details. 12429 The methods for authentication, integrity, and privacy for file 12430 layout-based data servers are the same as that used for metadata 12431 servers. Metadata and data servers use ONC RPC security flavors to 12432 authenticate, and SECINFO and SECINFO_NO_NAME to negotiate the 12433 security mechanism and services to be used. 12435 For a given file object, a metadata server MAY require different 12436 security parameters (secinfo4 value) than the data server. For a 12437 given file object with multiple data servers, the secinfo4 value 12438 SHOULD be the same across all data servers. 12440 If an NFSv4.1 implementation supports pNFS and supports NFSv4.1 file 12441 layouts, then the implementation MUST support the SECINFO_NO_NAME 12442 operation, on both the metadata and data servers. 12444 14. Internationalization 12446 The primary issue in which NFS version 4 needs to deal with 12447 internationalization, or I18N, is with respect to file names and 12448 other strings as used within the protocol. The choice of string 12449 representation must allow reasonable name/string access to clients 12450 which use various languages. The UTF-8 encoding of the UCS as 12451 defined by ISO10646 [10] allows for this type of access and follows 12452 the policy described in "IETF Policy on Character Sets and 12453 Languages", RFC2277 [11]. 12455 RFC3454 [12], otherwise know as "stringprep", documents a framework 12456 for using Unicode/UTF-8 in networking protocols, so as "to increase 12457 the likelihood that string input and string comparison work in ways 12458 that make sense for typical users throughout the world." A protocol 12459 must define a profile of stringprep "in order to fully specify the 12460 processing options." The remainder of this Internationalization 12461 section defines the NFS version 4 stringprep profiles. Much of 12462 terminology used for the remainder of this section comes from 12463 stringprep. 12465 There are three UTF-8 string types defined for NFS version 4: 12466 utf8str_cs, utf8str_cis, and utf8str_mixed. Separate profiles are 12467 defined for each. Each profile defines the following, as required by 12468 stringprep: 12470 o The intended applicability of the profile 12472 o The character repertoire that is the input and output to 12473 stringprep (which is Unicode 3.2 for referenced version of 12474 stringprep) 12476 o The mapping tables from stringprep used (as described in section 3 12477 of stringprep) 12479 o Any additional mapping tables specific to the profile 12481 o The Unicode normalization used, if any (as described in section 4 12482 of stringprep) 12484 o The tables from stringprep listing of characters that are 12485 prohibited as output (as described in section 5 of stringprep) 12487 o The bidirectional string testing used, if any (as described in 12488 section 6 of stringprep) 12490 o Any additional characters that are prohibited as output specific 12491 to the profile 12493 Stringprep discusses Unicode characters, whereas NFS version 4 12494 renders UTF-8 characters. Since there is a one-to-one mapping from 12495 UTF-8 to Unicode, when the remainder of this document refers to 12496 Unicode, the reader should assume UTF-8. 12498 Much of the text for the profiles comes from RFC3491 [13]. 12500 14.1. Stringprep profile for the utf8str_cs type 12502 Every use of the utf8str_cs type definition in the NFS version 4 12503 protocol specification follows the profile named nfs4_cs_prep. 12505 14.1.1. Intended applicability of the nfs4_cs_prep profile 12507 The utf8str_cs type is a case sensitive string of UTF-8 characters. 12508 Its primary use in NFS Version 4 is for naming components and 12509 pathnames. Components and pathnames are stored on the server's file 12510 system. Two valid distinct UTF-8 strings might be the same after 12511 processing via the utf8str_cs profile. If the strings are two names 12512 inside a directory, the NFS version 4 server will need to either: 12514 o disallow the creation of a second name if it's post processed form 12515 collides with that of an existing name, or 12517 o allow the creation of the second name, but arrange so that after 12518 post processing, the second name is different than the post 12519 processed form of the first name. 12521 14.1.2. Character repertoire of nfs4_cs_prep 12523 The nfs4_cs_prep profile uses Unicode 3.2, as defined in stringprep's 12524 Appendix A.1 12526 14.1.3. Mapping used by nfs4_cs_prep 12528 The nfs4_cs_prep profile specifies mapping using the following tables 12529 from stringprep: 12531 Table B.1 12533 Table B.2 is normally not part of the nfs4_cs_prep profile as it is 12534 primarily for dealing with case-insensitive comparisons. However, if 12535 the NFS version 4 file server supports the case_insensitive file 12536 system attribute, and if case_insensitive is true, the NFS version 4 12537 server MUST use Table B.2 (in addition to Table B1) when processing 12538 utf8str_cs strings, and the NFS version 4 client MUST assume Table 12539 B.2 (in addition to Table B.1) are being used. 12541 If the case_preserving attribute is present and set to false, then 12542 the NFS version 4 server MUST use table B.2 to map case when 12543 processing utf8str_cs strings. Whether the server maps from lower to 12544 upper case or the upper to lower case is an implementation 12545 dependency. 12547 14.1.4. Normalization used by nfs4_cs_prep 12549 The nfs4_cs_prep profile does not specify a normalization form. A 12550 later revision of this specification may specify a particular 12551 normalization form. Therefore, the server and client can expect that 12552 they may receive unnormalized characters within protocol requests and 12553 responses. If the operating environment requires normalization, then 12554 the implementation must normalize utf8str_cs strings within the 12555 protocol before presenting the information to an application (at the 12556 client) or local file system (at the server). 12558 14.1.5. Prohibited output for nfs4_cs_prep 12560 The nfs4_cs_prep profile specifies prohibiting using the following 12561 tables from stringprep: 12563 Table C.3 12565 Table C.4 12567 Table C.5 12569 Table C.6 12571 Table C.7 12573 Table C.8 12575 Table C.9 12577 14.1.6. Bidirectional output for nfs4_cs_prep 12579 The nfs4_cs_prep profile does not specify any checking of 12580 bidirectional strings. 12582 14.2. Stringprep profile for the utf8str_cis type 12584 Every use of the utf8str_cis type definition in the NFS version 4 12585 protocol specification follows the profile named nfs4_cis_prep. 12587 14.2.1. Intended applicability of the nfs4_cis_prep profile 12589 The utf8str_cis type is a case insensitive string of UTF-8 12590 characters. Its primary use in NFS Version 4 is for naming NFS 12591 servers. 12593 14.2.2. Character repertoire of nfs4_cis_prep 12595 The nfs4_cis_prep profile uses Unicode 3.2, as defined in 12596 stringprep's Appendix A.1 12598 14.2.3. Mapping used by nfs4_cis_prep 12600 The nfs4_cis_prep profile specifies mapping using the following 12601 tables from stringprep: 12603 Table B.1 12604 Table B.2 12606 14.2.4. Normalization used by nfs4_cis_prep 12608 The nfs4_cis_prep profile specifies using Unicode normalization form 12609 KC, as described in stringprep. 12611 14.2.5. Prohibited output for nfs4_cis_prep 12613 The nfs4_cis_prep profile specifies prohibiting using the following 12614 tables from stringprep: 12616 Table C.1.2 12618 Table C.2.2 12620 Table C.3 12622 Table C.4 12624 Table C.5 12626 Table C.6 12628 Table C.7 12630 Table C.8 12632 Table C.9 12634 14.2.6. Bidirectional output for nfs4_cis_prep 12636 The nfs4_cis_prep profile specifies checking bidirectional strings as 12637 described in stringprep's section 6. 12639 14.3. Stringprep profile for the utf8str_mixed type 12641 Every use of the utf8str_mixed type definition in the NFS version 4 12642 protocol specification follows the profile named nfs4_mixed_prep. 12644 14.3.1. Intended applicability of the nfs4_mixed_prep profile 12646 The utf8str_mixed type is a string of UTF-8 characters, with a prefix 12647 that is case sensitive, a separator equal to '@', and a suffix that 12648 is fully qualified domain name. Its primary use in NFS Version 4 is 12649 for naming principals identified in an Access Control Entry. 12651 14.3.2. Character repertoire of nfs4_mixed_prep 12653 The nfs4_mixed_prep profile uses Unicode 3.2, as defined in 12654 stringprep's Appendix A.1 12656 14.3.3. Mapping used by nfs4_cis_prep 12658 For the prefix and the separator of a utf8str_mixed string, the 12659 nfs4_mixed_prep profile specifies mapping using the following table 12660 from stringprep: 12662 Table B.1 12664 For the suffix of a utf8str_mixed string, the nfs4_mixed_prep profile 12665 specifies mapping using the following tables from stringprep: 12667 Table B.1 12669 Table B.2 12671 14.3.4. Normalization used by nfs4_mixed_prep 12673 The nfs4_mixed_prep profile specifies using Unicode normalization 12674 form KC, as described in stringprep. 12676 14.3.5. Prohibited output for nfs4_mixed_prep 12678 The nfs4_mixed_prep profile specifies prohibiting using the following 12679 tables from stringprep: 12681 Table C.1.2 12683 Table C.2.2 12685 Table C.3 12687 Table C.4 12689 Table C.5 12691 Table C.6 12693 Table C.7 12695 Table C.8 12697 Table C.9 12699 14.3.6. Bidirectional output for nfs4_mixed_prep 12701 The nfs4_mixed_prep profile specifies checking bidirectional strings 12702 as described in stringprep's section 6. 12704 14.4. UTF-8 Related Errors 12706 Where the client sends an invalid UTF-8 string, the server should 12707 return an NFS4ERR_INVAL (Table 8) error. This includes cases in 12708 which inappropriate prefixes are detected and where the count 12709 includes trailing bytes that do not constitute a full UCS character. 12711 Where the client supplied string is valid UTF-8 but contains 12712 characters that are not supported by the server as a value for that 12713 string (e.g. names containing characters that have more than two 12714 octets on a file system that supports Unicode characters only), the 12715 server should return an NFS4ERR_BADCHAR (Table 8) error. 12717 Where a UTF-8 string is used as a file name, and the file system, 12718 while supporting all of the characters within the name, does not 12719 allow that particular name to be used, the server should return the 12720 error NFS4ERR_BADNAME (Table 8). This includes situations in which 12721 the server file system imposes a normalization constraint on name 12722 strings, but will also include such situations as file system 12723 prohibitions of "." and ".." as file names for certain operations, 12724 and other such constraints. 12726 15. Error Values 12728 NFS error numbers are assigned to failed operations within a compound 12729 request. A compound request contains a number of NFS operations that 12730 have their results encoded in sequence in a compound reply. The 12731 results of successful operations will consist of an NFS4_OK status 12732 followed by the encoded results of the operation. If an NFS 12733 operation fails, an error status will be entered in the reply and the 12734 compound request will be terminated. 12736 15.1. Error Definitions 12738 Protocol Error Definitions 12740 +-----------------------------------+--------+----------------------+ 12741 | Error | Number | Description | 12742 +-----------------------------------+--------+----------------------+ 12743 | NFS4_OK | 0 | Indicates the | 12744 | | | operation completed | 12745 | | | successfully. | 12746 | NFS4ERR_ACCESS | 13 | Permission denied. | 12747 | | | The caller does not | 12748 | | | have the correct | 12749 | | | permission to | 12750 | | | perform the | 12751 | | | requested operation. | 12752 | | | Contrast this with | 12753 | | | NFS4ERR_PERM, which | 12754 | | | restricts itself to | 12755 | | | owner or privileged | 12756 | | | user permission | 12757 | | | failures. | 12758 | NFS4ERR_ATTRNOTSUPP | 10032 | An attribute | 12759 | | | specified is not | 12760 | | | supported by the | 12761 | | | server. Does not | 12762 | | | apply to the GETATTR | 12763 | | | operation. | 12764 | NFS4ERR_ADMIN_REVOKED | 10047 | Due to administrator | 12765 | | | intervention, the | 12766 | | | lockowner's record | 12767 | | | locks, share | 12768 | | | reservations, and | 12769 | | | delegations have | 12770 | | | been revoked by the | 12771 | | | server. | 12772 | NFS4ERR_BACK_CHAN_BUSY | 10057 | The session cannot | 12773 | | | be destroyed because | 12774 | | | the server has | 12775 | | | callback requests | 12776 | | | outstanding. | 12777 | NFS4ERR_BADCHAR | 10040 | A UTF-8 string | 12778 | | | contains a character | 12779 | | | which is not | 12780 | | | supported by the | 12781 | | | server in the | 12782 | | | context in which it | 12783 | | | being used. | 12784 | NFS4ERR_BAD_COOKIE | 10003 | READDIR cookie is | 12785 | | | stale. | 12786 | NFS4ERR_BADHANDLE | 10001 | Illegal NFS | 12787 | | | filehandle. The | 12788 | | | filehandle failed | 12789 | | | internal consistency | 12790 | | | checks. | 12791 | NFS4ERR_BADIOMODE | 10049 | Layout iomode is | 12792 | | | invalid. | 12793 | NFS4ERR_BADLAYOUT | 10050 | Layout specified is | 12794 | | | invalid. | 12795 | NFS4ERR_BADNAME | 10041 | A name string in a | 12796 | | | request consists of | 12797 | | | valid UTF-8 | 12798 | | | characters supported | 12799 | | | by the server but | 12800 | | | the name is not | 12801 | | | supported by the | 12802 | | | server as a valid | 12803 | | | name for current | 12804 | | | operation. | 12805 | NFS4ERR_BADOWNER | 10039 | An owner, | 12806 | | | owner_group, or ACL | 12807 | | | attribute value can | 12808 | | | not be translated to | 12809 | | | local | 12810 | | | representation. | 12811 | NFS4ERR_BAD_SESSION_DIGEST | 10051 | The digest used in a | 12812 | | | SET_SSV or | 12813 | | | BIND_CONN_TO_SESSION | 12814 | | | request is not | 12815 | | | valid. | 12816 | NFS4ERR_BADTYPE | 10007 | An attempt was made | 12817 | | | to create an object | 12818 | | | of a type not | 12819 | | | supported by the | 12820 | | | server. | 12821 | NFS4ERR_BAD_RANGE | 10042 | The range for a | 12822 | | | LOCK, LOCKT, or | 12823 | | | LOCKU operation is | 12824 | | | not appropriate to | 12825 | | | the allowable range | 12826 | | | of offsets for the | 12827 | | | server. | 12828 | NFS4ERR_BAD_SEQID | 10026 | The sequence number | 12829 | | | in a locking request | 12830 | | | is neither the next | 12831 | | | expected number or | 12832 | | | the last number | 12833 | | | processed. This | 12834 | | | error does not apply | 12835 | | | to and should never | 12836 | | | be generated in | 12837 | | | NFSv4.1. | 12838 | NFS4ERR_BADSESSION | 10052 | TDB | 12839 | NFS4ERR_BADSLOT | 10053 | TDB | 12840 | NFS4ERR_BAD_STATEID | 10025 | A stateid generated | 12841 | | | by the current | 12842 | | | server instance, but | 12843 | | | which does not | 12844 | | | designate any | 12845 | | | locking state | 12846 | | | (either current or | 12847 | | | superseded) for a | 12848 | | | current | 12849 | | | lockowner-file pair, | 12850 | | | was used. | 12851 | NFS4ERR_BADXDR | 10036 | The server | 12852 | | | encountered an XDR | 12853 | | | decoding error while | 12854 | | | processing an | 12855 | | | operation. | 12856 | NFS4ERR_CLID_INUSE | 10017 | The EXCHANGE_ID | 12857 | | | operation has found | 12858 | | | that a client ID is | 12859 | | | already in use by | 12860 | | | another client. | 12861 | NFS4ERR_CLIENTID_BUSY | 10074 | The DESTROY_CLIENTID | 12862 | | | operation has found | 12863 | | | there are has | 12864 | | | sessions and/or | 12865 | | | stateids bound to | 12866 | | | the client ID. | 12867 | NFS4ERR_COMPLETE_ALREADY | 10054 | A RECLAIM_COMPLETE | 12868 | | | operation was done | 12869 | | | by a client which | 12870 | | | had already | 12871 | | | performed one. | 12872 | NFS4ERR_CONN_NOT_BOUND_TO_SESSION | 10055 | The connection is | 12873 | | | not bound to the | 12874 | | | specified session. | 12875 | NFS4ERR_CONN_BINDING_NOT_ENFORCED | 10073 | Client is trying use | 12876 | | | enforced connection | 12877 | | | binding, but it | 12878 | | | disabled enforcement | 12879 | | | when the session was | 12880 | | | created. | 12881 | NFS4ERR_DEADLOCK | 10045 | The server has been | 12882 | | | able to determine a | 12883 | | | file locking | 12884 | | | deadlock condition | 12885 | | | for a blocking lock | 12886 | | | request. | 12887 | NFS4ERR_DELAY | 10008 | The server initiated | 12888 | | | the request, but was | 12889 | | | not able to complete | 12890 | | | it in a timely | 12891 | | | fashion. The client | 12892 | | | should wait and then | 12893 | | | try the request with | 12894 | | | a new RPC | 12895 | | | transaction ID. For | 12896 | | | example, this error | 12897 | | | should be returned | 12898 | | | from a server that | 12899 | | | supports | 12900 | | | hierarchical storage | 12901 | | | and receives a | 12902 | | | request to process a | 12903 | | | file that has been | 12904 | | | migrated. In this | 12905 | | | case, the server | 12906 | | | should start the | 12907 | | | immigration process | 12908 | | | and respond to | 12909 | | | client with this | 12910 | | | error. This error | 12911 | | | may also occur when | 12912 | | | a necessary | 12913 | | | delegation recall | 12914 | | | makes processing a | 12915 | | | request in a timely | 12916 | | | fashion impossible. | 12917 | NFS4ERR_DELEG_ALREADY_WANTED | 10056 | The client has | 12918 | | | already registered | 12919 | | | that it wants a | 12920 | | | delegation. | 12921 | NFS4ERR_DENIED | 10010 | An attempt to lock a | 12922 | | | file is denied. | 12923 | | | Since this may be a | 12924 | | | temporary condition, | 12925 | | | the client is | 12926 | | | encouraged to retry | 12927 | | | the lock request | 12928 | | | until the lock is | 12929 | | | accepted. | 12930 | NFS4ERR_DQUOT | 69 | Resource (quota) | 12931 | | | hard limit exceeded. | 12932 | | | The user's resource | 12933 | | | limit on the server | 12934 | | | has been exceeded. | 12935 | NFS4ERR_EXIST | 17 | File exists. The | 12936 | | | file specified | 12937 | | | already exists. | 12938 | NFS4ERR_EXPIRED | 10011 | A lease has expired | 12939 | | | that is being used | 12940 | | | in the current | 12941 | | | operation. | 12942 | NFS4ERR_FBIG | 27 | File too large. The | 12943 | | | operation would have | 12944 | | | caused a file to | 12945 | | | grow beyond the | 12946 | | | server's limit. | 12947 | NFS4ERR_FHEXPIRED | 10014 | The filehandle | 12948 | | | provided is volatile | 12949 | | | and has expired at | 12950 | | | the server. | 12951 | NFS4ERR_FILE_OPEN | 10046 | The operation can | 12952 | | | not be successfully | 12953 | | | processed because a | 12954 | | | file involved in the | 12955 | | | operation is | 12956 | | | currently open. | 12957 | NFS4ERR_GRACE | 10013 | The server is in its | 12958 | | | recovery or grace | 12959 | | | period which should | 12960 | | | match the lease | 12961 | | | period of the | 12962 | | | server. | 12963 | NFS4ERR_INVAL | 22 | Invalid argument or | 12964 | | | unsupported argument | 12965 | | | for an operation. | 12966 | | | Two examples are | 12967 | | | attempting a | 12968 | | | READLINK on an | 12969 | | | object other than a | 12970 | | | symbolic link or | 12971 | | | specifying a value | 12972 | | | for an enum field | 12973 | | | that is not defined | 12974 | | | in the protocol | 12975 | | | (e.g. nfs_ftype4). | 12976 | NFS4ERR_IO | 5 | I/O error. A hard | 12977 | | | error (for example, | 12978 | | | a disk error) | 12979 | | | occurred while | 12980 | | | processing the | 12981 | | | requested operation. | 12982 | NFS4ERR_ISDIR | 21 | Is a directory. The | 12983 | | | caller specified a | 12984 | | | directory in a | 12985 | | | non-directory | 12986 | | | operation. | 12987 | NFS4ERR_LAYOUTTRYLATER | 10058 | Layouts are | 12988 | | | temporarily | 12989 | | | unavailable for the | 12990 | | | file, client should | 12991 | | | retry later. | 12992 | NFS4ERR_LAYOUTUNAVAILABLE | 10059 | Layouts are not | 12993 | | | available for the | 12994 | | | file or its | 12995 | | | containing file | 12996 | | | system. | 12997 | NFS4ERR_LEASE_MOVED | 10031 | A lease being | 12998 | | | renewed is | 12999 | | | associated with a | 13000 | | | file system that has | 13001 | | | been migrated to a | 13002 | | | new server. | 13003 | NFS4ERR_LOCKED | 10012 | A read or write | 13004 | | | operation was | 13005 | | | attempted on a | 13006 | | | locked file. | 13007 | NFS4ERR_LOCK_NOTSUPP | 10043 | Server does not | 13008 | | | support atomic | 13009 | | | upgrade or downgrade | 13010 | | | of locks. | 13011 | NFS4ERR_LOCK_RANGE | 10028 | A lock request is | 13012 | | | operating on a | 13013 | | | sub-range of a | 13014 | | | current lock for the | 13015 | | | lock owner and the | 13016 | | | server does not | 13017 | | | support this type of | 13018 | | | request. | 13019 | NFS4ERR_LOCKS_HELD | 10037 | A CLOSE was | 13020 | | | attempted and file | 13021 | | | locks would exist | 13022 | | | after the CLOSE. | 13023 | NFS4ERR_MINOR_VERS_MISMATCH | 10021 | The server has | 13024 | | | received a request | 13025 | | | that specifies an | 13026 | | | unsupported minor | 13027 | | | version. The server | 13028 | | | must return a | 13029 | | | COMPOUND4res with a | 13030 | | | zero length | 13031 | | | operations result | 13032 | | | array. | 13033 | NFS4ERR_SEQ_MISORDERED | 10063 | The requester sent a | 13034 | | | SEQUENCE or | 13035 | | | CB_SEQUENCE | 13036 | | | operation with an | 13037 | | | invalid sequenceid. | 13038 | NFS4ERR_SEQUENCE_POS | 10064 | The requester sent a | 13039 | | | COMPOUND or | 13040 | | | CB_COMPOUND with a | 13041 | | | SEQUENCE or | 13042 | | | CB_SEQUENCE | 13043 | | | operation that was | 13044 | | | not the first | 13045 | | | operation. | 13046 | NFS4ERR_MLINK | 31 | Too many hard links. | 13047 | NFS4ERR_MOVED | 10019 | The file system | 13048 | | | which contains the | 13049 | | | current filehandle | 13050 | | | object is not | 13051 | | | present at the | 13052 | | | server. It may have | 13053 | | | been relocated, | 13054 | | | migrated to another | 13055 | | | server or may have | 13056 | | | never been present. | 13057 | | | The client may | 13058 | | | obtain the new file | 13059 | | | system location by | 13060 | | | obtaining the | 13061 | | | "fs_locations" | 13062 | | | attribute for the | 13063 | | | current filehandle. | 13064 | | | For further | 13065 | | | discussion, refer to | 13066 | | | the section | 13067 | | | "Multi-server Name | 13068 | | | Space". | 13069 | NFS4ERR_NAMETOOLONG | 63 | The filename in an | 13070 | | | operation was too | 13071 | | | long. | 13072 | NFS4ERR_NOENT | 2 | No such file or | 13073 | | | directory. The file | 13074 | | | or directory name | 13075 | | | specified does not | 13076 | | | exist. | 13077 | NFS4ERR_NOFILEHANDLE | 10020 | The logical current | 13078 | | | filehandle value | 13079 | | | (or, in the case of | 13080 | | | RESTOREFH, the saved | 13081 | | | filehandle value) | 13082 | | | has not been set | 13083 | | | properly. This may | 13084 | | | be a result of a | 13085 | | | malformed COMPOUND | 13086 | | | operation (i.e. no | 13087 | | | PUTFH or PUTROOTFH | 13088 | | | before an operation | 13089 | | | that requires the | 13090 | | | current filehandle | 13091 | | | be set). | 13092 | NFS4ERR_NO_GRACE | 10033 | A reclaim of client | 13093 | | | state was attempted | 13094 | | | in circumstances in | 13095 | | | which the server | 13096 | | | cannot guarantee | 13097 | | | that conflicting | 13098 | | | state has not been | 13099 | | | provided to another | 13100 | | | client. This can | 13101 | | | occur because the | 13102 | | | reclaim has been | 13103 | | | done outside of the | 13104 | | | grace period of the | 13105 | | | server, after the | 13106 | | | client has done a | 13107 | | | RECLAIM_COMPLETE | 13108 | | | operation, or | 13109 | | | because previous | 13110 | | | operations have | 13111 | | | created a situation | 13112 | | | in which the server | 13113 | | | is not able to | 13114 | | | determine that a | 13115 | | | reclaim-interfering | 13116 | | | edge condition does | 13117 | | | not exist. | 13118 | NFS4ERR_NOMATCHING_LAYOUT | 10060 | Client has no | 13119 | | | matching layout | 13120 | | | (segment) to return. | 13121 | NFS4ERR_NOSPC | 28 | No space left on | 13122 | | | device. The | 13123 | | | operation would have | 13124 | | | caused the server's | 13125 | | | file system to | 13126 | | | exceed its limit. | 13127 | NFS4ERR_NOTDIR | 20 | Not a directory. | 13128 | | | The caller specified | 13129 | | | a non-directory in a | 13130 | | | directory operation. | 13131 | NFS4ERR_NOTEMPTY | 66 | An attempt was made | 13132 | | | to remove a | 13133 | | | directory that was | 13134 | | | not empty. | 13135 | NFS4ERR_NOTSUPP | 10004 | Operation is not | 13136 | | | supported. | 13137 | NFS4ERR_NOT_SAME | 10027 | This error is | 13138 | | | returned by the | 13139 | | | VERIFY operation to | 13140 | | | signify that the | 13141 | | | attributes compared | 13142 | | | were not the same as | 13143 | | | provided in the | 13144 | | | client's request. | 13145 | NFS4ERR_NXIO | 6 | I/O error. No such | 13146 | | | device or address. | 13147 | NFS4ERR_OLD_STATEID | 10024 | A stateid which | 13148 | | | designates the | 13149 | | | locking state for a | 13150 | | | lockowner-file at an | 13151 | | | earlier time was | 13152 | | | used. This error | 13153 | | | does not apply to | 13154 | | | and should never be | 13155 | | | generated in | 13156 | | | NFSv4.1. | 13157 | NFS4ERR_OPENMODE | 10038 | The client attempted | 13158 | | | a READ, WRITE, LOCK | 13159 | | | or SETATTR operation | 13160 | | | not sanctioned by | 13161 | | | the stateid passed | 13162 | | | (e.g. writing to a | 13163 | | | file opened only for | 13164 | | | read). | 13165 | NFS4ERR_OP_ILLEGAL | 10044 | An illegal operation | 13166 | | | value has been | 13167 | | | specified in the | 13168 | | | argop field of a | 13169 | | | COMPOUND or | 13170 | | | CB_COMPOUND | 13171 | | | procedure. | 13172 | NFS4ERR_OP_NOT_IN_SESSION | 10070 | The COMPOUND or | 13173 | | | CB_COMPOUND contains | 13174 | | | an operation that | 13175 | | | requires a SEQUENCE | 13176 | | | or CB_SEQUENCE | 13177 | | | operation to precede | 13178 | | | it in order to | 13179 | | | establish a session. | 13180 | NFS4ERR_PERM | 1 | Not owner. The | 13181 | | | operation was not | 13182 | | | allowed because the | 13183 | | | caller is either not | 13184 | | | a privileged user | 13185 | | | (root) or not the | 13186 | | | owner of the target | 13187 | | | of the operation. | 13188 | NFS4ERR_PNFS_IO_HOLE | 10075 | The pNFS client has | 13189 | | | attempted to read | 13190 | | | from or write to a | 13191 | | | illegal hole of a | 13192 | | | file of a data | 13193 | | | server that is using | 13194 | | | the STRIPE4_SPARSE | 13195 | | | stripe type. See | 13196 | | | Section 13.5. | 13197 | NFS4ERR_RECALLCONFLICT | 10061 | Layout is | 13198 | | | unavailable due to a | 13199 | | | conflicting | 13200 | | | LAYOUTRECALL that is | 13201 | | | in progress. | 13202 | NFS4ERR_RECLAIM_BAD | 10034 | The reclaim provided | 13203 | | | by the client does | 13204 | | | not match any of the | 13205 | | | server's state | 13206 | | | consistency checks | 13207 | | | and is bad. | 13208 | NFS4ERR_RECLAIM_CONFLICT | 10035 | The reclaim provided | 13209 | | | by the client has | 13210 | | | encountered a | 13211 | | | conflict and can not | 13212 | | | be provided. | 13213 | | | Potentially | 13214 | | | indicates a | 13215 | | | misbehaving client. | 13216 | NFS4ERR_REP_TOO_BIG | 10066 | The reply to a | 13217 | | | COMPOUND or | 13218 | | | CB_COMPOUND would | 13219 | | | exceed the channel's | 13220 | | | negotiated maximum | 13221 | | | response size. | 13222 | NFS4ERR_REP_TOO_BIG_TO_CACHE | 10067 | The reply to a | 13223 | | | COMPOUND or | 13224 | | | CB_COMPOUND would | 13225 | | | exceed the channel's | 13226 | | | negotiated maximum | 13227 | | | size for replies | 13228 | | | cached in the reply | 13229 | | | cache. | 13230 | NFS4ERR_REQ_TOO_BIG | 10065 | The COMPOUND or | 13231 | | | CB_COMPOUND request | 13232 | | | exceeds the | 13233 | | | channel's negotiated | 13234 | | | maximum size for | 13235 | | | requests. | 13236 | NFS4ERR_RESTOREFH | 10030 | The RESTOREFH | 13237 | | | operation does not | 13238 | | | have a saved | 13239 | | | filehandle | 13240 | | | (identified by | 13241 | | | SAVEFH) to operate | 13242 | | | upon. | 13243 | NFS4ERR_RETRY_UNCACHED_REP | 10068 | The requester has | 13244 | | | attempted a retry of | 13245 | | | COMPOUND or | 13246 | | | CB_COMPOUND which it | 13247 | | | previously requested | 13248 | | | not be placed in the | 13249 | | | reply cache. | 13250 | NFS4ERR_ROFS | 30 | Read-only file | 13251 | | | system. A modifying | 13252 | | | operation was | 13253 | | | attempted on a | 13254 | | | read-only file | 13255 | | | system. | 13256 | NFS4ERR_SAME | 10009 | This error is | 13257 | | | returned by the | 13258 | | | NVERIFY operation to | 13259 | | | signify that the | 13260 | | | attributes compared | 13261 | | | were the same as | 13262 | | | provided in the | 13263 | | | client's request. | 13264 | NFS4ERR_SERVERFAULT | 10006 | An error occurred on | 13265 | | | the server which | 13266 | | | does not map to any | 13267 | | | of the legal NFS | 13268 | | | version 4 protocol | 13269 | | | error values. The | 13270 | | | client should | 13271 | | | translate this into | 13272 | | | an appropriate | 13273 | | | error. UNIX clients | 13274 | | | may choose to | 13275 | | | translate this to | 13276 | | | EIO. | 13277 | NFS4ERR_SHARE_DENIED | 10015 | An attempt to OPEN a | 13278 | | | file with a share | 13279 | | | reservation has | 13280 | | | failed because of a | 13281 | | | share conflict. | 13282 | NFS4ERR_STALE | 70 | Invalid filehandle. | 13283 | | | The filehandle given | 13284 | | | in the arguments was | 13285 | | | invalid. The file | 13286 | | | referred to by that | 13287 | | | filehandle no longer | 13288 | | | exists or access to | 13289 | | | it has been revoked. | 13290 | NFS4ERR_STALE_CLIENTID | 10022 | A client ID not | 13291 | | | recognized by the | 13292 | | | server was used in a | 13293 | | | locking or | 13294 | | | CREATE_SESSION | 13295 | | | request. | 13296 | NFS4ERR_STALE_STATEID | 10023 | A stateid generated | 13297 | | | by an earlier server | 13298 | | | instance was used. | 13299 | NFS4ERR_SYMLINK | 10029 | The current | 13300 | | | filehandle provided | 13301 | | | for a LOOKUP is not | 13302 | | | a directory but a | 13303 | | | symbolic link. Also | 13304 | | | used if the final | 13305 | | | component of the | 13306 | | | OPEN path is a | 13307 | | | symbolic link. | 13308 | NFS4ERR_TOOSMALL | 10005 | The encoded response | 13309 | | | to a READDIR request | 13310 | | | exceeds the size | 13311 | | | limit set by the | 13312 | | | initial request. | 13313 | NFS4ERR_TOO_MANY_OPS | 10070 | The COMPOUND or | 13314 | | | CB_COMPOUND request | 13315 | | | has too many | 13316 | | | operations. | 13317 | NFS4ERR_UNKNOWN_LAYOUTTYPE | 10062 | Layout type is | 13318 | | | unknown. | 13319 | NFS4ERR_UNSAFE_COMPOUND | 10069 | The client has sent | 13320 | | | a COMPOUND request | 13321 | | | with an unsafe mix | 13322 | | | of operations. | 13323 | NFS4ERR_WRONGSEC | 10016 | The security | 13324 | | | mechanism being used | 13325 | | | by the client for | 13326 | | | the operation does | 13327 | | | not match the | 13328 | | | server's security | 13329 | | | policy. The client | 13330 | | | should change the | 13331 | | | security mechanism | 13332 | | | being used and retry | 13333 | | | the operation. | 13334 | NFS4ERR_XDEV | 18 | Attempt to do an | 13335 | | | operation between | 13336 | | | different fsids. | 13337 +-----------------------------------+--------+----------------------+ 13339 Table 8 13341 15.2. Operations and their valid errors 13343 Mappings of valid error returns for each protocol operation 13345 +----------------------+--------------------------------------------+ 13346 | Operation | Errors | 13347 +----------------------+--------------------------------------------+ 13348 | ACCESS | NFS4ERR_ACCESS, NFS4ERR_BADHANDLE, | 13349 | | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 13350 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 13351 | | NFS4ERR_IO, NFS4ERR_MOVED, | 13352 | | NFS4ERR_NOFILEHANDLE, | 13353 | | NFS4ERR_OP_NOT_IN_SESSION, | 13354 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13355 | | NFS4ERR_REP_TOO_BIG, | 13356 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13357 | | NFS4ERR_UNSAFE_COMPOUND, | 13358 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE | 13359 | BIND_CONN_TO_SESSION | NFS4ERR_BADSESSION, | 13360 | | NFS4ERR_BAD_SESSION_DIGEST, | 13361 | | NFS4ERR_CONN_BINDING_NOT_ENFORCED | 13362 | CLOSE | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADHANDLE, | 13363 | | NFS4ERR_BAD_STATEID, NFS4ERR_BADXDR, | 13364 | | NFS4ERR_DELAY, NFS4ERR_EXPIRED, | 13365 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 13366 | | NFS4ERR_ISDIR, NFS4ERR_LEASE_MOVED, | 13367 | | NFS4ERR_LOCKS_HELD, NFS4ERR_MOVED, | 13368 | | NFS4ERR_NOFILEHANDLE, | 13369 | | NFS4ERR_OP_NOT_IN_SESSION, | 13370 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13371 | | NFS4ERR_REP_TOO_BIG, | 13372 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13373 | | NFS4ERR_UNSAFE_COMPOUND, | 13374 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 13375 | | NFS4ERR_STALE_STATEID | 13376 | COMMIT | NFS4ERR_ACCESS, NFS4ERR_BADHANDLE, | 13377 | | NFS4ERR_BADXDR, NFS4ERR_FHEXPIRED, | 13378 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_ISDIR, | 13379 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 13380 | | NFS4ERR_OP_NOT_IN_SESSION, | 13381 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13382 | | NFS4ERR_REP_TOO_BIG, | 13383 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13384 | | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_ROFS, | 13385 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE | 13386 | CREATE | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | 13387 | | NFS4ERR_BADCHAR, NFS4ERR_BADHANDLE, | 13388 | | NFS4ERR_BADNAME, NFS4ERR_BADOWNER, | 13389 | | NFS4ERR_BADTYPE, NFS4ERR_BADXDR, | 13390 | | NFS4ERR_DELAY, NFS4ERR_DQUOT, | 13391 | | NFS4ERR_EXIST, NFS4ERR_FHEXPIRED, | 13392 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 13393 | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOFILEHANDLE, | 13394 | | NFS4ERR_NOSPC, NFS4ERR_NOTDIR, | 13395 | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_PERM, | 13396 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13397 | | NFS4ERR_REP_TOO_BIG, | 13398 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13399 | | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_ROFS, | 13400 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE | 13401 | EXCHANGE_ID | | 13402 | CREATE_SESSION | NFS4ERR_BADXDR, NFS4ERR_CLID_INUSE, | 13403 | | NFS4ERR_SERVERFAULT, | 13404 | | NFS4ERR_STALE_CLIENTID | 13405 | DELEGPURGE | NFS4ERR_BADXDR, NFS4ERR_NOTSUPP, | 13406 | | NFS4ERR_LEASE_MOVED, NFS4ERR_MOVED, | 13407 | | NFS4ERR_OP_NOT_IN_SESSION, | 13408 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13409 | | NFS4ERR_REP_TOO_BIG, | 13410 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13411 | | NFS4ERR_UNSAFE_COMPOUND, | 13412 | | NFS4ERR_SERVERFAULT, | 13413 | | NFS4ERR_STALE_CLIENTID | 13414 | DELEGRETURN | NFS4ERR_ADMIN_REVOKED, | 13415 | | NFS4ERR_BAD_STATEID, NFS4ERR_BADXDR, | 13416 | | NFS4ERR_EXPIRED, NFS4ERR_INVAL, | 13417 | | NFS4ERR_LEASE_MOVED, NFS4ERR_MOVED, | 13418 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | 13419 | | NFS4ERR_OP_NOT_IN_SESSION, | 13420 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13421 | | NFS4ERR_REP_TOO_BIG, | 13422 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13423 | | NFS4ERR_UNSAFE_COMPOUND, | 13424 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 13425 | | NFS4ERR_STALE_STATEID | 13426 | DESTROY_CLIENTID | NFS4ERR_CLIENTID_BUSY, | 13427 | | NFS4ERR_STALE_CLIENTID | 13428 | DESTROY_SESSION | NFS4ERR_BACK_CHAN_BUSY, | 13429 | | NFS4ERR_BADSESSION, NFS4ERR_STALE_CLIENTID | 13430 | GET_DIR_DELEGATION | NFS4ERR_ACCESS, NFS4ERR_BADHANDLE, | 13431 | | NFS4ERR_BADXDR, NFS4ERR_FHEXPIRED, | 13432 | | NFS4ERR_INVAL, NFS4ERR_MOVED, | 13433 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 13434 | | NFS4ERR_OP_NOT_IN_SESSION, | 13435 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13436 | | NFS4ERR_REP_TOO_BIG, | 13437 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13438 | | NFS4ERR_UNSAFE_COMPOUND, | 13439 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 13440 | | NFS4ERR_WRONGSEC, NFS4ERR_EIO, | 13441 | | NFS4ERR_NOTSUPP | 13442 | GETATTR | NFS4ERR_ACCESS, NFS4ERR_BADHANDLE, | 13443 | | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 13444 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 13445 | | NFS4ERR_IO, NFS4ERR_MOVED, | 13446 | | NFS4ERR_NOFILEHANDLE, | 13447 | | NFS4ERR_OP_NOT_IN_SESSION, | 13448 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13449 | | NFS4ERR_REP_TOO_BIG, | 13450 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13451 | | NFS4ERR_UNSAFE_COMPOUND, | 13452 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE | 13453 | GETDEVICEINFO | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 13454 | | NFS4ERR_TOOSMALL, | 13455 | | NFS4ERR_UNKNOWN_LAYOUTTYPE | 13456 | GETDEVICELIST | NFS4ERR_BAD_COOKIE, NFS4ERR_FHEXPIRED, | 13457 | | NFS4ERR_INVAL, NFS4ERR_TOOSMALL, | 13458 | | NFS4ERR_UNKNOWN_LAYOUTTYPE | 13459 | GETFH | NFS4ERR_BADHANDLE, NFS4ERR_FHEXPIRED, | 13460 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 13461 | | NFS4ERR_OP_NOT_IN_SESSION, | 13462 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13463 | | NFS4ERR_REP_TOO_BIG, | 13464 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13465 | | NFS4ERR_UNSAFE_COMPOUND, | 13466 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE | 13467 | ILLEGAL | NFS4ERR_OP_ILLEGAL | 13468 | LAYOUTCOMMIT | NFS4ERR_BADLAYOUT, NFS4ERR_BADIOMODE, | 13469 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 13470 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NO_GRACE, | 13471 | | NFS4ERR_RECLAIM_BAD, NFS4ERR_STALE, | 13472 | | NFS4ERR_STALE_CLIENTID, | 13473 | | NFS4ERR_UNKNOWN_LAYOUTTYPE | 13474 | LAYOUTGET | NFS4ERR_BADLAYOUT, NFS4ERR_BADIOMODE, | 13475 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 13476 | | NFS4ERR_INVAL, NFS4ERR_LAYOUTUNAVAILABLE, | 13477 | | NFS4ERR_LAYOUTTRYLATER, NFS4ERR_LOCKED, | 13478 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | 13479 | | NFS4ERR_RECALLCONFLICT, NFS4ERR_STALE, | 13480 | | NFS4ERR_STALE_CLIENTID, NFS4ERR_TOOSMALL, | 13481 | | NFS4ERR_UNKNOWN_LAYOUTTYPE | 13482 | LAYOUTRETURN | NFS4ERR_BADLAYOUT, NFS4ERR_BADIOMODE, | 13483 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 13484 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NO_GRACE, | 13485 | | NFS4ERR_STALE, NFS4ERR_STALE_CLIENTID, | 13486 | | NFS4ERR_UNKNOWN_LAYOUTTYPE | 13487 | LINK | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 13488 | | NFS4ERR_BADHANDLE, NFS4ERR_BADNAME, | 13489 | | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 13490 | | NFS4ERR_DQUOT, NFS4ERR_EXIST, | 13491 | | NFS4ERR_FHEXPIRED, NFS4ERR_FILE_OPEN, | 13492 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_ISDIR, | 13493 | | NFS4ERR_MLINK, NFS4ERR_MOVED, | 13494 | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | 13495 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 13496 | | NFS4ERR_NOTDIR, NFS4ERR_NOTSUPP, | 13497 | | NFS4ERR_OP_NOT_IN_SESSION, | 13498 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13499 | | NFS4ERR_REP_TOO_BIG, | 13500 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13501 | | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_ROFS, | 13502 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 13503 | | NFS4ERR_WRONGSEC, NFS4ERR_XDEV | 13504 | LOCK | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 13505 | | NFS4ERR_BADHANDLE, NFS4ERR_BAD_RANGE, | 13506 | | NFS4ERR_BAD_STATEID, NFS4ERR_BADXDR, | 13507 | | NFS4ERR_DEADLOCK, NFS4ERR_DELAY, | 13508 | | NFS4ERR_DENIED, NFS4ERR_EXPIRED, | 13509 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 13510 | | NFS4ERR_INVAL, NFS4ERR_ISDIR, | 13511 | | NFS4ERR_LEASE_MOVED, NFS4ERR_LOCK_NOTSUPP, | 13512 | | NFS4ERR_LOCK_RANGE, NFS4ERR_MOVED, | 13513 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NO_GRACE, | 13514 | | NFS4ERR_OPENMODE, | 13515 | | NFS4ERR_OP_NOT_IN_SESSION, | 13516 | | NFS4ERR_RECLAIM_BAD, | 13517 | | NFS4ERR_RECLAIM_CONFLICT, | 13518 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13519 | | NFS4ERR_REP_TOO_BIG, | 13520 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13521 | | NFS4ERR_UNSAFE_COMPOUND, | 13522 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 13523 | | NFS4ERR_STALE_CLIENTID, | 13524 | | NFS4ERR_STALE_STATEID | 13525 | LOCKT | NFS4ERR_ACCESS, NFS4ERR_BADHANDLE, | 13526 | | NFS4ERR_BAD_RANGE, NFS4ERR_BADXDR, | 13527 | | NFS4ERR_DELAY, NFS4ERR_DENIED, | 13528 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 13529 | | NFS4ERR_INVAL, NFS4ERR_ISDIR, | 13530 | | NFS4ERR_LEASE_MOVED, NFS4ERR_LOCK_RANGE, | 13531 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 13532 | | NFS4ERR_OP_NOT_IN_SESSION, | 13533 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13534 | | NFS4ERR_REP_TOO_BIG, | 13535 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13536 | | NFS4ERR_UNSAFE_COMPOUND, | 13537 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 13538 | | NFS4ERR_STALE_CLIENTID | 13539 | LOCKU | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 13540 | | NFS4ERR_BADHANDLE, NFS4ERR_BAD_RANGE, | 13541 | | NFS4ERR_BAD_STATEID, NFS4ERR_BADXDR, | 13542 | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | 13543 | | NFS4ERR_GRACE, NFS4ERR_INVAL, | 13544 | | NFS4ERR_ISDIR, NFS4ERR_LEASE_MOVED, | 13545 | | NFS4ERR_LOCK_RANGE, NFS4ERR_MOVED, | 13546 | | NFS4ERR_NOFILEHANDLE, | 13547 | | NFS4ERR_OP_NOT_IN_SESSION, | 13548 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13549 | | NFS4ERR_REP_TOO_BIG, | 13550 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13551 | | NFS4ERR_UNSAFE_COMPOUND, | 13552 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 13553 | | NFS4ERR_STALE_STATEID | 13554 | LOOKUP | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 13555 | | NFS4ERR_BADHANDLE, NFS4ERR_BADNAME, | 13556 | | NFS4ERR_BADXDR, NFS4ERR_FHEXPIRED, | 13557 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 13558 | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | 13559 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 13560 | | NFS4ERR_OP_NOT_IN_SESSION, | 13561 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13562 | | NFS4ERR_REP_TOO_BIG, | 13563 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13564 | | NFS4ERR_UNSAFE_COMPOUND, | 13565 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 13566 | | NFS4ERR_SYMLINK, NFS4ERR_WRONGSEC | 13567 | LOOKUPP | NFS4ERR_ACCESS, NFS4ERR_BADHANDLE, | 13568 | | NFS4ERR_FHEXPIRED, NFS4ERR_IO, | 13569 | | NFS4ERR_MOVED, NFS4ERR_NOENT, | 13570 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 13571 | | NFS4ERR_OP_NOT_IN_SESSION, | 13572 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13573 | | NFS4ERR_REP_TOO_BIG, | 13574 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13575 | | NFS4ERR_UNSAFE_COMPOUND, | 13576 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 13577 | | NFS4ERR_WRONGSEC | 13578 | NVERIFY | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | 13579 | | NFS4ERR_BADCHAR, NFS4ERR_BADHANDLE, | 13580 | | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 13581 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 13582 | | NFS4ERR_IO, NFS4ERR_MOVED, | 13583 | | NFS4ERR_NOFILEHANDLE, | 13584 | | NFS4ERR_OP_NOT_IN_SESSION, | 13585 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13586 | | NFS4ERR_REP_TOO_BIG, | 13587 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13588 | | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_SAME, | 13589 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE | 13590 | OPEN | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 13591 | | NFS4ERR_ATTRNOTSUPP, NFS4ERR_BADCHAR, | 13592 | | NFS4ERR_BADHANDLE, NFS4ERR_BADNAME, | 13593 | | NFS4ERR_BADOWNER, NFS4ERR_BADXDR, | 13594 | | NFS4ERR_DELAY, NFS4ERR_DQUOT, | 13595 | | NFS4ERR_EXIST, NFS4ERR_EXPIRED, | 13596 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 13597 | | NFS4ERR_IO, NFS4ERR_INVAL, NFS4ERR_ISDIR, | 13598 | | NFS4ERR_LEASE_MOVED, NFS4ERR_MOVED, | 13599 | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | 13600 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 13601 | | NFS4ERR_NOTDIR, NFS4ERR_NO_GRACE, | 13602 | | NFS4ERR_PERM, NFS4ERR_RECLAIM_BAD, | 13603 | | NFS4ERR_RECLAIM_CONFLICT, | 13604 | | NFS4ERR_OP_NOT_IN_SESSION, | 13605 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13606 | | NFS4ERR_REP_TOO_BIG, | 13607 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13608 | | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_ROFS, | 13609 | | NFS4ERR_SERVERFAULT, NFS4ERR_SHARE_DENIED, | 13610 | | NFS4ERR_STALE, NFS4ERR_STALE_CLIENTID, | 13611 | | NFS4ERR_SYMLINK, NFS4ERR_WRONGSEC | 13612 | OPEN_DOWNGRADE | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADHANDLE, | 13613 | | NFS4ERR_BAD_STATEID, NFS4ERR_BADXDR, | 13614 | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | 13615 | | NFS4ERR_INVAL, NFS4ERR_MOVED, | 13616 | | NFS4ERR_NOFILEHANDLE, | 13617 | | NFS4ERR_OP_NOT_IN_SESSION, | 13618 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13619 | | NFS4ERR_REP_TOO_BIG, | 13620 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13621 | | NFS4ERR_UNSAFE_COMPOUND, | 13622 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 13623 | | NFS4ERR_STALE_STATEID | 13624 | OPENATTR | NFS4ERR_ACCESS, NFS4ERR_BADHANDLE, | 13625 | | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 13626 | | NFS4ERR_DQUOT, NFS4ERR_FHEXPIRED, | 13627 | | NFS4ERR_IO, NFS4ERR_MOVED, NFS4ERR_NOENT, | 13628 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 13629 | | NFS4ERR_NOTSUPP, | 13630 | | NFS4ERR_OP_NOT_IN_SESSION, | 13631 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13632 | | NFS4ERR_REP_TOO_BIG, | 13633 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13634 | | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_ROFS, | 13635 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE | 13636 | PUTFH | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 13637 | | NFS4ERR_FHEXPIRED, NFS4ERR_MOVED, | 13638 | | NFS4ERR_OP_NOT_IN_SESSION, | 13639 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13640 | | NFS4ERR_REP_TOO_BIG, | 13641 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13642 | | NFS4ERR_UNSAFE_COMPOUND, | 13643 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 13644 | | NFS4ERR_WRONGSEC | 13645 | PUTPUBFH | NFS4ERR_OP_NOT_IN_SESSION, | 13646 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13647 | | NFS4ERR_REP_TOO_BIG, | 13648 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13649 | | NFS4ERR_UNSAFE_COMPOUND, | 13650 | | NFS4ERR_SERVERFAULT, NFS4ERR_WRONGSEC | 13651 | PUTROOTFH | NFS4ERR_OP_NOT_IN_SESSION, | 13652 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13653 | | NFS4ERR_REP_TOO_BIG, | 13654 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13655 | | NFS4ERR_UNSAFE_COMPOUND, | 13656 | | NFS4ERR_SERVERFAULT, NFS4ERR_WRONGSEC | 13657 | READ | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 13658 | | NFS4ERR_BADHANDLE, NFS4ERR_BAD_STATEID, | 13659 | | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 13660 | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | 13661 | | NFS4ERR_GRACE, NFS4ERR_IO, NFS4ERR_INVAL, | 13662 | | NFS4ERR_ISDIR, NFS4ERR_LEASE_MOVED, | 13663 | | NFS4ERR_LOCKED, NFS4ERR_MOVED, | 13664 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NXIO, | 13665 | | NFS4ERR_OP_NOT_IN_SESSION, | 13666 | | NFS4ERR_OPENMODE, NFS4ERR_PNFS_IO_HOLE, | 13667 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13668 | | NFS4ERR_REP_TOO_BIG, | 13669 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13670 | | NFS4ERR_UNSAFE_COMPOUND, | 13671 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 13672 | | NFS4ERR_STALE_STATEID | 13673 | READDIR | NFS4ERR_ACCESS, NFS4ERR_BADHANDLE, | 13674 | | NFS4ERR_BAD_COOKIE, NFS4ERR_BADXDR, | 13675 | | NFS4ERR_DELAY, NFS4ERR_FHEXPIRED, | 13676 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 13677 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 13678 | | NFS4ERR_NOT_SAME, | 13679 | | NFS4ERR_OP_NOT_IN_SESSION, | 13680 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13681 | | NFS4ERR_REP_TOO_BIG, | 13682 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13683 | | NFS4ERR_UNSAFE_COMPOUND, | 13684 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 13685 | | NFS4ERR_TOOSMALL | 13686 | READLINK | NFS4ERR_ACCESS, NFS4ERR_BADHANDLE, | 13687 | | NFS4ERR_DELAY, NFS4ERR_FHEXPIRED, | 13688 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_ISDIR, | 13689 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 13690 | | NFS4ERR_NOTSUPP, | 13691 | | NFS4ERR_OP_NOT_IN_SESSION, | 13692 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13693 | | NFS4ERR_REP_TOO_BIG, | 13694 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13695 | | NFS4ERR_UNSAFE_COMPOUND, | 13696 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE | 13697 | RECLAIM_COMPLETE | NFS4ERR_COMPLETE_ALREADY | 13698 | RELEASE_LOCKOWNER | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 13699 | | NFS4ERR_EXPIRED, NFS4ERR_LEASE_MOVED, | 13700 | | NFS4ERR_LOCKS_HELD, | 13701 | | NFS4ERR_OP_NOT_IN_SESSION, | 13702 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13703 | | NFS4ERR_REP_TOO_BIG, | 13704 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13705 | | NFS4ERR_UNSAFE_COMPOUND, | 13706 | | NFS4ERR_SERVERFAULT, | 13707 | | NFS4ERR_STALE_CLIENTID | 13708 | REMOVE | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 13709 | | NFS4ERR_BADHANDLE, NFS4ERR_BADNAME, | 13710 | | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 13711 | | NFS4ERR_FHEXPIRED, NFS4ERR_FILE_OPEN, | 13712 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 13713 | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | 13714 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 13715 | | NFS4ERR_NOTEMPTY, | 13716 | | NFS4ERR_OP_NOT_IN_SESSION, | 13717 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13718 | | NFS4ERR_REP_TOO_BIG, | 13719 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13720 | | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_ROFS, | 13721 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE | 13722 | RENAME | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 13723 | | NFS4ERR_BADHANDLE, NFS4ERR_BADNAME, | 13724 | | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 13725 | | NFS4ERR_DQUOT, NFS4ERR_EXIST, | 13726 | | NFS4ERR_FHEXPIRED, NFS4ERR_FILE_OPEN, | 13727 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 13728 | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | 13729 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 13730 | | NFS4ERR_NOTDIR, NFS4ERR_NOTEMPTY, | 13731 | | NFS4ERR_OP_NOT_IN_SESSION, | 13732 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13733 | | NFS4ERR_REP_TOO_BIG, | 13734 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13735 | | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_ROFS, | 13736 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 13737 | | NFS4ERR_WRONGSEC, NFS4ERR_XDEV | 13738 | RESTOREFH | NFS4ERR_BADHANDLE, NFS4ERR_FHEXPIRED, | 13739 | | NFS4ERR_MOVED, NFS4ERR_OP_NOT_IN_SESSION, | 13740 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13741 | | NFS4ERR_REP_TOO_BIG, | 13742 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13743 | | NFS4ERR_UNSAFE_COMPOUND, | 13744 | | NFS4ERR_RESTOREFH, NFS4ERR_SERVERFAULT, | 13745 | | NFS4ERR_STALE, NFS4ERR_WRONGSEC | 13746 | SAVEFH | NFS4ERR_BADHANDLE, NFS4ERR_FHEXPIRED, | 13747 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 13748 | | NFS4ERR_OP_NOT_IN_SESSION, | 13749 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13750 | | NFS4ERR_REP_TOO_BIG, | 13751 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13752 | | NFS4ERR_UNSAFE_COMPOUND, | 13753 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE | 13754 | SECINFO | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 13755 | | NFS4ERR_BADHANDLE, NFS4ERR_BADNAME, | 13756 | | NFS4ERR_BADXDR, NFS4ERR_FHEXPIRED, | 13757 | | NFS4ERR_INVAL, NFS4ERR_MOVED, | 13758 | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | 13759 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 13760 | | NFS4ERR_OP_NOT_IN_SESSION, | 13761 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13762 | | NFS4ERR_REP_TOO_BIG, | 13763 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13764 | | NFS4ERR_UNSAFE_COMPOUND, | 13765 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE | 13766 | SECINFO_NO_NAME | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 13767 | | NFS4ERR_BADHANDLE, NFS4ERR_BADNAME, | 13768 | | NFS4ERR_BADXDR, NFS4ERR_FHEXPIRED, | 13769 | | NFS4ERR_INVAL, NFS4ERR_MOVED, | 13770 | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | 13771 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 13772 | | NFS4ERR_OP_NOT_IN_SESSION, | 13773 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13774 | | NFS4ERR_REP_TOO_BIG, | 13775 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13776 | | NFS4ERR_UNSAFE_COMPOUND, | 13777 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE | 13778 | SEQUENCE | NFS4ERR_BADSESSION, NFS4ERR_BADSLOT, | 13779 | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | 13780 | | NFS4ERR_SEQ_MISORDERED, | 13781 | | NFS4ERR_SEQUENCE_POS, NFS4ERR_REQ_TOO_BIG, | 13782 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_REP_TOO_BIG, | 13783 | | NFS4ERR_REP_TOO_BIG_TO_CACHE | 13784 | SET_SSV | NFS4ERR_BAD_SESSION_DIGEST, | 13785 | | NFS4ERR_CONN_BINDING_NOT_ENFORCED | 13786 | SETATTR | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 13787 | | NFS4ERR_ATTRNOTSUPP, NFS4ERR_BADCHAR, | 13788 | | NFS4ERR_BADHANDLE, NFS4ERR_BADOWNER, | 13789 | | NFS4ERR_BAD_STATEID, NFS4ERR_BADXDR, | 13790 | | NFS4ERR_DELAY, NFS4ERR_DQUOT, | 13791 | | NFS4ERR_EXPIRED, NFS4ERR_FBIG, | 13792 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 13793 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_ISDIR, | 13794 | | NFS4ERR_LOCKED, NFS4ERR_MOVED, | 13795 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 13796 | | NFS4ERR_OPENMODE, NFS4ERR_PERM, | 13797 | | NFS4ERR_OP_NOT_IN_SESSION, | 13798 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13799 | | NFS4ERR_REP_TOO_BIG, | 13800 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13801 | | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_ROFS, | 13802 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 13803 | | NFS4ERR_STALE_STATEID | 13804 | EXCHANGE_ID | NFS4ERR_BADXDR, NFS4ERR_CLID_INUSE, | 13805 | | NFS4ERR_INVAL, NFS4ERR_SERVERFAULT | 13806 | CREATE_SESSION | NFS4ERR_BADXDR, NFS4ERR_CLID_INUSE, | 13807 | | NFS4ERR_DELAY, NFS4ERR_SERVERFAULT, | 13808 | | NFS4ERR_STALE_CLIENTID | 13809 | VERIFY | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | 13810 | | NFS4ERR_BADCHAR, NFS4ERR_BADHANDLE, | 13811 | | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 13812 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 13813 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 13814 | | NFS4ERR_NOT_SAME, | 13815 | | NFS4ERR_OP_NOT_IN_SESSION, | 13816 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13817 | | NFS4ERR_REP_TOO_BIG, | 13818 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13819 | | NFS4ERR_UNSAFE_COMPOUND, | 13820 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE | 13821 | WANT_DELEGATION | | 13822 | WRITE | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 13823 | | NFS4ERR_BADHANDLE, NFS4ERR_BAD_STATEID, | 13824 | | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 13825 | | NFS4ERR_DQUOT, NFS4ERR_EXPIRED, | 13826 | | NFS4ERR_FBIG, NFS4ERR_FHEXPIRED, | 13827 | | NFS4ERR_GRACE, NFS4ERR_INVAL, NFS4ERR_IO, | 13828 | | NFS4ERR_ISDIR, NFS4ERR_LEASE_MOVED, | 13829 | | NFS4ERR_LOCKED, NFS4ERR_MOVED, | 13830 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 13831 | | NFS4ERR_NXIO, NFS4ERR_OP_NOT_IN_SESSION, | 13832 | | NFS4ERR_OPENMODE, NFS4ERR_PNFS_IO_HOLE, | 13833 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | 13834 | | NFS4ERR_REP_TOO_BIG, | 13835 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13836 | | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_ROFS, | 13837 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 13838 | | NFS4ERR_STALE_STATEID | 13839 +----------------------+--------------------------------------------+ 13841 Table 9 13843 15.3. Callback operations and their valid errors 13845 Mappings of valid error returns for each protocol callback operation 13847 +-------------------------+-----------------------------------------+ 13848 | Callback Operation | Errors | 13849 +-------------------------+-----------------------------------------+ 13850 | CB_GETATTR | NFS4ERR_BADHANDLE NFS4ERR_BADXDR | 13851 | | NFS4ERR_OP_NOT_IN_SESSION, | 13852 | | NFS4ERR_REQ_TOO_BIG, | 13853 | | NFS4ERR_TOO_MANY_OPS, | 13854 | | NFS4ERR_REP_TOO_BIG, | 13855 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13856 | | NFS4ERR_UNSAFE_COMPOUND, | 13857 | | NFS4ERR_SERVERFAULT | 13858 | CB_ILLEGAL | NFS4ERR_OP_ILLEGAL | 13859 | CB_LAYOUTRECALL | NFS4ERR_NOMATCHING_LAYOUT | 13860 | CB_NOTIFY | NFS4ERR_BAD_STATEID NFS4ERR_INVAL | 13861 | | NFS4ERR_BADXDR NFS4ERR_SERVERFAULT | 13862 | CB_PUSH_DELEG | | 13863 | CB_RECALL | NFS4ERR_BADHANDLE NFS4ERR_BAD_STATEID | 13864 | | NFS4ERR_BADXDR | 13865 | | NFS4ERR_OP_NOT_IN_SESSION, | 13866 | | NFS4ERR_REQ_TOO_BIG, | 13867 | | NFS4ERR_TOO_MANY_OPS, | 13868 | | NFS4ERR_REP_TOO_BIG, | 13869 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13870 | | NFS4ERR_SERVERFAULT | 13871 | CB_RECALL_ANY | NFS4ERR_OP_NOT_IN_SESSION, | 13872 | | NFS4ERR_REQ_TOO_BIG, | 13873 | | NFS4ERR_TOO_MANY_OPS, | 13874 | | NFS4ERR_REP_TOO_BIG, | 13875 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 13876 | | NFS4ERR_INVAL | 13877 | CB_RECALLABLE_OBJ_AVAIL | | 13878 | CB_RECALL_CREDIT | | 13879 | CB_SEQUENCE | NFS4ERR_BADSESSION, NFS4ERR_BADSLOT, | 13880 | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | 13881 | | NFS4ERR_SEQ_MISORDERED, | 13882 | | NFS4ERR_SEQUENCE_POS, | 13883 | | NFS4ERR_REQ_TOO_BIG, | 13884 | | NFS4ERR_TOO_MANY_OPS, | 13885 | | NFS4ERR_REP_TOO_BIG, | 13886 | | NFS4ERR_REP_TOO_BIG_TO_CACHE | 13887 +-------------------------+-----------------------------------------+ 13889 Table 10 13891 15.4. Errors and the operations that use them 13893 +-----------------------------------+-------------------------------+ 13894 | Error | Operations | 13895 +-----------------------------------+-------------------------------+ 13896 | NFS4ERR_ACCESS | ACCESS, COMMIT, CREATE, | 13897 | | GETATTR, GET_DIR_DELEGATION, | 13898 | | LINK, LOCK, LOCKT, LOCKU, | 13899 | | LOOKUP, LOOKUPP, NVERIFY, | 13900 | | OPEN, OPENATTR, READ, | 13901 | | READDIR, READLINK, REMOVE, | 13902 | | RENAME, SECINFO, | 13903 | | SECINFO_NO_NAME, SETATTR, | 13904 | | VERIFY, WRITE | 13905 | NFS4ERR_ADMIN_REVOKED | CLOSE, DELEGRETURN, LOCK, | 13906 | | LOCKU, OPEN, OPEN_DOWNGRADE, | 13907 | | READ, RELEASE_LOCKOWNER, | 13908 | | SETATTR, WRITE | 13909 | NFS4ERR_ATTRNOTSUPP | CREATE, NVERIFY, OPEN, | 13910 | | SETATTR, VERIFY | 13911 | NFS4ERR_BACK_CHAN_BUSY | DESTROY_SESSION | 13912 | NFS4ERR_BADCHAR | CREATE, LINK, LOOKUP, | 13913 | | NVERIFY, OPEN, REMOVE, | 13914 | | RENAME, SECINFO, | 13915 | | SECINFO_NO_NAME, SETATTR, | 13916 | | VERIFY | 13917 | NFS4ERR_BADHANDLE | ACCESS, CB_GETATTR, | 13918 | | CB_RECALL, CLOSE, COMMIT, | 13919 | | CREATE, GETATTR, GETFH, | 13920 | | GET_DIR_DELEGATION, LINK, | 13921 | | LOCK, LOCKT, LOCKU, LOOKUP, | 13922 | | LOOKUPP, NVERIFY, OPEN, | 13923 | | OPENATTR, OPEN_DOWNGRADE, | 13924 | | PUTFH, READ, READDIR, | 13925 | | READLINK, REMOVE, RENAME, | 13926 | | RESTOREFH, SAVEFH, SECINFO, | 13927 | | SECINFO_NO_NAME, SETATTR, | 13928 | | VERIFY, WRITE | 13929 | NFS4ERR_BADIOMODE | LAYOUTCOMMIT, LAYOUTGET, | 13930 | | LAYOUTRETURN | 13931 | NFS4ERR_BADLAYOUT | LAYOUTCOMMIT, LAYOUTGET, | 13932 | | LAYOUTRETURN | 13933 | NFS4ERR_BADNAME | CREATE, LINK, LOOKUP, OPEN, | 13934 | | REMOVE, RENAME, SECINFO, | 13935 | | SECINFO_NO_NAME | 13936 | NFS4ERR_BADOWNER | CREATE, OPEN, SETATTR | 13937 | NFS4ERR_BADSESSION | BIND_CONN_TO_SESSION, | 13938 | | CB_SEQUENCE, DESTROY_SESSION, | 13939 | | SEQUENCE | 13940 | NFS4ERR_BADSLOT | CB_SEQUENCE, SEQUENCE | 13941 | NFS4ERR_BADTYPE | CREATE | 13942 | NFS4ERR_BADXDR | ACCESS, CB_GETATTR, | 13943 | | CB_NOTIFY, CB_RECALL, CLOSE, | 13944 | | COMMIT, CREATE, | 13945 | | CREATE_SESSION, DELEGPURGE, | 13946 | | DELEGRETURN, EXCHANGE_ID, | 13947 | | GETATTR, GET_DIR_DELEGATION, | 13948 | | LINK, LOCK, LOCKT, LOCKU, | 13949 | | LOOKUP, NVERIFY, OPEN, | 13950 | | OPENATTR, OPEN_DOWNGRADE, | 13951 | | PUTFH, READ, READDIR, | 13952 | | RELEASE_LOCKOWNER, REMOVE, | 13953 | | RENAME, SECINFO, | 13954 | | SECINFO_NO_NAME, SETATTR, | 13955 | | VERIFY, WRITE | 13956 | NFS4ERR_BAD_COOKIE | GETDEVICELIST, READDIR | 13957 | NFS4ERR_BAD_RANGE | LOCK, LOCKT, LOCKU | 13958 | NFS4ERR_BAD_SESSION_DIGEST | BIND_CONN_TO_SESSION, SET_SSV | 13959 | NFS4ERR_BAD_STATEID | CB_NOTIFY, CB_RECALL, CLOSE, | 13960 | | DELEGRETURN, LOCK, LOCKU, | 13961 | | OPEN_DOWNGRADE, READ, | 13962 | | SETATTR, WRITE | 13963 | NFS4ERR_CLID_INUSE | CREATE_SESSION, EXCHANGE_ID | 13964 | NFS4ERR_CLIENTID_BUSY | DESTROY_CLIENTID | 13965 | NFS4ERR_COMPLETE_ALREADY | RECLAIM_COMPLETE | 13966 | NFS4ERR_CONN_BINDING_NOT_ENFORCED | BIND_CONN_TO_SESSION, SET_SSV | 13967 | NFS4ERR_CONN_NOT_BOUND_TO_SESSION | CB_SEQUENCE, SEQUENCE | 13968 | NFS4ERR_DEADLOCK | LOCK | 13969 | NFS4ERR_DELAY | ACCESS, CLOSE, CREATE, | 13970 | | CREATE_SESSION, GETATTR, | 13971 | | LINK, LOCK, LOCKT, NVERIFY, | 13972 | | OPEN, OPENATTR, READ, | 13973 | | READDIR, READLINK, REMOVE, | 13974 | | RENAME, SETATTR, VERIFY, | 13975 | | WRITE | 13976 | NFS4ERR_DENIED | LOCK, LOCKT | 13977 | NFS4ERR_DQUOT | CREATE, LINK, OPEN, OPENATTR, | 13978 | | RENAME, SETATTR, WRITE | 13979 | NFS4ERR_EIO | GET_DIR_DELEGATION | 13980 | NFS4ERR_EXIST | CREATE, LINK, OPEN, RENAME | 13981 | NFS4ERR_EXPIRED | CLOSE, DELEGRETURN, LOCK, | 13982 | | LOCKU, OPEN, OPEN_DOWNGRADE, | 13983 | | READ, RELEASE_LOCKOWNER, | 13984 | | SETATTR, WRITE | 13985 | NFS4ERR_FBIG | SETATTR, WRITE | 13986 | NFS4ERR_FHEXPIRED | ACCESS, CLOSE, COMMIT, | 13987 | | CREATE, GETATTR, | 13988 | | GETDEVICEINFO, GETDEVICELIST, | 13989 | | GETFH, GET_DIR_DELEGATION, | 13990 | | LAYOUTCOMMIT, LAYOUTGET, | 13991 | | LAYOUTRETURN, LINK, LOCK, | 13992 | | LOCKT, LOCKU, LOOKUP, | 13993 | | LOOKUPP, NVERIFY, OPEN, | 13994 | | OPENATTR, OPEN_DOWNGRADE, | 13995 | | PUTFH, READ, READDIR, | 13996 | | READLINK, REMOVE, RENAME, | 13997 | | RESTOREFH, SAVEFH, SECINFO, | 13998 | | SECINFO_NO_NAME, SETATTR, | 13999 | | VERIFY, WRITE | 14000 | NFS4ERR_FILE_OPEN | LINK, REMOVE, RENAME | 14001 | NFS4ERR_GRACE | LAYOUTGET, LOCK, LOCKT, | 14002 | | LOCKU, OPEN, READ, SETATTR, | 14003 | | WRITE | 14004 | NFS4ERR_INVAL | ACCESS, CB_NOTIFY, | 14005 | | CB_RECALL_ANY, CLOSE, COMMIT, | 14006 | | CREATE, DELEGRETURN, | 14007 | | EXCHANGE_ID, GETATTR, | 14008 | | GETDEVICEINFO, GETDEVICELIST, | 14009 | | GET_DIR_DELEGATION, | 14010 | | LAYOUTCOMMIT, LAYOUTGET, | 14011 | | LAYOUTRETURN, LINK, LOCK, | 14012 | | LOCKT, LOCKU, LOOKUP, | 14013 | | NVERIFY, OPEN, | 14014 | | OPEN_DOWNGRADE, READ, | 14015 | | READDIR, READLINK, REMOVE, | 14016 | | RENAME, SECINFO, | 14017 | | SECINFO_NO_NAME, SETATTR, | 14018 | | VERIFY, WRITE | 14019 | NFS4ERR_IO | ACCESS, COMMIT, CREATE, | 14020 | | GETATTR, LINK, LOOKUP, | 14021 | | LOOKUPP, NVERIFY, OPEN, | 14022 | | OPENATTR, READ, READDIR, | 14023 | | READLINK, REMOVE, RENAME, | 14024 | | SETATTR, WRITE | 14025 | NFS4ERR_ISDIR | CLOSE, COMMIT, LINK, LOCK, | 14026 | | LOCKT, LOCKU, OPEN, READ, | 14027 | | READLINK, SETATTR, WRITE | 14028 | NFS4ERR_LAYOUTTRYLATER | LAYOUTGET | 14029 | NFS4ERR_LAYOUTUNAVAILABLE | LAYOUTGET | 14030 | NFS4ERR_LEASE_MOVED | CLOSE, DELEGPURGE, | 14031 | | DELEGRETURN, LOCK, LOCKT, | 14032 | | LOCKU, OPEN, READ, | 14033 | | RELEASE_LOCKOWNER, WRITE | 14034 | NFS4ERR_LOCKED | LAYOUTGET, READ, SETATTR, | 14035 | | WRITE | 14036 | NFS4ERR_LOCKS_HELD | CLOSE, RELEASE_LOCKOWNER | 14037 | NFS4ERR_LOCK_NOTSUPP | LOCK | 14038 | NFS4ERR_LOCK_RANGE | LOCK, LOCKT, LOCKU | 14039 | NFS4ERR_MLINK | LINK | 14040 | NFS4ERR_MOVED | ACCESS, CLOSE, COMMIT, | 14041 | | CREATE, DELEGPURGE, | 14042 | | DELEGRETURN, GETATTR, GETFH, | 14043 | | GET_DIR_DELEGATION, LINK, | 14044 | | LOCK, LOCKT, LOCKU, LOOKUP, | 14045 | | LOOKUPP, NVERIFY, OPEN, | 14046 | | OPENATTR, OPEN_DOWNGRADE, | 14047 | | PUTFH, READ, READDIR, | 14048 | | READLINK, REMOVE, RENAME, | 14049 | | RESTOREFH, SAVEFH, SECINFO, | 14050 | | SECINFO_NO_NAME, SETATTR, | 14051 | | VERIFY, WRITE | 14052 | NFS4ERR_NAMETOOLONG | CREATE, LINK, LOOKUP, OPEN, | 14053 | | REMOVE, RENAME, SECINFO, | 14054 | | SECINFO_NO_NAME | 14055 | NFS4ERR_NOENT | LINK, LOOKUP, LOOKUPP, OPEN, | 14056 | | OPENATTR, REMOVE, RENAME, | 14057 | | SECINFO, SECINFO_NO_NAME | 14058 | NFS4ERR_NOFILEHANDLE | ACCESS, CLOSE, COMMIT, | 14059 | | CREATE, DELEGRETURN, GETATTR, | 14060 | | GETFH, GET_DIR_DELEGATION, | 14061 | | LAYOUTCOMMIT, LAYOUTGET, | 14062 | | LAYOUTRETURN, LINK, LOCK, | 14063 | | LOCKT, LOCKU, LOOKUP, | 14064 | | LOOKUPP, NVERIFY, OPEN, | 14065 | | OPENATTR, OPEN_DOWNGRADE, | 14066 | | READ, READDIR, READLINK, | 14067 | | REMOVE, RENAME, SAVEFH, | 14068 | | SECINFO, SECINFO_NO_NAME, | 14069 | | SETATTR, VERIFY, WRITE | 14070 | NFS4ERR_NOMATCHING_LAYOUT | CB_LAYOUTRECALL | 14071 | NFS4ERR_NOSPC | CREATE, LINK, OPEN, OPENATTR, | 14072 | | RENAME, SETATTR, WRITE | 14073 | NFS4ERR_NOTDIR | CREATE, GET_DIR_DELEGATION, | 14074 | | LINK, LOOKUP, LOOKUPP, OPEN, | 14075 | | READDIR, REMOVE, RENAME, | 14076 | | SECINFO, SECINFO_NO_NAME | 14077 | NFS4ERR_NOTEMPTY | REMOVE, RENAME | 14078 | NFS4ERR_NOTSUPP | DELEGPURGE, DELEGRETURN, | 14079 | | GET_DIR_DELEGATION, | 14080 | | LAYOUTGET, LINK, OPENATTR, | 14081 | | READLINK | 14082 | NFS4ERR_NOT_SAME | READDIR, VERIFY | 14083 | NFS4ERR_NO_GRACE | LAYOUTCOMMIT, LAYOUTRETURN, | 14084 | | LOCK, OPEN | 14085 | NFS4ERR_NXIO | READ, WRITE | 14086 | NFS4ERR_OPENMODE | LOCK, READ, SETATTR, WRITE | 14087 | NFS4ERR_OP_ILLEGAL | CB_ILLEGAL, ILLEGAL | 14088 | NFS4ERR_OP_NOT_IN_SESSION | ACCESS, CB_GETATTR, | 14089 | | CB_RECALL, CB_RECALL_ANY, | 14090 | | CLOSE, COMMIT, CREATE, | 14091 | | DELEGPURGE, DELEGRETURN, | 14092 | | GETATTR, GETFH, | 14093 | | GET_DIR_DELEGATION, LINK, | 14094 | | LOCK, LOCKT, LOCKU, LOOKUP, | 14095 | | LOOKUPP, NVERIFY, OPEN, | 14096 | | OPENATTR, OPEN_DOWNGRADE, | 14097 | | PUTFH, PUTPUBFH, PUTROOTFH, | 14098 | | READ, READDIR, READLINK, | 14099 | | RELEASE_LOCKOWNER, REMOVE, | 14100 | | RENAME, RESTOREFH, SAVEFH, | 14101 | | SECINFO, SECINFO_NO_NAME, | 14102 | | SETATTR, VERIFY, WRITE | 14103 | NFS4ERR_PERM | CREATE, OPEN, SETATTR | 14104 | NFS4ERR_PNFS_IO_HOLE | READ, WRITE | 14105 | NFS4ERR_RECALLCONFLICT | LAYOUTGET | 14106 | NFS4ERR_RECLAIM_BAD | LAYOUTCOMMIT, LOCK, OPEN | 14107 | NFS4ERR_RECLAIM_CONFLICT | LOCK, OPEN | 14108 | NFS4ERR_REP_TOO_BIG | ACCESS, CB_GETATTR, | 14109 | | CB_RECALL, CB_RECALL_ANY, | 14110 | | CB_SEQUENCE, CLOSE, COMMIT, | 14111 | | CREATE, DELEGPURGE, | 14112 | | DELEGRETURN, GETATTR, GETFH, | 14113 | | GET_DIR_DELEGATION, LINK, | 14114 | | LOCK, LOCKT, LOCKU, LOOKUP, | 14115 | | LOOKUPP, NVERIFY, OPEN, | 14116 | | OPENATTR, OPEN_DOWNGRADE, | 14117 | | PUTFH, PUTPUBFH, PUTROOTFH, | 14118 | | READ, READDIR, READLINK, | 14119 | | RELEASE_LOCKOWNER, REMOVE, | 14120 | | RENAME, RESTOREFH, SAVEFH, | 14121 | | SECINFO, SECINFO_NO_NAME, | 14122 | | SEQUENCE, SETATTR, VERIFY, | 14123 | | WRITE | 14124 | NFS4ERR_REP_TOO_BIG_TO_CACHE | ACCESS, CB_GETATTR, | 14125 | | CB_RECALL, CB_RECALL_ANY, | 14126 | | CB_SEQUENCE, CLOSE, COMMIT, | 14127 | | CREATE, DELEGPURGE, | 14128 | | DELEGRETURN, GETATTR, GETFH, | 14129 | | GET_DIR_DELEGATION, LINK, | 14130 | | LOCK, LOCKT, LOCKU, LOOKUP, | 14131 | | LOOKUPP, NVERIFY, OPEN, | 14132 | | OPENATTR, OPEN_DOWNGRADE, | 14133 | | PUTFH, PUTPUBFH, PUTROOTFH, | 14134 | | READ, READDIR, READLINK, | 14135 | | RELEASE_LOCKOWNER, REMOVE, | 14136 | | RENAME, RESTOREFH, SAVEFH, | 14137 | | SECINFO, SECINFO_NO_NAME, | 14138 | | SEQUENCE, SETATTR, VERIFY, | 14139 | | WRITE | 14140 | NFS4ERR_REQ_TOO_BIG | ACCESS, CB_GETATTR, | 14141 | | CB_RECALL, CB_RECALL_ANY, | 14142 | | CB_SEQUENCE, CLOSE, COMMIT, | 14143 | | CREATE, DELEGPURGE, | 14144 | | DELEGRETURN, GETATTR, GETFH, | 14145 | | GET_DIR_DELEGATION, LINK, | 14146 | | LOCK, LOCKT, LOCKU, LOOKUP, | 14147 | | LOOKUPP, NVERIFY, OPEN, | 14148 | | OPENATTR, OPEN_DOWNGRADE, | 14149 | | PUTFH, PUTPUBFH, PUTROOTFH, | 14150 | | READ, READDIR, READLINK, | 14151 | | RELEASE_LOCKOWNER, REMOVE, | 14152 | | RENAME, RESTOREFH, SAVEFH, | 14153 | | SECINFO, SECINFO_NO_NAME, | 14154 | | SEQUENCE, SETATTR, VERIFY, | 14155 | | WRITE | 14156 | NFS4ERR_RESTOREFH | RESTOREFH | 14157 | NFS4ERR_ROFS | COMMIT, CREATE, LINK, OPEN, | 14158 | | OPENATTR, REMOVE, RENAME, | 14159 | | SETATTR, WRITE | 14160 | NFS4ERR_SAME | NVERIFY | 14161 | NFS4ERR_SEQUENCE_POS | CB_SEQUENCE, SEQUENCE | 14162 | NFS4ERR_SEQ_MISORDERED | CB_SEQUENCE, SEQUENCE | 14163 | NFS4ERR_SERVERFAULT | ACCESS, CB_GETATTR, | 14164 | | CB_NOTIFY, CB_RECALL, CLOSE, | 14165 | | COMMIT, CREATE, | 14166 | | CREATE_SESSION, DELEGPURGE, | 14167 | | DELEGRETURN, EXCHANGE_ID, | 14168 | | GETATTR, GETFH, | 14169 | | GET_DIR_DELEGATION, LINK, | 14170 | | LOCK, LOCKT, LOCKU, LOOKUP, | 14171 | | LOOKUPP, NVERIFY, OPEN, | 14172 | | OPENATTR, OPEN_DOWNGRADE, | 14173 | | PUTFH, PUTPUBFH, PUTROOTFH, | 14174 | | READ, READDIR, READLINK, | 14175 | | RELEASE_LOCKOWNER, REMOVE, | 14176 | | RENAME, RESTOREFH, SAVEFH, | 14177 | | SECINFO, SECINFO_NO_NAME, | 14178 | | SETATTR, VERIFY, WRITE | 14179 | NFS4ERR_SHARE_DENIED | OPEN | 14180 | NFS4ERR_STALE | ACCESS, CLOSE, COMMIT, | 14181 | | CREATE, DELEGRETURN, GETATTR, | 14182 | | GETFH, GET_DIR_DELEGATION, | 14183 | | LAYOUTCOMMIT, LAYOUTGET, | 14184 | | LAYOUTRETURN, LINK, LOCK, | 14185 | | LOCKT, LOCKU, LOOKUP, | 14186 | | LOOKUPP, NVERIFY, OPEN, | 14187 | | OPENATTR, OPEN_DOWNGRADE, | 14188 | | PUTFH, READ, READDIR, | 14189 | | READLINK, REMOVE, RENAME, | 14190 | | RESTOREFH, SAVEFH, SECINFO, | 14191 | | SECINFO_NO_NAME, SETATTR, | 14192 | | VERIFY, WRITE | 14193 | NFS4ERR_STALE_CLIENTID | CREATE_SESSION, DELEGPURGE, | 14194 | | DESTROY_CLIENTID, | 14195 | | DESTROY_SESSION, | 14196 | | LAYOUTCOMMIT, LAYOUTGET, | 14197 | | LAYOUTRETURN, LOCK, LOCKT, | 14198 | | OPEN, RELEASE_LOCKOWNER | 14199 | NFS4ERR_STALE_STATEID | CLOSE, DELEGRETURN, LOCK, | 14200 | | LOCKU, OPEN_DOWNGRADE, READ, | 14201 | | SETATTR, WRITE | 14202 | NFS4ERR_SYMLINK | LOOKUP, OPEN | 14203 | NFS4ERR_TOOSMALL | GETDEVICEINFO, GETDEVICELIST, | 14204 | | LAYOUTGET, READDIR | 14205 | NFS4ERR_TOO_MANY_OPS | ACCESS, CB_GETATTR, | 14206 | | CB_RECALL, CB_RECALL_ANY, | 14207 | | CB_SEQUENCE, CLOSE, COMMIT, | 14208 | | CREATE, DELEGPURGE, | 14209 | | DELEGRETURN, GETATTR, GETFH, | 14210 | | GET_DIR_DELEGATION, LINK, | 14211 | | LOCK, LOCKT, LOCKU, LOOKUP, | 14212 | | LOOKUPP, NVERIFY, OPEN, | 14213 | | OPENATTR, OPEN_DOWNGRADE, | 14214 | | PUTFH, PUTPUBFH, PUTROOTFH, | 14215 | | READ, READDIR, READLINK, | 14216 | | RELEASE_LOCKOWNER, REMOVE, | 14217 | | RENAME, RESTOREFH, SAVEFH, | 14218 | | SECINFO, SECINFO_NO_NAME, | 14219 | | SEQUENCE, SETATTR, VERIFY, | 14220 | | WRITE | 14221 | NFS4ERR_UNKNOWN_LAYOUTTYPE | GETDEVICEINFO, GETDEVICELIST, | 14222 | | LAYOUTCOMMIT, LAYOUTGET, | 14223 | | LAYOUTRETURN | 14224 | NFS4ERR_UNSAFE_COMPOUND | ACCESS, CB_GETATTR, CLOSE, | 14225 | | COMMIT, CREATE, DELEGPURGE, | 14226 | | DELEGRETURN, GETATTR, GETFH, | 14227 | | GET_DIR_DELEGATION, LINK, | 14228 | | LOCK, LOCKT, LOCKU, LOOKUP, | 14229 | | LOOKUPP, NVERIFY, OPEN, | 14230 | | OPENATTR, OPEN_DOWNGRADE, | 14231 | | PUTFH, PUTPUBFH, PUTROOTFH, | 14232 | | READ, READDIR, READLINK, | 14233 | | RELEASE_LOCKOWNER, REMOVE, | 14234 | | RENAME, RESTOREFH, SAVEFH, | 14235 | | SECINFO, SECINFO_NO_NAME, | 14236 | | SETATTR, VERIFY, WRITE | 14237 | NFS4ERR_WRONGSEC | GET_DIR_DELEGATION, LINK, | 14238 | | LOOKUP, LOOKUPP, OPEN, PUTFH, | 14239 | | PUTPUBFH, PUTROOTFH, RENAME, | 14240 | | RESTOREFH | 14241 | NFS4ERR_XDEV | LINK, RENAME | 14242 +-----------------------------------+-------------------------------+ 14244 Table 11 14246 16. NFS version 4.1 Procedures 14248 16.1. Procedure 0: NULL - No Operation 14249 16.1.1. SYNOPSIS 14251 16.1.2. ARGUMENTS 14253 void; 14255 16.1.3. RESULTS 14257 void; 14259 16.1.4. DESCRIPTION 14261 Standard NULL procedure. Void argument, void response. This 14262 procedure has no functionality associated with it. Because of this 14263 it is sometimes used to measure the overhead of processing a service 14264 request. Therefore, the server should ensure that no unnecessary 14265 work is done in servicing this procedure. 14267 16.1.5. ERRORS 14269 None. 14271 16.2. Procedure 1: COMPOUND - Compound Operations 14273 16.2.1. SYNOPSIS 14275 compoundargs -> compoundres 14277 16.2.2. ARGUMENTS 14279 union nfs_argop4 switch (nfs_opnum4 argop) { 14280 case : ; 14281 ... 14282 }; 14284 struct COMPOUND4args { 14285 utf8str_cs tag; 14286 uint32_t minorversion; 14287 nfs_argop4 argarray<>; 14288 }; 14290 16.2.3. RESULTS 14292 union nfs_resop4 switch (nfs_opnum4 resop){ 14293 case : ; 14294 ... 14295 }; 14297 struct COMPOUND4res { 14298 nfsstat4 status; 14299 utf8str_cs tag; 14300 nfs_resop4 resarray<>; 14301 }; 14303 16.2.4. DESCRIPTION 14305 The COMPOUND procedure is used to combine one or more of the NFS 14306 operations into a single RPC request. The main NFS RPC program has 14307 two main procedures: NULL and COMPOUND. All other operations use the 14308 COMPOUND procedure as a wrapper. 14310 The COMPOUND procedure is used to combine individual operations into 14311 a single RPC request. The server interprets each of the operations 14312 in turn. If an operation is executed by the server and the status of 14313 that operation is NFS4_OK, then the next operation in the COMPOUND 14314 procedure is executed. The server continues this process until there 14315 are no more operations to be executed or one of the operations has a 14316 status value other than NFS4_OK. 14318 In the processing of the COMPOUND procedure, the server may find that 14319 it does not have the available resources to execute any or all of the 14320 operations within the COMPOUND sequence. See Section 2.10.4.4 for a 14321 more detailed discussion. 14323 The server will generally choose between two methods of decoding the 14324 client's request. The first would be the traditional one pass XDR 14325 decode. If there is an XDR decoding error in this case, the RPC XDR 14326 decode error would be returned. The second method would be to make 14327 an initial pass to decode the basic COMPOUND request and then to XDR 14328 decode the individual operations; the most interesting is the decode 14329 of attributes. In this case, the server may encounter an XDR decode 14330 error during the second pass. In this case, the server would return 14331 the error NFS4ERR_BADXDR to signify the decode error. 14333 The COMPOUND arguments contain a "minorversion" field. For NFSv4.1, 14334 the value for this field is 1. If the server receives a COMPOUND 14335 procedure with a minorversion field value that it does not support, 14336 the server MUST return an error of NFS4ERR_MINOR_VERS_MISMATCH and a 14337 zero length resultdata array. 14339 Contained within the COMPOUND results is a "status" field. If the 14340 results array length is non-zero, this status must be equivalent to 14341 the status of the last operation that was executed within the 14342 COMPOUND procedure. Therefore, if an operation incurred an error 14343 then the "status" value will be the same error value as is being 14344 returned for the operation that failed. 14346 Note that operations, 0 (zero) and 1 (one) are not defined for the 14347 COMPOUND procedure. Operation 2 is not defined but reserved for 14348 future definition and use with minor versioning. If the server 14349 receives a operation array that contains operation 2 and the 14350 minorversion field has a value of 0 (zero), an error of 14351 NFS4ERR_OP_ILLEGAL, as described in the next paragraph, is returned 14352 to the client. If an operation array contains an operation 2 and the 14353 minorversion field is non-zero and the server does not support the 14354 minor version, the server returns an error of 14355 NFS4ERR_MINOR_VERS_MISMATCH. Therefore, the 14356 NFS4ERR_MINOR_VERS_MISMATCH error takes precedence over all other 14357 errors. 14359 It is possible that the server receives a request that contains an 14360 operation that is less than the first legal operation (OP_ACCESS) or 14361 greater than the last legal operation (OP_RELEASE_LOCKOWNER). In 14362 this case, the server's response will encode the opcode OP_ILLEGAL 14363 rather than the illegal opcode of the request. The status field in 14364 the ILLEGAL return results will set to NFS4ERR_OP_ILLEGAL. The 14365 COMPOUND procedure's return results will also be NFS4ERR_OP_ILLEGAL. 14367 The definition of the "tag" in the request is left to the 14368 implementor. It may be used to summarize the content of the compound 14369 request for the benefit of packet sniffers and engineers debugging 14370 implementations. However, the value of "tag" in the response SHOULD 14371 be the same value as provided in the request. This applies to the 14372 tag field of the CB_COMPOUND procedure as well. 14374 16.2.4.1. Current File Handle and Stateid 14376 The COMPOUND procedure offers a simple environment for the execution 14377 of the operations specified by the client. The first two relate to 14378 the file handle while the second two relate to the current stateid. 14380 16.2.4.1.1. Current File Handle 14382 The current and saved file handle are used throughout the protocol. 14383 Most operations implicitly use the current file handle as a argument 14384 and many set the current file handle as part of the results. The 14385 combination of client specified sequences of operations and current 14386 and saved file handle arguments and results allows for greater 14387 protocol flexibility. The best or easiest example of current file 14388 handle usage is a sequence like the following: 14390 PUTFH fh1 {fh1} 14391 LOOKUP "compA" {fh2} 14392 GETATTR {fh2} 14393 LOOKUP "compB" {fh3} 14394 GETATTR {fh3} 14395 LOOKUP "compC" {fh4} 14396 GETATTR {fh4} 14397 GETFH 14399 Figure 75 14401 In this example, the PUTFH operation explicitly sets the current file 14402 handle value while the result of each LOOKUP operation sets the 14403 current file handle value to the resultant file system object. Also, 14404 the client is able to insert GETATTR operations using the current 14405 file handle as an argument. 14407 Along with the current file handle, there is a saved file handle. 14408 While the current file handle is set as the result of operations like 14409 LOOKUP, the saved file handle must be set directly with the use of 14410 the SAVEFH operation. The SAVEFH operations copies the current file 14411 handle value to the saved value. The saved file handle value is used 14412 in combination with the current file handle value for the LINK and 14413 RENAME operations. The RESTOREFH operation will copy the saved file 14414 handle value to the current file handle value; as a result, the saved 14415 file handle value may be used a sort of "scratch" area for the 14416 client's series of operations. 14418 16.2.4.1.2. Current Stateid 14420 With NFSv4.1, additions of a current stateid and a saved stateid have 14421 been made to the COMPOUND processing environment; this allows for the 14422 passing of stateids between operations. There are no changes to the 14423 syntax of the protocol, only changes to the semantics of a few 14424 operations. 14426 A "current stateid" is the stateid that is associated with the 14427 current file handle. The current stateid may only be changed by an 14428 operation that modifies the current file handle or returns a stateid. 14429 If an operation returns a stateid it MUST set the current stateid to 14430 the returned value. If an operation sets the current file handle but 14431 does not return a stateid, the current stateid MUST be set to the 14432 all-zeros special stateid. As an example, PUTFH will change the 14433 current server state from {ocfh, osid} to {cfh, 0} while LOCK will 14434 change the current state from {cfh, osid} to {cfh, nsid}. The SAVEFH 14435 and RESTOREFH operations will save and restore both the file handle 14436 and the stateid as a set. 14438 Any operation which takes as an argument a stateid that is not the 14439 special all-zeros stateid MUST set the current stateid to the all- 14440 zeros value before evaluating the operation. If the argument is the 14441 special all-zeros stateid, the operation is evaluated using the 14442 current stateid. 14444 The following example is the common case of a simple READ operation 14445 with a supplied stateid showing that the PUTFH initializes the 14446 current stateid to zero. The subsequent READ with stateid sid1 14447 replaces the current stateid before evaluating the operation. 14449 PUTFH fh1 - -> {fh1, 0} 14450 READ sid1,0,1024 {fh1, sid1} -> {fh1, sid1} 14452 Figure 76 14454 This next example performs an OPEN with the client provided stateid 14455 sid1 and as a result generates stateid sid2. The next operation 14456 specifies the READ with the special all-zero stateid but the current 14457 stateid set by the previous operation is actually used when the 14458 operation is evaluated, allowing correct interaction with any 14459 existing, potentially conflicting, locks. 14461 PUTFH fh1 - -> {fh1, 0} 14462 OPEN R,sid1,"compA" {fh1, sid1} -> {fh2, sid2} 14463 READ 0,0,1024 {fh2, sid2} -> {fh2, sid2} 14464 CLOSE 0 {fh2, sid2} -> {fh2, sid3} 14466 Figure 77 14468 The final example is similar to the second in how it passes the 14469 stateid sid2 generated by the LOCK operation to the next READ 14470 operation. This allows the client to explicitly surround a single 14471 I/O operation with a lock and its appropriate stateid to guarantee 14472 correctness with other client locks. 14474 PUTFH fh1 - -> {fh1, 0} 14475 LOCK W,0,1024,sid1 {fh1, sid1} -> {fh1, sid2} 14476 READ 0,0,1024 {fh1, sid2} -> {fh1, sid2} 14477 LOCKU W,0,1024,0 {fh1, sid2} -> {fh1, sid3} 14479 Figure 78 14481 16.2.5. IMPLEMENTATION 14483 16.2.6. ERRORS 14485 All errors defined in the protocol 14487 17. NFS version 4.1 Operations 14489 17.1. Operation 3: ACCESS - Check Access Rights 14491 17.1.1. SYNOPSIS 14493 (cfh), accessreq -> supported, accessrights 14495 17.1.2. ARGUMENTS 14497 /* 14498 * ACCESS: Check access permission 14499 */ 14500 const ACCESS4_READ = 0x00000001; 14501 const ACCESS4_LOOKUP = 0x00000002; 14502 const ACCESS4_MODIFY = 0x00000004; 14503 const ACCESS4_EXTEND = 0x00000008; 14504 const ACCESS4_DELETE = 0x00000010; 14505 const ACCESS4_EXECUTE = 0x00000020; 14507 struct ACCESS4args { 14508 /* CURRENT_FH: object */ 14509 uint32_t access; 14510 }; 14512 17.1.3. RESULTS 14514 struct ACCESS4resok { 14515 uint32_t supported; 14516 uint32_t access; 14517 }; 14519 union ACCESS4res switch (nfsstat4 status) { 14520 case NFS4_OK: 14521 ACCESS4resok resok4; 14522 default: 14523 void; 14524 }; 14526 17.1.4. DESCRIPTION 14528 ACCESS determines the access rights that a user, as identified by the 14529 credentials in the RPC request, has with respect to the file system 14530 object specified by the current filehandle. The client encodes the 14531 set of access rights that are to be checked in the bit mask "access". 14532 The server checks the permissions encoded in the bit mask. If a 14533 status of NFS4_OK is returned, two bit masks are included in the 14534 response. The first, "supported", represents the access rights for 14535 which the server can verify reliably. The second, "access", 14536 represents the access rights available to the user for the filehandle 14537 provided. On success, the current filehandle retains its value. 14539 Note that the supported field will contain only as many values as was 14540 originally sent in the arguments. For example, if the client sends 14541 an ACCESS operation with only the ACCESS4_READ value set and the 14542 server supports this value, the server will return only ACCESS4_READ 14543 even if it could have reliably checked other values. 14545 The results of this operation are necessarily advisory in nature. A 14546 return status of NFS4_OK and the appropriate bit set in the bit mask 14547 does not imply that such access will be allowed to the file system 14548 object in the future. This is because access rights can be revoked 14549 by the server at any time. 14551 The following access permissions may be requested: 14553 ACCESS4_READ Read data from file or read a directory. 14555 ACCESS4_LOOKUP Look up a name in a directory (no meaning for non- 14556 directory objects). 14558 ACCESS4_MODIFY Rewrite existing file data or modify existing 14559 directory entries. 14561 ACCESS4_EXTEND Write new data or add directory entries. 14563 ACCESS4_DELETE Delete an existing directory entry. 14565 ACCESS4_EXECUTE Execute file (no meaning for a directory). 14567 On success, the current filehandle retains its value. 14569 17.1.5. IMPLEMENTATION 14571 In general, it is not sufficient for the client to attempt to deduce 14572 access permissions by inspecting the uid, gid, and mode fields in the 14573 file attributes or by attempting to interpret the contents of the ACL 14574 attribute. This is because the server may perform uid or gid mapping 14575 or enforce additional access control restrictions. It is also 14576 possible that the server may not be in the same ID space as the 14577 client. In these cases (and perhaps others), the client can not 14578 reliably perform an access check with only current file attributes. 14580 In the NFS version 2 protocol, the only reliable way to determine 14581 whether an operation was allowed was to try it and see if it 14582 succeeded or failed. Using the ACCESS operation in the NFS version 4 14583 protocol, the client can ask the server to indicate whether or not 14584 one or more classes of operations are permitted. The ACCESS 14585 operation is provided to allow clients to check before doing a series 14586 of operations which will result in an access failure. The OPEN 14587 operation provides a point where the server can verify access to the 14588 file object and method to return that information to the client. The 14589 ACCESS operation is still useful for directory operations or for use 14590 in the case the UNIX API "access" is used on the client. 14592 The information returned by the server in response to an ACCESS call 14593 is not permanent. It was correct at the exact time that the server 14594 performed the checks, but not necessarily afterwards. The server can 14595 revoke access permission at any time. 14597 The client should use the effective credentials of the user to build 14598 the authentication information in the ACCESS request used to 14599 determine access rights. It is the effective user and group 14600 credentials that are used in subsequent read and write operations. 14602 Many implementations do not directly support the ACCESS4_DELETE 14603 permission. Operating systems like UNIX will ignore the 14604 ACCESS4_DELETE bit if set on an access request on a non-directory 14605 object. In these systems, delete permission on a file is determined 14606 by the access permissions on the directory in which the file resides, 14607 instead of being determined by the permissions of the file itself. 14608 Therefore, the mask returned enumerating which access rights can be 14609 determined will have the ACCESS4_DELETE value set to 0. This 14610 indicates to the client that the server was unable to check that 14611 particular access right. The ACCESS4_DELETE bit in the access mask 14612 returned will then be ignored by the client. 14614 17.2. Operation 4: CLOSE - Close File 14616 17.2.1. SYNOPSIS 14618 (cfh), seqid, open_stateid -> open_stateid 14620 17.2.2. ARGUMENTS 14622 /* 14623 * CLOSE: Close a file and release share reservations 14624 */ 14625 struct CLOSE4args { 14626 /* CURRENT_FH: object */ 14627 seqid4 seqid; 14628 stateid4 open_stateid; 14629 }; 14631 17.2.3. RESULTS 14633 union CLOSE4res switch (nfsstat4 status) { 14634 case NFS4_OK: 14635 stateid4 open_stateid; 14636 default: 14637 void; 14638 }; 14640 17.2.4. DESCRIPTION 14642 The CLOSE operation releases share reservations for the regular or 14643 named attribute file as specified by the current filehandle. The 14644 share reservations and other state information released at the server 14645 as a result of this CLOSE is only that associated with the supplied 14646 stateid. State associated with other OPENs is not affected. 14648 If record locks are held, the client SHOULD release all locks before 14649 issuing a CLOSE. The server MAY free all outstanding locks on CLOSE 14650 but some servers may not support the CLOSE of a file that still has 14651 record locks held. The server MUST return failure if any locks would 14652 exist after the CLOSE. 14654 The seqid value argument must have the value zero. If any other 14655 value is specified the server MUST return the error NFS4ERR_INVAL. 14657 On success, the current filehandle retains its value. 14659 17.2.5. IMPLEMENTATION 14661 Even though CLOSE returns a stateid, this stateid is not useful to 14662 the client and should be treated as deprecated. CLOSE "shuts down" 14663 the state associated with all OPENs for the file by a single 14664 open_owner. As noted above, CLOSE will either release all file 14665 locking state or return an error. Therefore, the stateid returned by 14666 CLOSE is not useful for operations that follow. 14668 17.3. Operation 5: COMMIT - Commit Cached Data 14670 17.3.1. SYNOPSIS 14672 (cfh), offset, count -> verifier 14674 17.3.2. ARGUMENTS 14676 /* 14677 * COMMIT: Commit cached data on server to stable storage 14678 */ 14679 struct COMMIT4args { 14680 /* CURRENT_FH: file */ 14681 offset4 offset; 14682 count4 count; 14683 }; 14685 17.3.3. RESULTS 14687 struct COMMIT4resok { 14688 verifier4 writeverf; 14689 }; 14691 union COMMIT4res switch (nfsstat4 status) { 14692 case NFS4_OK: 14693 COMMIT4resok resok4; 14694 default: 14695 void; 14696 }; 14698 17.3.4. DESCRIPTION 14700 The COMMIT operation forces or flushes data to stable storage for the 14701 file specified by the current filehandle. The flushed data is that 14702 which was previously written with a WRITE operation which had the 14703 stable field set to UNSTABLE4. 14705 The offset specifies the position within the file where the flush is 14706 to begin. An offset value of 0 (zero) means to flush data starting 14707 at the beginning of the file. The count specifies the number of 14708 bytes of data to flush. If count is 0 (zero), a flush from offset to 14709 the end of the file is done. 14711 The server returns a write verifier upon successful completion of the 14712 COMMIT. The write verifier is used by the client to determine if the 14713 server has restarted or rebooted between the initial WRITE(s) and the 14714 COMMIT. The client does this by comparing the write verifier 14715 returned from the initial writes and the verifier returned by the 14716 COMMIT operation. The server must vary the value of the write 14717 verifier at each server event or instantiation that may lead to a 14718 loss of uncommitted data. Most commonly this occurs when the server 14719 is rebooted; however, other events at the server may result in 14720 uncommitted data loss as well. 14722 On success, the current filehandle retains its value. 14724 17.3.5. IMPLEMENTATION 14726 The COMMIT operation is similar in operation and semantics to the 14727 POSIX fsync(2) system call that synchronizes a file's state with the 14728 disk (file data and metadata is flushed to disk or stable storage). 14729 COMMIT performs the same operation for a client, flushing any 14730 unsynchronized data and metadata on the server to the server's disk 14731 or stable storage for the specified file. Like fsync(2), it may be 14732 that there is some modified data or no modified data to synchronize. 14733 The data may have been synchronized by the server's normal periodic 14734 buffer synchronization activity. COMMIT should return NFS4_OK, 14735 unless there has been an unexpected error. 14737 COMMIT differs from fsync(2) in that it is possible for the client to 14738 flush a range of the file (most likely triggered by a buffer- 14739 reclamation scheme on the client before file has been completely 14740 written). 14742 The server implementation of COMMIT is reasonably simple. If the 14743 server receives a full file COMMIT request, that is starting at 14744 offset 0 and count 0, it should do the equivalent of fsync()'ing the 14745 file. Otherwise, it should arrange to have the cached data in the 14746 range specified by offset and count to be flushed to stable storage. 14747 In both cases, any metadata associated with the file must be flushed 14748 to stable storage before returning. It is not an error for there to 14749 be nothing to flush on the server. This means that the data and 14750 metadata that needed to be flushed have already been flushed or lost 14751 during the last server failure. 14753 The client implementation of COMMIT is a little more complex. There 14754 are two reasons for wanting to commit a client buffer to stable 14755 storage. The first is that the client wants to reuse a buffer. In 14756 this case, the offset and count of the buffer are sent to the server 14757 in the COMMIT request. The server then flushes any cached data based 14758 on the offset and count, and flushes any metadata associated with the 14759 file. It then returns the status of the flush and the write 14760 verifier. The other reason for the client to generate a COMMIT is 14761 for a full file flush, such as may be done at close. In this case, 14762 the client would gather all of the buffers for this file that contain 14763 uncommitted data, do the COMMIT operation with an offset of 0 and 14764 count of 0, and then free all of those buffers. Any other dirty 14765 buffers would be sent to the server in the normal fashion. 14767 After a buffer is written by the client with the stable parameter set 14768 to UNSTABLE4, the buffer must be considered as modified by the client 14769 until the buffer has either been flushed via a COMMIT operation or 14770 written via a WRITE operation with stable parameter set to FILE_SYNC4 14771 or DATA_SYNC4. This is done to prevent the buffer from being freed 14772 and reused before the data can be flushed to stable storage on the 14773 server. 14775 When a response is returned from either a WRITE or a COMMIT operation 14776 and it contains a write verifier that is different than previously 14777 returned by the server, the client will need to retransmit all of the 14778 buffers containing uncommitted cached data to the server. How this 14779 is to be done is up to the implementor. If there is only one buffer 14780 of interest, then it should probably be sent back over in a WRITE 14781 request with the appropriate stable parameter. If there is more than 14782 one buffer, it might be worthwhile retransmitting all of the buffers 14783 in WRITE requests with the stable parameter set to UNSTABLE4 and then 14784 retransmitting the COMMIT operation to flush all of the data on the 14785 server to stable storage. The timing of these retransmissions is 14786 left to the implementor. 14788 The above description applies to page-cache-based systems as well as 14789 buffer-cache-based systems. In those systems, the virtual memory 14790 system will need to be modified instead of the buffer cache. 14792 17.4. Operation 6: CREATE - Create a Non-Regular File Object 14794 17.4.1. SYNOPSIS 14796 (cfh), name, type, attrs -> (cfh), change_info, attrs_set 14798 17.4.2. ARGUMENTS 14800 /* 14801 * CREATE: Create a non-regular file 14802 */ 14803 union createtype4 switch (nfs_ftype4 type) { 14804 case NF4LNK: 14805 linktext4 linkdata; 14806 case NF4BLK: 14807 case NF4CHR: 14808 specdata4 devdata; 14809 case NF4SOCK: 14810 case NF4FIFO: 14811 case NF4DIR: 14812 void; 14813 default: 14814 void; /* server should return NFS4ERR_BADTYPE */ 14815 }; 14817 struct CREATE4args { 14818 /* CURRENT_FH: directory for creation */ 14819 createtype4 objtype; 14820 component4 objname; 14821 fattr4 createattrs; 14822 }; 14824 17.4.3. RESULTS 14826 struct CREATE4resok { 14827 change_info4 cinfo; 14828 bitmap4 attrset; /* attributes set */ 14829 }; 14831 union CREATE4res switch (nfsstat4 status) { 14832 case NFS4_OK: 14833 CREATE4resok resok4; 14834 default: 14835 void; 14836 }; 14838 17.4.4. DESCRIPTION 14840 The CREATE operation creates a non-regular file object in a directory 14841 with a given name. The OPEN operation MUST be used to create a 14842 regular file. 14844 The objname specifies the name for the new object. The objtype 14845 determines the type of object to be created: directory, symlink, etc. 14847 If an object of the same name already exists in the directory, the 14848 server will return the error NFS4ERR_EXIST. 14850 For the directory where the new file object was created, the server 14851 returns change_info4 information in cinfo. With the atomic field of 14852 the change_info4 struct, the server will indicate if the before and 14853 after change attributes were obtained atomically with respect to the 14854 file object creation. 14856 If the objname has a length of 0 (zero), or if objname does not obey 14857 the UTF-8 definition, the error NFS4ERR_INVAL will be returned. 14859 The current filehandle is replaced by that of the new object. 14861 The createattrs specifies the initial set of attributes for the 14862 object. The set of attributes may include any writable attribute 14863 valid for the object type. When the operation is successful, the 14864 server will return to the client an attribute mask signifying which 14865 attributes were successfully set for the object. 14867 If createattrs includes neither the owner attribute nor an ACL with 14868 an ACE for the owner, and if the server's file system both supports 14869 and requires an owner attribute (or an owner ACE) then the server 14870 MUST derive the owner (or the owner ACE). This would typically be 14871 from the principal indicated in the RPC credentials of the call, but 14872 the server's operating environment or file system semantics may 14873 dictate other methods of derivation. Similarly, if createattrs 14874 includes neither the group attribute nor a group ACE, and if the 14875 server's file system both supports and requires the notion of a group 14876 attribute (or group ACE), the server MUST derive the group attribute 14877 (or the corresponding owner ACE) for the file. This could be from 14878 the RPC call's credentials, such as the group principal if the 14879 credentials include it (such as with AUTH_SYS), from the group 14880 identifier associated with the principal in the credentials (for 14881 e.g., POSIX systems have a passwd database that has the group 14882 identifier for every user identifier), inherited from directory the 14883 object is created in, or whatever else the server's operating 14884 environment or file system semantics dictate. This applies to the 14885 OPEN operation too. 14887 Conversely, it is possible the client will specify in createattrs an 14888 owner attribute or group attribute or ACL that the principal 14889 indicated the RPC call's credentials does not have permissions to 14890 create files for. The error to be returned in this instance is 14891 NFS4ERR_PERM. This applies to the OPEN operation too. 14893 17.4.5. IMPLEMENTATION 14895 If the client desires to set attribute values after the create, a 14896 SETATTR operation can be added to the COMPOUND request so that the 14897 appropriate attributes will be set. 14899 17.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting Recovery 14901 17.5.1. SYNOPSIS 14903 client ID -> 14905 17.5.2. ARGUMENTS 14907 /* 14908 * DELEGPURGE: Purge Delegations Awaiting Recovery 14909 */ 14910 struct DELEGPURGE4args { 14911 clientid4 clientid; 14912 }; 14914 17.5.3. RESULTS 14916 struct DELEGPURGE4res { 14917 nfsstat4 status; 14918 }; 14920 17.5.4. DESCRIPTION 14922 Purges all of the delegations awaiting recovery for a given client. 14923 This is useful for clients which do not commit delegation information 14924 to stable storage to indicate that conflicting requests need not be 14925 delayed by the server awaiting recovery of delegation information. 14927 This operation should be used by clients that record delegation 14928 information on stable storage on the client. In this case, 14929 DELEGPURGE should be issued immediately after doing delegation 14930 recovery on all delegations known to the client. Doing so will 14931 notify the server that no additional delegations for the client will 14932 be recovered allowing it to free resources, and avoid delaying other 14933 clients who make requests that conflict with the unrecovered 14934 delegations. The set of delegations known to the server and the 14935 client may be different. The reason for this is that a client may 14936 fail after making a request which resulted in delegation but before 14937 it received the results and committed them to the client's stable 14938 storage. 14940 The server MAY support DELEGPURGE, but if it does not, it MUST NOT 14941 support CLAIM_DELEGATE_PREV. 14943 17.6. Operation 8: DELEGRETURN - Return Delegation 14945 17.6.1. SYNOPSIS 14947 (cfh), stateid -> 14949 17.6.2. ARGUMENTS 14951 /* 14952 * DELEGRETURN: Return a delegation 14953 */ 14954 struct DELEGRETURN4args { 14955 /* CURRENT_FH: delegated file */ 14956 stateid4 deleg_stateid; 14957 }; 14959 17.6.3. RESULTS 14961 struct DELEGRETURN4res { 14962 nfsstat4 status; 14963 }; 14965 17.6.4. DESCRIPTION 14967 Returns the delegation represented by the current filehandle and 14968 stateid. 14970 Delegations may be returned when recalled or voluntarily (i.e. before 14971 the server has recalled them). In either case the client must 14972 properly propagate state changed under the context of the delegation 14973 to the server before returning the delegation. 14975 17.7. Operation 9: GETATTR - Get Attributes 14977 17.7.1. SYNOPSIS 14979 (cfh), attrbits -> attrbits, attrvals 14981 17.7.2. ARGUMENTS 14983 /* 14984 * GETATTR: Get file attributes 14985 */ 14986 struct GETATTR4args { 14987 /* CURRENT_FH: directory or file */ 14988 bitmap4 attr_request; 14989 }; 14991 17.7.3. RESULTS 14993 struct GETATTR4resok { 14994 fattr4 obj_attributes; 14995 }; 14997 union GETATTR4res switch (nfsstat4 status) { 14998 case NFS4_OK: 14999 GETATTR4resok resok4; 15000 default: 15001 void; 15002 }; 15004 17.7.4. DESCRIPTION 15006 The GETATTR operation will obtain attributes for the file system 15007 object specified by the current filehandle. The client sets a bit in 15008 the bitmap argument for each attribute value that it would like the 15009 server to return. The server returns an attribute bitmap that 15010 indicates the attribute values which it was able to return, which 15011 will include all attributes requested by the client which are 15012 attributes supported by the server for the target file system. This 15013 bitmap is followed by the attribute values ordered lowest attribute 15014 number first. 15016 The server must return a value for each attribute that the client 15017 requests if the attribute is supported by the server for the target 15018 file system. If the server does not support a particular attribute 15019 on the target file system then it must not return the attribute value 15020 and must not set the attribute bit in the result bitmap. The server 15021 must return an error if it supports an attribute on the target but 15022 cannot obtain its value. In that case, no attribute values will be 15023 returned. 15025 File systems which are absent should be treated as having support for 15026 a very small set of attributes as described in GETATTR Within an 15027 Absent File System (Section 5), even if previously, when the file 15028 system was present, more attributes were supported. 15030 All servers must support the mandatory attributes as specified in 15031 File Attributes (Section 10.3.1), for all file systems, with the 15032 exception of absent file systems. 15034 On success, the current filehandle retains its value. 15036 17.7.5. IMPLEMENTATION 15038 17.8. Operation 10: GETFH - Get Current Filehandle 15040 17.8.1. SYNOPSIS 15042 (cfh) -> filehandle 15044 17.8.2. ARGUMENTS 15046 /* CURRENT_FH: */ 15047 void; 15049 17.8.3. RESULTS 15051 /* 15052 * GETFH: Get current filehandle 15053 */ 15054 struct GETFH4resok { 15055 nfs_fh4 object; 15056 }; 15058 union GETFH4res switch (nfsstat4 status) { 15059 case NFS4_OK: 15060 GETFH4resok resok4; 15061 default: 15062 void; 15063 }; 15065 17.8.4. DESCRIPTION 15067 This operation returns the current filehandle value. 15069 On success, the current filehandle retains its value. 15071 17.8.5. IMPLEMENTATION 15073 Operations that change the current filehandle like LOOKUP or CREATE 15074 do not automatically return the new filehandle as a result. For 15075 instance, if a client needs to lookup a directory entry and obtain 15076 its filehandle then the following request is needed. 15078 PUTFH (directory filehandle) 15080 LOOKUP (entry name) 15082 GETFH 15084 17.9. Operation 11: LINK - Create Link to a File 15086 17.9.1. SYNOPSIS 15088 (sfh), (cfh), newname -> (cfh), change_info 15090 17.9.2. ARGUMENTS 15092 /* 15093 * LINK: Create link to an object 15094 */ 15095 struct LINK4args { 15096 /* SAVED_FH: source object */ 15097 /* CURRENT_FH: target directory */ 15098 component4 newname; 15099 }; 15101 17.9.3. RESULTS 15103 struct LINK4resok { 15104 change_info4 cinfo; 15105 }; 15107 union LINK4res switch (nfsstat4 status) { 15108 case NFS4_OK: 15109 LINK4resok resok4; 15110 default: 15111 void; 15112 }; 15114 17.9.4. DESCRIPTION 15116 The LINK operation creates an additional newname for the file 15117 represented by the saved filehandle, as set by the SAVEFH operation, 15118 in the directory represented by the current filehandle. The existing 15119 file and the target directory must reside within the same file system 15120 on the server. On success, the current filehandle will continue to 15121 be the target directory. If an object exists in the target directory 15122 with the same name as newname, the server must return NFS4ERR_EXIST. 15124 For the target directory, the server returns change_info4 information 15125 in cinfo. With the atomic field of the change_info4 struct, the 15126 server will indicate if the before and after change attributes were 15127 obtained atomically with respect to the link creation. 15129 If the newname has a length of 0 (zero), or if newname does not obey 15130 the UTF-8 definition, the error NFS4ERR_INVAL will be returned. 15132 17.9.5. IMPLEMENTATION 15134 Changes to any property of the "hard" linked files are reflected in 15135 all of the linked files. When a link is made to a file, the 15136 attributes for the file should have a value for numlinks that is one 15137 greater than the value before the LINK operation. 15139 The statement "file and the target directory must reside within the 15140 same file system on the server" means that the fsid fields in the 15141 attributes for the objects are the same. If they reside on different 15142 file systems, the error, NFS4ERR_XDEV, is returned. On some servers, 15143 the filenames, "." and "..", are illegal as newname. 15145 In the case that newname is already linked to the file represented by 15146 the saved filehandle, the server will return NFS4ERR_EXIST. 15148 Note that symbolic links are created with the CREATE operation. 15150 17.10. Operation 12: LOCK - Create Lock 15152 17.10.1. SYNOPSIS 15154 (cfh) locktype, reclaim, offset, length, locker -> stateid 15156 17.10.2. ARGUMENTS 15158 /* 15159 * For LOCK, transition from open_stateid and lock_owner 15160 * to a lock stateid. 15161 */ 15162 struct open_to_lock_owner4 { 15163 seqid4 open_seqid; 15164 stateid4 open_stateid; 15165 seqid4 lock_seqid; 15166 lock_owner4 lock_owner; 15167 }; 15169 /* 15170 * For LOCK, existing lock stateid continues to request new 15171 * file lock for the same lock_owner and open_stateid. 15172 */ 15173 struct exist_lock_owner4 { 15174 stateid4 lock_stateid; 15175 seqid4 lock_seqid; 15176 }; 15178 union locker4 switch (bool new_lock_owner) { 15179 case TRUE: 15180 open_to_lock_owner4 open_owner; 15181 case FALSE: 15182 exist_lock_owner4 lock_owner; 15183 }; 15185 /* 15186 * LOCK/LOCKT/LOCKU: Record lock management 15187 */ 15188 struct LOCK4args { 15189 /* CURRENT_FH: file */ 15190 nfs_lock_type4 locktype; 15191 bool reclaim; 15192 offset4 offset; 15193 length4 length; 15194 locker4 locker; 15195 }; 15197 17.10.3. RESULTS 15199 struct LOCK4denied { 15200 offset4 offset; 15201 length4 length; 15202 nfs_lock_type4 locktype; 15203 lock_owner4 owner; 15204 }; 15206 struct LOCK4resok { 15207 stateid4 lock_stateid; 15208 }; 15210 union LOCK4res switch (nfsstat4 status) { 15211 case NFS4_OK: 15212 LOCK4resok resok4; 15213 case NFS4ERR_DENIED: 15214 LOCK4denied denied; 15215 default: 15216 void; 15217 }; 15219 17.10.4. DESCRIPTION 15221 The LOCK operation requests a record lock for the octet range 15222 specified by the offset and length parameters. The lock type is also 15223 specified to be one of the nfs_lock_type4s. If this is a reclaim 15224 request, the reclaim parameter will be TRUE; 15226 Bytes in a file may be locked even if those octets are not currently 15227 allocated to the file. To lock the file from a specific offset 15228 through the end-of-file (no matter how long the file actually is) use 15229 a length field with all bits set to 1 (one). If the length is zero, 15230 or if a length which is not all bits set to one is specified, and 15231 length when added to the offset exceeds the maximum 64-bit unsigned 15232 integer value, the error NFS4ERR_INVAL will result. 15234 Some servers may only support locking for octet offsets that fit 15235 within 32 bits. If the client specifies a range that includes an 15236 octet beyond the last octet offset of the 32-bit range, but does not 15237 include the last octet offset of the 32-bit and all of the octet 15238 offsets beyond it, up to the end of the valid 64-bit range, such a 15239 32-bit server MUST return the error NFS4ERR_BAD_RANGE. 15241 In the case that the lock is denied, the owner, offset, and length of 15242 a conflicting lock are returned. 15244 The locker argument specifies the lock_owner that is associated with 15245 the LOCK request. The locker4 structure is a switched union that 15246 indicates whether the client has already created record locking state 15247 associated with the current open file and lock owner. In the case in 15248 which it has, the argument is just a stateid for the set of locks 15249 associated with that open file and lock owner, together with a 15250 lock_seqid value which must be zero. In the case where no such state 15251 has been established, or the client does not have the stateid 15252 available, the argument contains the stateid of the open file with 15253 which this lock is to be associated, together with the lock_owner 15254 which which the lock is to be associated. The open_to_lock_owner 15255 case covers the very first lock done by a lock_owner for a given open 15256 file and offers a method to use the established state of the 15257 open_stateid to transition to the use of a lock stateid. 15259 The client field of the lock owner, and all seqid values in the 15260 arguments have zero as the only valid value. When any of these are 15261 specified as other than zero, the server MUST return an 15262 NFS4ERR_INVAL. The client ID with which all owners and stateids are 15263 associated is the client ID associated with the session on which the 15264 request was issued. The client ID appearing in a LOCK4denied 15265 structure is the actual client associated with the conflicting lock, 15266 whether this is the client ID associated with the current session, or 15267 a different one. 15269 On success, the current filehandle retains its value. 15271 17.10.5. IMPLEMENTATION 15273 If the server is unable to determine the exact offset and length of 15274 the conflicting lock, the same offset and length that were provided 15275 in the arguments should be returned in the denied results. The File 15276 Locking section contains a full description of this and the other 15277 file locking operations. 15279 LOCK operations are subject to permission checks and to checks 15280 against the access type of the associated file. However, the 15281 specific right and modes required for various type of locks, reflect 15282 the semantics of the server-exported file system, and are not 15283 specified by the protocol. For example, Windows 2000 allows a write 15284 lock of a file open for READ, while a POSIX-compliant system does 15285 not. 15287 When the client makes a lock request that corresponds to a range that 15288 the lockowner has locked already (with the same or different lock 15289 type), or to a sub-region of such a range, or to a region which 15290 includes multiple locks already granted to that lockowner, in whole 15291 or in part, and the server does not support such locking operations 15292 (i.e. does not support POSIX locking semantics), the server will 15293 return the error NFS4ERR_LOCK_RANGE. In that case, the client may 15294 return an error, or it may emulate the required operations, using 15295 only LOCK for ranges that do not include any octets already locked by 15296 that lock_owner and LOCKU of locks held by that lock_owner 15297 (specifying an exactly-matching range and type). Similarly, when the 15298 client makes a lock request that amounts to upgrading (changing from 15299 a read lock to a write lock) or downgrading (changing from write lock 15300 to a read lock) an existing record lock, and the server does not 15301 support such a lock, the server will return NFS4ERR_LOCK_NOTSUPP. 15302 Such operations may not perfectly reflect the required semantics in 15303 the face of conflicting lock requests from other clients. 15305 17.11. Operation 13: LOCKT - Test For Lock 15307 17.11.1. SYNOPSIS 15309 (cfh) locktype, offset, length owner -> {void, NFS4ERR_DENIED -> 15310 owner} 15312 17.11.2. ARGUMENTS 15314 struct LOCKT4args { 15315 /* CURRENT_FH: file */ 15316 nfs_lock_type4 locktype; 15317 offset4 offset; 15318 length4 length; 15319 lock_owner4 owner; 15320 }; 15322 17.11.3. RESULTS 15324 union LOCKT4res switch (nfsstat4 status) { 15325 case NFS4ERR_DENIED: 15326 LOCK4denied denied; 15327 case NFS4_OK: 15328 void; 15329 default: 15330 void; 15331 }; 15333 17.11.4. DESCRIPTION 15335 The LOCKT operation tests the lock as specified in the arguments. If 15336 a conflicting lock exists, the owner, offset, length, and type of the 15337 conflicting lock are returned. The owner field in the results 15338 includes the client ID of the owner of conflicting lock, whether this 15339 is the client ID associated with the current session or a different 15340 client ID. If no lock is held, nothing other than NFS4_OK is 15341 returned. Lock types READ_LT and READW_LT are processed in the same 15342 way in that a conflicting lock test is done without regard to 15343 blocking or non-blocking. The same is true for WRITE_LT and 15344 WRITEW_LT. 15346 The ranges are specified as for LOCK. The NFS4ERR_INVAL and 15347 NFS4ERR_BAD_RANGE errors are returned under the same circumstances as 15348 for LOCK. 15350 The client ID field of the owner should be specified as zero. The 15351 client ID used for ownership comparisons is that associated with the 15352 session on which the request is issued. If the client ID field is 15353 other than zero, the server MUST return the error NFS4ERR_INVAL. 15355 On success, the current filehandle retains its value. 15357 17.11.5. IMPLEMENTATION 15359 If the server is unable to determine the exact offset and length of 15360 the conflicting lock, the same offset and length that were provided 15361 in the arguments should be returned in the denied results. The File 15362 Locking section contains further discussion of the file locking 15363 mechanisms. 15365 LOCKT uses a lock_owner4 rather a stateid4, as is used in LOCK to 15366 identify the owner. This is because the client does not have to open 15367 the file to test for the existence of a lock, so a stateid may not be 15368 available. 15370 The test for conflicting locks should exclude locks for the current 15371 lockowner. Note that since such locks are not examined the possible 15372 existence of overlapping ranges may not affect the results of LOCKT. 15373 If the server does examine locks that match the lockowner for the 15374 purpose of range checking, NFS4ERR_LOCK_RANGE may be returned.. In 15375 the event that it returns NFS4_OK, clients may do a LOCK and receive 15376 NFS4ERR_LOCK_RANGE on the LOCK request because of the flexibility 15377 provided to the server. 15379 17.12. Operation 14: LOCKU - Unlock File 15381 17.12.1. SYNOPSIS 15383 (cfh) type, seqid, stateid, offset, length -> stateid 15385 17.12.2. ARGUMENTS 15387 struct LOCKU4args { 15388 /* CURRENT_FH: file */ 15389 nfs_lock_type4 locktype; 15390 seqid4 seqid; 15391 stateid4 lock_stateid; 15392 offset4 offset; 15393 length4 length; 15394 }; 15396 17.12.3. RESULTS 15398 union LOCKU4res switch (nfsstat4 status) { 15399 case NFS4_OK: 15400 stateid4 lock_stateid; 15401 default: 15402 void; 15403 }; 15405 17.12.4. DESCRIPTION 15407 The LOCKU operation unlocks the record lock specified by the 15408 parameters. The client may set the locktype field to any value that 15409 is legal for the nfs_lock_type4 enumerated type, and the server MUST 15410 accept any legal value for locktype. Any legal value for locktype 15411 has no effect on the success or failure of the LOCKU operation. 15413 The ranges are specified as for LOCK. The NFS4ERR_INVAL and 15414 NFS4ERR_BAD_RANGE errors are returned under the same circumstances as 15415 for LOCK. 15417 The seqid parameter should be specified as zero. If any other value 15418 is specified, the server must return an NFS4ERR_INVAL error. 15420 On success, the current filehandle retains its value. 15422 17.12.5. IMPLEMENTATION 15424 If the area to be unlocked does not correspond exactly to a lock 15425 actually held by the lockowner the server may return the error 15426 NFS4ERR_LOCK_RANGE. This includes the case in which the area is not 15427 locked, where the area is a sub-range of the area locked, where it 15428 overlaps the area locked without matching exactly or the area 15429 specified includes multiple locks held by the lockowner. In all of 15430 these cases, allowed by POSIX locking semantics, a client receiving 15431 this error, should if it desires support for such operations, 15432 simulate the operation using LOCKU on ranges corresponding to locks 15433 it actually holds, possibly followed by LOCK requests for the sub- 15434 ranges not being unlocked. 15436 17.13. Operation 15: LOOKUP - Lookup Filename 15438 17.13.1. SYNOPSIS 15440 (cfh), component -> (cfh) 15442 17.13.2. ARGUMENTS 15444 /* 15445 * LOOKUP: Lookup filename 15446 */ 15447 struct LOOKUP4args { 15448 /* CURRENT_FH: directory */ 15449 component4 objname; 15450 }; 15452 17.13.3. RESULTS 15454 struct LOOKUP4res { 15455 /* CURRENT_FH: object */ 15456 nfsstat4 status; 15457 }; 15459 17.13.4. DESCRIPTION 15461 This operation LOOKUPs or finds a file system object using the 15462 directory specified by the current filehandle. LOOKUP evaluates the 15463 component and if the object exists the current filehandle is replaced 15464 with the component's filehandle. 15466 If the component cannot be evaluated either because it does not exist 15467 or because the client does not have permission to evaluate the 15468 component, then an error will be returned and the current filehandle 15469 will be unchanged. 15471 If the component is a zero length string or if any component does not 15472 obey the UTF-8 definition, the error NFS4ERR_INVAL will be returned. 15474 17.13.5. IMPLEMENTATION 15476 If the client wants to achieve the effect of a multi-component 15477 lookup, it may construct a COMPOUND request such as (and obtain each 15478 filehandle): 15480 PUTFH (directory filehandle) 15481 LOOKUP "pub" 15482 GETFH 15483 LOOKUP "foo" 15484 GETFH 15485 LOOKUP "bar" 15486 GETFH 15488 NFS version 4 servers depart from the semantics of previous NFS 15489 versions in allowing LOOKUP requests to cross mountpoints on the 15490 server. The client can detect a mountpoint crossing by comparing the 15491 fsid attribute of the directory with the fsid attribute of the 15492 directory looked up. If the fsids are different then the new 15493 directory is a server mountpoint. UNIX clients that detect a 15494 mountpoint crossing will need to mount the server's file system. 15495 This needs to be done to maintain the file object identity checking 15496 mechanisms common to UNIX clients. 15498 Servers that limit NFS access to "shares" or "exported" file systems 15499 should provide a pseudo file system into which the exported file 15500 systems can be integrated, so that clients can browse the server's 15501 name space. The clients view of a pseudo file system will be limited 15502 to paths that lead to exported file systems. 15504 Note: previous versions of the protocol assigned special semantics to 15505 the names "." and "..". NFS version 4 assigns no special semantics 15506 to these names. The LOOKUPP operator must be used to lookup a parent 15507 directory. 15509 Note that this operation does not follow symbolic links. The client 15510 is responsible for all parsing of filenames including filenames that 15511 are modified by symbolic links encountered during the lookup process. 15513 If the current filehandle supplied is not a directory but a symbolic 15514 link, the error NFS4ERR_SYMLINK is returned as the error. For all 15515 other non-directory file types, the error NFS4ERR_NOTDIR is returned. 15517 17.14. Operation 16: LOOKUPP - Lookup Parent Directory 15518 17.14.1. SYNOPSIS 15520 (cfh) -> (cfh) 15522 17.14.2. ARGUMENTS 15524 /* CURRENT_FH: object */ 15525 void; 15527 17.14.3. RESULTS 15529 /* 15530 * LOOKUPP: Lookup parent directory 15531 */ 15532 struct LOOKUPP4res { 15533 /* CURRENT_FH: directory */ 15534 nfsstat4 status; 15535 }; 15537 17.14.4. DESCRIPTION 15539 The current filehandle is assumed to refer to a regular directory or 15540 a named attribute directory. LOOKUPP assigns the filehandle for its 15541 parent directory to be the current filehandle. If there is no parent 15542 directory an NFS4ERR_NOENT error must be returned. Therefore, 15543 NFS4ERR_NOENT will be returned by the server when the current 15544 filehandle is at the root or top of the server's file tree. 15546 As for LOOKUP, LOOKUPP will also cross mountpoints. 15548 If the current filehandle is not a directory or named attribute 15549 directory, the error NFS4ERR_NOTDIR is returned. 15551 If the requester's security flavor does not match that configured for 15552 the parent directory, then the server SHOULD return NFS4ERR_WRONGSEC 15553 (a future minor revision of NFSv4 may upgrade this to MUST) in the 15554 LOOKUPP response. However, if the server does so, it MUST support 15555 the new SECINFO_NO_NAME operation, so that the client can gracefully 15556 determine the correct security flavor. See the discussion of the 15557 SECINFO_NO_NAME operation for a description. 15559 If the current filehandle is a named attribute directory that is 15560 associated with a file system object via OPENATTR (i.e. not a sub- 15561 directory of a named attribute directory) LOOKUPP SHOULD return the 15562 filehandle of the associated file system object. 15564 17.14.5. IMPLEMENTATION 15566 17.15. Operation 17: NVERIFY - Verify Difference in Attributes 15568 17.15.1. SYNOPSIS 15570 (cfh), fattr -> - 15572 17.15.2. ARGUMENTS 15574 /* 15575 * NVERIFY: Verify attributes different 15576 */ 15577 struct NVERIFY4args { 15578 /* CURRENT_FH: object */ 15579 fattr4 obj_attributes; 15580 }; 15582 17.15.3. RESULTS 15584 struct NVERIFY4res { 15585 nfsstat4 status; 15586 }; 15588 17.15.4. DESCRIPTION 15590 This operation is used to prefix a sequence of operations to be 15591 performed if one or more attributes have changed on some file system 15592 object. If all the attributes match then the error NFS4ERR_SAME must 15593 be returned. 15595 On success, the current filehandle retains its value. 15597 17.15.5. IMPLEMENTATION 15599 This operation is useful as a cache validation operator. If the 15600 object to which the attributes belong has changed then the following 15601 operations may obtain new data associated with that object. For 15602 instance, to check if a file has been changed and obtain new data if 15603 it has: 15605 PUTFH (public) 15606 LOOKUP "foobar" 15607 NVERIFY attrbits attrs 15608 READ 0 32767 15610 In the case that a recommended attribute is specified in the NVERIFY 15611 operation and the server does not support that attribute for the file 15612 system object, the error NFS4ERR_ATTRNOTSUPP is returned to the 15613 client. 15615 When the attribute rdattr_error or any write-only attribute (e.g. 15616 time_modify_set) is specified, the error NFS4ERR_INVAL is returned to 15617 the client. 15619 17.16. Operation 18: OPEN - Open a Regular File 15621 17.16.1. SYNOPSIS 15623 , share_access, share_deny, owner, openhow, claim 15624 -> (cfh), stateid, cinfo, rflags, attrset, delegation 15626 17.16.2. ARGUMENTS 15628 /* 15629 * Various definitions for OPEN 15630 */ 15631 enum createmode4 { 15632 UNCHECKED4 = 0, 15633 GUARDED4 = 1, 15634 EXCLUSIVE4 = 2 15635 }; 15637 union createhow4 switch (createmode4 mode) { 15638 case UNCHECKED4: 15639 case GUARDED4: 15640 fattr4 createattrs; 15641 case EXCLUSIVE4: 15642 verifier4 createverf; 15643 }; 15645 enum opentype4 { 15646 OPEN4_NOCREATE = 0, 15647 OPEN4_CREATE = 1 15648 }; 15650 union openflag4 switch (opentype4 opentype) { 15651 case OPEN4_CREATE: 15652 createhow4 how; 15653 default: 15654 void; 15655 }; 15657 /* Next definitions used for OPEN delegation */ 15658 enum limit_by4 { 15659 NFS_LIMIT_SIZE = 1, 15660 NFS_LIMIT_BLOCKS = 2 15661 /* others as needed */ 15662 }; 15664 struct nfs_modified_limit4 { 15665 uint32_t num_blocks; 15666 uint32_t bytes_per_block; 15667 }; 15669 union nfs_space_limit4 switch (limit_by4 limitby) { 15670 /* limit specified as file size */ 15671 case NFS_LIMIT_SIZE: 15672 uint64_t filesize; 15673 /* limit specified by number of blocks */ 15674 case NFS_LIMIT_BLOCKS: 15675 nfs_modified_limit4 mod_blocks; 15676 } ; 15678 /* 15679 * Share Access and Deny constants for open argument 15680 */ 15681 const OPEN4_SHARE_ACCESS_READ = 0x00000001; 15682 const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; 15683 const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; 15685 const OPEN4_SHARE_DENY_NONE = 0x00000000; 15686 const OPEN4_SHARE_DENY_READ = 0x00000001; 15687 const OPEN4_SHARE_DENY_WRITE = 0x00000002; 15688 const OPEN4_SHARE_DENY_BOTH = 0x00000003; 15690 /* new flags for share_access field of OPEN4args */ 15691 const OPEN4_SHARE_ACCESS_WANT_DELEG_MASK = 0xFF00; 15692 const OPEN4_SHARE_ACCESS_WANT_NO_PREFERENCE = 0x0000; 15693 const OPEN4_SHARE_ACCESS_WANT_READ_DELEG = 0x0100; 15694 const OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG = 0x0200; 15695 const OPEN4_SHARE_ACCESS_WANT_ANY_DELEG = 0x0300; 15696 const OPEN4_SHARE_ACCESS_WANT_NO_DELEG = 0x0400; 15697 const OPEN4_SHARE_ACCESS_WANT_CANCEL = 0x0500; 15699 const OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL = 0x10000; 15700 const OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED = 0x20000; 15702 enum open_delegation_type4 { 15703 OPEN_DELEGATE_NONE = 0, 15704 OPEN_DELEGATE_READ = 1, 15705 OPEN_DELEGATE_WRITE = 2, 15706 OPEN_DELEGATE_NONE_EXT = 3 /* new to v4.1 */ 15707 }; 15709 enum open_claim_type4 { 15710 CLAIM_NULL = 0, 15711 CLAIM_PREVIOUS = 1, 15712 CLAIM_DELEGATE_CUR = 2, 15713 CLAIM_DELEGATE_PREV = 3, 15715 /* 15716 * Like CLAIM_NULL, but object identified 15717 * by the current filehandle. 15718 */ 15719 CLAIM_FH = 4, /* new to v4.1 */ 15721 /* 15722 * Like CLAIM_DELEGATE_CUR, but object identified 15723 * by current filehandle. 15724 */ 15725 CLAIM_DELEG_CUR_FH = 5, /* new to v4.1 */ 15727 /* 15728 * Like CLAIM_DELEGATE_PREV, but object identified 15729 * by current filehandle. 15730 */ 15731 CLAIM_DELEG_PREV_FH = 6 /* new to v4.1 */ 15732 }; 15734 struct open_claim_delegate_cur4 { 15735 stateid4 delegate_stateid; 15736 component4 file; 15737 }; 15739 union open_claim4 switch (open_claim_type4 claim) { 15740 /* 15741 * No special rights to file. Ordinary OPEN of the specified file. 15742 */ 15743 case CLAIM_NULL: 15744 /* CURRENT_FH: directory */ 15745 component4 file; 15747 /* 15748 * Right to the file established by an open previous to server 15749 * reboot. File identified by filehandle obtained at that time 15750 * rather than by name. 15751 */ 15752 case CLAIM_PREVIOUS: 15753 /* CURRENT_FH: file being reclaimed */ 15754 open_delegation_type4 delegate_type; 15756 /* 15757 * Right to file based on a delegation granted by the server. 15758 * File is specified by name. 15759 */ 15760 case CLAIM_DELEGATE_CUR: 15761 /* CURRENT_FH: directory */ 15762 open_claim_delegate_cur4 delegate_cur_info; 15764 /* Right to file based on a delegation granted to a previous boot 15765 * instance of the client. File is specified by name. 15766 */ 15767 case CLAIM_DELEGATE_PREV: 15768 /* CURRENT_FH: directory */ 15769 component4 file_delegate_prev; 15770 }; 15772 /* 15773 * OPEN: Open a file, potentially receiving an open delegation 15774 */ 15775 struct OPEN4args { 15776 seqid4 seqid; 15777 uint32_t share_access; 15778 uint32_t share_deny; 15779 open_owner4 owner; 15780 openflag4 openhow; 15781 open_claim4 claim; 15782 }; 15784 17.16.3. RESULTS 15786 struct open_read_delegation4 { 15787 stateid4 stateid; /* Stateid for delegation*/ 15788 bool recall; /* Pre-recalled flag for 15789 delegations obtained 15790 by reclaim 15791 (CLAIM_PREVIOUS) */ 15792 nfsace4 permissions; /* Defines users who don't 15793 need an ACCESS call to 15794 open for read */ 15795 }; 15797 struct open_write_delegation4 { 15798 stateid4 stateid; /* Stateid for delegation */ 15799 bool recall; /* Pre-recalled flag for 15800 delegations obtained 15801 by reclaim 15802 (CLAIM_PREVIOUS) */ 15803 nfs_space_limit4 space_limit; /* Defines condition that 15804 the client must check to 15805 determine whether the 15806 file needs to be flushed 15807 to the server on close. 15808 */ 15809 nfsace4 permissions; /* Defines users who don't 15810 need an ACCESS call as 15811 part of a delegated 15812 open. */ 15813 }; 15815 enum why_no_delegation4 { /* new to v4.1 */ 15816 WND4_NOT_WANTED = 0, 15817 WND4_CONTENTION = 1, 15818 WND4_RESOURCE = 2, 15819 WND4_NOT_SUPP_FTYPE = 3, 15820 WND4_WRITE_DELEG_NOT_SUPP_FTYPE = 4, 15821 WND4_NOT_SUPP_UPGRADE = 5, 15822 WND4_NOT_SUPP_DOWNGRADE = 6, 15823 WND4_CANCELED = 7, 15824 WND4_IS_DIR = 8 15825 }; 15827 union open_none_delegation4 /* new to v4.1 */ 15828 switch (why_no_delegation4 ond_why) { 15829 case WND4_CONTENTION: 15830 bool ond_server_will_push_deleg; 15831 case WND4_RESOURCE: 15832 bool ond_server_will_signal_avail; 15833 default: 15834 void; 15835 }; 15837 union open_delegation4 15838 switch (open_delegation_type4 delegation_type) { 15839 case OPEN_DELEGATE_NONE: 15840 void; 15841 case OPEN_DELEGATE_READ: 15842 open_read_delegation4 read; 15843 case OPEN_DELEGATE_WRITE: 15844 open_write_delegation4 write; 15845 case OPEN_DELEGATE_NONE_EXT: /* new to v4.1 */ 15846 open_none_delegation4 od_whynone; 15847 }; 15848 /* 15849 * Result flags 15850 */ 15851 /* Client must confirm open */ 15852 const OPEN4_RESULT_CONFIRM = 0x00000002; 15853 /* Type of file locking behavior at the server */ 15854 const OPEN4_RESULT_LOCKTYPE_POSIX = 0x00000004; 15855 /* Server will preserve file if removed while open */ 15856 const OPEN4_RESULT_PRESERVE_UNLINKED = 0x00000008; 15857 /* Server may use CB_NOTIFY_LOCK on locks derived from this open */ 15858 const OPEN4_RESULT_MAY_NOTIFY_LOCK = 0x00000020; 15860 struct OPEN4resok { 15861 stateid4 stateid; /* Stateid for open */ 15862 change_info4 cinfo; /* Directory Change Info */ 15863 uint32_t rflags; /* Result flags */ 15864 bitmap4 attrset; /* attribute set for create*/ 15865 open_delegation4 delegation; /* Info on any open 15866 delegation */ 15867 }; 15869 union OPEN4res switch (nfsstat4 status) { 15870 case NFS4_OK: 15871 /* CURRENT_FH: opened file */ 15872 OPEN4resok resok4; 15873 default: 15874 void; 15875 }; 15877 17.16.4. DESCRIPTION 15879 The OPEN operation creates and/or opens a regular file in a directory 15880 with the provided name. If the file does not exist at the server and 15881 creation is desired, specification of the method of creation is 15882 provided by the openhow parameter. The client has the choice of 15883 three creation methods: UNCHECKED, GUARDED, or EXCLUSIVE. 15885 If the current filehandle is a named attribute directory, OPEN will 15886 then create or open a named attribute file. Note that exclusive 15887 create of a named attribute is not supported. If the createmode is 15888 EXCLUSIVE4 and the current filehandle is a named attribute directory, 15889 the server will return EINVAL. 15891 UNCHECKED means that the file should be created if a file of that 15892 name does not exist and encountering an existing regular file of that 15893 name is not an error. For this type of create, createattrs specifies 15894 the initial set of attributes for the file. The set of attributes 15895 may include any writable attribute valid for regular files. When an 15896 UNCHECKED create encounters an existing file, the attributes 15897 specified by createattrs are not used, except that when an size of 15898 zero is specified, the existing file is truncated. If GUARDED is 15899 specified, the server checks for the presence of a duplicate object 15900 by name before performing the create. If a duplicate exists, an 15901 error of NFS4ERR_EXIST is returned as the status. If the object does 15902 not exist, the request is performed as described for UNCHECKED. For 15903 each of these cases (UNCHECKED and GUARDED) where the operation is 15904 successful, the server will return to the client an attribute mask 15905 signifying which attributes were successfully set for the object. 15907 EXCLUSIVE specifies that the server is to follow exclusive creation 15908 semantics, using the verifier to ensure exclusive creation of the 15909 target. The server should check for the presence of a duplicate 15910 object by name. If the object does not exist, the server creates the 15911 object and stores the verifier with the object. If the object does 15912 exist and the stored verifier matches the client provided verifier, 15913 the server uses the existing object as the newly created object. If 15914 the stored verifier does not match, then an error of NFS4ERR_EXIST is 15915 returned. No attributes may be provided in this case, since the 15916 server may use an attribute of the target object to store the 15917 verifier. If the server uses an attribute to store the exclusive 15918 create verifier, it will signify which attribute by setting the 15919 appropriate bit in the attribute mask that is returned in the 15920 results. 15922 For the target directory, the server returns change_info4 information 15923 in cinfo. With the atomic field of the change_info4 struct, the 15924 server will indicate if the before and after change attributes were 15925 obtained atomically with respect to the link creation. 15927 Upon successful creation, the current filehandle is replaced by that 15928 of the new object. 15930 The OPEN operation provides for Windows share reservation capability 15931 with the use of the share_access and share_deny fields of the OPEN 15932 arguments. The client specifies at OPEN the required share_access 15933 and share_deny modes. For clients that do not directly support 15934 SHAREs (i.e. UNIX), the expected deny value is DENY_NONE. In the 15935 case that there is a existing SHARE reservation that conflicts with 15936 the OPEN request, the server returns the error NFS4ERR_SHARE_DENIED. 15937 For each OPEN, the client must provide a value for the owner field 15938 for the OPEN argument. The client ID associated with the owner is 15939 not derived from the client field of the owner parameter but is 15940 instead the client ID associated with the session on which the 15941 request is issued. If the client ID field of the owner parameter is 15942 not zero, the server MUST return an NFS4ERR_INVAL error. For 15943 additional discussion of SHARE semantics see Section 8.8. 15945 The seqid value is not used in NFSv4.1. If the value passed is not 15946 zero, the server MUST return an NFS4ERR_INVAL error. 15948 In the case that the client is recovering state from a server 15949 failure, the claim field of the OPEN argument is used to signify that 15950 the request is meant to reclaim state previously held. 15952 The "claim" field of the OPEN argument is used to specify the file to 15953 be opened and the state information which the client claims to 15954 possess. There are seven claim types as follows: 15956 +---------------------+---------------------------------------------+ 15957 | open type | description | 15958 +---------------------+---------------------------------------------+ 15959 | CLAIM_NULL CLAIM_FH | For the client, this is a new OPEN request | 15960 | | and there is no previous state associate | 15961 | | with the file for the client. With | 15962 | | CLAIM_NULL the file is identified by the | 15963 | | current filehandle and the specified | 15964 | | component name. With CLAIM_FH (new to | 15965 | | v4.1) the file is identified by just the | 15966 | | current filehandle. | 15967 | CLAIM_PREVIOUS | The client is claiming basic OPEN state for | 15968 | | a file that was held previous to a server | 15969 | | reboot. Generally used when a server is | 15970 | | returning persistent filehandles; the | 15971 | | client may not have the file name to | 15972 | | reclaim the OPEN. | 15973 | CLAIM_DELEGATE_CUR | The client is claiming a delegation for | 15974 | CLAIM_DELEG_PREV_FH | OPEN as granted by the server. Generally | 15975 | | this is done as part of recalling a | 15976 | | delegation. With CLAIM_DELEGATE_CUR, the | 15977 | | file is identified by the current | 15978 | | filehandle and the specified component | 15979 | | name. With CLAIM_DELEG_PREV_FH (new to | 15980 | | v4.1), the file is identified by just the | 15981 | | current filehandle. | 15982 | CLAIM_DELEGATE_PREV | The client is claiming a delegation granted | 15983 | CLAIM_DELEG_PREV_FH | to a previous client instance; used after | 15984 | | the client reboots. The server MAY support | 15985 | | CLAIM_DELEGATE_PREV or CLAIM_DELEG_PREV_FH. | 15986 | | If it does support either open type, | 15987 | | CREATE_SESSION MUST NOT remove the client's | 15988 | | delegation state, and the server MUST | 15989 | | support the DELEGPURGE operation. | 15990 +---------------------+---------------------------------------------+ 15991 For OPEN requests whose claim type is other than CLAIM_PREVIOUS (i.e. 15992 requests other than those devoted to reclaiming opens after a server 15993 reboot) that reach the server during its grace or lease expiration 15994 period, the server returns an error of NFS4ERR_GRACE. 15996 For any OPEN request, the server may return an open delegation, which 15997 allows further opens and closes to be handled locally on the client 15998 as described in the section Open Delegation. Note that delegation is 15999 up to the server to decide. The client should never assume that 16000 delegation will or will not be granted in a particular instance. It 16001 should always be prepared for either case. A partial exception is 16002 the reclaim (CLAIM_PREVIOUS) case, in which a delegation type is 16003 claimed. In this case, delegation will always be granted, although 16004 the server may specify an immediate recall in the delegation 16005 structure. 16007 The rflags returned by a successful OPEN allow the server to return 16008 information governing how the open file is to be handled. 16010 o OPEN4_RESULT_CONFIRM is deprecated and MUST not be returned by an 16011 NFSv4.1 server. 16013 o OPEN4_RESULT_LOCKTYPE_POSIX indicates the server's file locking 16014 behavior supports the complete set of Posix locking techniques. 16015 From this the client can choose to manage file locking state in a 16016 way to handle a mis-match of file locking management. 16018 o OPEN4_RESULT_PRESERVE_UNLINKED indicates the server will preserve 16019 the open file if the client (or any other client) removes the file 16020 as long as it is open. Furthermore, the server promises to 16021 preserve the file through the grace period after server reboot, 16022 thereby giving the client the opportunity to reclaim his open. 16024 o OPEN4_RESULT_MAY_NOTIFY_LOCK indicates that the server may attempt 16025 CB_NOTIFY_LOCK callbacks for locks on this file. This flag is a 16026 hint only, and may be safely ignored by the client. 16028 If the component is of zero length, NFS4ERR_INVAL will be returned. 16029 The component is also subject to the normal UTF-8, character support, 16030 and name checks. See the section "UTF-8 Related Errors" for further 16031 [[Comment.17: add an xref to the UTD-8 section]]. discussion. 16033 When an OPEN is done and the specified lockowner already has the 16034 resulting filehandle open, the result is to "OR" together the new 16035 share and deny status together with the existing status. In this 16036 case, only a single CLOSE need be done, even though multiple OPENs 16037 were completed. When such an OPEN is done, checking of share 16038 reservations for the new OPEN proceeds normally, with no exception 16039 for the existing OPEN held by the same lockowner. 16041 If the underlying file system at the server is only accessible in a 16042 read-only mode and the OPEN request has specified ACCESS_WRITE or 16043 ACCESS_BOTH, the server will return NFS4ERR_ROFS to indicate a read- 16044 only file system. 16046 As with the CREATE operation, the server MUST derive the owner, owner 16047 ACE, group, or group ACE if any of the four attributes are required 16048 and supported by the server's file system. For an OPEN with the 16049 EXCLUSIVE4 createmode, the server has no choice, since such OPEN 16050 calls do not include the createattrs field. Conversely, if 16051 createattrs is specified, and includes owner or group (or 16052 corresponding ACEs) that the principal in the RPC call's credentials 16053 does not have authorization to create files for, then the server may 16054 return NFS4ERR_PERM. 16056 In the case of a OPEN which specifies a size of zero (e.g. 16057 truncation) and the file has named attributes, the named attributes 16058 are left as is. They are not removed. 16060 NFSv4.1 gives more precise control to clients over acquisition of 16061 delegations via the following new flags for the share_access field of 16062 OPEN4args: 16064 OPEN4_SHARE_ACCESS_WANT_READ_DELEG 16066 OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 16068 OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 16070 OPEN4_SHARE_ACCESS_WANT_NO_DELEG 16072 OPEN4_SHARE_ACCESS_WANT_CANCEL 16074 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL 16076 OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED 16078 If (share_access & OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) is not zero, 16079 then the client will have specified one and only one of: 16081 OPEN4_SHARE_ACCESS_WANT_READ_DELEG 16083 OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 16084 OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 16086 OPEN4_SHARE_ACCESS_WANT_NO_DELEG 16088 OPEN4_SHARE_ACCESS_WANT_CANCEL 16090 Otherwise the client is indicating no desire for a delegation and the 16091 server MAY or MAY not return a delegation in the OPEN response. 16093 If the server supports the new _WANT_ flags and the client issues one 16094 or more of the new flags, then in the event the server does not 16095 return a delegation, it MUST return a delegation type of 16096 OPEN_DELEGATE_NONE_EXT. od_whynone indicates why no delegation was 16097 returned and will be one of: 16099 WND4_NOT_WANTED The client specified 16100 OPEN4_SHARE_ACCESS_WANT_NO_DELEG. 16102 WND4_CONTENTION There is a conflicting delegation or open on the 16103 file. 16105 WND4_RESOURCE Resource limitations prevent the server from granting 16106 a delegation. 16108 WND4_NOT_SUPP_FTYPE The server does not support delegations on this 16109 file type. 16111 WND4_WRITE_DELEG_NOT_SUPP_FTYPE The server does not support write 16112 delegations on this file type. 16114 WND4_NOT_SUPP_UPGRADE The server does not support atomic upgrade of 16115 a read delegation to a write delegation. 16117 WND4_NOT_SUPP_DOWNGRADE The server does not support atomic downgrade 16118 of a write delegation to a read delegation. 16120 WND4_CANCELED The client specified OPEN4_SHARE_ACCESS_WANT_CANCEL 16121 and now any "want" for this file object is cancelled. 16123 WND4_IS_DIR The specified file object is a directory, and the 16124 operation is OPEN or WANT_DELEGATION which do not support 16125 delegations on directories. 16127 OPEN4_SHARE_ACCESS_WANT_READ_DELEG, 16128 OPEN_SHARE_ACCESS_WANT_WRITE_DELEG, or 16129 OPEN_SHARE_ACCESS_WANT_ANY_DELEG mean, respectively, the client wants 16130 a read, write, or any delegation regardless which of 16131 OPEN4_SHARE_ACCESS_READ, OPEN4_SHARE_ACCESS_WRITE, or 16132 OPEN4_SHARE_ACCESS_BOTH is set. If the client has a read delegation 16133 on a file, and requests a write delegation, then the client is 16134 requesting atomic upgrade of its read delegation to a write 16135 delegation. If the client has a write delegation on a file, and 16136 requests a read delegation, then the client is requesting atomic 16137 downgrade to a read delegation. A server MAY support atomic upgrade 16138 or downgrade. If it does, then the returned delegation_type of 16139 OPEN_DELEGATE_READ or OPEN_DELEGATE_WRITE that is different than the 16140 delegation type the client currently has, indicates successful 16141 upgrade or downgrade. If it does not support atomic delegation 16142 upgrade or downgrade, then od_whynone will be WND4_NOT_SUPP_UPGRADE 16143 or WND4_NOT_SUPP_DOWNGRADE. 16145 OPEN4_SHARE_ACCESS_WANT_NO_DELEG means the client wants no 16146 delegation. 16148 OPEN4_SHARE_ACCESS_WANT_CANCEL means the client wants no delegation 16149 and wants to cancel any previously registered "want" for a 16150 delegation. 16152 The client may set one or both of 16153 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL and 16154 OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED. However, they 16155 will have no effect unless one of following are set: 16157 o OPEN4_SHARE_ACCESS_WANT_READ_DELEG 16159 o OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 16161 o OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 16163 If the client specifies 16164 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL, then it wishes 16165 to register a "want" for a delegation, in the event the OPEN results 16166 do not include a delegation. If so and the server denies the 16167 delegation due to insufficient resources, the server MAY later inform 16168 the client, via the CB_RECALLABLE_OBJ_AVAIL operation, that the 16169 resource limitation condition has eased. The server will tell the 16170 client that it intends to send a future CB_RECALLABLE_OBJ_AVAIL 16171 operation by setting delegation_type in the results to 16172 OPEN_DELEGATE_NONE_EXT, ond_why to WND4_RESOURCE, and 16173 ond_server_will_signal_avail set to TRUE. If 16174 ond_server_will_signal_avail is set to TRUE, the server MUST later 16175 send a CB_RECALLABLE_OBJ_AVAIL operation. 16177 If the client specifies 16178 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_UNCONTENDED, then it wishes 16179 to register a "want" for a delegation, in the event the OPEN results 16180 do not include a delegation. If so and the server denies the 16181 delegation due to insufficient resources, the server MAY later inform 16182 the client, via the CB_PUSH_DELEG operation, that the resource 16183 limitation condition has eased. The server will tell the client that 16184 it intends to send a future CB_PUSH_DELEG operation by setting 16185 delegation_type in the results to OPEN_DELEGATE_NONE_EXT, ond_why to 16186 WND4_CONTENTION, and ond_server_will_push_deleg to TRUE. If 16187 ond_server_will_push_deleg is TRUE, the server MUST later send a 16188 CB_RECALLABLE_OBJ_AVAIL operation. 16190 If the client has previously registered a want for a delegation on a 16191 file, and then sends a request to register a want for a delegation on 16192 the same file, the server MUST return a new error: 16193 NFS4ERR_DELEG_ALREADY_WANTED. If the client wishes to register a 16194 different type of delegation want for the same file, it MUST cancel 16195 the existing delegation WANT. 16197 17.16.5. IMPLEMENTATION 16199 The OPEN operation contains support for EXCLUSIVE create. The 16200 mechanism is similar to the support in NFS version 3 [18]. However, 16201 this mechanism is not needed if a server stores its reply cache in 16202 stable storage. If the server indicates (via the csr_persist field 16203 in the response to CREATE_SESSION) the client SHOULD NOT use OPEN's 16204 approach to exclusive create. 16206 In absence of csr_persist being TRUE, the client invokes exclusive 16207 create by setting the how parameter is EXCLUSIVE. In this case, the 16208 client provides a verifier that can reasonably be expected to be 16209 unique. A combination of a client identifier, perhaps the client 16210 network address, and a unique number generated by the client, perhaps 16211 the RPC transaction identifier, may be appropriate. This mechanism 16212 allows reliable exclusive create semantics even when the server does 16213 not support the storing session reply information in stable storage. 16215 If the object does not exist, the server creates the object and 16216 stores the verifier in stable storage. For file systems that do not 16217 provide a mechanism for the storage of arbitrary file attributes, the 16218 server may use one or more elements of the object meta-data to store 16219 the verifier. The verifier must be stored in stable storage to 16220 prevent erroneous failure on retransmission of the request. It is 16221 assumed that an exclusive create is being performed because exclusive 16222 semantics are critical to the application. Because of the expected 16223 usage, exclusive CREATE does not rely solely on the normally volatile 16224 duplicate request cache for storage of the verifier. The duplicate 16225 request cache in volatile storage does not survive a crash and may 16226 actually flush on a long network partition, opening failure windows. 16227 In the UNIX local file system environment, the expected storage 16228 location for the verifier on creation is the meta-data (time stamps) 16229 of the object. For this reason, an exclusive object create may not 16230 include initial attributes because the server would have nowhere to 16231 store the verifier. 16233 If the server can not support these exclusive create semantics, 16234 possibly because of the requirement to commit the verifier to stable 16235 storage, it should fail the OPEN request with the error, 16236 NFS4ERR_NOTSUPP. 16238 During an exclusive CREATE request, if the object already exists, the 16239 server reconstructs the object's verifier and compares it with the 16240 verifier in the request. If they match, the server treats the 16241 request as a success. The request is presumed to be a duplicate of 16242 an earlier, successful request for which the reply was lost and that 16243 the server duplicate request cache mechanism did not detect. If the 16244 verifiers do not match, the request is rejected with the status, 16245 NFS4ERR_EXIST. 16247 Once the client has performed a successful exclusive create, it must 16248 issue a SETATTR to set the correct object attributes. Until it does 16249 so, it should not rely upon any of the object attributes, since the 16250 server implementation may need to overload object meta-data to store 16251 the verifier. The subsequent SETATTR must not occur in the same 16252 COMPOUND request as the OPEN. This separation will guarantee that 16253 the exclusive create mechanism will continue to function properly in 16254 the face of retransmission of the request. 16256 Use of the GUARDED attribute does not provide exactly-once semantics. 16257 In particular, if a reply is lost and the server does not detect the 16258 retransmission of the request, the operation can fail with 16259 NFS4ERR_EXIST, even though the create was performed successfully. 16260 The client would use this behavior in the case that the application 16261 has not requested an exclusive create but has asked to have the file 16262 truncated when the file is opened. In the case of the client timing 16263 out and retransmitting the create request, the client can use GUARDED 16264 to prevent against a sequence like: create, write, create 16265 (retransmitted) from occurring. 16267 For SHARE reservations, the client must specify a value for 16268 share_access that is one of READ, WRITE, or BOTH. For share_deny, 16269 the client must specify one of NONE, READ, WRITE, or BOTH. If the 16270 client fails to do this, the server must return NFS4ERR_INVAL. 16272 Based on the share_access value (READ, WRITE, or BOTH) the client 16273 should check that the requester has the proper access rights to 16274 perform the specified operation. This would generally be the results 16275 of applying the ACL access rules to the file for the current 16276 requester. However, just as with the ACCESS operation, the client 16277 should not attempt to second-guess the server's decisions, as access 16278 rights may change and may be subject to server administrative 16279 controls outside the ACL framework. If the requester is not 16280 authorized to READ or WRITE (depending on the share_access value), 16281 the server must return NFS4ERR_ACCESS. Note that since the NFS 16282 version 4 protocol does not impose any requirement that READs and 16283 WRITEs issued for an open file have the same credentials as the OPEN 16284 itself, the server still must do appropriate access checking on the 16285 READs and WRITEs themselves. 16287 If the component provided to OPEN is a symbolic link, the error 16288 NFS4ERR_SYMLINK will be returned to the client. If the current 16289 filehandle is not a directory, the error NFS4ERR_NOTDIR will be 16290 returned. 16292 The use of the OPEN4_RESULT_PRESERVE_UNLINKED result flag allows a 16293 client avoid the common implementation practice of renaming an open 16294 file to ".nfs" after it removes the file. After the 16295 server returns OPEN4_RESULT_PRESERVE_UNLINKED, if a client issues a 16296 REMOVE operation that would reduce the file's link count to zero, the 16297 server SHOULD report a value of zero for the FATTR4_NUMLINKS 16298 attribute on the file. 16300 17.16.5.1. WARNING TO CLIENT IMPLEMENTORS 16302 OPEN resembles LOOKUP in that it generates a filehandle for the 16303 client to use. Unlike LOOKUP though, OPEN creates server state on 16304 the filehandle. In normal circumstances, the client can only release 16305 this state with a CLOSE operation. CLOSE uses the current filehandle 16306 to determine which file to close. Therefore the client MUST follow 16307 every OPEN operation with a GETFH operation in the same COMPOUND 16308 procedure. This will supply the client with the filehandle such that 16309 CLOSE can be used appropriately. 16311 Simply waiting for the lease on the file to expire is insufficient 16312 because the server may maintain the state indefinitely as long as 16313 another client does not attempt to make a conflicting access to the 16314 same file. 16316 17.17. Operation 19: OPENATTR - Open Named Attribute Directory 16318 17.17.1. SYNOPSIS 16320 (cfh) createdir -> (cfh) 16322 17.17.2. ARGUMENTS 16324 /* 16325 * OPENATTR: open named attributes directory 16326 */ 16327 struct OPENATTR4args { 16328 /* CURRENT_FH: object */ 16329 bool createdir; 16330 }; 16332 17.17.3. RESULTS 16334 struct OPENATTR4res { 16335 /* CURRENT_FH: named attr directory */ 16336 nfsstat4 status; 16337 }; 16339 17.17.4. DESCRIPTION 16341 The OPENATTR operation is used to obtain the filehandle of the named 16342 attribute directory associated with the current filehandle. The 16343 result of the OPENATTR will be a filehandle to an object of type 16344 NF4ATTRDIR. From this filehandle, READDIR and LOOKUP operations can 16345 be used to obtain filehandles for the various named attributes 16346 associated with the original file system object. Filehandles 16347 returned within the named attribute directory will have a type of 16348 NF4NAMEDATTR. 16350 The createdir argument allows the client to signify if a named 16351 attribute directory should be created as a result of the OPENATTR 16352 operation. Some clients may use the OPENATTR operation with a value 16353 of FALSE for createdir to determine if any named attributes exist for 16354 the object. If none exist, then NFS4ERR_NOENT will be returned. If 16355 createdir has a value of TRUE and no named attribute directory 16356 exists, one is created. The creation of a named attribute directory 16357 assumes that the server has implemented named attribute support in 16358 this fashion and is not required to do so by this definition. 16360 17.17.5. IMPLEMENTATION 16362 If the server does not support named attributes for the current 16363 filehandle, an error of NFS4ERR_NOTSUPP will be returned to the 16364 client. 16366 17.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access 16368 17.18.1. SYNOPSIS 16370 (cfh), stateid, seqid, access, deny -> stateid 16372 17.18.2. ARGUMENTS 16374 /* 16375 * OPEN_DOWNGRADE: downgrade the access/deny for a file 16376 */ 16377 struct OPEN_DOWNGRADE4args { 16378 /* CURRENT_FH: opened file */ 16379 stateid4 open_stateid; 16380 seqid4 seqid; 16381 uint32_t share_access; 16382 uint32_t share_deny; 16383 }; 16385 17.18.3. RESULTS 16387 struct OPEN_DOWNGRADE4resok { 16388 stateid4 open_stateid; 16389 }; 16391 union OPEN_DOWNGRADE4res switch(nfsstat4 status) { 16392 case NFS4_OK: 16393 OPEN_DOWNGRADE4resok resok4; 16394 default: 16395 void; 16396 }; 16398 17.18.4. DESCRIPTION 16400 This operation is used to adjust the share_access and share_deny bits 16401 for a given open. This is necessary when a given lockowner opens the 16402 same file multiple times with different share_access and share_deny 16403 flags. In this situation, a close of one of the opens may change the 16404 appropriate share_access and share_deny flags to remove bits 16405 associated with opens no longer in effect. 16407 The share_access and share_deny bits specified in this operation 16408 replace the current ones for the specified open file. The 16409 share_access and share_deny bits specified must be exactly equal to 16410 the union of the share_access and share_deny bits specified for some 16411 subset of the OPENs in effect for current openowner on the current 16412 file. If that constraint is not respected, the error NFS4ERR_INVAL 16413 should be returned. Since share_access and share_deny bits are 16414 subsets of those already granted, it is not possible for this request 16415 to be denied because of conflicting share reservations. 16417 The seqid value is not used in NFSv4.1. If the value passed is not 16418 zero, the server MUST return an NFS4ERR_INVAL error. 16420 On success, the current filehandle retains its value. 16422 17.19. Operation 22: PUTFH - Set Current Filehandle 16424 17.19.1. SYNOPSIS 16426 filehandle -> (cfh) 16428 17.19.2. ARGUMENTS 16430 /* 16431 * PUTFH: Set current filehandle 16432 */ 16433 struct PUTFH4args { 16434 nfs_fh4 object; 16435 }; 16437 17.19.3. RESULTS 16439 struct PUTFH4res { 16440 /* CURRENT_FH: */ 16441 nfsstat4 status; 16442 }; 16444 17.19.4. DESCRIPTION 16446 Replaces the current filehandle with the filehandle provided as an 16447 argument. 16449 If the security mechanism used by the requester does not meet the 16450 requirements of the filehandle provided to this operation, the server 16451 MUST return NFS4ERR_WRONGSEC. 16453 17.19.5. IMPLEMENTATION 16455 Commonly used as the first operator in an NFS request to set the 16456 context for following operations. 16458 17.20. Operation 23: PUTPUBFH - Set Public Filehandle 16460 17.20.1. SYNOPSIS 16462 - -> (cfh) 16464 17.20.2. ARGUMENT 16466 void; 16468 17.20.3. RESULT 16470 /* 16471 * PUTPUBFH: Set public filehandle 16472 */ 16473 struct PUTPUBFH4res { 16474 /* CURRENT_FH: public fh */ 16475 nfsstat4 status; 16476 }; 16478 17.20.4. DESCRIPTION 16480 Replaces the current filehandle with the filehandle that represents 16481 the public filehandle of the server's name space. This filehandle 16482 may be different from the "root" filehandle which may be associated 16483 with some other directory on the server. 16485 The public filehandle represents the concepts embodied in RFC2054 16486 [25], RFC2055 [26], RFC2224 [32]. The intent for NFS version 4 is 16487 that the public filehandle (represented by the PUTPUBFH operation) be 16488 used as a method of providing WebNFS server compatibility with NFS 16489 versions 2 and 3. 16491 The public filehandle and the root filehandle (represented by the 16492 PUTROOTFH operation) should be equivalent. If the public and root 16493 filehandles are not equivalent, then the public filehandle MUST be a 16494 descendant of the root filehandle. 16496 17.20.5. IMPLEMENTATION 16498 Used as the first operator in an NFS request to set the context for 16499 following operations. 16501 With the NFS version 2 and 3 public filehandle, the client is able to 16502 specify whether the path name provided in the LOOKUP should be 16503 evaluated as either an absolute path relative to the server's root or 16504 relative to the public filehandle. RFC2224 [32] contains further 16505 discussion of the functionality. With NFS version 4, that type of 16506 specification is not directly available in the LOOKUP operation. The 16507 reason for this is because the component separators needed to specify 16508 absolute vs. relative are not allowed in NFS version 4. Therefore, 16509 the client is responsible for constructing its request such that the 16510 use of either PUTROOTFH or PUTPUBFH are used to signify absolute or 16511 relative evaluation of an NFS URL respectively. 16513 Note that there are warnings mentioned in RFC2224 [32] with respect 16514 to the use of absolute evaluation and the restrictions the server may 16515 place on that evaluation with respect to how much of its namespace 16516 has been made available. These same warnings apply to NFS version 4. 16517 It is likely, therefore that because of server implementation 16518 details, an NFS version 3 absolute public filehandle lookup may 16519 behave differently than an NFS version 4 absolute resolution. 16521 There is a form of security negotiation as described in RFC2755 [33] 16522 that uses the public filehandle a method of employing SNEGO. This 16523 method is not available with NFS version 4 as filehandles are not 16524 overloaded with special meaning and therefore do not provide the same 16525 framework as NFS versions 2 and 3. Clients should therefore use the 16526 security negotiation mechanisms described in this RFC. 16528 17.20.6. ERRORS 16530 17.21. Operation 24: PUTROOTFH - Set Root Filehandle 16532 17.21.1. SYNOPSIS 16534 - -> (cfh) 16536 17.21.2. ARGUMENTS 16538 void; 16540 17.21.3. RESULTS 16542 /* 16543 * PUTROOTFH: Set root filehandle 16544 */ 16545 struct PUTROOTFH4res { 16546 /* CURRENT_FH: root fh */ 16547 nfsstat4 status; 16548 }; 16550 17.21.4. DESCRIPTION 16552 Replaces the current filehandle with the filehandle that represents 16553 the root of the server's name space. From this filehandle a LOOKUP 16554 operation can locate any other filehandle on the server. This 16555 filehandle may be different from the "public" filehandle which may be 16556 associated with some other directory on the server. 16558 17.21.5. IMPLEMENTATION 16560 Commonly used as the first operator in an NFS request to set the 16561 context for following operations. 16563 17.22. Operation 25: READ - Read from File 16565 17.22.1. SYNOPSIS 16567 (cfh), stateid, offset, count -> eof, data 16569 17.22.2. ARGUMENTS 16571 /* 16572 * READ: Read from file 16573 */ 16574 struct READ4args { 16575 /* CURRENT_FH: file */ 16576 stateid4 stateid; 16577 offset4 offset; 16578 count4 count; 16579 }; 16581 17.22.3. RESULTS 16583 struct READ4resok { 16584 bool eof; 16585 opaque data<>; 16586 }; 16588 union READ4res switch (nfsstat4 status) { 16589 case NFS4_OK: 16590 READ4resok resok4; 16591 default: 16592 void; 16593 }; 16595 17.22.4. DESCRIPTION 16597 The READ operation reads data from the regular file identified by the 16598 current filehandle. 16600 The client provides an offset of where the READ is to start and a 16601 count of how many bytes are to be read. An offset of 0 (zero) means 16602 to read data starting at the beginning of the file. If offset is 16603 greater than or equal to the size of the file, the status, NFS4_OK, 16604 is returned with a data length set to 0 (zero) and eof is set to 16605 TRUE. The READ is subject to access permissions checking. 16607 If the client specifies a count value of 0 (zero), the READ succeeds 16608 and returns 0 (zero) bytes of data again subject to access 16609 permissions checking. The server may choose to return fewer bytes 16610 than specified by the client. The client needs to check for this 16611 condition and handle the condition appropriately. 16613 The stateid value for a READ request represents a value returned from 16614 a previous record lock or share reservation request. The stateid is 16615 used by the server to verify that the associated share reservation 16616 and any record locks are still valid and to update lease timeouts for 16617 the client. 16619 If the read ended at the end-of-file (formally, in a correctly formed 16620 READ request, if offset + count is equal to the size of the file), or 16621 the read request extends beyond the size of the file (if offset + 16622 count is greater than the size of the file), eof is returned as TRUE; 16623 otherwise it is FALSE. A successful READ of an empty file will 16624 always return eof as TRUE. 16626 If the current filehandle is not a regular file, an error will be 16627 returned to the client. In the case the current filehandle 16628 represents a directory, NFS4ERR_ISDIR is return; otherwise, 16629 NFS4ERR_INVAL is returned. 16631 For a READ with a stateid value of all bits 0, the server MAY allow 16632 the READ to be serviced subject to mandatory file locks or the 16633 current share deny modes for the file. For a READ with a stateid 16634 value of all bits 1, the server MAY allow READ operations to bypass 16635 locking checks at the server. 16637 On success, the current filehandle retains its value. 16639 17.22.5. IMPLEMENTATION 16641 It is possible for the server to return fewer than count bytes of 16642 data. If the server returns less than the count requested and eof is 16643 set to FALSE, the client should issue another READ to get the 16644 remaining data. A server may return less data than requested under 16645 several circumstances. The file may have been truncated by another 16646 client or perhaps on the server itself, changing the file size from 16647 what the requesting client believes to be the case. This would 16648 reduce the actual amount of data available to the client. It is 16649 possible that the server may back off the transfer size and reduce 16650 the read request return. Server resource exhaustion may also occur 16651 necessitating a smaller read return. 16653 If mandatory file locking is on for the file, and if the region 16654 corresponding to the data to be read from file is write locked by an 16655 owner not associated the stateid, the server will return the 16656 NFS4ERR_LOCKED error. The client should try to get the appropriate 16657 read record lock via the LOCK operation before re-attempting the 16658 READ. When the READ completes, the client should release the record 16659 lock via LOCKU. 16661 17.23. Operation 26: READDIR - Read Directory 16663 17.23.1. SYNOPSIS 16665 (cfh), cookie, cookieverf, dircount, maxcount, attr_request -> 16666 cookieverf { cookie, name, attrs } 16668 17.23.2. ARGUMENTS 16670 /* 16671 * READDIR: Read directory 16672 */ 16673 struct READDIR4args { 16674 /* CURRENT_FH: directory */ 16675 nfs_cookie4 cookie; 16676 verifier4 cookieverf; 16677 count4 dircount; 16678 count4 maxcount; 16679 bitmap4 attr_request; 16680 }; 16682 17.23.3. RESULTS 16684 struct entry4 { 16685 nfs_cookie4 cookie; 16686 component4 name; 16687 fattr4 attrs; 16688 entry4 *nextentry; 16689 }; 16691 struct dirlist4 { 16692 entry4 *entries; 16693 bool eof; 16694 }; 16696 struct READDIR4resok { 16697 verifier4 cookieverf; 16698 dirlist4 reply; 16699 }; 16701 union READDIR4res switch (nfsstat4 status) { 16702 case NFS4_OK: 16703 READDIR4resok resok4; 16704 default: 16705 void; 16706 }; 16708 17.23.4. DESCRIPTION 16710 The READDIR operation retrieves a variable number of entries from a 16711 file system directory and returns client requested attributes for 16712 each entry along with information to allow the client to request 16713 additional directory entries in a subsequent READDIR. 16715 The arguments contain a cookie value that represents where the 16716 READDIR should start within the directory. A value of 0 (zero) for 16717 the cookie is used to start reading at the beginning of the 16718 directory. For subsequent READDIR requests, the client specifies a 16719 cookie value that is provided by the server on a previous READDIR 16720 request. 16722 The cookieverf value should be set to 0 (zero) when the cookie value 16723 is 0 (zero) (first directory read). On subsequent requests, it 16724 should be a cookieverf as returned by the server. The cookieverf 16725 must match that returned by the READDIR in which the cookie was 16726 acquired. If the server determines that the cookieverf is no longer 16727 valid for the directory, the error NFS4ERR_NOT_SAME must be returned. 16729 The dircount portion of the argument is a hint of the maximum number 16730 of bytes of directory information that should be returned. This 16731 value represents the length of the names of the directory entries and 16732 the cookie value for these entries. This length represents the XDR 16733 encoding of the data (names and cookies) and not the length in the 16734 native format of the server. 16736 The maxcount value of the argument is the maximum number of bytes for 16737 the result. This maximum size represents all of the data being 16738 returned within the READDIR4resok structure and includes the XDR 16739 overhead. The server may return less data. If the server is unable 16740 to return a single directory entry within the maxcount limit, the 16741 error NFS4ERR_TOOSMALL will be returned to the client. 16743 Finally, attr_request represents the list of attributes to be 16744 returned for each directory entry supplied by the server. 16746 On successful return, the server's response will provide a list of 16747 directory entries. Each of these entries contains the name of the 16748 directory entry, a cookie value for that entry, and the associated 16749 attributes as requested. The "eof" flag has a value of TRUE if there 16750 are no more entries in the directory. 16752 The cookie value is only meaningful to the server and is used as a 16753 "bookmark" for the directory entry. As mentioned, this cookie is 16754 used by the client for subsequent READDIR operations so that it may 16755 continue reading a directory. The cookie is similar in concept to a 16756 READ offset but should not be interpreted as such by the client. 16757 Ideally, the cookie value should not change if the directory is 16758 modified since the client may be caching these values. 16760 In some cases, the server may encounter an error while obtaining the 16761 attributes for a directory entry. Instead of returning an error for 16762 the entire READDIR operation, the server can instead return the 16763 attribute 'fattr4_rdattr_error'. With this, the server is able to 16764 communicate the failure to the client and not fail the entire 16765 operation in the instance of what might be a transient failure. 16766 Obviously, the client must request the fattr4_rdattr_error attribute 16767 for this method to work properly. If the client does not request the 16768 attribute, the server has no choice but to return failure for the 16769 entire READDIR operation. 16771 For some file system environments, the directory entries "." and ".." 16772 have special meaning and in other environments, they may not. If the 16773 server supports these special entries within a directory, they should 16774 not be returned to the client as part of the READDIR response. To 16775 enable some client environments, the cookie values of 0, 1, and 2 are 16776 to be considered reserved. Note that the UNIX client will use these 16777 values when combining the server's response and local representations 16778 to enable a fully formed UNIX directory presentation to the 16779 application. 16781 For READDIR arguments, cookie values of 1 and 2 should not be used 16782 and for READDIR results cookie values of 0, 1, and 2 should not be 16783 returned. 16785 On success, the current filehandle retains its value. 16787 17.23.5. IMPLEMENTATION 16789 The server's file system directory representations can differ 16790 greatly. A client's programming interfaces may also be bound to the 16791 local operating environment in a way that does not translate well 16792 into the NFS protocol. Therefore the use of the dircount and 16793 maxcount fields are provided to allow the client the ability to 16794 provide guidelines to the server. If the client is aggressive about 16795 attribute collection during a READDIR, the server has an idea of how 16796 to limit the encoded response. The dircount field provides a hint on 16797 the number of entries based solely on the names of the directory 16798 entries. Since it is a hint, it may be possible that a dircount 16799 value is zero. In this case, the server is free to ignore the 16800 dircount value and return directory information based on the 16801 specified maxcount value. 16803 The cookieverf may be used by the server to help manage cookie values 16804 that may become stale. It should be a rare occurrence that a server 16805 is unable to continue properly reading a directory with the provided 16806 cookie/cookieverf pair. The server should make every effort to avoid 16807 this condition since the application at the client may not be able to 16808 properly handle this type of failure. 16810 The use of the cookieverf will also protect the client from using 16811 READDIR cookie values that may be stale. For example, if the file 16812 system has been migrated, the server may or may not be able to use 16813 the same cookie values to service READDIR as the previous server 16814 used. With the client providing the cookieverf, the server is able 16815 to provide the appropriate response to the client. This prevents the 16816 case where the server may accept a cookie value but the underlying 16817 directory has changed and the response is invalid from the client's 16818 context of its previous READDIR. 16820 Since some servers will not be returning "." and ".." entries as has 16821 been done with previous versions of the NFS protocol, the client that 16822 requires these entries be present in READDIR responses must fabricate 16823 them. 16825 17.24. Operation 27: READLINK - Read Symbolic Link 16827 17.24.1. SYNOPSIS 16829 (cfh) -> linktext 16831 17.24.2. ARGUMENTS 16833 /* CURRENT_FH: symlink */ 16834 void; 16836 17.24.3. RESULTS 16838 /* 16839 * READLINK: Read symbolic link 16840 */ 16841 struct READLINK4resok { 16842 linktext4 link; 16843 }; 16845 union READLINK4res switch (nfsstat4 status) { 16846 case NFS4_OK: 16847 READLINK4resok resok4; 16848 default: 16849 void; 16850 }; 16852 17.24.4. DESCRIPTION 16854 READLINK reads the data associated with a symbolic link. The data is 16855 a UTF-8 string that is opaque to the server. That is, whether 16856 created by an NFS client or created locally on the server, the data 16857 in a symbolic link is not interpreted when created, but is simply 16858 stored. 16860 On success, the current filehandle retains its value. 16862 17.24.5. IMPLEMENTATION 16864 A symbolic link is nominally a pointer to another file. The data is 16865 not necessarily interpreted by the server, just stored in the file. 16866 It is possible for a client implementation to store a path name that 16867 is not meaningful to the server operating system in a symbolic link. 16868 A READLINK operation returns the data to the client for 16869 interpretation. If different implementations want to share access to 16870 symbolic links, then they must agree on the interpretation of the 16871 data in the symbolic link. 16873 The READLINK operation is only allowed on objects of type NF4LNK. 16874 The server should return the error, NFS4ERR_INVAL, if the object is 16875 not of type, NF4LNK. 16877 17.25. Operation 28: REMOVE - Remove File System Object 16879 17.25.1. SYNOPSIS 16881 (cfh), filename -> change_info 16883 17.25.2. ARGUMENTS 16885 /* 16886 * REMOVE: Remove filesystem object 16887 */ 16888 struct REMOVE4args { 16889 /* CURRENT_FH: directory */ 16890 component4 target; 16891 }; 16893 17.25.3. RESULTS 16895 struct REMOVE4resok { 16896 change_info4 cinfo; 16897 }; 16899 union REMOVE4res switch (nfsstat4 status) { 16900 case NFS4_OK: 16901 REMOVE4resok resok4; 16902 default: 16903 void; 16904 }; 16906 17.25.4. DESCRIPTION 16908 The REMOVE operation removes (deletes) a directory entry named by 16909 filename from the directory corresponding to the current filehandle. 16910 If the entry in the directory was the last reference to the 16911 corresponding file system object, the object may be destroyed. 16913 For the directory where the filename was removed, the server returns 16914 change_info4 information in cinfo. With the atomic field of the 16915 change_info4 struct, the server will indicate if the before and after 16916 change attributes were obtained atomically with respect to the 16917 removal. 16919 If the target has a length of 0 (zero), or if target does not obey 16920 the UTF-8 definition, the error NFS4ERR_INVAL will be returned. 16922 On success, the current filehandle retains its value. 16924 17.25.5. IMPLEMENTATION 16926 NFS versions 2 and 3 required a different operator RMDIR for 16927 directory removal and REMOVE for non-directory removal. This allowed 16928 clients to skip checking the file type when being passed a non- 16929 directory delete system call (e.g. unlink() in POSIX) to remove a 16930 directory, as well as the converse (e.g. a rmdir() on a non- 16931 directory) because they knew the server would check the file type. 16932 NFS version 4 REMOVE can be used to delete any directory entry 16933 independent of its file type. The implementor of an NFS version 4 16934 client's entry points from the unlink() and rmdir() system calls 16935 should first check the file type against the types the system call is 16936 allowed to remove before issuing a REMOVE. Alternatively, the 16937 implementor can produce a COMPOUND call that includes a LOOKUP/VERIFY 16938 sequence to verify the file type before a REMOVE operation in the 16939 same COMPOUND call. 16941 The concept of last reference is server specific. However, if the 16942 numlinks field in the previous attributes of the object had the value 16943 1, the client should not rely on referring to the object via a 16944 filehandle. Likewise, the client should not rely on the resources 16945 (disk space, directory entry, and so on) formerly associated with the 16946 object becoming immediately available. Thus, if a client needs to be 16947 able to continue to access a file after using REMOVE to remove it, 16948 the client should take steps to make sure that the file will still be 16949 accessible. The usual mechanism used is to RENAME the file from its 16950 old name to a new hidden name. 16952 If the server finds that the file is still open when the REMOVE 16953 arrives: 16955 o The server SHOULD NOT delete the file's directory entry if the 16956 file was opened with OPEN4_SHARE_DENY_WRITE or 16957 OPEN4_SHARE_DENY_BOTH. 16959 o If the file was not opened with OPEN4_SHARE_DENY_WRITE or 16960 OPEN4_SHARE_DENY_BOTH, the server SHOULD delete the file's 16961 directory entry. However, until last CLOSE of the file, the 16962 server MAY continue to allow access to the file via its 16963 filehandle. 16965 17.26. Operation 29: RENAME - Rename Directory Entry 16967 17.26.1. SYNOPSIS 16969 (sfh), oldname, (cfh), newname -> source_change_info, 16970 target_change_info 16972 17.26.2. ARGUMENTS 16974 /* 16975 * RENAME: Rename directory entry 16976 */ 16977 struct RENAME4args { 16978 /* SAVED_FH: source directory */ 16979 component4 oldname; 16980 /* CURRENT_FH: target directory */ 16981 component4 newname; 16982 }; 16984 17.26.3. RESULTS 16986 struct RENAME4resok { 16987 change_info4 source_cinfo; 16988 change_info4 target_cinfo; 16989 }; 16991 union RENAME4res switch (nfsstat4 status) { 16992 case NFS4_OK: 16993 RENAME4resok resok4; 16994 default: 16995 void; 16996 }; 16998 17.26.4. DESCRIPTION 17000 The RENAME operation renames the object identified by oldname in the 17001 source directory corresponding to the saved filehandle, as set by the 17002 SAVEFH operation, to newname in the target directory corresponding to 17003 the current filehandle. The operation is required to be atomic to 17004 the client. Source and target directories must reside on the same 17005 file system on the server. On success, the current filehandle will 17006 continue to be the target directory. 17008 If the target directory already contains an entry with the name, 17009 newname, the source object must be compatible with the target: either 17010 both are non-directories or both are directories and the target must 17011 be empty. If compatible, the existing target is removed before the 17012 rename occurs (See the IMPLEMENTATION subsection of the section 17013 "Operation 28: REMOVE - Remove File System Object" for client and 17014 server actions whenever a target is removed). If they are not 17015 compatible or if the target is a directory but not empty, the server 17016 will return the error, NFS4ERR_EXIST. 17018 If oldname and newname both refer to the same file (they might be 17019 hard links of each other), then RENAME should perform no action and 17020 return success. 17022 For both directories involved in the RENAME, the server returns 17023 change_info4 information. With the atomic field of the change_info4 17024 struct, the server will indicate if the before and after change 17025 attributes were obtained atomically with respect to the rename. 17027 If the oldname refers to a named attribute and the saved and current 17028 filehandles refer to different file system objects, the server will 17029 return NFS4ERR_XDEV just as if the saved and current filehandles 17030 represented directories on different file systems. 17032 If the oldname or newname has a length of 0 (zero), or if oldname or 17033 newname does not obey the UTF-8 definition, the error NFS4ERR_INVAL 17034 will be returned. 17036 17.26.5. IMPLEMENTATION 17038 The RENAME operation must be atomic to the client. The statement 17039 "source and target directories must reside on the same file system on 17040 the server" means that the fsid fields in the attributes for the 17041 directories are the same. If they reside on different file systems, 17042 the error, NFS4ERR_XDEV, is returned. 17044 Based on the value of the fh_expire_type attribute for the object, 17045 the filehandle may or may not expire on a RENAME. However, server 17046 implementors are strongly encouraged to attempt to keep filehandles 17047 from expiring in this fashion. 17049 On some servers, the file names "." and ".." are illegal as either 17050 oldname or newname, and will result in the error NFS4ERR_BADNAME. In 17051 addition, on many servers the case of oldname or newname being an 17052 alias for the source directory will be checked for. Such servers 17053 will return the error NFS4ERR_INVAL in these cases. 17055 If either of the source or target filehandles are not directories, 17056 the server will return NFS4ERR_NOTDIR. 17058 17.27. Operation 31: RESTOREFH - Restore Saved Filehandle 17060 17.27.1. SYNOPSIS 17062 (sfh) -> (cfh) 17064 17.27.2. ARGUMENTS 17066 /* SAVED_FH: */ 17067 void; 17069 17.27.3. RESULTS 17071 /* 17072 * RESTOREFH: Restore saved filehandle 17073 */ 17075 struct RESTOREFH4res { 17076 /* CURRENT_FH: value of saved fh */ 17077 nfsstat4 status; 17078 }; 17080 17.27.4. DESCRIPTION 17082 Set the current filehandle to the value in the saved filehandle. If 17083 there is no saved filehandle then return the error NFS4ERR_RESTOREFH. 17085 17.27.5. IMPLEMENTATION 17087 Operations like OPEN and LOOKUP use the current filehandle to 17088 represent a directory and replace it with a new filehandle. Assuming 17089 the previous filehandle was saved with a SAVEFH operator, the 17090 previous filehandle can be restored as the current filehandle. This 17091 is commonly used to obtain post-operation attributes for the 17092 directory, e.g. 17094 PUTFH (directory filehandle) 17095 SAVEFH 17096 GETATTR attrbits (pre-op dir attrs) 17097 CREATE optbits "foo" attrs 17098 GETATTR attrbits (file attributes) 17099 RESTOREFH 17100 GETATTR attrbits (post-op dir attrs) 17102 17.27.6. ERRORS 17104 17.28. Operation 32: SAVEFH - Save Current Filehandle 17106 17.28.1. SYNOPSIS 17108 (cfh) -> (sfh) 17110 17.28.2. ARGUMENTS 17112 /* CURRENT_FH: */ 17113 void; 17115 17.28.3. RESULTS 17117 /* 17118 * SAVEFH: Save current filehandle 17119 */ 17120 struct SAVEFH4res { 17121 /* SAVED_FH: value of current fh */ 17122 nfsstat4 status; 17123 }; 17125 17.28.4. DESCRIPTION 17127 Save the current filehandle. If a previous filehandle was saved then 17128 it is no longer accessible. The saved filehandle can be restored as 17129 the current filehandle with the RESTOREFH operator. 17131 On success, the current filehandle retains its value. 17133 17.28.5. IMPLEMENTATION 17135 17.29. Operation 33: SECINFO - Obtain Available Security 17137 17.29.1. SYNOPSIS 17139 (cfh), name -> { secinfo } 17141 17.29.2. ARGUMENTS 17143 /* 17144 * SECINFO: Obtain Available Security Mechanisms 17145 */ 17146 struct SECINFO4args { 17147 /* CURRENT_FH: directory */ 17148 component4 name; 17149 }; 17151 17.29.3. RESULTS 17153 /* 17154 * From RFC 2203 17155 */ 17156 enum rpc_gss_svc_t { 17157 RPC_GSS_SVC_NONE = 1, 17158 RPC_GSS_SVC_INTEGRITY = 2, 17159 RPC_GSS_SVC_PRIVACY = 3 17160 }; 17162 struct rpcsec_gss_info { 17163 sec_oid4 oid; 17164 qop4 qop; 17165 rpc_gss_svc_t service; 17166 }; 17168 /* RPCSEC_GSS has a value of '6' - See RFC 2203 */ 17169 union secinfo4 switch (uint32_t flavor) { 17170 case RPCSEC_GSS: 17171 rpcsec_gss_info flavor_info; 17172 default: 17173 void; 17174 }; 17176 typedef secinfo4 SECINFO4resok<>; 17178 union SECINFO4res switch (nfsstat4 status) { 17179 case NFS4_OK: 17180 SECINFO4resok resok4; 17181 default: 17182 void; 17183 }; 17185 17.29.4. DESCRIPTION 17187 The SECINFO operation is used by the client to obtain a list of valid 17188 RPC authentication flavors for a specific directory filehandle, file 17189 name pair. SECINFO should apply the same access methodology used for 17190 LOOKUP when evaluating the name. Therefore, if the requester does 17191 not have the appropriate access to LOOKUP the name then SECINFO must 17192 behave the same way and return NFS4ERR_ACCESS. 17194 The result will contain an array which represents the security 17195 mechanisms available, with an order corresponding to the server's 17196 preferences, the most preferred being first in the array. The client 17197 is free to pick whatever security mechanism it both desires and 17198 supports, or to pick in the server's preference order the first one 17199 it supports. The array entries are represented by the secinfo4 17200 structure. The field 'flavor' will contain a value of AUTH_NONE, 17201 AUTH_SYS (as defined in RFC1831 [4]), or RPCSEC_GSS (as defined in 17202 RFC2203 [5]). The field flavor can also any other security flavor 17203 registered with IANA. 17205 For the flavors AUTH_NONE and AUTH_SYS, no additional security 17206 information is returned. The same is true of many (if not most) 17207 other security flavors, including AUTH_DH. For a return value of 17208 RPCSEC_GSS, a security triple is returned that contains the mechanism 17209 object id (as defined in RFC2743 [8]), the quality of protection (as 17210 defined in RFC2743 [8]) and the service type (as defined in RFC2203 17211 [5]). It is possible for SECINFO to return multiple entries with 17212 flavor equal to RPCSEC_GSS with different security triple values. 17214 On success, the current filehandle retains its value. 17216 If the name has a length of 0 (zero), or if name does not obey the 17217 UTF-8 definition, the error NFS4ERR_INVAL will be returned. 17219 17.29.5. IMPLEMENTATION 17221 The SECINFO operation is expected to be used by the NFS client when 17222 the error value of NFS4ERR_WRONGSEC is returned from another NFS 17223 operation. This signifies to the client that the server's security 17224 policy is different from what the client is currently using. At this 17225 point, the client is expected to obtain a list of possible security 17226 flavors and choose what best suits its policies. 17228 As mentioned, the server's security policies will determine when a 17229 client request receives NFS4ERR_WRONGSEC. The operations which may 17230 receive this error are: LINK, LOOKUP, LOOKUPP, OPEN, PUTFH, PUTPUBFH, 17231 PUTROOTFH, RESTOREFH, RENAME, and indirectly READDIR. LINK and 17232 RENAME will only receive this error if the security used for the 17233 operation is inappropriate for saved filehandle. With the exception 17234 of READDIR, these operations represent the point at which the client 17235 can instantiate a filehandle into the "current filehandle" at the 17236 server. The filehandle is either provided by the client (PUTFH, 17237 PUTPUBFH, PUTROOTFH) or generated as a result of a name to filehandle 17238 translation (LOOKUP and OPEN). RESTOREFH is different because the 17239 filehandle is a result of a previous SAVEFH. Even though the 17240 filehandle, for RESTOREFH, might have previously passed the server's 17241 inspection for a security match, the server will check it again on 17242 RESTOREFH to ensure that the security policy has not changed. 17244 If the client wants to resolve an error return of NFS4ERR_WRONGSEC, 17245 the following will occur: 17247 o For LOOKUP and OPEN, the client will use SECINFO with the same 17248 current filehandle and name as provided in the original LOOKUP or 17249 OPEN to enumerate the available security triples. 17251 o For LINK, PUTFH, PUTROOTFH, PUTPUBFH, RENAME, and RESTOREFH, the 17252 client will use SECINFO_NO_NAME { style = 17253 SECINFO_STYLE4_CURRENT_FH }. The client will prefix the 17254 SECINFO_NO_NAME operation with the appropriate PUTFH, PUTPUBFH, or 17255 PUTROOTFH operation that provides the filehandle originally 17256 provided by the PUTFH, PUTPUBFH, PUTROOTFH, or RESTOREFH, or for 17257 the failed LINK or RENAME, the SAVEFH. 17259 o NOTE: In NFSv4.0, the client was required to use SECINFO, and had 17260 to reconstruct the parent of the original file handle, and the 17261 component name of the original filehandle. 17263 o For LOOKUPP, the client will use SECINFO_NO_NAME { style = 17264 SECINFO_STYLE4_PARENT } and provide the filehandle with equals the 17265 filehandle originally provided to LOOKUPP. 17267 The READDIR operation will not directly return the NFS4ERR_WRONGSEC 17268 error. However, if the READDIR request included a request for 17269 attributes, it is possible that the READDIR request's security triple 17270 did not match that of a directory entry. If this is the case and the 17271 client has requested the rdattr_error attribute, the server will 17272 return the NFS4ERR_WRONGSEC error in rdattr_error for the entry. 17274 See the section "Security Considerations" for a discussion on the 17275 recommendations for security flavor used by SECINFO and 17276 SECINFO_NO_NAME. 17278 17.30. Operation 34: SETATTR - Set Attributes 17280 17.30.1. SYNOPSIS 17282 (cfh), stateid, attrmask, attr_vals -> attrsset 17284 17.30.2. ARGUMENTS 17286 /* 17287 * SETATTR: Set attributes 17288 */ 17289 struct SETATTR4args { 17290 /* CURRENT_FH: target object */ 17291 stateid4 stateid; 17292 fattr4 obj_attributes; 17293 }; 17295 17.30.3. RESULTS 17297 struct SETATTR4res { 17298 nfsstat4 status; 17299 bitmap4 attrsset; 17300 }; 17302 17.30.4. DESCRIPTION 17304 The SETATTR operation changes one or more of the attributes of a file 17305 system object. The new attributes are specified with a bitmap and 17306 the attributes that follow the bitmap in bit order. 17308 The stateid argument for SETATTR is used to provide file locking 17309 context that is necessary for SETATTR requests that set the size 17310 attribute. Since setting the size attribute modifies the file's 17311 data, it has the same locking requirements as a corresponding WRITE. 17312 Any SETATTR that sets the size attribute is incompatible with a share 17313 reservation that specifies DENY_WRITE. The area between the old end- 17314 of-file and the new end-of-file is considered to be modified just as 17315 would have been the case had the area in question been specified as 17316 the target of WRITE, for the purpose of checking conflicts with 17317 record locks, for those cases in which a server is implementing 17318 mandatory record locking behavior. A valid stateid should always be 17319 specified. When the file size attribute is not set, the special 17320 stateid consisting of all bits zero should be passed. 17322 On either success or failure of the operation, the server will return 17323 the attrsset bitmask to represent what (if any) attributes were 17324 successfully set. The attrsset in the response is a subset of the 17325 bitmap4 that is part of the obj_attributes in the argument. 17327 On success, the current filehandle retains its value. 17329 17.30.5. IMPLEMENTATION 17331 If the request specifies the owner attribute to be set, the server 17332 should allow the operation to succeed if the current owner of the 17333 object matches the value specified in the request. Some servers may 17334 be implemented in a way as to prohibit the setting of the owner 17335 attribute unless the requester has privilege to do so. If the server 17336 is lenient in this one case of matching owner values, the client 17337 implementation may be simplified in cases of creation of an object 17338 followed by a SETATTR. 17340 The file size attribute is used to request changes to the size of a 17341 file. A value of 0 (zero) causes the file to be truncated, a value 17342 less than the current size of the file causes data from new size to 17343 the end of the file to be discarded, and a size greater than the 17344 current size of the file causes logically zeroed data bytes to be 17345 added to the end of the file. Servers are free to implement this 17346 using holes or actual zero data bytes. Clients should not make any 17347 assumptions regarding a server's implementation of this feature, 17348 beyond that the bytes returned will be zeroed. Servers must support 17349 extending the file size via SETATTR. 17351 SETATTR is not guaranteed atomic. A failed SETATTR may partially 17352 change a file's attributes. 17354 Changing the size of a file with SETATTR indirectly changes the 17355 time_modify. A client must account for this as size changes can 17356 result in data deletion. 17358 The attributes time_access_set and time_modify_set are write-only 17359 attributes constructed as a switched union so the client can direct 17360 the server in setting the time values. If the switched union 17361 specifies SET_TO_CLIENT_TIME4, the client has provided an nfstime4 to 17362 be used for the operation. If the switch union does not specify 17363 SET_TO_CLIENT_TIME4, the server is to use its current time for the 17364 SETATTR operation. 17366 If server and client times differ, programs that compare client time 17367 to file times can break. A time maintenance protocol should be used 17368 to limit client/server time skew. 17370 Use of a COMPOUND containing a VERIFY operation specifying only the 17371 change attribute, immediately followed by a SETATTR, provides a means 17372 whereby a client may specify a request that emulates the 17373 functionality of the SETATTR guard mechanism of NFS version 3. Since 17374 the function of the guard mechanism is to avoid changes to the file 17375 attributes based on stale information, delays between checking of the 17376 guard condition and the setting of the attributes have the potential 17377 to compromise this function, as would the corresponding delay in the 17378 NFS version 4 emulation. Therefore, NFS version 4 servers should 17379 take care to avoid such delays, to the degree possible, when 17380 executing such a request. 17382 If the server does not support an attribute as requested by the 17383 client, the server should return NFS4ERR_ATTRNOTSUPP. 17385 A mask of the attributes actually set is returned by SETATTR in all 17386 cases. That mask must not include attributes bits not requested to 17387 be set by the client, and must be equal to the mask of attributes 17388 requested to be set only if the SETATTR completes without error. 17390 17.31. Operation 37: VERIFY - Verify Same Attributes 17392 17.31.1. SYNOPSIS 17394 (cfh), fattr -> - 17396 17.31.2. ARGUMENTS 17398 /* 17399 * VERIFY: Verify attributes same 17400 */ 17401 struct VERIFY4args { 17402 /* CURRENT_FH: object */ 17403 fattr4 obj_attributes; 17404 }; 17406 17.31.3. RESULTS 17408 struct VERIFY4res { 17409 nfsstat4 status; 17410 }; 17412 17.31.4. DESCRIPTION 17414 The VERIFY operation is used to verify that attributes have a value 17415 assumed by the client before proceeding with following operations in 17416 the compound request. If any of the attributes do not match then the 17417 error NFS4ERR_NOT_SAME must be returned. The current filehandle 17418 retains its value after successful completion of the operation. 17420 17.31.5. IMPLEMENTATION 17422 One possible use of the VERIFY operation is the following compound 17423 sequence. With this the client is attempting to verify that the file 17424 being removed will match what the client expects to be removed. This 17425 sequence can help prevent the unintended deletion of a file. 17427 PUTFH (directory filehandle) 17428 LOOKUP (file name) 17429 VERIFY (filehandle == fh) 17430 PUTFH (directory filehandle) 17431 REMOVE (file name) 17433 This sequence does not prevent a second client from removing and 17434 creating a new file in the middle of this sequence but it does help 17435 avoid the unintended result. 17437 In the case that a recommended attribute is specified in the VERIFY 17438 operation and the server does not support that attribute for the file 17439 system object, the error NFS4ERR_ATTRNOTSUPP is returned to the 17440 client. 17442 When the attribute rdattr_error or any write-only attribute (e.g. 17443 time_modify_set) is specified, the error NFS4ERR_INVAL is returned to 17444 the client. 17446 17.32. Operation 38: WRITE - Write to File 17448 17.32.1. SYNOPSIS 17450 (cfh), stateid, offset, stable, data -> count, committed, writeverf 17452 17.32.2. ARGUMENTS 17454 /* 17455 * WRITE: Write to file 17456 */ 17457 enum stable_how4 { 17458 UNSTABLE4 = 0, 17459 DATA_SYNC4 = 1, 17460 FILE_SYNC4 = 2 17461 }; 17463 struct WRITE4args { 17464 /* CURRENT_FH: file */ 17465 stateid4 stateid; 17466 offset4 offset; 17467 stable_how4 stable; 17468 opaque data<>; 17469 }; 17471 17.32.3. RESULTS 17473 struct WRITE4resok { 17474 count4 count; 17475 stable_how4 committed; 17476 verifier4 writeverf; 17477 }; 17479 union WRITE4res switch (nfsstat4 status) { 17480 case NFS4_OK: 17481 WRITE4resok resok4; 17482 default: 17483 void; 17484 }; 17486 17.32.4. DESCRIPTION 17488 The WRITE operation is used to write data to a regular file. The 17489 target file is specified by the current filehandle. The offset 17490 specifies the offset where the data should be written. An offset of 17491 0 (zero) specifies that the write should start at the beginning of 17492 the file. The count, as encoded as part of the opaque data 17493 parameter, represents the number of bytes of data that are to be 17494 written. If the count is 0 (zero), the WRITE will succeed and return 17495 a count of 0 (zero) subject to permissions checking. The server may 17496 choose to write fewer bytes than requested by the client. 17498 Part of the write request is a specification of how the write is to 17499 be performed. The client specifies with the stable parameter the 17500 method of how the data is to be processed by the server. If stable 17501 is FILE_SYNC4, the server must commit the data written plus all file 17502 system metadata to stable storage before returning results. This 17503 corresponds to the NFS version 2 protocol semantics. Any other 17504 behavior constitutes a protocol violation. If stable is DATA_SYNC4, 17505 then the server must commit all of the data to stable storage and 17506 enough of the metadata to retrieve the data before returning. The 17507 server implementor is free to implement DATA_SYNC4 in the same 17508 fashion as FILE_SYNC4, but with a possible performance drop. If 17509 stable is UNSTABLE4, the server is free to commit any part of the 17510 data and the metadata to stable storage, including all or none, 17511 before returning a reply to the client. There is no guarantee 17512 whether or when any uncommitted data will subsequently be committed 17513 to stable storage. The only guarantees made by the server are that 17514 it will not destroy any data without changing the value of verf and 17515 that it will not commit the data and metadata at a level less than 17516 that requested by the client. 17518 The stateid value for a WRITE request represents a value returned 17519 from a previous record lock or share reservation request. The 17520 stateid is used by the server to verify that the associated share 17521 reservation and any record locks are still valid and to update lease 17522 timeouts for the client. 17524 Upon successful completion, the following results are returned. The 17525 count result is the number of bytes of data written to the file. The 17526 server may write fewer bytes than requested. If so, the actual 17527 number of bytes written starting at location, offset, is returned. 17529 The server also returns an indication of the level of commitment of 17530 the data and metadata via committed. If the server committed all 17531 data and metadata to stable storage, committed should be set to 17532 FILE_SYNC4. If the level of commitment was at least as strong as 17533 DATA_SYNC4, then committed should be set to DATA_SYNC4. Otherwise, 17534 committed must be returned as UNSTABLE4. If stable was FILE4_SYNC, 17535 then committed must also be FILE_SYNC4: anything else constitutes a 17536 protocol violation. If stable was DATA_SYNC4, then committed may be 17537 FILE_SYNC4 or DATA_SYNC4: anything else constitutes a protocol 17538 violation. If stable was UNSTABLE4, then committed may be either 17539 FILE_SYNC4, DATA_SYNC4, or UNSTABLE4. 17541 The final portion of the result is the write verifier. The write 17542 verifier is a cookie that the client can use to determine whether the 17543 server has changed instance (boot) state between a call to WRITE and 17544 a subsequent call to either WRITE or COMMIT. This cookie must be 17545 consistent during a single instance of the NFS version 4 protocol 17546 service and must be unique between instances of the NFS version 4 17547 protocol server, where uncommitted data may be lost. 17549 If a client writes data to the server with the stable argument set to 17550 UNSTABLE4 and the reply yields a committed response of DATA_SYNC4 or 17551 UNSTABLE4, the client will follow up some time in the future with a 17552 COMMIT operation to synchronize outstanding asynchronous data and 17553 metadata with the server's stable storage, barring client error. It 17554 is possible that due to client crash or other error that a subsequent 17555 COMMIT will not be received by the server. 17557 For a WRITE with a stateid value of all bits 0, the server MAY allow 17558 the WRITE to be serviced subject to mandatory file locks or the 17559 current share deny modes for the file. For a WRITE with a stateid 17560 value of all bits 1, the server MUST NOT allow the WRITE operation to 17561 bypass locking checks at the server and are treated exactly the same 17562 as if a stateid of all bits 0 were used. 17564 On success, the current filehandle retains its value. 17566 17.32.5. IMPLEMENTATION 17568 It is possible for the server to write fewer bytes of data than 17569 requested by the client. In this case, the server should not return 17570 an error unless no data was written at all. If the server writes 17571 less than the number of bytes specified, the client should issue 17572 another WRITE to write the remaining data. 17574 It is assumed that the act of writing data to a file will cause the 17575 time_modified of the file to be updated. However, the time_modified 17576 of the file should not be changed unless the contents of the file are 17577 changed. Thus, a WRITE request with count set to 0 should not cause 17578 the time_modified of the file to be updated. 17580 The definition of stable storage has been historically a point of 17581 contention. The following expected properties of stable storage may 17582 help in resolving design issues in the implementation. Stable 17583 storage is persistent storage that survives: 17585 1. Repeated power failures. 17587 2. Hardware failures (of any board, power supply, etc.). 17589 3. Repeated software crashes, including reboot cycle. 17591 This definition does not address failure of the stable storage module 17592 itself. 17594 The verifier is defined to allow a client to detect different 17595 instances of an NFS version 4 protocol server over which cached, 17596 uncommitted data may be lost. In the most likely case, the verifier 17597 allows the client to detect server reboots. This information is 17598 required so that the client can safely determine whether the server 17599 could have lost cached data. If the server fails unexpectedly and 17600 the client has uncommitted data from previous WRITE requests (done 17601 with the stable argument set to UNSTABLE4 and in which the result 17602 committed was returned as UNSTABLE4 as well) it may not have flushed 17603 cached data to stable storage. The burden of recovery is on the 17604 client and the client will need to retransmit the data to the server. 17606 A suggested verifier would be to use the time that the server was 17607 booted or the time the server was last started (if restarting the 17608 server without a reboot results in lost buffers). 17610 The committed field in the results allows the client to do more 17611 effective caching. If the server is committing all WRITE requests to 17612 stable storage, then it should return with committed set to 17613 FILE_SYNC4, regardless of the value of the stable field in the 17614 arguments. A server that uses an NVRAM accelerator may choose to 17615 implement this policy. The client can use this to increase the 17616 effectiveness of the cache by discarding cached data that has already 17617 been committed on the server. 17619 Some implementations may return NFS4ERR_NOSPC instead of 17620 NFS4ERR_DQUOT when a user's quota is exceeded. In the case that the 17621 current filehandle is a directory, the server will return 17622 NFS4ERR_ISDIR. If the current filehandle is not a regular file or a 17623 directory, the server will return NFS4ERR_INVAL. 17625 If mandatory file locking is on for the file, and corresponding 17626 record of the data to be written file is read or write locked by an 17627 owner that is not associated with the stateid, the server will return 17628 NFS4ERR_LOCKED. If so, the client must check if the owner 17629 corresponding to the stateid used with the WRITE operation has a 17630 conflicting read lock that overlaps with the region that was to be 17631 written. If the stateid's owner has no conflicting read lock, then 17632 the client should try to get the appropriate write record lock via 17633 the LOCK operation before re-attempting the WRITE. When the WRITE 17634 completes, the client should release the record lock via LOCKU. 17636 If the stateid's owner had a conflicting read lock, then the client 17637 has no choice but to return an error to the application that 17638 attempted the WRITE. The reason is that since the stateid's owner 17639 had a read lock, the server either attempted to temporarily 17640 effectively upgrade this read lock to a write lock, or the server has 17641 no upgrade capability. If the server attempted to upgrade the read 17642 lock and failed, it is pointless for the client to re-attempt the 17643 upgrade via the LOCK operation, because there might be another client 17644 also trying to upgrade. If two clients are blocked trying upgrade 17645 the same lock, the clients deadlock. If the server has no upgrade 17646 capability, then it is pointless to try a LOCK operation to upgrade. 17648 17.33. Operation 40: BACKCHANNEL_CTL - Backchannel control 17650 Control aspects of the backchannel 17652 17.33.1. SYNOPSIS 17654 callback program number, credentials -> - 17656 17.33.2. ARGUMENT 17658 /* 17659 * NFSv4.1 arguments and results 17660 */ 17661 struct gss_cb_handles4 { 17662 rpc_gss_svc_t gcbp_service; /* RFC 2203 */ 17663 opaque gcbp_handle_from_server<>; 17664 opaque gcbp_handle_from_client<>; 17665 }; 17667 union callback_sec_parms4 switch (uint32_t cb_secflavor) { 17668 case AUTH_NONE: 17669 void; 17670 case AUTH_SYS: 17671 authsys_parms cbsp_sys_cred; /* RFC 1831 */ 17672 case RPCSEC_GSS: 17673 gss_cb_handles4 cbsp_gss_handles; 17674 }; 17676 struct BACKCHANNEL_CTL4args { 17677 uint32_t bca_cb_program; 17678 callback_sec_parms4 bca_sec_parms<>; 17679 }; 17681 17.33.3. RESULT 17683 struct BACKCHANNEL_CTL4res { 17684 nfsstat4 bcr_status; 17685 }; 17687 17.33.4. DESCRIPTION 17689 The BACKCHANNEL_CTL operation replaces the backchannel's callback 17690 program number and adds (not replaces) RPCSEC_GSS contexts for use by 17691 the callback path. 17693 The arguments and results of the BACKCHANNEL_CTL call are a subset of 17694 the CREATE_SESSION parameters and have the same meaning. See the 17695 descriptions of csa_cb_program and csa_cb_sec_parms in 17696 Section 17.36.5. 17698 BACKCHANNEL_CTL MUST appear in a COMPOUND that starts with SEQUENCE. 17700 17.33.5. ERRORS 17702 TBD 17704 17.34. Operation 41: BIND_CONN_TO_SESSION 17706 17.34.1. SYNOPSIS 17708 sessionid, nonce, digest -> nonce, digest 17710 17.34.2. ARGUMENT 17712 struct bctsa_digest_input4 { 17713 sessionid4 bdai_sessid; 17714 uint64_t bdai_nonce1; 17715 uint64_t bdai_nonce2; 17716 }; 17718 enum channel_dir_from_client4 { 17719 CDFC4_FORE = 0x1, 17720 CDFC4_BACK = 0x2, 17721 CDFC4_FORE_OR_BOTH = 0x3, 17722 CDFC4_BACK_OR_BOTH = 0x7 17723 }; 17725 struct BIND_CONN_TO_SESSION4args { 17726 sessionid4 bctsa_sessid; 17727 bool bctsa_step1; 17728 channel_dir_from_client4 bctsa_dir; 17729 bool bctsa_use_conn_in_rdma_mode; 17730 uint64_t bctsa_nonce; 17731 opaque bctsa_digest<>; 17732 }; 17734 17.34.3. RESULT 17736 struct bctsr_digest_input4 { 17737 sessionid4 bdri_sessid; 17738 uint64_t bdri_nonce1; 17739 uint64_t bdri_nonce2; 17740 }; 17742 enum channel_dir_from_server4 { 17743 CDFS4_FORE = 0x1, 17744 CDFS4_BACK = 0x2, 17745 CDFS4_BOTH = 0x3 17746 }; 17748 struct BIND_CONN_TO_SESSION4resok { 17749 sessionid4 bctsr_sessid; 17750 bool bctsr_challenge; 17751 channel_dir_from_server4 bctsr_dir; 17752 bool bctsr_use_conn_in_rdma_mode; 17753 uint64_t bctsr_nonce; 17754 opaque bctsr_digest<>; 17755 }; 17757 union BIND_CONN_TO_SESSION4res switch (nfsstat4 bctsr_status) { 17758 case NFS4_OK: 17759 BIND_CONN_TO_SESSION4resok bctsr_resok4; 17760 default: 17761 void; 17762 }; 17764 17.34.4. DESCRIPTION 17766 BIND_CONN_TO_SESSION is used to bind additional connections to a 17767 session. It MUST be used on the connection being bound. It MUST be 17768 the only operation in the COMPOUND procedure. Any principal, 17769 security flavor, or RPCSEC_GSS context can invoke the operation. 17771 If when the session was created, the client opted to not enable 17772 enforcement of connection binding (see Section 17.36), the client is 17773 not required to use BIND_CONN_TO_SESSION, unless the client wishes to 17774 the bind the connection to the backchannel. In that case, because 17775 the client did not enable connection binding enforcement, it selected 17776 no hash algorithms for digest computation. Thus bctsa_digest and 17777 bctsr_digest will be zero length, and the neither the client or 17778 server verifies either digest. 17780 If the client enabled enforcement of connection binding, then to 17781 prevent replay attacks, BIND_CONN_TO_SESSION implements a challenge 17782 response protocol. This means that the client may be directed to 17783 issue BIND_CONN_TO_SESSION a second time on the same connection 17784 before the connection is bound to the session. The client is first 17785 returned a challenge value in bctsr_nonce, and the client must then 17786 calculate a digest using SSV as the key, and the challenge value and 17787 other information as the input text. Since the server is free to 17788 generate nonce values that are unlikely to be re-used, this prevents 17789 attackers from engaging in replay attacks to bind rogue connections 17790 to the session. 17792 bctsa_sessid identifies the session the connection is to be bound to. 17794 If bctsa_step1 is TRUE, then the client is trying to initiate a 17795 binding of a connection to a session. 17797 bctsa_nonce is a nonce used to deter replay attacks on the server. 17798 If bctsa_step1 is FALSE, bctsa_nonce MUST be different from the 17799 bctsa_nonce value for a previous BIND_CONN_TO_SESSION operation that 17800 had bctsa_step1 set to TRUE. 17802 bctsa_digest is computed as the output of the HMAC RFC2104 [14] using 17803 the current SSV as the key, and the XDR encoded value of data of type 17804 bctsa_digest_input4 as the input text. 17806 bdai_sessid is the same as bctsa_sessid. bdai_nonce1 is the same as 17807 bctsa_nonce. If bctsa_step1 was TRUE, then bdai_nonce2 is zero. 17808 Otherwise, bdai_nonce2 is the same as bctsr_nonce from previous 17809 response to BIND_CONN_TO_SESSION on the same connection and 17810 sessionid. 17812 In the response, bctsr_challenge is set to TRUE if the server is 17813 challenging the client to prove it is not attempting a replay attack. 17814 If it is set to true, the client MUST follow up with a 17815 BIND_CONN_TO_SESSION request with bctsda_step1 set to FALSE. If 17816 bctstr_challenge is set to FALSE, the server is either not 17817 challenging the client, or the response is in response to a 17818 challenge. 17820 bctsr_nonce, MUST NOT be equal to bctsa_nonce and is a nonce used to 17821 deter replay attacks on the client and server. 17823 bctsr_digest is the output of the HMAC using the SSV as the key, and 17824 the XDR encoded value of data type bctsr_digest_input as the input 17825 text. 17827 bdri_sessid is the same as bctsr_sessid which in turn should be the 17828 same as bctsa_sessid. bdri_nonce1 is the same as bctsr_nonce. 17830 bdri_nonce2 is the same as bctsa_nonce. If bctsr_challenge is TRUE, 17831 bdri_nonce3 is zero. Otherwise bdri_nonce3 is equal to the value of 17832 bctsa_nonce as sent in the preceding BIND_CONN_TO_SESSION that had 17833 bctsa_step1 set to TRUE. 17835 If server's computation of bctsa_digest does not match that in the 17836 arguments, the server MUST return NFS4ERR_BAD_SESSION_DIGEST. 17838 bctsa_dir indicates whether the client wants to bind the connection 17839 to the fore (operations) channel or back channel or both channels. 17840 The value CDFC4_FORE_OR_BOTH indicates the client wants to bind to 17841 the both the fore and back channel, but will accept the connection 17842 being bound to just the fore channel. The value CDFC4_BACK_OR_BOTH 17843 indicates the client wants to bind to the both the fore and back 17844 channel, but will accept the connection being bound to the back 17845 channel. The server replies in bctsr_dir which channel(s) the 17846 connection is bound to (but bctsr_dir is only meaningful if 17847 bctsr_challenge is FALSE). If the client specified CDFC4_FORE, the 17848 server MUST return CDFS4_FORE. If the client specified CDFC4_BACK, 17849 the server MUST return CDFS4_BACK. If the client specified 17850 CDFC4_FORE_OR_BOTH, the MUST return CDFS4_FORE ur CDFS4_BOTH. If the 17851 client specified CDFC4_BACK_OR_BOTH, the server MUST return 17852 CDFS4_BACK or CDFS4_BOTH. Note that if BIND_CONN_TO_SESSION has to 17853 be called in two steps, the server only processes the bctsa_dir value 17854 from the second step, and the client only processes the bctsr_dir 17855 from the second step. 17857 See the CREATE_SESSION operation (Section 17.36), and the description 17858 of the argument csa_use_conn_in_rdma_mode to understand 17859 bctsa_use_conn_in_rdma_mode, and the description of 17860 csr_use_conn_in_rdma_mode to understand bctsr_use_conn_in_rdma_mode. 17862 17.34.5. IMPLEMENTATION 17864 If the client's computation of bctsr_digest does not match that in 17865 the results, the client SHOULD NOT accept successful 17866 BIND_CONN_TO_SESSION results, and SHOULD assume there has been an 17867 attack. Possibilities include: 17869 o The attacker has managed to change the SSV, by binding another 17870 connection. 17872 o The attacker has not managed to change the SSV. 17874 The client recovers from a possible attack as follows. 17876 The client can issue SET_SSV to attempt to change the SSV. If SSV is 17877 changed successfully, including verification of the digest in the 17878 response to SET_SSV, then this means the attacker did not change the 17879 SSV. Thus the attacker has managed to hijack the connection. The 17880 client's only recourse is to disconnect, and bind a new connection. 17881 Using IPsec to protect the connection will prevent connection 17882 hijacking. 17884 If SET_SSV fails, or the verification of the digest in the response 17885 fails, the attacker has changed the SSV. The client's only recourse 17886 is to recreate the session. 17888 If the client loses all connections, it needs to use 17889 BIND_CONN_TO_SESSION to bind a new connection. The server will not 17890 have the SSV if the server has rebooted and the server doesn't keep 17891 the replay cache in stable storage. In that event, the preceding 17892 SEQUENCE op in the same compound will have returned 17893 NFS4ERR_BADSESSION, so the client's state machine goes back to 17894 CREATE_SESSION. 17896 There is an issue if SET_SSV is sent, no response is returned, and 17897 the last bound connection disconnects. The client, per the sessions 17898 model, needs to retry the SET_SSV. But it needs a new connection to 17899 do so, and needs to bind that connection to the session. The problem 17900 is that the digest calculation for BIND_CONN_TO_SESSION uses the SSV 17901 as the key, and the SSV may have changed. While there are multiple 17902 recovery strategies, a single, general strategy is described here. 17903 First the client reconnects. The client issues BIND_CONN_TO_SESSION 17904 with the new SSV used as the digest. If the server returns 17905 NFS4ERR_BAD_SESSION_DIGEST then this means the server's current SSV 17906 was not changed, and the SET_SSV was not executed. The client then 17907 tries BIND_CONN_TO_SESSION with the old SSV as the digest key. This 17908 should not return NFS4ERR_BAD_SESSION_DIGEST. If it does, an 17909 implementation error has occurred on either the client or server, and 17910 the client has to create a new session. 17912 17.34.6. ERRORS 17914 error list 17916 17.35. Operation 42: EXCHANGE_ID - Instantiate Client ID 17918 Exchange long hand client and server identifiers (owners), and create 17919 a client ID 17921 17.35.1. SYNOPSIS 17923 client owner -> client ID, server owner 17925 17.35.2. ARGUMENT 17927 const EXCHGID4_FLAG_SUPP_MOVED_REFER = 0x00000001; 17928 const EXCHGID4_FLAG_SUPP_MOVED_MIGR = 0x00000002; 17930 const EXCHGID4_FLAG_USE_NON_PNFS = 0x00010000; 17931 const EXCHGID4_FLAG_USE_PNFS_MDS = 0x00020000; 17932 const EXCHGID4_FLAG_USE_PNFS_DS = 0x00040000; 17934 struct EXCHANGE_ID4args { 17935 client_owner4 eia_clientowner; 17936 uint32_t eia_flags; 17937 nfs_impl_id4 eia_client_impl_id<1>; 17938 }; 17940 17.35.3. RESULT 17942 struct server_owner4 { 17943 uint64_t so_minor_id; 17944 opaque so_major_id; 17945 }; 17947 struct EXCHANGE_ID4resok { 17948 clientid4 eir_clientid; 17949 sequenceid4 eir_sequenceid; 17950 uint32_t eir_flags; 17951 server_owner4 eir_server_owner; 17952 opaque eir_server_scope; 17953 nfs_impl_id4 eir_server_impl_id<1>; 17954 }; 17956 union EXCHANGE_ID4res switch (nfsstat4 eir_status) { 17957 case NFS4_OK: 17958 EXCHANGE_ID4resok eir_resok4; 17959 default: 17960 void; 17961 }; 17963 17.35.4. DESCRIPTION 17965 The client uses the EXCHANGE_ID operation to register a particular 17966 client owners with the server. The client ID returned from this 17967 operation will be necessary for requests that create state on the 17968 server and will serve as a parent object to sessions created by the 17969 client. In order to confirm the client ID it and the returned 17970 sequenceid must first be used as an argument to CREATE_SESSION. 17972 The flags passed as part of the arguments and results to the 17973 EXCHANGE_ID operation allow the client and server inform each other 17974 of their capabilities as well as indicate how the client ID will be 17975 used. Whether a bit is set or cleared on the arguments' flags does 17976 not force the server to set or clear the same bit on the results' 17977 side. Bits not defined above should not be set in the eia_flags 17978 field. If they are, the server MUST reject the operation with 17979 NFS4ERR_INVAL. 17981 When the EXCHGID4_FLAG_SUPP_MOVED_REFER is set, the client indicates 17982 that it is capable of dealing with an NFS4ERR_MOVED error as part of 17983 a referral sequence. When this bit is not set, it is still legal for 17984 the server to perform a referral sequence. However, a server may use 17985 the fact that the client is incapable of correctly responding to a 17986 referral, by avoiding it for that particular client. It may, for 17987 instance, act as a proxy for that particular file system, at some 17988 cost in performance, although it is not obligated to do so. If the 17989 server will potentially perform a referral, it MUST set 17990 EXCHGID4_FLAG_SUPP_MOVED_REFER in eir_flags. 17992 When the EXCHGID4_FLAG_SUPP_MOVED_MIGR is set, the client indicates 17993 that it is capable of dealing with an NFS4ERR_MOVED error as part of 17994 a file system migration sequence. When this bit is not set, it is 17995 still legal for the server to indicate that a file system has moved, 17996 when this in fact happens. However, a server may use the fact that 17997 the client is incapable of correctly responding to a migration in its 17998 scheduling of file systems to migrate so as to avoid migration of 17999 file systems being actively used. It may also hide actual migrations 18000 from clients unable to deal with them by acting as a proxy for a 18001 migrated file system for particular clients, at some cost in 18002 performance, although it is not obligated to do so. If the server 18003 will potentially perform a migration, it MUST set 18004 EXCHGID4_FLAG_SUPP_MOVED_MIGR in eir_flags. 18006 When EXCHGID4_FLAG_USE_NON_PNFS is set in eia_flags, the client 18007 indicates it wants to use the server in a conventional, non-parallel 18008 NFS mode of operation. When EXCHGID4_FLAG_USE_NON_PNFS is set in 18009 eir_flags, the server is indicating it supports a conventional mode 18010 of operation. 18012 When EXCHGID4_FLAG_USE_PNFS_MDS is set in eia_flags, the client 18013 indicates it wants to use the server as a metadata server of a 18014 parallel NFS cluster. When EXCHGID4_FLAG_USE_PNFS_MDS is set in 18015 eir_flags, the server is indicating it supports a metadata server. 18017 When EXCHGID4_FLAG_USE_PNFS_DS is set in eia_flags, the client 18018 indicates it wants to use the server as a data server of a parallel 18019 NFS cluster. When EXCHGID4_FLAG_USE_PNFS_DS is set in eir_flags, the 18020 server is indicating it supports a data server. 18022 A client SHOULD indicate at least one of EXCHGID4_FLAG_USE_NON_PNFS, 18023 EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_USE_PNFS_DS so that a server 18024 willing to meet the client's desires can indicate it is doing so. A 18025 server MUST return at least one of the three bits, even if the bit is 18026 not among the flag bits sent from the client. 18028 The capabilities indicated in the flags word apply to all sessions 18029 created for the resulting client ID and are presumed by the server to 18030 remain valid until a new client instance with the same client 18031 instance string does an EXCHANGE_ID. The server may update its view 18032 of such capabilities when a new EXCHANGE_ID is done by the same 18033 client instance but clients should not depend upon such an update 18034 being effective until the server receives an EXCHANGE_ID for a new 18035 client instance. 18037 The arguments includes an array of up to one element in length called 18038 eia_client_impl_id. If eia_client_impl_id is present it contains the 18039 information identifying the implementation of the client. Similarly, 18040 the results include an array of up to one element in length called 18041 eir_server_impl_id that identifies the implementation of the server. 18042 Servers MUST allow a zero length eia_client_impl_id array, and 18043 clients MUST allow a zero length eir_server_impl_id array. Being 18044 able to identify specific implementations can help in planning by 18045 administrators or implementors. For example, diagnostic software may 18046 extract this information in an attempt to identify interoperability 18047 problems, performance workload behaviors or general usage statistics. 18048 Since the intent of having access to this information is for planning 18049 or general diagnosis only, the client and server MUST NOT interpret 18050 this implementation identity information in a way that affects 18051 interoperational behavior of the implementation. The reason is the 18052 if clients and servers did such a thing, they might use fewer 18053 capabilities of the protocol than the peer can support, or the client 18054 and server might refuse to interoperate. 18056 Because it is likely some implementations will violate the protocol 18057 specification and interpret the identity information, implementations 18058 MUST allow the users of the NFSv4 client and server to set the 18059 contents of the sent nfs_impl_id structure to any value. 18061 17.35.5. IMPLEMENTATION 18063 A server's client record is a 5-tuple: 18065 1. co_ownerid 18066 The client identifier string, from the eia_clientowner 18067 structure of the EXCHANGE_ID4args structure 18069 2. co_verifier: 18071 A client-specific value used to indicate reboots, from the 18072 eia_clientowner structure of the EXCHANGE_ID4args structure 18074 3. principal: 18076 The RPCSEC_GSS principal sent via the RPC headers 18078 4. client ID: 18080 The shorthand client identifier, generated by the server and 18081 returned via the eir_clientid field in the EXCHANGE_ID4resok 18082 structure 18084 5. confirmed: 18086 A private field on the server indicating whether or not a 18087 client record has been confirmed. A client record is 18088 confirmed if there has been a successful CREATE_SESSION 18089 operation to confirm it. Otherwise it is unconfirmed. An 18090 unconfirmed record is established by a EXCHANGE_ID call. Any 18091 unconfirmed record that is not confirmed within a lease period 18092 may be removed. 18094 The following identifiers represent special values for the fields in 18095 the records. 18097 ownerid_arg: 18099 The value of the eia_clientowner.co_ownerid subfield of the 18100 EXCHANGE_ID4args structure of the current request. 18102 verifier_arg: 18104 The value of the eia_clientowner.co_verifier subfield of the 18105 EXCHANGE_ID4args structure of the current request. 18107 old_verifier_arg: 18109 A value of the eia_clientowner.co_verifier field of a client 18110 record received in a previous request; this is distinct from 18111 verifier_arg. 18113 principal_arg: 18115 The value of the RPCSEC_GSS principal for the current request. 18117 old_principal_arg: 18119 A value of the RPCSEC_GSS principal received for a previous 18120 request. This is distinct from principal_arg. 18122 clientid_ret: 18124 The value of the eir_clientid field the server will return in the 18125 EXCHANGE_ID4resok structure for the current request. 18127 old_clientid_ret: 18129 The value of the eir_clientid field the server returned in the 18130 EXCHANGE_ID4resok structure for a previous request. This is 18131 distinct from clientid_ret. 18133 Since EXCHANGE_ID is a non-idempotent operation, we must consider the 18134 possibility that replays might occur as a result of a client reboot, 18135 network partition, malfunctioning router, etc. Replays are 18136 identified by the value of the client field of EXCHANGE_ID4args and 18137 the method for dealing with them is outlined in the scenarios below. 18139 The scenarios are described in terms of what client records whose 18140 eia_clientowner.co_ownerid subfield have a value equal to ownerid_arg 18141 existing in the server's set of client records. Any cases in which 18142 there is more than one record with identical values for ownerid_arg 18143 represent a server implementation error. Operation in the potential 18144 valid cases is summarized as follows. 18146 1. Common case 18148 If no client records with eia_clientowner.co_ownerid matching 18149 ownerid_arg exist, a new shorthand client identifier 18150 clientid_ret is generated, and the following unconfirmed 18151 record is added to the server's state. 18153 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 18154 FALSE } 18156 Subsequently, the server returns clientid_ret. 18158 2. Router Replay 18159 If the server has the following confirmed record, then this 18160 request is likely the result of a replayed request due to a 18161 faulty router or lost connection. 18163 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, TRUE 18164 } 18166 Since the record has been confirmed, the client must have 18167 received the server's reply from the initial EXCHANGE_ID 18168 request. Since this is simply a spurious request, there is no 18169 modification to the server's state, and the server makes no 18170 reply to the client. 18172 3. Client Collision 18174 If the server has the following confirmed record, then this 18175 request is likely the result of a chance collision between the 18176 values of the eia_clientowner.co_ownerid subfield of 18177 EXCHANGE_ID4args for two different clients. 18179 { ownerid_arg, *, old_principal_arg, clientid_ret, TRUE } 18181 Since the value of the eia_clientowner.co_ownerid subfield of 18182 each client record must be unique, there is no modification of 18183 the server's state. The server either returns 18184 NFS4ERR_CLID_INUSE is to indicate the client should retry with 18185 a different value for the eia_clientowner.co_ownerid subfield 18186 of EXCHANGE_ID4args, or the server considers the principal and 18187 ownerid together as the client owner, and treats the 18188 EXCHANGE_ID as a unique client owner. 18190 This scenario may also represent a malicious attempt to 18191 destroy a client's state on the server. For security reasons, 18192 the server MUST NOT remove the client's state when there is a 18193 principal mismatch. 18195 4. Replay 18197 If the server has the following unconfirmed record then this 18198 request is likely the result of a client replay due to a 18199 network partition or some other connection failure. 18201 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 18202 FALSE } 18204 Since the response to the EXCHANGE_ID request that created 18205 this record may have been lost, it is not acceptable to drop 18206 this replayed request. However, rather than processing it 18207 normally, the existing record is left unchanged and 18208 clientid_ret, which was generated for the previous request, is 18209 returned. 18211 5. Change of Principal 18213 If the server has the following unconfirmed record then this 18214 request is likely the result of a client which has for 18215 whatever reasons changed principals (possibly to change 18216 security flavor) after calling EXCHANGE_ID, but before calling 18217 CREATE_SESSION. 18219 { ownerid_arg, verifier_arg, old_principal_arg, clientid_ret, 18220 FALSE} 18222 Since the client has not changed, the principal field of the 18223 unconfirmed record is updated to principal_arg and 18224 clientid_ret is again returned. There is a small possibility 18225 that this is merely a collision on the client field of 18226 EXCHANGE_ID4args between unrelated clients, but since that is 18227 unlikely, and an unconfirmed record does not generally have 18228 any file system pertinent state, we can assume it is the same 18229 client without risking loss of any important state. 18231 After processing, the following record will exist on the 18232 server. 18234 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 18235 FALSE} 18237 6. Client Reboot 18239 If the server has the following confirmed client record, then 18240 this request is likely from a previously confirmed client 18241 which has rebooted. 18243 { ownerid_arg, old_verifier_arg, principal_arg, clientid_ret, 18244 TRUE } 18246 Since the previous incarnation of the same client will no 18247 longer be making requests, lock and share reservations should 18248 be released immediately rather than forcing the new 18249 incarnation to wait for the lease time on the previous 18250 incarnation to expire. Furthermore, session state should be 18251 removed since if the client had maintained that information 18252 across reboot, this request would not have been issued. If 18253 the server does not support the CLAIM_DELEGATE_PREV claim 18254 type, associated delegations should be purged as well; 18255 otherwise, delegations are retained and recovery proceeds 18256 according to the section Delegation Recovery (Section 9.2.1). 18257 The client record is updated with the new verifier and its 18258 status is changed to unconfirmed. 18260 After processing, clientid_ret is returned to the client and 18261 the following record will exist on the server. 18263 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 18264 FALSE } 18266 7. Reboot before confirmation 18268 If the server has the following unconfirmed record, then this 18269 request is likely from a client which rebooted before sending 18270 a CREATE_SESSION request. 18272 { ownerid_arg, old_verifier_arg, *, clientid_ret, FALSE } 18274 Since this is believed to be a request from a new incarnation 18275 of the original client, the server updates the value of 18276 eia_clientowner.co_verifier and returns the original 18277 clientid_ret. After processing, the following state exists on 18278 the server. 18280 { ownerid_arg, verifier_arg, *, clientid_ret, FALSE } 18282 In addition to the client ID and sequenceid, the server returns a 18283 server owner (eir_server_owner) and eir_server_scope. The former 18284 field is used for network trunking as described in 18285 Section 2.10.3.4.1. The latter field is used to allow clients to 18286 determine when clientids issued by one server may be recognized by 18287 another in the event of file system migration (see Section 10.6.7). 18289 17.36. Operation 43: CREATE_SESSION - Create New Session and Confirm 18290 Client ID 18292 Start up session and confirm client ID. 18294 17.36.1. SYNOPSIS 18296 client ID, session_args -> sessionid, session_args 18298 17.36.2. ARGUMENT 18300 struct channel_attrs4 { 18301 count4 ca_maxrequestsize; 18302 count4 ca_maxresponsesize; 18303 count4 ca_maxresponsesize_cached; 18304 count4 ca_maxoperations; 18305 count4 ca_maxrequests; 18306 uint32_t ca_rdma_ird<1>; 18307 }; 18309 union conn_binding4args switch (bool cba_enforce) { 18310 case TRUE: 18311 sec_oid4 cba_hash_algs<>; 18312 case FALSE: 18313 void; 18314 }; 18316 const CREATE_SESSION4_FLAG_PERSIST = 0x00000001; 18317 const CREATE_SESSION4_FLAG_CONN_BACK_CHAN = 0x00000002; 18318 const CREATE_SESSION4_FLAG_CONN_RDMA = 0x00000004; 18320 struct CREATE_SESSION4args { 18321 clientid4 csa_clientid; 18322 sequenceid4 csa_sequence; 18323 uint32_t csa_flags; 18325 count4 csa_headerpadsize; 18327 conn_binding4args csa_conn_binding_opts; 18329 channel_attrs4 csa_fore_chan_attrs; 18330 channel_attrs4 csa_back_chan_attrs; 18332 uint32_t csa_cb_program; 18333 callback_sec_parms4 csa_cb_sec_parms<>; 18334 }; 18336 17.36.3. RESULT 18338 struct hash_alg_info4 { 18339 uint32_t hai_hash_alg; 18340 uint32_t hai_ssv_len; 18341 }; 18343 union conn_binding4res switch (bool cbr_enforce) { 18344 case TRUE: 18345 hash_alg_info4 cbr_hash_alg_info; 18346 case FALSE: 18347 void; 18348 }; 18350 struct CREATE_SESSION4resok { 18351 sessionid4 csr_sessionid; 18352 sequenceid4 csr_sequence; 18354 uint32_t csr_flags; 18355 count4 csr_headerpadsize; 18357 conn_binding4res csr_conn_binding_opts; 18359 channel_attrs4 csr_fore_chan_attrs; 18360 channel_attrs4 csr_back_chan_attrs; 18361 }; 18363 union CREATE_SESSION4res switch (nfsstat4 csr_status) { 18364 case NFS4_OK: 18365 CREATE_SESSION4resok csr_resok4; 18366 default: 18367 void; 18368 }; 18370 17.36.4. DESCRIPTION 18372 This operation is used by the client to create new session objects on 18373 the server. The server MUST accept a CREATE_SESSION operation with 18374 no preceding SEQUENCE operation in the COMPOUND procedure. A client 18375 MAY precede CREATE_SESSION with SEQUENCE in a COMPOUND procedure; if 18376 so, any session created by CREATE_SESSION has no direct relation to 18377 the session specified in the SEQUENCE operation. 18379 In addition to creating a session, CREATE_SESSION has the following 18380 effects: 18382 o The first session created with a new shorthand client identifier 18383 (client ID) serves to confirm the creation of that client's state 18384 on the server. The server returns the parameter values for the 18385 new session. 18387 o The connection CREATE_SESSION is issued over is bound to the 18388 session and to the session's forward channel. 18390 17.36.5. IMPLEMENTATION 18392 To describe the implementation, the same notation for client records 18393 introduced in the description of EXCHANGE_ID is used with the 18394 following addition: 18396 clientid_arg: The value of the csa_clientid field of the 18397 CREATE_SESSION4args structure of the current request. 18399 Since CREATE_SESSION is a non-idempotent operation, we must consider 18400 the possibility that replays may occur as a result of a client 18401 reboot, network partition, malfunctioning router, etc. For each 18402 client ID created by EXCHANGE_ID, the server maintains a separate 18403 replay cache similar to the session replay cache used for SEQUENCE 18404 operations, with two distinctions. First this is a replay cache just 18405 for detecting and processing CREATE_SESSION requests for a given 18406 client ID. Second, the size of the client ID replay cache is of one 18407 slot (and as a result, the CREATE_SESSION request does not carry a 18408 slot number). This means that at most one CREATE_SESSION request for 18409 a given client ID can be outstanding. When client issues a 18410 successful EXCHANGE_ID it is returned eir_sequenceid, and the client 18411 is expected to set the value of csa_sequenceid in the next 18412 CREATE_SESSION it sends with that client ID to the value of 18413 eir_sequenceid. After EXCHANGE_ID, the server initializes the client 18414 ID slot to be equal to eir_sequenceid - 1 (accounting for underflow), 18415 and records a contrived CREATE_SESSION result with a "cached" result 18416 of NFS4ERR_SEQ_MISORDERED. With the slot thus initialized, the 18417 processing of the CREATE_SESSION operation is divided into four 18418 phases: 18420 1. Replay cache lookup. The server verifies it has a replay cache 18421 for the client ID. If the server contains no records with client 18422 ID equal to clientid_arg, then most likely the client's state has 18423 been purged during a period of inactivity, possibly due to a loss 18424 of connectivity. NFS4ERR_STALE_CLIENTID is returned, and no 18425 changes are made to any client records on the server. 18427 2. Sequenceid processing. If csa_sequenceid is equal to the 18428 sequenceid in the client's slot, then this is a replay of the 18429 previous CREATE_SESSION request, and the server returns the 18430 cached result. If csa_sequenceid is not equal to the sequenceid 18431 in the slot, and is more than one greater (accounting for 18432 wraparound), then the server returns the error 18433 NFS4ERR_SEQ_MISORDERED, and does not change the slot. If 18434 csa_sequenceid is equal to the slot's sequenceid + 1 (accounting 18435 for wraparound), then the slot's sequenceid is set to 18436 csa_sequenceid, and the CREATE_SESSION processing goes to the 18437 next phase. A subsequent new CREATE_SESSION call, MUST use a 18438 csa_sequence that is one greater than that recorded in the slot. 18440 3. Client ID confirmation. In case the state for the provided 18441 client ID has not been verified, it is confirmed before the 18442 session is created. Otherwise the client ID confirmation phase 18443 is skipped and only the session creation phase occurs. The 18444 operational cases are described in terms of what client records 18445 whose client ID field have value equal to clientid_arg exist in 18446 the server's set of client records. Any cases in which there is 18447 more than one record with identical values for client ID 18448 represent a server implementation error. Operation in the 18449 potential valid cases is summarized as follows. 18451 * Common Case 18453 If the server has the following unconfirmed record, then 18454 this is the expected confirmation of an unconfirmed record. 18456 { *, *, principal_arg, clientid_arg, FALSE } 18458 The confirmed field of the record is set to TRUE. 18460 The processing of the operation continues to session 18461 creation. 18463 * Principal Change or Collision 18465 If the server has the following record, then the client has 18466 changed principals after the previous EXCHANGE_ID request, 18467 or there has been a chance collision between shorthand 18468 client identifiers. 18470 { *, *, old_principal_arg, clientid_arg, *, sequence_arg } 18472 Neither of these cases are permissible. Processing stops 18473 and NFS4ERR_CLID_INUSE is returned to the client. No 18474 changes are made to any client records on the server. 18476 4. Session creation. The server confirmed the client ID, either in 18477 this CREATE_SESSION operation, or a previous CREATE_SESSION 18478 operation. The server examines the remaining fields of the 18479 arguments. For each argument field, if the value is acceptable 18480 to the server, it is recommended that the server use the provided 18481 value to create the new session. If it is not acceptable, the 18482 server may use a different value, but must return the value used 18483 to the client. These parameters have the following 18484 interpretation. 18486 csa_flags: 18488 The csa_flags field contains a list of the following flag 18489 bits: 18491 CREATE_SESSION4_FLAG_PERSIST: 18493 If CREATE_SESSION4_FLAG_PERSIST is set, the client desires 18494 server support for "reliable" semantics. For sessions in 18495 which only idempotent operations will be used (e.g. a read- 18496 only session), clients should not set 18497 CREATE_SESSION4_FLAG_PERSIST. If the server does not or 18498 cannot provide "reliable" semantics the result field 18499 csr_flags must not set CREATE_SESSION4_FLAG_PERSIST. 18501 If the server is a pNFS metadata server, for reasons 18502 described in Section 12.5.2 it MUST support 18503 CREATE_SESSION4_FLAG_PERSIST if it supports the layout_hint 18504 (Section 5.13.4) attribute. 18506 CREATE_SESSION4_FLAG_CONN_BACK_CHAN: 18508 If CREATE_SESSION4_FLAG_CONN_BACK_CHAN is set in csa_flags, 18509 the client is requesting that the server use the connection 18510 CREATE_SESSION is called over for the back channel as well 18511 as the forward channel. The server sets 18512 CREATE_SESSION4_FLAG_CONN_BACK_CHAN in the result field 18513 csr_flags if it agrees. If 18514 CREATE_SESSION4_FLAG_CONN_BACK_CHAN is not set in 18515 csa_flags, then CREATE_SESSION4_FLAG_CONN_BACK_CHAN MUST 18516 NOT be set in csr_flags. 18518 CREATE_SESSION4_FLAG_CONN_RDMA: 18520 If CREATE_SESSION4_FLAG_CONN_RDMA is set in csa_flags, the 18521 connection CREATE_SESSION is called over is currently in 18522 non-RDMA mode, but has the capability to operate in RDMA 18523 mode, and the client is requesting the server agree to 18524 "step up" to RDMA mode on the connection. The server sets 18525 CREATE_SESSION4_FLAG_CONN_RDMA in the result field 18526 csr_flags if it agrees. If CREATE_SESSION4_FLAG_CONN_RDMA 18527 is not set in csa_flags, then 18528 CREATE_SESSION4_FLAG_CONN_RDMA MUST NOT be set in 18529 csr_flags. Note that once the server agrees to step up, it 18530 and the client MUST exchange all future traffic on the 18531 connection with RPC RDMA framing and not Record Marking. 18532 [[Comment.18: add xref]] 18534 csa_headerpadsize: 18536 The maximum amount of padding the client is willing to apply 18537 to ensure that write payloads are aligned on some boundary at 18538 the server. The server should reply in csr_headerpadsize with 18539 its preferred value, or zero if padding is not in use. The 18540 server may decrease this value but MUST NOT increase it. 18542 csa_conn_binding_opts: 18544 This argument indicates whether the client wants the server to 18545 enforce connection binding ( see Section 2.10.6.3), and if so, 18546 which one way hash algorithms to use. The corresponding 18547 result is csr_conn_binding_opts. The argument contains the 18548 following fields. 18550 cba_enforce: 18552 Clients SHOULD set cba_enforce to TRUE so that servers 18553 reject the use of connections that are not explicitly bound 18554 to the session. If TRUE, the server MUST require the 18555 client to issue BIND_CONN_TO_SESSION before using a 18556 connection on a channel. If FALSE, then the digests used 18557 in SET_SSV and BIND_CONN_TO_SESSION MUST be zero length. 18559 The corresponding result is cbr_enforce which MUST be equal 18560 to cba_enforce. 18562 cba_hash_algs: 18564 This is the set of algorithms the client supports for the 18565 purpose of computing the digests needed for the SET_SSV and 18566 BIND_CONN_TO_SESSION operations. Each algorithm is 18567 specified as an object identifier (OID). The REQUIRED 18568 algorithms for a server are id-sha1, id-sha224, id-sha256, 18569 id-sha384, and id-sha512 RFC4055 [15]. 18571 If the server does not support any of the offered hash 18572 algorithms, CREATE_SESSION fails with error status 18573 NFS4ERR_OP_HASH_ALG_UNSUPP. Otherwise, the corresponding 18574 result is cbr_hash_alg_info, which contains two fields, 18575 hai_hash_alg and hai_ssv_len. The former is the index of 18576 the algorithm list of cba_hash_algs that the server has 18577 selected and the client MUST use for SET_SSV and 18578 BIND_CONN_TO_SESSION. The latter is the length in octets 18579 of the SSV the client MUST use in SET_SSV. The result 18580 hai_ssv_len MUST be greater than or equal to the length of 18581 the hash produced by the selected algorithm. 18583 csa_fore_chan_attrs 18585 csa_back_chan_attrs 18587 These two fields apply to attributes of the fore channel (aka 18588 the operations channel, which conveys requests originating 18589 from the client to the server), and the back channel (the 18590 channel that conveys callback requests originating from the 18591 server to the client). The results are in corresponding 18592 structures called csr_fore_chan_attrs and csr_back_chan_attrs. 18593 Each structure has the following fields: 18595 ca_maxrequestsize: 18597 The maximum size of a COMPOUND or CB_COMPOUND request that 18598 will be sent. This size represents the XDR encoded size of 18599 the request, including the RPC headers (including security 18600 flavor credentials and, verifiers) but excludes any 32 bit 18601 Record Marking headers. Imagine a request has a single 18602 Record Marking header preceding it. The maximum allowable 18603 count encoded in the header will be ca_maxrequestsize. If 18604 a sender sends a request that exceeds ca_maxrequestsize, 18605 the error NFS4ERR_REQ_TOO_BIG will be returned per the 18606 description in Section 2.10.4.4. 18608 ca_maxresponsesize: 18610 The maximum size of a COMPOUND or CB_COMPOUND reply that 18611 the receiver will accept from the sender including RPC 18612 headers (see the ca_maxrequestsize definition). The 18613 NFSv4.1 server MUST NOT increase the value of this 18614 parameter in the CREATE_SESSION results. If a sender sends 18615 a request for which the size of the reply would exceed this 18616 value, the receiver will return NFS4ERR_REP_TOO_BIG, per 18617 the description in Section 2.10.4.4. 18619 ca_maxresponsesize_cached: 18621 Like ca_maxresponsesize, but the maximum size of a reply 18622 that will be stored in the reply cache (Section 2.10.4.1). 18623 If ca_maxresponsesize_cached is less than 18624 ca_maxresponsesize, then this is an indication to the 18625 client that it needs to be selective about which replies it 18626 tells the server to cache; large replies (e.g. READ 18627 results), should not be cached. The client can decide 18628 which replies to cache via the SEQUENCE (Section 17.46) or 18629 CB_SEQUENCE (Section 19.9) operations. If a sender sends a 18630 request for which the size of the reply would exceed this 18631 value, the receiver will return 18632 NFS4ERR_REP_TOO_BIG_TO_CACHE, per the description in 18633 Section 2.10.4.4. 18635 ca_maxoperations: 18637 The maximum number of operations requests the receiver will 18638 accept in a COMPOUND or CB_COMPOUND. If client or server 18639 do not have a limit, they will set ca_maxoperations to 18640 0xFFFFFFFF. The server MUST NOT increase ca_maxoperations 18641 in the reply to CREATE_SESSION. If the requester issues a 18642 COMPOUND or CB_COMPOUND with more operations than 18643 ca_maxoperations, the replier MUST return 18644 NFS4ERR_TOO_MANY_OPS. 18646 ca_maxrequests: 18648 The maximum number of concurrent COMPOUND or CB_COMPOUND 18649 requests the sender will issue on the session. Subsequent 18650 requests will each be assigned a slot identifier by the 18651 client within the range 0 to ca_maxrequests - 1 inclusive. 18653 ca_rdma_ird: 18655 This array has a maximum of one element. If this array has 18656 one element, then the element contains the inbound RDMA 18657 read queue depth (IRD). 18659 csa_cb_program 18661 This is the program number the server must use in any 18662 callbacks sent through the back channel to the client. 18664 csa_cb_sec_parms 18666 This is an array of acceptable security credentials. Three 18667 security flavors are supported: AUTH_NONE, AUTH_SYS, and 18668 RPCSEC_GSS. If AUTH_NONE is specified for a credential, then 18669 this says the client is allowed to use AUTH_NONE on all 18670 callbacks for the session. If AUTH_SYS is specified, then the 18671 client is allowed to use AUTH_SYS on all callbacks, using the 18672 credential specified cbsp_sys_cred. If RPCSEC_GSS is 18673 specified, then the server is allowed to use the RPCSEC_GSS 18674 context specified in cbsp_gss_parms as the RPCSEC_GSS context 18675 in the credential of the RPC header of callbacks to the 18676 client. 18678 The RPCSEC_GSS context is specified with two RPCSEC_GSS 18679 handles. The first handle, gcbp_handle_from_server, is the 18680 fore handle the server returned to the client when the 18681 RPCSEC_GSS context was created on the server. The second 18682 handle, gcbp_handle_from_client is the back handle the client 18683 will map to the RPCSEC_GSS context to. The server can 18684 immediately use the RPCSEC_GSS context using 18685 gcbp_handle_from_client as the value for "handle" in the 18686 structure rpc_gss_cred_vers_1_t of the RPCSEC_GSS handle, and 18687 gss_proc set to RPCSEC_GSS_DATA. Note that while the GSS 18688 context state is shared between the fore and back RPCSEC_GSS 18689 contexts, the fore and back RPCSEC_GSS context state are 18690 independent of each other as far as the RPCSEC_GSS sequence 18691 number. 18693 Implementing RPCSEC_GSS callback support requires the client 18694 and server change their RPCSEC_GSS implementations. One 18695 possible set of changes includes: 18697 + Adding a data structure that wraps the GSS-API context with 18698 a reference count. 18700 + New functions to increment and decrement the reference 18701 count. If the reference count is decremented to zero, the 18702 wrapper data structure and the GSS-API context it refers to 18703 would be freed. 18705 + Change RPCSEC_GSS to create the wrapper data structure upon 18706 receiving GSS-API context from gss_accept_sec_context() and 18707 gss_init_sec_context(). The reference count would be 18708 initialized to 1. 18710 + Adding a function to map an existing RPCSEC_GSS handle to a 18711 pointer to the wrapper data structure. The reference count 18712 would be incremented. 18714 + Adding a function to create a new RPCSEC_GSS handle from a 18715 pointer to the wrapper data structure. The reference count 18716 would be incremented. 18718 + Replacing calls from RPCSEC_GSS that free GSS-API contexts, 18719 with calls to decrement the reference count on the wrapper 18720 data structure. 18722 5. The server creates the session by recording the parameter values 18723 used (including whether the CREATE_SESSION4_FLAG_PERSIST flag is 18724 set and has been accepted by the server) and allocating space for 18725 the session replay cache. For each slot in the replay cache, the 18726 server sets the sequenceid to zero (0), and records a result 18727 containing a result for a COMPOUND with a single SEQUENCE 18728 operation, with the cached error of NFS4ERR_SEQ_MISORDERED. 18729 Thus, the first SEQUENCE operation a client issues on a slot 18730 after the session is created MUST start with a sequenceid of one 18731 (1). The client initializes its replay cache for receiving 18732 callbacks in the same way, and similarly, the first CB_SEQUENCE 18733 operation on a slot after session creation must have a sequenceid 18734 of one. 18736 6. If the session state is created successfully, the server 18737 associates the session with the client ID provided by the client. 18739 17.37. Operation 44: DESTROY_SESSION - Destroy existing session 18741 Destroy existing session. 18743 17.37.1. SYNOPSIS 18745 sessionid -> status 18747 17.37.2. ARGUMENT 18749 struct DESTROY_SESSION4args { 18750 sessionid4 dsa_sessionid; 18751 }; 18753 17.37.3. RESULT 18755 struct DESTROY_SESSION4res { 18756 nfsstat4 dsr_status; 18757 }; 18759 17.37.4. DESCRIPTION 18761 The DESTROY_SESSION operation closes the session and discards the 18762 replay cache. Any remaining connections bound to the session are 18763 immediately unbound and may additionally be closed by the server. 18764 Locks, delegations, layouts, wants, and the lease, which are all tied 18765 to the client ID, are not affected by DESTROY_SESSION. 18767 If the COMPOUND request starts with SEQUENCE, then DESTROY_SESSION 18768 MUST be the final, or only operation, unless the sessionid specified 18769 in SEQUENCE is different from the sessionid specified in 18770 DESTROY_SESSION. DESTROY_SESSION MAY be the only operation in a 18771 COMPOUND request. Because the operation results in destruction of 18772 the session, any reply caching for this request, as well as 18773 previously completed requests, will be lost. For this reason, it is 18774 advisable to not place this operation in a COMPOUND request with 18775 other state-modifying operations (unless those operations are for a 18776 different session, as specified by SEQUENCE). 18778 Because the session is destroyed, a client that retransmits the 18779 request may receive an error in response, even though the original 18780 request was successful. 18782 If there is a backchannel on the session and the server has 18783 outstanding CB_SEQUENCE operations, then the server MAY refuse to 18784 destroy the session and return NFS4ERR_BACK_CHAN_BUSY. The client 18785 SHOULD respond to all outstanding CB_COMPOUNDs before re-issuing 18786 DESTROY_SESSION. 18788 17.37.5. IMPLEMENTATION 18790 No discussion at this time. 18792 17.38. Operation 45: FREE_STATEID - Free stateid with no locks 18794 Test a series of stateids for validity. 18796 17.38.1. SYNOPSIS 18798 stateid -> 18800 17.38.2. ARGUMENT 18802 struct FREE_STATEID4args { 18803 stateid4 fsa_stateid; 18804 }; 18806 17.38.3. RESULT 18808 struct FREE_STATEID4res { 18809 nfsstat4 fsr_status; 18810 }; 18812 17.38.4. DESCRIPTION 18814 The FREE_STATEID operation is used to free a stateid which no longer 18815 has any associated locks (including opens, record locks, delegations, 18816 layouts). This may be cause of client unlock operations or because 18817 of server revocation. If there are valid locks (of any kind) 18818 associated with the stateid in question, the error NFS4ERR_LOCKS_HELD 18819 will be returned, and the associated stateid will not be freed. 18821 When a stateid is freed which had been associated with revoked locks, 18822 the client, by doing the FREE_STATEID acknowledges the loss of those 18823 locks, allowing the server, once all such revoked state, is 18824 acknowledged to allow that client again to reclaim locks, without 18825 encountering the edge conditions discussed in Section 8.6.2. 18827 Once a successful FREE_STATEID is done for a given stateid, any 18828 subsequent use of that stateid will result in an NFS4ERR_BAD_STATEID 18829 error. 18831 17.38.5. IMPLEMENTATION 18833 No discussion at this time. 18835 17.39. Operation 46: GET_DIR_DELEGATION - Get a directory delegation 18837 Obtain a directory delegation. 18839 17.39.1. SYNOPSIS 18841 (cfh), requested notification -> 18842 (cfh), cookieverf, stateid, supported notification 18844 17.39.2. ARGUMENT 18846 /* 18847 * Notification types. 18848 */ 18849 const DIR_NOTIFICATION4_NONE = 0x00000000; 18850 const DIR_NOTIFICATION4_CHANGE_CHILD_ATTRIBUTES = 0x00000001; 18851 const DIR_NOTIFICATION4_CHANGE_DIR_ATTRIBUTES = 0x00000002; 18852 const DIR_NOTIFICATION4_REMOVE_ENTRY = 0x00000004; 18853 const DIR_NOTIFICATION4_ADD_ENTRY = 0x00000008; 18854 const DIR_NOTIFICATION4_RENAME_ENTRY = 0x00000010; 18855 const DIR_NOTIFICATION4_CHANGE_COOKIE_VERIFIER = 0x00000020; 18857 typedef uint32_t dir_notification_type4; 18859 typedef nfstime4 attr_notice4; 18861 struct GET_DIR_DELEGATION4args { 18862 bool gdda_signal_deleg_avail; 18863 dir_notification_type4 gdda_notification_type; 18864 attr_notice4 gdda_child_attr_delay; 18865 attr_notice4 gdda_dir_attr_delay; 18866 bitmap4 gdda_child_attributes; 18867 bitmap4 gdda_dir_attributes; 18868 }; 18870 17.39.3. RESULT 18872 struct GET_DIR_DELEGATION4resok { 18873 verifier4 gddr_cookieverf; 18874 /* Stateid for get_dir_delegation */ 18875 stateid4 gddr_stateid; 18876 /* Which notifications can the server support */ 18877 dir_notification_type4 gddr_notification; 18878 bitmap4 gddr_child_attributes; 18879 bitmap4 gddr_dir_attributes; 18880 }; 18882 enum gddrnf4_status { 18883 GDD4_OK = 0, 18884 GDD4_UNAVAIL = 1 18885 }; 18887 union GET_DIR_DELEGATION4res_non_fatal 18888 switch (gddrnf4_status gddrnf_status) { 18889 case GDD4_OK: 18890 GET_DIR_DELEGATION4resok gddrnf_resok4; 18891 case GDD4_UNAVAIL: 18892 bool gddrnf_will_signal_deleg_avail; 18893 }; 18895 union GET_DIR_DELEGATION4res 18896 switch (nfsstat4 gddr_status) { 18897 case NFS4_OK: 18898 /* CURRENT_FH: delegated dir */ 18899 GET_DIR_DELEGATION4res_non_fatal gddr_res_non_fatal4; 18900 default: 18901 void; 18902 }; 18904 17.39.4. DESCRIPTION 18906 The GET_DIR_DELEGATION operation is used by a client to request a 18907 directory delegation. The directory is represented by the current 18908 filehandle. The client also specifies whether it wants the server to 18909 notify it when the directory changes in certain ways by setting one 18910 or more bits in a bitmap. The server may also choose not to grant 18911 the delegation. In that case the server will return 18912 NFS4ERR_DIRDELEG_UNAVAIL. If the server decides to hand out the 18913 delegation, it will return a cookie verifier for that directory. If 18914 the cookie verifier changes when the client is holding the 18915 delegation, the delegation will be recalled unless the client has 18916 asked for notification for this event. In that case a notification 18917 will be sent to the client. 18919 The server will also return a directory delegation stateid in 18920 addition to the cookie verifier as a result of the GET_DIR_DELEGATION 18921 operation. This stateid will appear in callback messages related to 18922 the delegation, such as notifications and delegation recalls. The 18923 client will use this stateid to return the delegation voluntarily or 18924 upon recall. A delegation is returned by calling the DELEGRETURN 18925 operation. 18927 The server may not be able to support notifications of certain 18928 events. If the client asks for such notifications, the server must 18929 inform the client of its inability to do so as part of the 18930 GET_DIR_DELEGATION reply by not setting the appropriate bits in the 18931 supported notifications bitmask contained in the reply. 18933 The GET_DIR_DELEGATION operation can be used for both normal and 18934 named attribute directories. It covers all the entries in the 18935 directory except the ".." entry. That means if a directory and its 18936 parent both hold directory delegations, any changes to the parent 18937 will not cause a notification to be sent for the child even though 18938 the child's ".." entry points to the parent. 18940 If client sets gdda_signal_deleg_avail to TRUE, then it is 18941 registering with the client a "want" for a directory delegation. If 18942 the server supports and will honor the "want", the results will have 18943 gddrnf_will_signal_deleg_avail set to TRUE. If so the client should 18944 expect a future CB_RECALLABLE_OBJ_AVAIL operation to indicate that a 18945 directory delegation is available. 18947 17.39.5. IMPLEMENTATION 18949 Directory delegation provides the benefit of improving cache 18950 consistency of namespace information. This is done through 18951 synchronous callbacks. A server must support synchronous callbacks 18952 in order to support directory delegations. In addition to that, 18953 asynchronous notifications provide a way to reduce network traffic as 18954 well as improve client performance in certain conditions. 18955 Notifications would not be requested when the goal is just cache 18956 consistency. 18958 Notifications are specified in terms of potential changes to the 18959 directory. A client can ask to be notified of events by setting one 18960 or more flags in gdda_notification_type. The client can ask for 18961 notifications on addition of entries to a direction (by setting the 18962 DIR_NOTIFICATION4_ADD_ENTRY in gdda_notification_type), notifications 18963 on entry removal (DIR_NOTIFICATION4_REMOVE_ENTRY), renames 18964 (DIR_NOTIFICATION4_RENAME_ENTRY), directory attribute changes 18965 (DIR_NOTIFICATION4_CHANGE_DIR_ATTRIBUTES), and cookie verifier 18966 changes (DIR_NOTIFICATION4_CHANGE_COOKIE_VERIFIER) by setting one 18967 more corresponding flags in the gdda_notification_type field. 18969 The client can also ask for notifications of changes to attributes of 18970 directory entries (DIR_NOTIFICATION4_CHANGE_CHILD_ATTRIBUTES) in 18971 order to keep its attribute cache up to date. However any changes 18972 made to child attributes do not cause the delegation to be recalled. 18973 If a client is interested in directory entry caching, or negative 18974 name caching, it can set the gdda_notification_type appropriately and 18975 the server will notify it of all changes that would otherwise 18976 invalidate its name cache. The kind of notification a client asks 18977 for may depend on the directory size, its rate of change and the 18978 applications being used to access that directory. However, the 18979 conditions under which a client might ask for a notification, is out 18980 of the scope of this specification. 18982 For attribute notifications, the client will set bits in the 18983 gdda_dir_attributes bitmap to indicate which attributes it wants to 18984 be notified of. If the server does not support notifications for 18985 changes to a certain attribute, it should not set that attribute in 18986 the supported attribute bitmap specified in the reply 18987 (gddr_dir_attributes). The client will also set in the 18988 gdda_child_attributes bitmap the attributes of directory entries it 18989 wants to be notified of, and the server will indicate in 18990 gddr_child_attributes which attributes of directory entries it will 18991 notify the client of. 18993 The client will also let the server know if it wants to get the 18994 notification as soon as the attribute change occurs or after a 18995 certain delay by setting a delay factor; gdda_child_attr_delay is for 18996 attribute changes to directory entries and gdda_dir_attr_delay is for 18997 attribute changes to the directory. If this delay factor is set to 18998 zero, that indicates to the server that the client wants to be 18999 notified of any attribute changes as soon as they occur. If the 19000 delay factor is set to N seconds, the server will make a best effort 19001 guarantee that attribute updates are not out of sync by more than 19002 that. If the client asks for a delay factor that the server does not 19003 support or that may cause significant resource consumption on the 19004 server by causing the server to send a lot of notifications, the 19005 server should not commit to sending out notifications for attributes 19006 and therefore must not set the appropriate bit in the 19007 gddr_child_attributes and gddr_dir_attributes bitmaps in the 19008 response. 19010 The client should use a security flavor that the file system is 19011 exported with. If it uses a different flavor, the server should 19012 return NFS4ERR_WRONGSEC to the operation that precedes 19013 GET_DIR_DELEGATION and sets the current filehandle. 19015 17.40. Operation 47: GETDEVICEINFO - Get Device Information 19017 17.40.1. SYNOPSIS 19019 (cfh), device_id, layout_type, maxcount -> device_addr 19021 17.40.2. ARGUMENT 19023 struct GETDEVICEINFO4args { 19024 /* CURRENT_FH: file */ 19025 deviceid4 gdia_device_id; 19026 layouttype4 gdia_layout_type; 19027 count4 gdia_maxcount; 19028 }; 19030 17.40.3. RESULT 19032 struct GETDEVICEINFO4resok { 19033 device_addr4 gdir_device_addr; 19034 }; 19036 union GETDEVICEINFO4res switch (nfsstat4 gdir_status) { 19037 case NFS4_OK: 19038 GETDEVICEINFO4resok gdir_resok4; 19039 default: 19040 void; 19041 }; 19043 17.40.4. DESCRIPTION 19045 Returns device address information for a specified device. The 19046 device address MUST correspond to the layout type specified by the 19047 GETDEVICEINFO4args. The current filehandle (cfh) is used to identify 19048 the file system; device IDs are unique per file system (FSID) and are 19049 qualified by the layout type. 19051 See Section 12.2.12 for more details on device ID assignment. 19053 If the size of the device address exceeds gdia_maxcount bytes, the 19054 metadata server will return the error NFS4ERR_TOOSMALL. If an 19055 invalid device ID is given, the metadata server will respond with 19056 NFS4ERR_INVAL. 19058 17.40.5. IMPLEMENTATION 19060 17.41. Operation 48: GETDEVICELIST 19062 17.41.1. SYNOPSIS 19064 (cfh), layout_type, maxcount, cookie, cookieverf -> 19065 cookie, cookieverf, device info list<> 19067 17.41.2. ARGUMENT 19069 struct GETDEVICELIST4args { 19070 /* CURRENT_FH: file */ 19071 layouttype4 gdla_layout_type; 19072 count4 gdla_maxcount; 19073 nfs_cookie4 gdla_cookie; 19074 verifier4 gdla_cookieverf; 19075 }; 19077 17.41.3. RESULT 19079 struct GETDEVICELIST4resok { 19080 nfs_cookie4 gdlr_cookie; 19081 verifier4 gdlr_cookieverf; 19082 devlist_item4 gdlr_devinfo_list<>; 19083 bool gdlr_eof; 19084 }; 19086 union GETDEVICELIST4res switch (nfsstat4 gdlr_status) { 19087 case NFS4_OK: 19088 GETDEVICELIST4resok gdlr_resok4; 19089 default: 19090 void; 19091 }; 19093 17.41.4. DESCRIPTION 19095 In some applications, especially SAN environments, it is convenient 19096 to find out about all the devices associated with a file system. 19097 This lets a client determine if it has access to these devices, e.g., 19098 at mount time. 19100 This operation returns an array of items (devlist_item4) that 19101 establish the association between the short deviceid4 and the 19102 addressing information for that device, for a particular layout type. 19103 This operation may not be able to fetch all device information at 19104 once, thus it uses a cookie based approach, similar to READDIR, to 19105 fetch additional device information (see Section 17.23). The "eof" 19106 flag has a value of TRUE if there are no more entries to fetch. As 19107 in GETDEVICEINFO, the current filehandle (cfh) is used to identify 19108 the file system. 19110 As in GETDEVICEINFO, gdla_maxcount specifies the maximum number of 19111 bytes to return. If the metadata server is unable to return a single 19112 device address, it will return the error NFS4ERR_TOOSMALL. If an 19113 invalid device ID is given, the metadata server will respond with 19114 NFS4ERR_INVAL. 19116 17.41.5. IMPLEMENTATION 19118 17.42. Operation 49: LAYOUTCOMMIT - Commit writes made using a layout 19120 17.42.1. SYNOPSIS 19122 (client ID), (cfh), offset, length, reclaim, last_write_offset, 19123 time_modify, time_access, layoutupdate -> newsize 19125 17.42.2. ARGUMENT 19127 union newtime4 switch (bool nt_timechanged) { 19128 case TRUE: 19129 nfstime4 nt_time; 19130 case FALSE: 19131 void; 19132 }; 19134 union newoffset4 switch (bool no_newoffset) { 19135 case TRUE: 19136 offset4 no_offset; 19137 case FALSE: 19138 void; 19139 }; 19141 struct LAYOUTCOMMIT4args { 19142 /* CURRENT_FH: file */ 19143 offset4 loca_offset; 19144 length4 loca_length; 19145 bool loca_reclaim; 19146 newoffset4 loca_last_write_offset; 19147 newtime4 loca_time_modify; 19148 newtime4 loca_time_access; 19149 layoutupdate4 loca_layoutupdate; 19150 }; 19152 17.42.3. RESULT 19154 union newsize4 switch (bool ns_sizechanged) { 19155 case TRUE: 19156 length4 ns_size; 19157 case FALSE: 19158 void; 19159 }; 19161 struct LAYOUTCOMMIT4resok { 19162 newsize4 locr_newsize; 19163 }; 19165 union LAYOUTCOMMIT4res switch (nfsstat4 locr_status) { 19166 case NFS4_OK: 19167 LAYOUTCOMMIT4resok locr_resok4; 19168 default: 19169 void; 19170 }; 19172 17.42.4. DESCRIPTION 19174 Commits changes in the layout segment represented by the current file 19175 handle, client ID (derived from the sessionid in the preceding 19176 SEQUENCE operation), and octet range. Since layout segments are sub- 19177 dividable, a smaller portion of a layout segment, retrieved via 19178 LAYOUTGET, may be committed. The region being committed is specified 19179 through the octet range (loca_offset and loca_length). 19181 The LAYOUTCOMMIT operation indicates that the client has completed 19182 writes using a layout obtained by a previous LAYOUTGET. The client 19183 may have only written a subset of the data range it previously 19184 requested. LAYOUTCOMMIT allows it to commit or discard provisionally 19185 allocated space and to update the server with a new end of file. The 19186 layout segment referenced by LAYOUTCOMMIT is still valid after the 19187 operation completes and can be continued to be referenced by the 19188 client ID, filehandle, octet range, and layout type. 19190 If the loca_reclaim field is set to TRUE, this indicates that the 19191 client is attempting to commit changes to a layout after the reboot 19192 of the metadata server during the metadata server's recovery grace 19193 period. This type of request may be necessary when the client has 19194 uncommitted writes to provisionally allocated regions of a file which 19195 were sent to the storage devices before the reboot of the metadata 19196 server. In this case the layout provided by the client MUST be a 19197 subset of a writable layout that the client held immediately before 19198 the reboot of the metadata server. The metadata server is free to 19199 accept or reject this request based on its own internal metadata 19200 consistency checks. If the metadata server finds that the layout 19201 provided by the client does not pass its consistency checks, it MUST 19202 reject the request with the status NFS4ERR_RECLAIM_BAD. The 19203 successful completion of the LAYOUTCOMMIT request with loca_reclaim 19204 set to TRUE does NOT provide the client with a layout segment for the 19205 file. It simply commits the changes to the layout segment specified 19206 in the loca_layoutupdate field. To obtain a layout segment for the 19207 file the client must issue a LAYOUTGET request to the server after 19208 the server's grace period has expired. If the metadata server 19209 receives a LAYOUTCOMMIT request with loca_reclaim set to TRUE when 19210 the metadata server is not in its recovery grace period, it MUST 19211 reject the request with the status NFS4ERR_NO_GRACE. 19213 Setting the loca_reclaim field to TRUE is required if and only if the 19214 committed layout was acquired before the metadata server reboot. If 19215 the client is committing a layout segment that was acquired during 19216 the metadata server's grace period, it MUST set the "reclaim" field 19217 to FALSE. 19219 The loca_last_write_offset field specifies the offset of the last 19220 octet written by the client previous to the LAYOUTCOMMIT. Note: this 19221 value is never equal to the file's size (at most it is one octet less 19222 than the file's size). The metadata server may use this information 19223 to determine whether the file's size needs to be updated. If the 19224 metadata server updates the file's size as the result of the 19225 LAYOUTCOMMIT operation, it must return the new size 19226 (locr_newsize.ns_size) as part of the results. 19228 The loca_time_modify and loca_time_access [[Comment.19: If 19229 LAYOUTCOMMIT is only for writes, then why update access time?]] 19230 fields allow the client to suggest times it would like the metadata 19231 server to set. The metadata server may use these time values or it 19232 may use the time of the LAYOUTCOMMIT operation to set these time 19233 values. If the metadata server uses the client provided times, it 19234 should ensure time does not flow backwards. If the client wants to 19235 force the metadata server to set an exact time, the client should use 19236 a SETATTR operation in a compound right after LAYOUTCOMMIT. See 19237 Section 12.5.3 for more details. If the new client desires the 19238 resultant mtime or atime, it should construct the COMPOUND so that a 19239 GETATTR follows the LAYOUTCOMMIT. 19241 The loca_layoutupdate argument to LAYOUTCOMMIT provides a mechanism 19242 for a client to provide layout specific updates to the metadata 19243 server. For example, the layout update can describe what regions of 19244 the original layout have been used and what regions can be 19245 deallocated. There is no NFSv4.1 file layout-specific layoutupdate4 19246 structure. 19248 The layout information is more verbose for block devices than for 19249 objects and files because the latter two hide the details of block 19250 allocation behind their storage protocols. At the minimum, the 19251 client needs to communicate changes to the end of file location back 19252 to the server, and, if desired, its view of the file modify and 19253 access time. For block/volume layouts, it needs to specify precisely 19254 which blocks have been used. 19256 If the layout segment identified in the arguments does not exist, the 19257 error NFS4ERR_BADLAYOUT is returned. The layout segment being 19258 committed may also be rejected if it does not correspond to an 19259 existing layout with an iomode of LAYOUTIOMODE4_RW. 19261 On success, the current filehandle retains its value. 19263 17.42.5. IMPLEMENTATION 19265 Optionally, the client can also use LAYOUTCOMMIT with the 19266 loca_reclaim field set to TRUE to convey hints to modified file 19267 attributes or to report layout-type specific information such as I/O 19268 errors for object-based storage layouts, as normally done during 19269 normal operation. Doing so may help the metadata server to recover 19270 files more efficiently after reboot. For example, some file system 19271 implementations may require expansive recovery of file system objects 19272 if the metadata server does not get a positive indication from all 19273 clients holding a write layout that they have successfully completed 19274 all their writes. Sending a LAYOUTCOMMIT (if required) and then 19275 following with LAYOUTRETURN can provide such an indication and allow 19276 for graceful and efficient recovery. 19278 17.43. Operation 50: LAYOUTGET - Get Layout Information 19280 17.43.1. SYNOPSIS 19282 (cfh), signal_avail, layout_type, iomode, offset, 19283 length, minlength, maxcount -> layout example synopsis 19285 17.43.2. ARGUMENT 19287 struct LAYOUTGET4args { 19288 /* CURRENT_FH: file */ 19289 bool loga_signal_layout_avail; 19290 layouttype4 loga_layout_type; 19291 layoutiomode4 loga_iomode; 19292 offset4 loga_offset; 19293 length4 loga_length; 19294 length4 loga_minlength; 19295 count4 loga_maxcount; 19296 }; 19298 17.43.3. RESULT 19300 struct LAYOUTGET4resok { 19301 bool logr_return_on_close; 19302 layout4 logr_layout; 19303 }; 19305 union LAYOUTGET4res switch (nfsstat4 logr_status) { 19306 case NFS4_OK: 19307 LAYOUTGET4resok logr_resok4; 19308 case NFS4ERR_LAYOUTTRYLATER: 19309 bool logr_will_signal_layout_avail; 19310 default: 19311 void; 19312 }; 19314 17.43.4. DESCRIPTION 19316 Requests a layout segment from the metadata server for reading or 19317 writing (and reading) the file given by the filehandle at the octet 19318 range specified by offset and length. Layouts are identified by the 19319 client ID (derived from the sessionid in the preceding SEQUENCE 19320 operation), current filehandle, and layout type (loga_layout_type). 19321 The use of the loga_iomode depends upon the layout type, but should 19322 reflect the client's data access intent. 19324 If the metadata server is in a grace period, and does not persist 19325 layout segments and device ID to device address mappings, then it 19326 MUST return NFS4ERR_GRACE (see Section 8.6.2.1). 19328 The LAYOUTGET operation returns layout information for the specified 19329 octet range: a layout segment. To get a layout segment from a 19330 specific offset through the end-of-file, regardless of the file's 19331 length, a loga_length field with all bits set to 1 (one) should be 19332 used. If loga_length is zero, or if a loga_length which is not all 19333 bits set to one is specified, and loga_length when added to 19334 loga_offset exceeds the maximum 64-bit unsigned integer value, the 19335 error NFS4ERR_INVAL will result. 19337 The loga_minlength field specifies the minimum size overlap with the 19338 requested offset and length that is to be returned. If this 19339 requirement cannot be met, no layout must be returned; the error 19340 NFS4ERR_LAYOUTTRYLATER can be returned. 19342 The loga_maxcount field specifies the maximum layout size (in octets) 19343 that the client can handle. If the size of the layout structure 19344 exceeds the size specified by maxcount, the metadata server will 19345 return the NFS4ERR_TOOSMALL error. 19347 As well, the metadata server may adjust the range of the returned 19348 layout segment based on striping patterns and usage implied by the 19349 loga_iomode. The client must be prepared to get a layout segment 19350 that does not line up exactly with its request; there MUST be at 19351 least an overlap of loga_minlength between the layout returned by the 19352 server and the client's request, or the server SHOULD reject the 19353 request. See Section 12.5.2 for more details. 19355 The metadata server may also return a layout segment with an 19356 lo_iomode other than that requested by the client. If it does so, it 19357 must ensure that the lo_iomode is more permissive than the 19358 loga_iomode requested. E.g., this allows an implementation to 19359 upgrade read-only requests to read/write requests at its discretion, 19360 within the limits of the layout type specific protocol. A lo_iomode 19361 of either LAYOUTIOMODE4_READ or LAYOUTIOMODE4_RW must be returned. 19363 The logr_return_on_close result field is a directive to return the 19364 layout before closing the file. When the server sets this return 19365 value to TRUE, it must be prepared to recall the layout in the case 19366 the client fails to return the layout before close. For the server 19367 that knows a layout must be returned before a close of the file, this 19368 return value can be used to communicate the desired behavior to the 19369 client and thus removing one extra step from the client and server's 19370 interaction. 19372 The format of the returned layout (lo_content) is specific to the 19373 underlying file system. Layout types other than the NFSv4.1 file 19374 layout type are specified outside this document. 19376 If layouts are not supported for the requested file or its containing 19377 file system the server SHOULD return NFS4ERR_LAYOUTUNAVAILABLE. If 19378 the layout type is not supported, the metadata server should return 19379 NFS4ERR_UNKNOWN_LAYOUTTYPE. If layouts are supported but no layout 19380 matches the client provided layout identification, the server should 19381 return NFS4ERR_BADLAYOUT. If an invalid loga_iomode is specified, or 19382 a loga_iomode of LAYOUTIOMODE4_ANY is specified, the server should 19383 return NFS4ERR_BADIOMODE. 19385 If the layout for the file is unavailable due to transient 19386 conditions, e.g. file sharing prohibits layouts, the server must 19387 return NFS4ERR_LAYOUTTRYLATER. 19389 If the layout request is rejected due to an overlapping layout 19390 recall, the server must return NFS4ERR_RECALLCONFLICT. See 19391 Section 12.5.4.2 for details. 19393 If the layout conflicts with a mandatory octet range lock held on the 19394 file, and if the storage devices have no method of enforcing 19395 mandatory locks, other than through the restriction of layouts, the 19396 metadata server should return NFS4ERR_LOCKED. 19398 If client sets loga_signal_layout_avail to TRUE, then it is 19399 registering with the client a "want" for a layout in the event the 19400 layout cannot be obtained due to resource exhaustion. If the server 19401 supports and will honor the "want", the results will have 19402 logr_will_signal_layout_avail set to TRUE. If so the client should 19403 expect a CB_RECALLABLE_OBJ_AVAIL operation to indicate that a layout 19404 is available. 19406 On success, the current filehandle retains its value. 19408 17.43.5. IMPLEMENTATION 19410 Typically, LAYOUTGET will be called as part of a compound RPC after 19411 an OPEN operation and results in the client having location 19412 information for the file; a client may also hold a layout across 19413 multiple OPENs. The client specifies a layout type that limits what 19414 kind of layout the server will return. This prevents servers from 19415 issuing layouts that are unusable by the client. 19417 17.44. Operation 51: LAYOUTRETURN - Release Layout Information 19419 17.44.1. SYNOPSIS 19421 (cfh), layout_type, iomode, layoutreturn, reclaim -> - 19423 17.44.2. ARGUMENT 19425 struct LAYOUTRETURN4args { 19426 /* CURRENT_FH: file */ 19427 bool lora_reclaim; 19428 layouttype4 lora_layout_type; 19429 layoutiomode4 lora_iomode; 19430 layoutreturn4 lora_layoutreturn; 19431 }; 19433 17.44.3. RESULT 19435 struct LAYOUTRETURN4res { 19436 nfsstat4 lorr_status; 19437 }; 19439 17.44.4. DESCRIPTION 19441 Returns one or more layouts or layout segments represented by the 19442 client ID (derived from the sessionid in the preceding SEQUENCE 19443 operation), lora_layout_type, and lora_iomode. When layoutreturn is 19444 LAYOUTRETURN4_FILE the returned layout segment is further identified 19445 by the current filehandle, lrf_offset, and lrf_length. When 19446 layoutreturn is LAYOUTRETURN4_FSID the current filehandle is used to 19447 identify the file system and all layouts or layout segments matching 19448 the client ID, lora_layout_type, and lora_iomode are returned. When 19449 layoutreturn is LAYOUTRETURN4_ALL all layouts or layout segments 19450 matching the client ID, lora_layout_type, and lora_iomode are 19451 returned and the current filehandle is not used. After this call, 19452 the client MUST NOT use the returned layout segment(s) or layout(s) 19453 and the associated storage protocol to access the file data. A 19454 layout segment being returned may be a subdivision of a layout 19455 segment previously fetched through LAYOUTGET. As well, it may be a 19456 subset or superset of a layout segment specified by CB_LAYOUTRECALL. 19457 However, if it is a subset, the recall is not complete until the full 19458 recalled scope (LAYOUTRETURN4_FILE octet range, LAYOUTRETURN4_FSID, 19459 or LAYOUTRETURN4_ALL) has been returned. It is also permissible, and 19460 no error should result, for a client to return a octet range covering 19461 a layout it does not hold. If the lrf_length is all 1s, the layout 19462 covers the range from lrf_offset to EOF. An iomode of 19463 LAYOUTIOMODE4_ANY specifies that all layouts that match the other 19464 arguments to LAYOUTRETURN (i.e., client ID, lora_layout_type, and one 19465 of current filehandle and range; fsid derived from current 19466 filehandle; or LAYOUTRETURN4_ALL) are being returned. 19468 When lr_returntype is set to LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL 19469 the client also invalidates all the storage device ID to storage 19470 device address in the affected file system(s). Any device ID 19471 returned by a subsequent LAYOUTGET in the affected file system(s) 19472 will have to be resolved using either GETDEVICEINFO or GETDEVICELIST. 19474 The lora_reclaim field set to TRUE in a LAYOUTRETURN request 19475 specifies that the client is attempting to return a layout that was 19476 acquired before the reboot of the metadata server during the metadata 19477 server's grace period. When returning layouts that were acquired 19478 during the metadata server's grace period MUST set the lora_reclaim 19479 field to FALSE. The lora_reclaim field MUST be set to FALSE also 19480 when lr_layoutreturn is LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL. See 19481 LAYOUTCOMMIT (Section 17.42) for more details. 19483 Layouts may be returned when recalled or voluntarily (i.e., before 19484 the server has recalled them). In either case the client must 19485 properly propagate state changed under the context of the layout to 19486 the storage device(s)or to the metadata server before returning the 19487 layout. 19489 If a client fails to return a layout in a timely manner, then the 19490 metadata server should use its control protocol with the storage 19491 devices to fence the client from accessing the data referenced by the 19492 layout. See Section 12.5.4 for more details. 19494 If the layout identified in the arguments does not exist, the error 19495 NFS4ERR_BADLAYOUT is returned. If a layout exists, but the iomode 19496 does not match, NFS4ERR_BADIOMODE is returned. 19498 If the LAYOUTRETURN request sets the lora_reclaim field to TRUE after 19499 the metadata server's grace period, NFS4ERR_NO_GRACE is returned. 19501 If the LAYOUTRETURN request sets the lora_reclaim field to TRUE and 19502 lr_returntype is set to LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL, 19503 NFS4ERR_INVAL is returned. 19505 On success, the current filehandle retains its value. 19507 [[Comment.20: Should LAYOUTRETURN be modified to handle FSID 19508 callbacks?]] 19510 17.44.5. IMPLEMENTATION 19512 The final LAYOUTRETURN operation in response to a CB_LAYOUTRECALL 19513 callback MUST be serialized with any outstanding, intersecting 19514 LAYOUTRETURN operations. Note that it is possible that while a 19515 client is returning the layout for some recalled range the server may 19516 recall a superset of that range (e.g. LAYOUTRECALL4_ALL); the final 19517 return operation for the latter must block until the former layout 19518 recall is done - when its corresponding final return operation is 19519 replied. 19521 Returning all layouts in a file system using LAYOUTRETURN4_FSID is 19522 typically done in response to a CB_LAYOUTRECALL for that file system 19523 as the final return operation. Similarly, LAYOUTRETURN4_ALL is used 19524 in response to a recall callback for all layouts. It is possible 19525 that the client already returned some outstanding layouts via 19526 individual LAYOUTRETURN calls and the call for LAYOUTRETURN4_FSID or 19527 LAYOUTRETURN4_ALL marks the end of the LAYOUTRETURN sequence. See 19528 Section 12.5.4.1 for more details. 19530 17.45. Operation 52: SECINFO_NO_NAME - Get Security on Unnamed Object 19532 Obtain available security mechanisms with the use of the parent of an 19533 object or the current filehandle. 19535 17.45.1. SYNOPSIS 19537 (cfh), secinfo_style -> { secinfo } 19539 17.45.2. ARGUMENT 19541 enum secinfo_style4 { 19542 SECINFO_STYLE4_CURRENT_FH = 0, 19543 SECINFO_STYLE4_PARENT = 1 19544 }; 19546 typedef secinfo_style4 SECINFO_NO_NAME4args; 19548 17.45.3. RESULT 19550 typedef SECINFO4res SECINFO_NO_NAME4res; 19552 17.45.4. DESCRIPTION 19554 Like the SECINFO operation, SECINFO_NO_NAME is used by the client to 19555 obtain a list of valid RPC authentication flavors for a specific file 19556 object. Unlike SECINFO, SECINFO_NO_NAME only works with objects are 19557 accessed by filehandle. 19559 There are two styles of SECINFO_NO_NAME, as determined by the value 19560 of the secinfo_style4 enumeration. If SECINFO_STYLE4_CURRENT_FH is 19561 passed, then SECINFO_NO_NAME is querying for the required security 19562 for the current filehandle. If SECINFO_STYLE4_PARENT is passed, then 19563 SECINFO_NO_NAME is querying for the required security of the current 19564 filehandles's parent. If the style selected is 19565 SECINFO_STYLE4_PARENT, then SECINFO should apply the same access 19566 methodology used for LOOKUPP when evaluating the traversal to the 19567 parent directory. Therefore, if the requester does not have the 19568 appropriate access to LOOKUPP the parent then SECINFO_NO_NAME must 19569 behave the same way and return NFS4ERR_ACCESS. 19571 Note that if PUTFH, PUTPUBFH, or PUTROOTFH return NFS4ERR_WRONGSEC, 19572 this is tantamount to the server asserting that the client will have 19573 to guess what the required security is, because there is no way to 19574 query. Therefore, the client must iterate through the security 19575 triples available at the client and reattempt the PUTFH, PUTROOTFH or 19576 PUTPUBFH operation. In the unfortunate event none of the MANDATORY 19577 security triples are supported by the client and server, the client 19578 SHOULD try using others that support integrity. Failing that, the 19579 client can try using other forms (e.g. AUTH_SYS and AUTH_NONE), but 19580 because such forms lack integrity checks, this puts the client at 19581 risk. 19583 The server implementor should pay particular attention to Section 2.6 19584 for instructions on avoiding NFS4ERR_WRONGSEC error returns from 19585 PUTFH, PUTROOTFH, PUTPUBFH, or RESTOREFH. 19587 Everything else about SECINFO_NO_NAME is the same as SECINFO. See 19588 the discussion on SECINFO (Section 17.29.4). 19590 17.45.5. IMPLEMENTATION 19592 See the discussion on SECINFO (Section 17.29.5). 19594 17.46. Operation 53: SEQUENCE - Supply per-procedure sequencing and 19595 control 19597 Supply per-procedure sequencing and control 19599 17.46.1. SYNOPSIS 19601 control -> control 19603 17.46.2. ARGUMENT 19605 struct SEQUENCE4args { 19606 sessionid4 sa_sessionid; 19607 sequenceid4 sa_sequenceid; 19608 slotid4 sa_slotid; 19609 slotid4 sa_highest_slotid; 19610 bool sa_cachethis; 19611 }; 19613 17.46.3. RESULT 19615 const SEQ4_STATUS_CB_PATH_DOWN = 0x00000001; 19616 const SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING = 0x00000002; 19617 const SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED = 0x00000004; 19618 const SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED = 0x00000008; 19619 const SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED = 0x00000010; 19620 const SEQ4_STATUS_ADMIN_STATE_REVOKED = 0x00000020; 19621 const SEQ4_STATUS_RECALLABLE_STATE_REVOKED = 0x00000040; 19622 const SEQ4_STATUS_LEASE_MOVED = 0x00000080; 19623 const SEQ4_STATUS_RESTART_RECLAIM_NEEDED = 0x00000100; 19625 struct SEQUENCE4resok { 19626 sessionid4 sr_sessionid; 19627 sequenceid4 sr_sequenceid; 19628 slotid4 sr_slotid; 19629 slotid4 sr_highest_slotid; 19630 slotid4 sr_target_highest_slotid; 19631 uint32_t sr_status_flags; 19632 }; 19634 union SEQUENCE4res switch (nfsstat4 sr_status) { 19635 case NFS4_OK: 19636 SEQUENCE4resok sr_resok4; 19637 default: 19638 void; 19639 }; 19641 17.46.4. DESCRIPTION 19643 The SEQUENCE operation is used to manage operational accounting for 19644 the session on which the operation is sent. The contents include the 19645 client and session to which this request belongs, slotid and 19646 sequenceid, used by the server to implement session request control 19647 and the duplicate reply cache semantics, and exchanged slot counts 19648 which are used to adjust these values. 19650 This operation MUST appear as the first operation of any COMPOUND in 19651 which it appears. The error NFS4ERR_SEQUENCE_POS will be returned 19652 when if it is found in any position in a COMPOUND beyond the first. 19653 Operations other than SEQUENCE, BIND_CONN_TO_SESSION, EXCHANGE_ID, 19654 CREATE_SESSION, and DESTROY_SESSION, may not appear as the first 19655 operation in a COMPOUND. Such operations will get the error 19656 NFS4ERR_OP_NOT_IN_SESSION if they do appear at the start of a 19657 COMPOUND. 19659 If SEQUENCE is received on a connection not bound to the session via 19660 CREATE_SESSION or BIND_CONN_TO_SESSION, and the client specified 19661 connecting binding enforcement when the session was created (see 19662 Section 17.36), then the server returns 19663 NFS4ERR_CONN_NOT_BOUND_TO_SESSION. 19665 If sa_cachethis is TRUE, then the client is requesting that the 19666 server cache the reply in the server's reply cache. The server MUST 19667 cache the reply (see Section 2.10.4.1.2). 19669 The response to the SEQUENCE operation contains a word of status 19670 flags (sr_status_flags) that that can provide to the client 19671 information related to the status of the client's lock state and 19672 communications paths. Note that any status bits relating to lock 19673 state are MAY reset when lock state is lost due to a server reboot or 19674 the establishment of a new client instance. Note that if the client 19675 ID implied by sa_sessionid was established with 19677 ( 19678 eir_flags 19679 & ( 19680 EXCHGID4_FLAG_USE_PNFS_DS 19681 | EXCHGID4_FLAG_USE_PNFS_MDS 19682 | EXCHGID4_FLAG_USE_NON_PNFS 19683 ) 19684 ) == EXCHGID4_FLAG_USE_PNFS_DS) 19686 in the EXCHANGE_ID results (i.e the client ID is only for data 19687 servers), then sr_status_flags MUST always be zero. 19689 SEQ4_STATUS_CB_PATH_DOWN 19690 When set, indicates that the client has no operational callback 19691 path, making it necessary for the client to re-establish one, 19692 return his recallable locks, or both. This bit remains set until 19693 the callback path is again available. 19695 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING 19696 When set, indicates that the GSS contexts to be used for callbacks 19697 are expected to expire within a period equal to the lease time. 19698 This bit remains set until the expiration time of the contexts is 19699 beyond the lease period from the current time. 19701 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED 19702 When set, indicates the GSS contexts to be used for callbacks have 19703 expired. This bit remains set until new non-expired contexts are 19704 provided. 19706 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED 19707 When set, indicates that the lease has expired and as a result the 19708 server released all of the client's locking state. This status 19709 bit remains set until the loss of all such locks has been 19710 acknowledged by use of FREE_BADLOCK, or by establishing a new 19711 client instance by destroying all sessions (via DESTROY_SESSION), 19712 the client ID (via DESTROY_CLIENT), and then invoking EXCHANGE_ID 19713 and CREATE_SESSION to establish a new client ID. 19715 SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED 19716 When set indicates that some subset of the client's locks have 19717 been revoked due to expiration of the lease period followed by 19718 another client's conflicting lock request. This status bit 19719 remains set until the loss of all such locks has been acknowledged 19720 by use of FREE_BADLOCK. 19722 SEQ4_STATUS_ADMIN_STATE_REVOKED 19723 When set indicates that one or more locks have been revoked 19724 without expiration of the lease period, due to administrative 19725 action. This status bit remains set until the loss of all such 19726 locks has been acknowledged by use of FREE_BADLOCK. 19728 SEQ4_STATUS_RECALLABLE_STATE_REVOKED 19729 When set indicates that one or more recallable locks have been 19730 revoked without expiration of the lease period, due to the 19731 client's failure to return them when recalled. This status bit 19732 remains set until the loss of all such locks has been acknowledged 19733 by use of FREE_BADLOCK. 19735 SEQ4_STATUS_LEASE_MOVED 19736 When set indicates that responsibility for lease renewal has been 19737 transferred to one or more new servers. This condition will 19738 continue until the client receives an NFS4ERR_MOVED error and the 19739 server receives the subsequent GETATTR for the fs_locations or 19740 fs_locations_info attribute for an access to each file system for 19741 which a lease has been moved to a new server. 19743 SEQ4_STATUS_RESTART_RECLAIM_NEEDED 19744 When set indicates that due to server retart or reboot. The 19745 reason SEQ4_STATUS_RESTART_RECLAIM_NEEDED is not reset after 19746 server restart or reboot is that the the session and client ID 19747 have persisted (usually due the CREATE_SESSION result having 19748 returned the CREATE_SESSION4_FLAG_PERSIST flag in csr_flags), all 19749 other leased state has been lost. The client must reclaim the 19750 lost state via the procedure described in Section 8.6.2, although 19751 re-establishing a clientid and session is neither necessary nor 19752 recommended. 19754 If the difference between sa_sequenceid and the sequenceid the server 19755 has for the slot is two (2) or more, then server MUST return 19756 NFS4ERR_SEQ_MISORDERED. If sa_sequenceid is less than the server's 19757 cached sequenceid (accounting for wraparound of the unsigned 19758 sequenceid value), then the server MUST return 19759 NFS4ERR_SEQ_MISORDERED. If sa_sequenceid and the cached sequenceid 19760 are the same, this is a replay, and the server returns the response 19761 to the COMPOUND that is cached. Otherwise, if sa_sequenceid is one 19762 greater (accounting for wraparound) than the cached sequenceid, then 19763 this is a new request, and the slot's sequenceid is incremented. The 19764 operations subsequent to SEQUENCE, if any, are processed. If there 19765 are no other operations, the only other effects are to cache the 19766 SEQUENCE reply in the slot, maintain the session's activity, and 19767 renew the lease of state related to the client ID. 19769 If SEQUENCE returns an error, then the state of the slot (sequenceid, 19770 cached reply) is not changed, nor is the associated lease renewed. 19772 If SEQUENCE returns NFS4_OK, then the associated lease is renewed, 19773 except if SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED is returned in the 19774 status word. 19776 The server returns two "highest_slotid" values: sr_highest_slotid, 19777 and sr_target_highest_slotid. The former is the highest slotid the 19778 server will accept in future SEQUENCE operation, and must not be less 19779 than the the value of sa_highest_slotid. The latter is the highest 19780 slotid the server would prefer the client use on a future SEQUENCE 19781 operation. 19783 17.46.5. IMPLEMENTATION 19785 The server MUST maintain a mapping of sessionid to client ID in order 19786 to validate any operations that follow SEQUENCE that take a stateid 19787 as an argument and/or result. 19789 If the client establishes a persistent session, then the server MUST 19790 also persist the client ID, such that it is valid through server 19791 reboot or restart. If the session and client ID are not persistent, 19792 then in the event of server reboot or restart, if the client ID is no 19793 longer valid, upon encountering an sa_sessionid that maps to a stale 19794 client ID, the server SHOULD return NFS4ERR_STATE_CLIENTID, which 19795 indicates that both the client ID and sessionid are stale. 19797 The server's implementation constraints may require constructing a 19798 sessionid such that it is impossible to discern a sessionid that is 19799 invalid due to malformation from one that is invalid due to server 19800 restart. In that event, when the client receives NFS4ERR_BADSESSION, 19801 it may check for stale client ID by issuing a CREATE_SESSION with the 19802 client ID. If CREATE_SESSION succeeds, the client has a session to 19803 use, and it MAY retry the original COMPOUND with the new sessionid 19804 (unless SEQ4_STATUS_RESTART_RECLAIM_NEEDED is returned in 19805 sr_status_flags; in which case the client MUST first reclaim state as 19806 described in Section 8.6.2.1). 19808 17.47. Operation 54: SET_SSV 19810 17.47.1. SYNOPSIS 19812 ssv, digest -> digest 19814 17.47.2. ARGUMENT 19816 struct ssa_digest_input4 { 19817 SEQUENCE4args sdi_seqargs; 19818 }; 19820 struct SET_SSV4args { 19821 opaque ssa_ssv<>; 19822 opaque ssa_digest<>; 19823 }; 19825 17.47.3. RESULT 19827 struct ssr_digest_input4 { 19828 SEQUENCE4res sdi_seqres; 19829 }; 19831 struct SET_SSV4resok { 19832 opaque ssr_digest<>; 19833 }; 19835 union SET_SSV4res switch (nfsstat4 ssr_status) { 19836 case NFS4_OK: 19837 SET_SSV4resok ssr_resok4; 19838 default: 19839 void; 19840 }; 19842 17.47.4. DESCRIPTION 19844 This operation is used to set or update the SSV for a session. It 19845 MUST be preceded by SEQUENCE in the same COMPOUND. It MUST be 19846 invoked only on a connection bound to the session. It MUST NOT be 19847 used if the client did not enable connecting binding enforcement when 19848 the session was created (see Section 17.36); the server returns 19849 NFS4ERR_OP_CONN_BINDING_NOT_ENFORCED in that case. If the client 19850 enabled connection binding enforcement, then SET_SSV MUST be invoked 19851 at least once prior to a BIND_CONN_TO_SESSION operation. 19853 ssa_digest is computed as the output of the HMAC RFC2104 [14] using 19854 the current SSV as the key, and an XDR encoded value of data type 19855 ssa_digest_input4. The field sdi_seqargs is equal to the arguments 19856 of the SEQUENCE operation for the COMPOUND procedure that SET_SSV is 19857 within. 19859 The ssa_ssv is XORed with the current SSV to produce the new SSV. 19861 In the response, ssr_digest is the output of the HMAC using the new 19862 SSV as the key, and an XDR encoded value of data type 19863 ssr_digest_input4. The field sdi_seqres is equal to the results of 19864 the SEQUENCE operation for the COMPOUND procedure that SET_SSV is 19865 within. 19867 17.47.5. IMPLEMENTATION 19869 When the server receives ssa_digest, it MUST verify the digest by 19870 computing the digest the same way the client did and comparing it 19871 with ssa_digest. If the server gets a different result, this is an 19872 error, NFS4ERR_BAD_SESSION_DIGEST. Generally, in order to change the 19873 SSV or bind new connections to the session, the client has no 19874 recourse but to recreate the session with CREATE_SESSION. However, 19875 the IMPLEMENTATION section BIND_CONN_TO_SESSION describes a scenario 19876 where a client can legitimately get NFS4ERR_BAD_SESSION_DIGEST for a 19877 SET_SSV, and how to recover from it. 19879 Clients SHOULD NOT send an ssa_ssv that is equal to a previous 19880 ssa_ssv, nor equal to a previous SSV. 19882 Clients SHOULD issue SET_SSV with RPCSEC_GSS privacy. Servers MUST 19883 support RPCSEC_GSS with privacy for any COMPOUND that has { SEQUENCE, 19884 SET_SSV }. 19886 17.48. Operation 55: TEST_STATEID - Test stateids for validity 19888 Test a series of stateids for validity. 19890 17.48.1. SYNOPSIS 19892 stateids<> -> error_codes<> 19894 17.48.2. ARGUMENT 19896 struct TEST_STATEID4args { 19897 stateid4 ts_stateids<>; 19898 }; 19900 17.48.3. RESULT 19902 struct TEST_STATEID4resok { 19903 nfsstat4 tsr_status_codes<>; 19904 }; 19906 union TEST_STATEID4res switch (nfsstat4 tsr_status) { 19907 case NFS4_OK: 19908 TEST_STATEID4resok tsr_resok4; 19909 default: 19910 void; 19911 }; 19913 17.48.4. DESCRIPTION 19915 The TEST_STATEID operation is used to check the validity of a set of 19916 stateids. It is intended to be used when the client receives an 19917 indication that one or more of its stateids have been invalidated due 19918 to lock revocation. TEST_STATEID allows a large set of such stateids 19919 to be tested and allows problems with earlier stateids not to 19920 interfere with checking of subsequent ones as would happen if 19921 individual stateids are tested by operation in a COMPOUND. 19923 For each stateid, the server provides the status code that would be 19924 returned if that stateid were to be used in normal operation. 19925 Returning such an status indication is not an error and does not 19926 cause processing to terminate. Checks for the validity of the 19927 stateid proceed as they would for normal operations with two 19928 exceptions. There is no check for the type of stateid object, as 19929 would be the case for normal and there is no reference to the current 19930 filehandle. 19932 The errors which are validly returned within the status_code array 19933 are: NFS4ERR_OK, NFS4ERR_BAD_STATEID, NFS4ERR_EXPIRED, 19934 NFS4ERR_ADMIN_REVOKED, and NFS4ERR_DELEG_REVOKED. 19936 17.48.5. IMPLEMENTATION 19938 No discussion at this time. 19940 17.49. Operation 56: WANT_DELEGATION 19942 17.49.1. SYNOPSIS 19944 (cfh), (client ID) -> stateid, delegation 19946 17.49.2. ARGUMENT 19948 union deleg_claim4 switch (open_claim_type4 dc_claim) { 19949 /* 19950 * No special rights to object. Ordinary delegation 19951 * request of the specified object. Object identified 19952 * by filehandle. 19953 */ 19954 case CLAIM_FH: /* new to v4.1 */ 19955 void; 19957 /* 19958 * Right to file based on a delegation granted to a previous boot 19959 * instance of the client. File is specified by filehandle. 19960 */ 19961 case CLAIM_DELEG_PREV_FH: /* new to v4.1 */ 19962 /* CURRENT_FH: file being opened */ 19963 void; 19965 /* 19966 * Right to the file established by an open previous to server 19967 * reboot. File identified by filehandle. 19968 * Used during server reclaim grace period. 19969 */ 19970 case CLAIM_PREVIOUS: 19971 /* CURRENT_FH: file being reclaimed */ 19972 open_delegation_type4 dc_delegate_type; 19973 }; 19975 struct WANT_DELEGATION4args { 19976 uint32_t wda_want; 19977 deleg_claim4 wda_claim; 19978 }; 19980 17.49.3. RESULT 19982 union WANT_DELEGATION4res switch (nfsstat4 wdr_status) { 19983 case NFS4_OK: 19984 open_delegation4 wdr_resok4; 19985 default: 19986 void; 19987 }; 19989 17.49.4. DESCRIPTION 19991 Where this description mandates the return of a specific error code 19992 for a specific condition, and where multiple conditions apply, the 19993 server MAY return any of the mandated error codes. 19995 This operation allows a client to get a delegation on all types of 19996 files except directories. The server MAY support this operation. If 19997 the server does not support this operation, it MUST return 19998 NFS4ERR_NOTSUPP. This operation also allows the client to register a 19999 "want" for a delegation for the specified file object, and be 20000 notified via a callback when the delegation is available. The server 20001 MAY support notifications of availability via callbacks. If the 20002 server does not support registration of wants it MUST NOT return an 20003 error to indicate that. 20005 The client SHOULD NOT set OPEN4_SHARE_ACCESS_READ and SHOULD NOT set 20006 OPEN4_SHARE_ACCESS_WRITE in wda_want. If it does, the server MUST 20007 ignore them. 20009 The meanings of the following flags in wda_want are the same as they 20010 are in OPEN: 20012 OPEN4_SHARE_ACCESS_WANT_READ_DELEG 20014 OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 20016 OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 20018 OPEN4_SHARE_ACCESS_WANT_NO_DELEG 20020 OPEN4_SHARE_ACCESS_WANT_CANCEL 20022 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL 20024 OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED 20026 The handling of the above flags in WANT_DELEGATION is the same as in 20027 OPEN. 20029 A request for a conflicting delegation MUST NOT trigger the recall of 20030 the existing delegation. 20032 The successful results of WANT_DELEG are of type open_delegation4 20033 which is the same type as the "delegation" field in the results of 20034 the OPEN operation. The server constructs wdr_resok4 the same way it 20035 constructs OPEN's "delegation" with one differences: WANT_DELEGATION 20036 MUST NOT return a delegation type of OPEN_DELEGATE_NONE. As with 20037 OPEN, if (wda_want & OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) is zero then 20038 the client is indicating no desire for a delegation and the server 20039 MAY or MAY not return a delegation in the WANT_DELEG response. 20041 17.49.5. IMPLEMENTATION 20043 TBD 20045 17.50. Operation 57: DESTROY_CLIENTID - Destroy existing client ID 20047 Destroy existing client ID. 20049 17.50.1. SYNOPSIS 20051 client ID -> - 20053 17.50.2. ARGUMENT 20055 struct DESTROY_CLIENTID4args { 20056 clientid4 dca_clientid; 20057 }; 20059 17.50.3. RESULT 20061 struct DESTROY_CLIENTID4res { 20062 nfsstat4 dcr_status; 20063 }; 20065 17.50.4. DESCRIPTION 20067 The DESTROY_CLIENTID operation destroys the client ID if there are no 20068 sessions, opens, locks, delegations, layouts, and wants, associated 20069 with the client ID. 20071 If the COMPOUND request starts with SEQUENCE, then the session 20072 identified in SEQUENCE must not be one bound to the client ID 20073 identified in DESTROY_CLIENTID or the DESTROY_CLIENTID operation will 20074 fail because there is still a session bound to the client ID. 20075 DESTROY_CLIENTID MAY be the only operation in a COMPOUND request. 20077 Note that because the operation can be sent outside of a session, a 20078 client that retransmits the request may receive an error in response, 20079 because though the original request resulted in the successful 20080 destruction of the client ID. 20082 17.50.5. IMPLEMENTATION 20084 DESTROY_CLIENTID allows a server to immediately reclaim the resources 20085 consumed by an unsued client ID, and also to forget that it ever 20086 generated the client ID. By forgetting it ever generated the the 20087 client ID the server can safely reuse the client ID on a future 20088 EXCHANGE_ID operation. 20090 17.51. Operation 10044: ILLEGAL - Illegal operation 20092 17.51.1. SYNOPSIS 20094 -> () 20096 17.51.2. ARGUMENTS 20098 void; 20100 17.51.3. RESULTS 20102 /* 20103 * ILLEGAL: Response for illegal operation numbers 20104 */ 20105 struct ILLEGAL4res { 20106 nfsstat4 status; 20107 }; 20109 17.51.4. DESCRIPTION 20111 This operation is a placeholder for encoding a result to handle the 20112 case of the client sending an operation code within COMPOUND that is 20113 not supported. See the COMPOUND procedure description for more 20114 details. 20116 The status field of ILLEGAL4res MUST be set to NFS4ERR_OP_ILLEGAL. 20118 17.51.5. IMPLEMENTATION 20120 A client will probably not send an operation with code OP_ILLEGAL but 20121 if it does, the response will be ILLEGAL4res just as it would be with 20122 any other invalid operation code. Note that if the server gets an 20123 illegal operation code that is not OP_ILLEGAL, and if the server 20124 checks for legal operation codes during the XDR decode phase, then 20125 the ILLEGAL4res would not be returned. 20127 18. NFS version 4.1 Callback Procedures 20129 The procedures used for callbacks are defined in the following 20130 sections. In the interest of clarity, the terms "client" and 20131 "server" refer to NFS clients and servers, despite the fact that for 20132 an individual callback RPC, the sense of these terms would be 20133 precisely the opposite. 20135 18.1. Procedure 0: CB_NULL - No Operation 20137 18.1.1. SYNOPSIS 20139 18.1.2. ARGUMENTS 20141 void; 20143 18.1.3. RESULTS 20145 void; 20147 18.1.4. DESCRIPTION 20149 Standard NULL procedure. Void argument, void response. Even though 20150 there is no direct functionality associated with this procedure, the 20151 server will use CB_NULL to confirm the existence of a path for RPCs 20152 from server to client. 20154 18.1.5. ERRORS 20156 None. 20158 18.2. Procedure 1: CB_COMPOUND - Compound Operations 20160 18.2.1. SYNOPSIS 20162 compoundargs -> compoundres 20164 18.2.2. ARGUMENTS 20166 enum nfs_cb_opnum4 { 20167 OP_CB_GETATTR = 3, 20168 OP_CB_RECALL = 4, 20169 OP_CB_ILLEGAL = 10044 20170 }; 20172 union nfs_cb_argop4 switch (unsigned argop) { 20173 case OP_CB_GETATTR: CB_GETATTR4args opcbgetattr; 20174 case OP_CB_RECALL: CB_RECALL4args opcbrecall; 20175 case OP_CB_ILLEGAL: void opcbillegal; 20176 }; 20178 struct CB_COMPOUND4args { 20179 utf8str_cs tag; 20180 uint32_t minorversion; 20181 nfs_cb_argop4 argarray<>; 20182 }; 20184 18.2.3. RESULTS 20186 union nfs_cb_resop4 switch (unsigned resop){ 20187 case OP_CB_GETATTR: CB_GETATTR4res opcbgetattr; 20188 case OP_CB_RECALL: CB_RECALL4res opcbrecall; 20189 }; 20191 struct CB_COMPOUND4res { 20192 nfsstat4 status; 20193 utf8str_cs tag; 20194 nfs_cb_resop4 resarray<>; 20195 }; 20197 18.2.4. DESCRIPTION 20199 The CB_COMPOUND procedure is used to combine one or more of the 20200 callback procedures into a single RPC request. The main callback RPC 20201 program has two main procedures: CB_NULL and CB_COMPOUND. All other 20202 operations use the CB_COMPOUND procedure as a wrapper. 20204 In the processing of the CB_COMPOUND procedure, the client may find 20205 that it does not have the available resources to execute any or all 20206 of the operations within the CB_COMPOUND sequence. This is discussed 20207 in Section 2.10.4.4. 20209 The minorversion field of the arguments MUST be the same as the 20210 minorversion of the COMPOUND procedure used to created the client ID 20211 and session. For NFSv4.1, minorversion MUST be set to 1. 20213 Contained within the CB_COMPOUND results is a 'status' field. This 20214 status must be equivalent to the status of the last operation that 20215 was executed within the CB_COMPOUND procedure. Therefore, if an 20216 operation incurred an error then the 'status' value will be the same 20217 error value as is being returned for the operation that failed. 20219 For the definition of the "tag" field, see the section "Procedure 1: 20220 COMPOUND - Compound Operations". [[Comment.21: Need an xref.]] 20222 Illegal operation codes are handled in the same way as they are 20223 handled for the COMPOUND procedure. 20225 18.2.5. IMPLEMENTATION 20227 The CB_COMPOUND procedure is used to combine individual operations 20228 into a single RPC request. The client interprets each of the 20229 operations in turn. If an operation is executed by the client and 20230 the status of that operation is NFS4_OK, then the next operation in 20231 the CB_COMPOUND procedure is executed. The client continues this 20232 process until there are no more operations to be executed or one of 20233 the operations has a status value other than NFS4_OK. 20235 18.2.6. ERRORS 20237 NFS4ERR_BADHANDLE NFS4ERR_BAD_STATEID NFS4ERR_BADXDR 20238 NFS4ERR_OP_ILLEGAL NFS4ERR_RESOURCE NFS4ERR_SERVERFAULT 20240 19. NFS version 4.1 Callback Operations 20242 19.1. Operation 3: CB_GETATTR - Get Attributes 20244 19.1.1. SYNOPSIS 20246 fh, attr_request -> attrmask, attr_vals 20248 19.1.2. ARGUMENT 20250 /* 20251 * NFS4 Callback Procedure Definitions and Program 20252 */ 20254 /* 20255 * CB_GETATTR: Get Current Attributes 20256 */ 20257 struct CB_GETATTR4args { 20258 nfs_fh4 fh; 20259 bitmap4 attr_request; 20260 }; 20262 19.1.3. RESULT 20264 struct CB_GETATTR4resok { 20265 fattr4 obj_attributes; 20266 }; 20268 union CB_GETATTR4res switch (nfsstat4 status) { 20269 case NFS4_OK: 20270 CB_GETATTR4resok resok4; 20271 default: 20272 void; 20273 }; 20275 19.1.4. DESCRIPTION 20277 The CB_GETATTR operation is used by the server to obtain the current 20278 modified state of a file that has been write delegated. The 20279 attributes size and change are the only ones guaranteed to be 20280 serviced by the client. See the section "Handling of CB_GETATTR" for 20281 a full description of how the client and server are to interact with 20282 the use of CB_GETATTR. 20284 If the filehandle specified is not one for which the client holds a 20285 write open delegation, an NFS4ERR_BADHANDLE error is returned. 20287 19.1.5. IMPLEMENTATION 20289 The client returns attrmask bits and the associated attribute values 20290 only for the change attribute, and attributes that it may change 20291 (time_modify, and size). 20293 19.2. Operation 4: CB_RECALL - Recall an Open Delegation 20295 19.2.1. SYNOPSIS 20297 stateid, truncate, fh -> () 20299 19.2.2. ARGUMENT 20301 /* 20302 * CB_RECALL: Recall an Open Delegation 20303 */ 20304 struct CB_RECALL4args { 20305 stateid4 stateid; 20306 bool truncate; 20307 nfs_fh4 fh; 20308 }; 20310 19.2.3. RESULT 20312 struct CB_RECALL4res { 20313 nfsstat4 status; 20314 }; 20316 19.2.4. DESCRIPTION 20318 The CB_RECALL operation is used to begin the process of recalling an 20319 open delegation and returning it to the server. 20321 The truncate flag is used to optimize recall for a file which is 20322 about to be truncated to zero. When it is set, the client is freed 20323 of obligation to propagate modified data for the file to the server, 20324 since this data is irrelevant. 20326 If the handle specified is not one for which the client holds an open 20327 delegation, an NFS4ERR_BADHANDLE error is returned. 20329 If the stateid specified is not one corresponding to an open 20330 delegation for the file specified by the filehandle, an 20331 NFS4ERR_BAD_STATEID is returned. 20333 19.2.5. IMPLEMENTATION 20335 The client should reply to the callback immediately. Replying does 20336 not complete the recall except when an error was returned. The 20337 recall is not complete until the delegation is returned using a 20338 DELEGRETURN. 20340 19.3. Operation 5: CB_LAYOUTRECALL 20342 19.3.1. SYNOPSIS 20344 layout_type, iomode, layoutchanged, layoutrecall -> - 20346 19.3.2. ARGUMENT 20348 /* 20349 * NFSv4.1 callback arguments and results 20350 */ 20352 enum layoutrecall_type4 { 20353 LAYOUTRECALL4_FILE = 1, 20354 LAYOUTRECALL4_FSID = 2, 20355 LAYOUTRECALL4_ALL = 3 20356 }; 20358 struct layoutrecall_file4 { 20359 nfs_fh4 lor_fh; 20360 offset4 lor_offset; 20361 length4 lor_length; 20362 }; 20364 union layoutrecall4 switch(layoutrecall_type4 recalltype) { 20365 case LAYOUTRECALL4_FILE: 20366 layoutrecall_file4 lor_layout; 20367 case LAYOUTRECALL4_FSID: 20368 fsid4 lor_fsid; 20369 case LAYOUTRECALL4_ALL: 20370 void; 20371 }; 20373 struct CB_LAYOUTRECALL4args { 20374 layouttype4 clora_type; 20375 layoutiomode4 clora_iomode; 20376 bool clora_changed; 20377 layoutrecall4 clora_recall; 20378 }; 20380 19.3.3. RESULT 20382 struct CB_LAYOUTRECALL4res { 20383 nfsstat4 clorr_status; 20384 }; 20386 19.3.4. DESCRIPTION 20388 The CB_LAYOUTRECALL operation is used to begin the process of 20389 recalling layout segments, a layout, all layouts pertaining to a 20390 particular file system (FSID), or layouts in all file systems (ALL). 20391 If LAYOUTRECALL4_FILE is specified, the lrf_offset and lrf_length 20392 fields specify the layout segments. If a lrf_length of all ones is 20393 specified then all layout segments identified by the current file 20394 handle, clora_type, clora_iomode, and corresponding to the octet 20395 range from lrf_offset to the end-of-file MUST be returned (via 20396 LAYOUTRETURN, see Section 17.44). The clora_iomode specifies the set 20397 of layouts to be returned. An clora_iomode of LAYOUTIOMODE4_ANY 20398 specifies that all matching layout segments regardless of iomode, 20399 must be returned; otherwise, only layout segments that exactly match 20400 the iomode must be returned. If clora_iomode is LAYOUTIOMODE4_ANY, 20401 lo_offset is zero, and lo_length is all ones, then the entire layout 20402 is to be returned. 20404 If the clora_changed field is TRUE, then the client SHOULD not write 20405 and commit its modified data to the storage devices specified by the 20406 layout being recalled. Instead, it is preferable for the client to 20407 write and commit the modified data through the metadata server. 20408 Alternatively, the client may attempt to obtain a new layout. Note: 20409 in order to obtain a new layout the client must first return the old 20410 layout. Since obtaining a new layout is not guaranteed to succeed, 20411 the client must be ready to write and commit its modified data 20412 through the metadata server. 20414 If the client does not hold any layout segment either matching or 20415 overlapping with the requested layout, it returns 20416 NFS4ERR_NOMATCHING_LAYOUT. 20418 If LAYOUTRECALL4_FSID is specified, the fsid specifies the file 20419 system for which any outstanding layouts MUST be returned. If 20420 LAYOUTRECALL4_ALL is specified, all outstanding layouts MUST be 20421 returned. In addition, LAYOUTRECALL4_FSID and LAYOUTRECALL4_ALL 20422 specify that all the storage device ID to storage device address 20423 mappings in the affected file system(s) are also recalled. The 20424 respective LAYOUTRETURN with either LAYOUTRETURN4_FSID or 20425 LAYOUTRETURN4_ALL acknowledges to the server that the client 20426 invalidated the said device mappings. Device mappings are 20427 invalidated also when no layouts are found for LAYOUTRECALL4_FSID or 20428 LAYOUTRECALL4_ALL and NFS4ERR_NOMATCHING_LAYOUT is returned. 20430 19.3.5. IMPLEMENTATION 20432 The client should reply to the callback immediately. Replying does 20433 not complete the recall except when an error is returned; otherwise 20434 the recall is not complete until the layout(s) are returned using a 20435 LAYOUTRETURN operation. 20437 The client should complete any in-flight I/O operations using the 20438 recalled layout(s) before returning it/them via LAYOUTRETURN. If the 20439 client has buffered modified data there are a number of options for 20440 writing and committing that data. If clora_changed is false, the 20441 client may choose to write modified data directly to storage before 20442 calling LAYOUTRETURN. However, if clora_changed is true, the client 20443 may either choose to write it later using normal NFSv4 WRITE 20444 operations to the metadata server or it may attempt to obtain a new 20445 layout, after first returning the recalled layout, using the new 20446 layout to write the modified data. Regardless of whether the client 20447 is holding a layout, it may always write data through the metadata 20448 server. 20450 If modified data is written while the layout is held, the client must 20451 still issue LAYOUTCOMMIT operations at the appropriate time, 20452 especially before issuing the LAYOUTRETURN. If a large amount of 20453 modified data is outstanding, the client may issue LAYOUTRETURNs for 20454 portions of the layout being recalled; this allows the server to 20455 monitor the client's progress and adherence to the callback. 20456 However, the last LAYOUTRETURN in a sequence of returns, MUST specify 20457 the full range being recalled (see Section 12.5.4.1 for details). 20459 19.4. Operation 6: CB_NOTIFY - Notify directory changes 20461 Tell the client of directory changes. 20463 19.4.1. SYNOPSIS 20465 stateid, notification -> {} 20467 19.4.2. ARGUMENT 20469 /* Changed entry information. */ 20470 struct notify_entry4 { 20471 component4 ne_file; 20472 fattr4 ne_attrs; 20473 }; 20475 /* Previous entry information */ 20476 struct prev_entry4 { 20477 notify_entry4 pe_prev_entry; 20478 /* what READDIR returned for this entry */ 20479 nfs_cookie4 pe_prev_entry_cookie; 20480 }; 20481 struct notify_add4 { 20482 notify_entry4 nad_new_entry; 20483 /* what READDIR would have returned for this entry */ 20484 nfs_cookie4 nad_new_entry_cookie<1>; 20485 prev_entry4 nad_prev_entry<1>; 20486 bool nad_last_entry; 20487 }; 20489 struct notify_attr4 { 20490 notify_entry4 na_changed_entry; 20491 }; 20493 struct notify_remove4 { 20494 notify_entry4 nrm_old_entry; 20495 nfs_cookie4 nrm_old_entry_cookie; 20496 }; 20498 struct notify_rename4 { 20499 notify_entry4 nrn_old_entry; 20500 notify_add4 nrn_new_entry; 20501 }; 20503 struct notify_verifier4 { 20504 verifier4 nv_old_cookieverf; 20505 verifier4 nv_new_cookieverf; 20506 }; 20508 enum notify_type4 { 20509 NOTIFY4_CHANGE_CHILD_ATTRS = 0, 20510 NOTIFY4_CHANGE_DIR_ATTRS = 1, 20511 NOTIFY4_REMOVE_ENTRY = 2, 20512 NOTIFY4_ADD_ENTRY = 3, 20513 NOTIFY4_RENAME_ENTRY = 4, 20514 NOTIFY4_CHANGE_COOKIE_VERIFIER = 5 20515 }; 20517 /* 20518 * Notification information sent to the client. 20519 */ 20520 union notify4 switch (notify_type4 n_type) { 20521 case NOTIFY4_CHANGE_CHILD_ATTRS: 20522 notify_attr4 n_change_child_attrs; 20523 case NOTIFY4_CHANGE_DIR_ATTRS: 20524 fattr4 n_change_dir_attrs; 20525 case NOTIFY4_REMOVE_ENTRY: 20526 notify_remove4 n_remove_notify; 20527 case NOTIFY4_ADD_ENTRY: 20528 notify_add4 n_add_notify; 20530 case NOTIFY4_RENAME_ENTRY: 20531 notify_rename4 n_rename_notify; 20532 case NOTIFY4_CHANGE_COOKIE_VERIFIER: 20533 notify_verifier4 n_verf_notify; 20534 }; 20536 struct CB_NOTIFY4args { 20537 stateid4 cna_stateid; 20538 nfs_fh4 cna_fh; 20539 notify4 cna_changes<>; 20540 }; 20542 19.4.3. RESULT 20544 struct CB_NOTIFY4res { 20545 nfsstat4 cnr_status; 20546 }; 20548 19.4.4. DESCRIPTION 20550 The CB_NOTIFY operation is used by the server to send notifications 20551 to clients about changes in a delegated directory. These 20552 notifications are sent over the callback path. The notification is 20553 sent once the original request has been processed on the server. The 20554 server will send an array of notifications for all changes that might 20555 have occurred in the directory. The notify_type4 can only have one 20556 bit set for each notification in the array. If the client holding 20557 the delegation makes any changes in the directory that cause files or 20558 sub directories to be added or removed, the server will notify that 20559 client of the resulting change(s). If the client holding the 20560 delegation is making attribute or cookie verifier changes only, the 20561 server does not need to send notifications to that client. The 20562 server will send the following information for each operation: 20564 ADDING A FILE The server will send information about the new entry 20565 being created along with the cookie for that entry. The entry 20566 information (data type notify_add4) includes the component name of 20567 the entry and attributes. If this entry is added to the end of 20568 the directory, the server will set the nad_last_entry flag to 20569 true. If the file is added such that there is at least one entry 20570 before it, the server will also return the previous entry 20571 information (nad_prev_entry, a variable length array of up to one 20572 element. If the array is of zero length, there is no previous 20573 entry), along with its cookie. This is to help clients find the 20574 right location in their DNLC or directory caches where this entry 20575 should be cached. If the new entry's cookie is available, it will 20576 be in nad_new_entry_cookie (another variable length array of up to 20577 one element). 20579 REMOVING A FILE The server will send information about the directory 20580 entry being deleted. The server will also send the cookie value 20581 for the deleted entry so that clients can get to the cached 20582 information for this entry. 20584 RENAMING A FILE The server will send information about both the old 20585 entry and the new entry. This includes name and attributes for 20586 each entry. This notification is only sent if both entries are in 20587 the same directory. If the rename is across directories, the 20588 server will send a remove notification to one directory and an add 20589 notification to the other directory, assuming both have a 20590 directory delegation. 20592 FILE/DIR ATTRIBUTE CHANGE The client will use the attribute mask to 20593 inform the server of attributes for which it wants to receive 20594 notifications. This change notification can be requested for both 20595 changes to the attributes of the directory as well as changes to 20596 any file attributes in the directory by using two separate 20597 attribute masks. The client cannot ask for change attribute 20598 notification per file. One attribute mask covers all the files in 20599 the directory. Upon any attribute change, the server will send 20600 back the values of changed attributes. Notifications might not 20601 make sense for some file system wide attributes and it is up to 20602 the server to decide which subset it wants to support. The client 20603 can negotiate the frequency of attribute notifications by letting 20604 the server know how often it wants to be notified of an attribute 20605 change. The server will return supported notification frequencies 20606 or an indication that no notification is permitted for directory 20607 or child attributes by setting the dir_notif_delay and 20608 dir_entry_notif_delay attributes respectively. 20610 COOKIE VERIFIER CHANGE If the cookie verifier changes while a client 20611 is holding a delegation, the server will notify the client so that 20612 it can invalidate its cookies and reissue a READDIR to get the new 20613 set of cookies. 20615 19.4.5. IMPLEMENTATION 20617 19.5. Operation 7: CB_PUSH_DELEG 20619 19.5.1. SYNOPSIS 20621 fh, stateid -> { } 20623 19.5.2. ARGUMENT 20625 struct CB_PUSH_DELEG4args { 20626 stateid4 cpda_stateid; 20627 nfs_fh4 cpda_fh; 20628 open_delegation4 cpda_delegation; 20630 }; 20632 19.5.3. RESULT 20634 struct CB_PUSH_DELEG4res { 20635 nfsstat4 cpdr_status; 20636 }; 20638 19.5.4. DESCRIPTION 20640 CB_PUSH_DELEG is used by the server to both signal to the client that 20641 the delegation it wants is available and to simultaneously offer the 20642 delegation to the client. The client has the choice of accepting the 20643 delegation by returning NFS4_OK to the server, delaying the decision 20644 to accept the offered delegation by returning NFS4ERR_DELAY or 20645 permanently rejecting the offer of the delegation via any other error 20646 status. 20648 The server MUST send in cpda_delegation a delegation corresponding to 20649 the type of what the client requested in the OPEN, WANT_DELEGATION, 20650 or GET_DIR_DELEGATION request. 20652 If the client does return NFS4ERR_DELAY and there is a conflicting 20653 delegation request, the server MAY process it at the expense of the 20654 client that returned NFS4ERR_DELAY. The client's want will not be 20655 cancelled, but MAY processed behind other delegation requests or 20656 registered wants. 20658 19.5.5. IMPLEMENTATION 20660 TBD 20662 19.6. Operation 8: CB_RECALL_ANY - Keep any N delegations 20664 Notify client to return delegation and keep N of them. 20666 19.6.1. SYNOPSIS 20668 N, type_mask -> {} 20670 19.6.2. ARGUMENT 20672 const RCA4_TYPE_MASK_RDATA_DLG = 0; 20673 const RCA4_TYPE_MASK_WDATA_DLG = 1; 20674 const RCA4_TYPE_MASK_DIR_DLG = 2; 20675 const RCA4_TYPE_MASK_FILE_LAYOUT = 3; 20676 const RCA4_TYPE_MASK_BLK_LAYOUT_MIN = 4; 20677 const RCA4_TYPE_MASK_BLK_LAYOUT_MAX = 7; 20678 const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8; 20679 const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 11; 20680 const RCA4_TYPE_MASK_OTHER_LAYOUT_MIN = 12; 20681 const RCA4_TYPE_MASK_OTHER_LAYOUT_MAX = 15; 20683 struct CB_RECALL_ANY4args { 20684 uint32_t craa_objects_to_keep; 20685 bitmap4 craa_type_mask; 20686 }; 20688 19.6.3. RESULT 20690 struct CB_RECALL_ANY4res { 20691 nfsstat4 crar_status; 20692 }; 20694 19.6.4. DESCRIPTION 20696 The server may decide that it cannot hold all of the state for 20697 recallable objects, such as delegations and layouts, without running 20698 out of resources. In such a case, it is free to recall individual 20699 objects to reduce the load but this would be far from optimal. 20701 Because the general purpose of such recallable objects as delegations 20702 is to eliminate client interaction with the server, the server cannot 20703 interpret lack of recent use as indicating that the object is no 20704 longer useful. The absence of visible use may be the result of a 20705 large number of potential operations eliminated. In the case of 20706 layouts, the layout will be used explicitly but the meta-data server 20707 does not have direct knowledge of such use. 20709 In order to implement an effective reclaim scheme for such objects, 20710 the server's knowledge of available resources must be used to 20711 determine when objects must be recalled with the clients selecting 20712 the actual objects to be returned. 20714 Server implementations may differ in their resource allocation 20715 requirements. For example, one server may share resources among all 20716 classes of recallable objects whereas another may use separate 20717 resource pools for layouts and for delegations, or further separate 20718 resources by types of delegations. 20720 When a given resource pool is over-utilized, the server can issue a 20721 CB_RECALL_ANY to clients holding recallable objects of the types 20722 involved, allowing it to keep a certain number of such objects and 20723 return any excess. A mask specifies which types of objects are to be 20724 limited. The client chooses, based on its own knowledge of current 20725 usefulness, which of the objects in that class should be returned. 20727 For NFSv4.1, sixteen bits are defined. For some of these, ranges are 20728 defined and it is up to the definition of the storage protocol to 20729 specify how these are to be used. There are ranges for blocks-based 20730 storage protocols, for object-based storage protocols and a reserved 20731 range for other experimental storage protocols. The RFC defining 20732 such a storage protocol needs to specify how particular bits within 20733 its range are to be used. For example, it may specify a mapping 20734 between attributes of the layout (read vs. write, size of area) and 20735 the bit to be used or it may define a field in the layout where the 20736 associated bit position is made available by the server to the 20737 client. 20739 When an undefined bit is set in the type mask, NFS4ERR_INVAL should 20740 be returned. However even if a client does not support an object of 20741 the specified type, if the bit is defined, NFS4ERR_INVAL should not 20742 be returned. Future minor versions of NFSv4 may expand the set of 20743 valid type mask bits. 20745 CB_RECALL_ANY specifies a count of objects that the client may keep 20746 as opposed to a count that the client must return. This is to avoid 20747 potential race between a CB_RECALL_ANY that had a count of objects to 20748 free with a set of client-originated operations to return layouts or 20749 delegations. As a result of the race, the client and server would 20750 have differing ideas as to how many objects to return. Hence the 20751 client could mistakenly free too many. 20753 If resource demands prompt it, the server may send another 20754 CB_RECALL_ANY with a lower count, even it has not yet received an 20755 acknowledgement from the client for a previous CB_RECALL_ANY with the 20756 same type mask. Although the possibility exists that these will be 20757 received by the client in a order different from the order in which 20758 they were sent, any such permutation of the callback stream is 20759 harmless. It is the job of the client to bring down the size of the 20760 recallable object set in line with each CB_RECALL_ANY received and 20761 until that obligation is met it cannot be canceled or modified by any 20762 subsequent CB_RECALL_ANY for the same type mask. Thus if the server 20763 sends two CB_RECALL_ANY's, the effect will be the same as if the 20764 lower count was sent, whatever the order of recall receipt. Note 20765 that this means that a server may not cancel the effect of a 20766 CB_RECALL_ANY by sending another recall with a higher count. When a 20767 CB_RECALL_ANY is received and the count is already within the limit 20768 set or is above a limit that the client is working to get down to, 20769 that callback has no effect. 20771 The client can choose to return any type of object specified by the 20772 mask. If a server wishes to limit use of objects of a specific type, 20773 it should only specify that type in the mask sent. The client may 20774 not return requested objects and it is up to the server to handle 20775 this situation, typically by doing specific recalls to properly limit 20776 resource usage. The server should give the client enough time to 20777 return objects before proceeding to specific recalls. This time 20778 should not be less than the lease period. 20780 Servers are generally free not to give out recallable objects when 20781 insufficient resources are available. Note that the effect of such a 20782 policy is implicitly to give precedence to existing objects relative 20783 to requested ones, with the result that resources might not be 20784 optimally used. To prevent this, servers are well advised to make 20785 the point at which they start issuing CB_RECALL_ANY callbacks 20786 somewhat below that at which they cease to give out new delegations 20787 and layouts. This allows the client to purge its less-used objects 20788 whenever appropriate and so continue to have its subsequent requests 20789 given new resources freed up by object returns. 20791 19.6.5. IMPLEMENTATION 20793 19.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL 20795 19.7.1. SYNOPSIS 20797 TBD 20799 19.7.2. ARGUMENT 20801 typedef CB_RECALL_ANY4args CB_RECALLABLE_OBJ_AVAIL4args; 20803 19.7.3. RESULT 20805 struct CB_RECALLABLE_OBJ_AVAIL4res { 20806 nfsstat4 croa_status; 20807 }; 20809 19.7.4. DESCRIPTION 20811 CB_RECALLABLE_OBJ_AVAIL is used by the server to signal the client 20812 that the server has resources to grant recallable objects that might 20813 previously have been denied by OPEN, WANT_DELEGATION, GET_DIR_DELEG, 20814 or LAYOUTGET. 20816 The argument, objects_to_keep means the total number of recallable 20817 objects of the types indicated in the argument type_mask that the 20818 server believes it can allow the client to have, including the number 20819 of such objects the client already has. A client that tries to 20820 acquire more recallable objects than the server informs it can have 20821 runs the risk of having objects recalled. 20823 19.7.5. IMPLEMENTATION 20825 TBD 20827 19.8. Operation 10: CB_RECALL_SLOT - change flow control limits 20829 Change flow control limits 20831 19.8.1. SYNOPSIS 20833 targetcount -> status 20835 19.8.2. ARGUMENT 20837 struct CB_RECALL_SLOT4args { 20838 uint32_t rsa_target_highest_slotid; 20839 }; 20841 19.8.3. RESULT 20843 struct CB_RECALL_SLOT4res { 20844 nfsstat4 rsr_status; 20845 }; 20847 19.8.4. DESCRIPTION 20849 The CB_RECALL_SLOT operation requests the client to return session 20850 slots, and if applicable, transport credits (e.g. RDMA credits for 20851 connections bound to the operations channel) to the server. 20852 CB_RECALL_SLOT specifies rsa_target_highest_slotid, the target 20853 highest_slot the server wants for the session. The client, should 20854 then work toward reducing the highest_slot to the target. 20856 If the session has only non-RDMA connections bound to its operations 20857 channel, then the client need only wait for all outstanding requests 20858 with a slotid > rsa_target_highest_slotid to complete, then issue a 20859 single COMPOUND consisting of a single SEQUENCE operation, with the 20860 sa_highslot field set to rsa_target_highest_slotid. If there are 20861 RDMA-based connections bound to operation channel, then the client 20862 needs to also issue enough zero-length RDMA Sends to take the total 20863 RDMA credit count to rsa_target_highest_slotid + 1 or below. 20865 19.8.5. IMPLEMENTATION 20867 No discussion at this time. 20869 19.9. Operation 11: CB_SEQUENCE - Supply callback channel sequencing 20870 and control 20872 Sequence and control 20874 19.9.1. SYNOPSIS 20876 control -> control 20878 19.9.2. ARGUMENT 20880 struct referring_call4 { 20881 sequenceid4 rc_sequenceid; 20882 slotid4 rc_slotid; 20883 }; 20885 struct referring_call_list4 { 20886 sessionid4 rcl_sessionid; 20887 referring_call4 rcl_referring_calls<>; 20888 }; 20890 struct CB_SEQUENCE4args { 20891 sessionid4 csa_sessionid; 20892 sequenceid4 csa_sequenceid; 20893 slotid4 csa_slotid; 20894 slotid4 csa_highest_slotid; 20895 bool csa_cachethis; 20896 referring_call_list4 csa_referring_call_lists<>; 20897 }; 20899 19.9.3. RESULT 20901 struct CB_SEQUENCE4resok { 20902 sessionid4 csr_sessionid; 20903 sequenceid4 csr_sequenceid; 20904 slotid4 csr_slotid; 20905 slotid4 csr_highest_slotid; 20906 slotid4 csr_target_highest_slotid; 20907 }; 20909 union CB_SEQUENCE4res switch (nfsstat4 csr_status) { 20910 case NFS4_OK: 20911 CB_SEQUENCE4resok csr_resok4; 20912 default: 20913 void; 20914 }; 20916 19.9.4. DESCRIPTION 20918 The CB_SEQUENCE operation is used to manage operational accounting 20919 for the callback channel of the session on which the operation is 20920 sent. The contents include the session to which this request 20921 belongs, slotid and sequenceid used by the server to implement 20922 session request control and exactly once semantics, and exchanged 20923 slot maximums which are used to adjust the size of the replay cache. 20924 This operation MUST appear once as the first operation in each 20925 CB_COMPOUND sent procedure after the callback channel is successfully 20926 bound, or a protocol error must result. See Section 17.46.4 for a 20927 description of how slots are processed. 20929 If csa_cachethis is TRUE, then the server is requesting that the 20930 client cache the reply in the callback reply cache. The client MUST 20931 cache the reply (see Section 2.10.4.1.2). 20933 The csa_referring_call_lists array is the list of COMPOUND calls, 20934 identified by sessionid, slotid and sequencid, that the client 20935 previously sent to the server that could have triggered the callback. 20936 A sessionid is included because leased state is tied to a client ID, 20937 and a client ID can have multiple sessions. See Section 2.10.4.3 20938 Resolving server callback races with sessions. 20940 If the difference between csa_sequenceid and the sequenceid the 20941 client has for the slot is two (2) or more, then client MUST return 20942 NFS4ERR_SEQ_MISORDERED. If csa_sequenceid is less than the client's 20943 cached sequencid (accounting for wraparound of the unsigned 20944 sequenceid value), then the client MUST return 20945 NFS4ERR_SEQ_MISORDERED. If sa_sequenceid and the cached sequenceid 20946 are the same, this is a replay, and the client returns the response 20947 to the CB_COMPOUND that is cached. Otherwise, if sa_sequenceid is 20948 one greater (accounting for wraparound) than the cached sequenceid, 20949 then this is a new request, and the slot's sequenceid is incremented. 20950 The operations subsequent to CB_SEQUENCE, if any, are processed. If 20951 there are no other operations, the only other effects are to cache 20952 the CB_SEQUENCE reply in the slot. 20954 If CB_SEQUENCE returns an error, then the state of the slot 20955 (sequenceid, cached reply) is not changed. 20957 The client returns two "highest_slotid" values: csr_highest_slotid, 20958 and csr_target_highest_slotid. The former is the highest slotid the 20959 client will accept in a future CB_SEQUENCE operation, and must not be 20960 less than the the value of csa_highest_slotid. The latter is the 20961 highest slotid the client would prefer the client use on a future 20962 CB_SEQUENCE operation. 20964 19.9.5. IMPLEMENTATION 20966 19.10. Operation 12: CB_WANTS_CANCELLED 20968 19.10.1. SYNOPSIS 20970 fh, size -> - 20972 19.10.2. ARGUMENT 20974 struct CB_WANTS_CANCELLED4args { 20975 bool cwca_contended_wants_cancelled; 20976 bool cwca_resourced_wants_cancelled; 20977 }; 20979 19.10.3. RESULT 20981 struct CB_WANTS_CANCELLED4res { 20982 nfsstat4 cwcr_status; 20983 }; 20985 19.10.4. DESCRIPTION 20987 The CB_WANTS_CANCELLED operation is used to notify the client that 20988 the some or all wants it registered for recallable delegations and 20989 layouts have been canceled. 20991 If cwca_contended_wants_cancelled is TRUE, this indicates the server 20992 will not be pushing to the client any delegations that become 20993 available after contention passes. 20995 If cwca_resourced_wants_cancelled is TRUE, this indicates the server 20996 will not notify the client when there are resources on the server 20997 grant delegations or layouts. 20999 After receiving a CB_WANTS_CANCELLED operation, the client is free to 21000 attempt to acquire the delegations or layouts it was waiting for, and 21001 possibly re-register wants. 21003 19.10.5. IMPLEMENTATION 21005 19.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible lock 21006 availability 21008 19.11.1. SYNOPSIS 21010 fh, lockowner -> () 21012 19.11.2. ARGUMENT 21014 struct CB_NOTIFY_LOCK4args { 21015 lock_owner4 cnla_lock_owner; 21016 nfs_fh4 cnla_fh; 21017 }; 21019 19.11.3. RESULT 21021 struct CB_NOTIFY_LOCK4res { 21022 nfsstat4 cnlr_status; 21023 }; 21025 19.11.4. DESCRIPTION 21027 The server may use this operation to indicate that a lock for the 21028 given file and lockowner may have become available. 21030 This callback is meant to be used by servers to help reduce the 21031 latency of blocking locks in the case where they recognize that a 21032 client which has been polling for a blocking lock may now be able to 21033 acquire the lock. The notification is purely a hint, provided as a 21034 possible performance optimization, and is not required for 21035 correctness. 21037 19.11.5. IMPLEMENTATION 21039 The server must not grant the lock to the client unless and until it 21040 receives an actual lock request from the client. Similarly, the 21041 client receiving this callback cannot assume that it now has the 21042 lock, or that a subsequent request for the lock will be successful. 21044 The server is not required to implement this callback, and even if it 21045 does, it is not required to use it in any particular case. Therefore 21046 the client must still rely on polling for blocking locks, as 21047 described in the "Blocking Locks" section. 21049 Similarly, the client is not required to implement this callback, and 21050 even it does, is still free to ignore it. Therefore the server must 21051 not assume that the client will act based on the callback. 21053 If the server supports this callback for a given file, it should set 21054 the OPEN4_RESULT_MAY_NOTIFY_LOCK flag when responding to successful 21055 opens for that file. This does not commit the server to use of 21056 CB_NOTIFY_LOCK, but the client may use this as a hint to decide how 21057 frequently poll for locks derived from that open. 21059 19.12. Operation 10044: CB_ILLEGAL - Illegal Callback Operation 21061 19.12.1. SYNOPSIS 21063 -> () 21065 19.12.2. ARGUMENT 21067 void; 21069 19.12.3. RESULT 21071 /* 21072 * CB_ILLEGAL: Response for illegal operation numbers 21073 */ 21074 struct CB_ILLEGAL4res { 21075 nfsstat4 status; 21076 }; 21078 19.12.4. DESCRIPTION 21080 This operation is a placeholder for encoding a result to handle the 21081 case of the client sending an operation code within COMPOUND that is 21082 not supported. See the COMPOUND procedure description for more 21083 details. 21085 The status field of CB_ILLEGAL4res MUST be set to NFS4ERR_OP_ILLEGAL. 21087 19.12.5. IMPLEMENTATION 21089 A server will probably not send an operation with code OP_CB_ILLEGAL 21090 but if it does, the response will be CB_ILLEGAL4res just as it would 21091 be with any other invalid operation code. Note that if the client 21092 gets an illegal operation code that is not OP_ILLEGAL, and if the 21093 client checks for legal operation codes during the XDR decode phase, 21094 then the CB_ILLEGAL4res would not be returned. 21096 20. Security Considerations 21098 TBD 21100 21. IANA Considerations 21102 21.1. Defining new layout types 21104 New layout type numbers will be requested from IANA. IANA will only 21105 provide layout type numbers for Standards Track RFCs approved by the 21106 IESG, in accordance with Standards Action policy defined in RFC2434 21107 [16]. 21109 The author of a new pNFS layout specification must follow these steps 21110 to obtain acceptance of the layout type as a standard: 21112 1. The author devises the new layout specification. 21114 2. The new layout type specification MUST, at a minimum: 21116 * Define the contents of the layout-type-specific fields of the 21117 following data types: 21119 + the da_addr_body field of the device_addr4 data type; 21121 + the loh_body field of the layouthint4 data type; 21123 + the loc_body field of layout_content4 data type (which in 21124 turn is the lo_content field of the layout4 data type); 21126 + the lou_body field of the layoutupdate4 data type; 21128 * Describe or define the storage access protocol used to access 21129 the data servers 21131 * Describe the methods of recovery from storage device restart, 21132 and loss of layout state on the metadata server (see 21133 Section 12.7.3). 21135 * Include a security considerations section 21137 3. The author documents the new layout specification as an Internet 21138 Draft. 21140 4. The author submits the Internet Draft for review through the IETF 21141 standards process as defined in "Internet Official Protocol 21142 Standards" (STD 1). The new layout specification will be 21143 submitted for eventual publication as a standards track RFC. 21145 5. The layout specification progresses through the IETF standards 21146 process; the new option will be reviewed by the NFSv4 Working 21147 Group (if that group still exists), or as an Internet Draft not 21148 submitted by an IETF working group. 21150 22. References 21152 22.1. Normative References 21154 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 21155 Levels", March 1997. 21157 [2] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, 21158 C., Eisler, M., and D. Noveck, "Network File System (NFS) 21159 version 4 Protocol", RFC 3530, April 2003. 21161 [3] Eisler, M., "XDR: External Data Representation Standard", 21162 STD 67, RFC 4506, May 2006. 21164 [4] Srinivasan, R., "RPC: Remote Procedure Call Protocol 21165 Specification Version 2", RFC 1831, August 1995. 21167 [5] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol 21168 Specification", RFC 2203, September 1997. 21170 [6] Linn, J., "The Kerberos Version 5 GSS-API Mechanism", RFC 1964, 21171 June 1996. 21173 [7] Eisler, M., "LIPKEY - A Low Infrastructure Public Key Mechanism 21174 Using SPKM", RFC 2847, June 2000. 21176 [8] Linn, J., "Generic Security Service Application Program 21177 Interface Version 2, Update 1", RFC 2743, January 2000. 21179 [9] Hinden, R. and S. Deering, "IP Version 6 Addressing 21180 Architecture", RFC 1884, December 1995. 21182 [10] International Organization for Standardization, "Information 21183 Technology - Universal Multiple-octet coded Character Set (UCS) 21184 - Part 1: Architecture and Basic Multilingual Plane", 21185 ISO Standard 10646-1, May 1993. 21187 [11] Alvestrand, H., "IETF Policy on Character Sets and Languages", 21188 BCP 18, RFC 2277, January 1998. 21190 [12] Hoffman, P. and M. Blanchet, "Preparation of Internationalized 21191 Strings ("stringprep")", RFC 3454, December 2002. 21193 [13] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile 21194 for Internationalized Domain Names (IDN)", RFC 3491, 21195 March 2003. 21197 [14] Krawczyk, H., Bellare, M., and R. Canetti, "HMAC: Keyed-Hashing 21198 for Message Authentication", RFC 2104, February 1997. 21200 [15] Schaad, J., Kaliski, B., and R. Housley, "Additional Algorithms 21201 and Identifiers for RSA Cryptography for use in the Internet 21202 X.509 Public Key Infrastructure Certificate and Certificate 21203 Revocation List (CRL) Profile", RFC 4055, June 2005. 21205 [16] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA 21206 Considerations Section in RFCs", BCP 26, RFC 2434, 21207 October 1998. 21209 22.2. Informative References 21211 [17] Nowicki, B., "NFS: Network File System Protocol specification", 21212 RFC 1094, March 1989. 21214 [18] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3 21215 Protocol Specification", RFC 1813, June 1995. 21217 [19] Eisler, M., "NFS Version 2 and Version 3 Security Issues and 21218 the NFS Protocol's Use of RPCSEC_GSS and Kerberos V5", 21219 RFC 2623, June 1999. 21221 [20] Juszczak, C., "Improving the Performance and Correctness of an 21222 NFS Server", USENIX Conference Proceedings , June 1990. 21224 [21] Reynolds, J., "Assigned Numbers: RFC 1700 is Replaced by an On- 21225 line Database", RFC 3232, January 2002. 21227 [22] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", 21228 RFC 1833, August 1995. 21230 [23] Zelenka, J., Welch, B., and B. Halevy, "Object-based pNFS 21231 Operations", July 2005, . 21234 [24] Black, D., "pNFS Block/Volume Layout", July 2005, . 21237 [25] Callaghan, B., "WebNFS Client Specification", RFC 2054, 21238 October 1996. 21240 [26] Callaghan, B., "WebNFS Server Specification", RFC 2055, 21241 October 1996. 21243 [27] Shepler, S., "NFS Version 4 Design Considerations", RFC 2624, 21244 June 1999. 21246 [28] Simonsen, K., "Character Mnemonics and Character Sets", 21247 RFC 1345, June 1992. 21249 [29] Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M., and E. 21250 Zeidner, "Internet Small Computer Systems Interface (iSCSI)", 21251 RFC 3720, April 2004. 21253 [30] Snively, R., "Fibre Channel Protocol for SCSI, 2nd Version 21254 (FCP-2)", ANSI/INCITS 350-2003, Oct 2003. 21256 [31] Weber, R., "Object-Based Storage Device Commands (OSD)", ANSI/ 21257 INCITS 400-2004, July 2004, 21258 . 21260 [32] Callaghan, B., "NFS URL Scheme", RFC 2224, October 1997. 21262 [33] Chiu, A., Eisler, M., and B. Callaghan, "Security Negotiation 21263 for WebNFS", RFC 2755, January 2000. 21265 Appendix A. Acknowledgments 21267 The initial drafts for the SECINFO extensions were edited by Mike 21268 Eisler with contributions from Peng Dai, Sergey Klyushin, and Carl 21269 Burnett. 21271 The initial drafts for the SESSIONS extensions were edited by Tom 21272 Talpey, Spencer Shepler, Jon Bauman with contributions from Charles 21273 Antonelli, Brent Callaghan, Mike Eisler, John Howard, Chet Juszczak, 21274 Trond Myklebust, Dave Noveck, John Scott, Mike Stolarchuk and Mark 21275 Wittle. [[Comment.22: global namespace stuff?]] 21277 The initial drafts for the Directory Delegations support were 21278 contributed by Saadia Khan with input from Dave Noveck, Mike Eisler, 21279 Carl Burnett, Ted Anderson and Tom Talpey. 21281 The initial drafts for the ACL explanations were contributed by Sam 21282 Falkner and Lisa Week. 21284 The initial drafts for the parallel NFS support were edited by Brent 21285 Welch and Garth Goodson. Additional authors for those documents were 21286 Benny Halevy, David Black, and Andy Adamson. Additional input came 21287 from the informal group which contributed to the construction of the 21288 initial pNFS drafts; specific acknowledgement goes to Gary Grider, 21289 Peter Corbett, Dave Noveck, and Peter Honeyman. The pNFS work was 21290 inspired by the NASD and OSD work done by Garth Gibson. Gary Grider 21291 of the national labs (LANL) has also been a champion of high- 21292 performance parallel I/O. 21294 Fredric Isaman found several errors in draft versions of the ONC RPC 21295 XDR description of the NFSv4.1 protocol. 21297 Authors' Addresses 21299 Spencer Shepler 21300 Sun Microsystems, Inc. 21301 7808 Moonflower Drive 21302 Austin, TX 78750 21303 USA 21305 Phone: +1-512-349-9376 21306 Email: spencer.shepler@sun.com 21308 Mike Eisler 21309 Network Appliance, Inc. 21310 5765 Chase Point Circle 21311 Colorado Springs, CO 80919 21312 USA 21314 Phone: +1-719-599-9026 21315 Email: email2mre-@yahoo.com 21316 URI: Insert ietf2 between the - and @ symbols in the above address 21317 David Noveck 21318 Network Appliance, Inc. 21319 1601 Trapelo Road, Suite 16 21320 Waltham, MA 02454 21321 USA 21323 Phone: +1-781-768-5347 21324 Email: dnoveck@netapp.com 21326 Full Copyright Statement 21328 Copyright (C) The IETF Trust (2007). 21330 This document is subject to the rights, licenses and restrictions 21331 contained in BCP 78, and except as set forth therein, the authors 21332 retain all their rights. 21334 This document and the information contained herein are provided on an 21335 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 21336 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 21337 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 21338 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 21339 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 21340 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 21342 Intellectual Property 21344 The IETF takes no position regarding the validity or scope of any 21345 Intellectual Property Rights or other rights that might be claimed to 21346 pertain to the implementation or use of the technology described in 21347 this document or the extent to which any license under such rights 21348 might or might not be available; nor does it represent that it has 21349 made any independent effort to identify any such rights. Information 21350 on the procedures with respect to rights in RFC documents can be 21351 found in BCP 78 and BCP 79. 21353 Copies of IPR disclosures made to the IETF Secretariat and any 21354 assurances of licenses to be made available, or the result of an 21355 attempt made to obtain a general license or permission for the use of 21356 such proprietary rights by implementers or users of this 21357 specification can be obtained from the IETF on-line IPR repository at 21358 http://www.ietf.org/ipr. 21360 The IETF invites any interested party to bring to its attention any 21361 copyrights, patents or patent applications, or other proprietary 21362 rights that may cover technology that may be required to implement 21363 this standard. Please address the information to the IETF at 21364 ietf-ipr@ietf.org. 21366 Acknowledgment 21368 Funding for the RFC Editor function is provided by the IETF 21369 Administrative Support Activity (IASA).