idnits 2.17.1 draft-ietf-nfsv4-minorversion1-29.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.i or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? -- It seems you're using the 'non-IETF stream' Licence Notice instead Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the document. == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. -- The abstract seems to indicate that this document obsoletes RFC3530, but the header doesn't have an 'Obsoletes:' line to match this. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1044 has weird spacing: '...privacy no ...' == Line 3133 has weird spacing: '...est|Pad bytes...' == Line 4300 has weird spacing: '... opaque devic...' == Line 4412 has weird spacing: '...str_cis nii_...' == Line 4413 has weird spacing: '...8str_cs nii...' == (29 more instances...) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (December 15, 2008) is 5601 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '0' is mentioned on line 14953, but not defined -- Looks like a reference, but probably isn't: 'X' on line 14705 -- Looks like a reference, but probably isn't: 'Y' on line 14713 ** Obsolete normative reference: RFC 1831 (ref. '3') (Obsoleted by RFC 5531) -- Possible downref: Non-RFC (?) normative reference: ref. '6' ** Downref: Normative reference to an Informational RFC: RFC 2104 (ref. '11') -- Possible downref: Non-RFC (?) normative reference: ref. '13' == Outdated reference: A later version (-06) exists of draft-ietf-nfsv4-rpc-netid-04 -- Possible downref: Non-RFC (?) normative reference: ref. '15' -- Possible downref: Non-RFC (?) normative reference: ref. '16' -- Possible downref: Non-RFC (?) normative reference: ref. '17' ** Obsolete normative reference: RFC 3454 (ref. '18') (Obsoleted by RFC 7564) -- Possible downref: Non-RFC (?) normative reference: ref. '19' -- Possible downref: Non-RFC (?) normative reference: ref. '20' ** Obsolete normative reference: RFC 3491 (ref. '22') (Obsoleted by RFC 5891) -- Possible downref: Non-RFC (?) normative reference: ref. '23' -- Possible downref: Non-RFC (?) normative reference: ref. '24' -- Possible downref: Non-RFC (?) normative reference: ref. '25' -- Possible downref: Non-RFC (?) normative reference: ref. '26' -- Possible downref: Non-RFC (?) normative reference: ref. '28' -- Obsolete informational reference (is this intentional?): RFC 3530 (ref. '29') (Obsoleted by RFC 7530) == Outdated reference: A later version (-12) exists of draft-ietf-nfsv4-pnfs-obj-11 == Outdated reference: A later version (-12) exists of draft-ietf-nfsv4-pnfs-block-11 -- Obsolete informational reference (is this intentional?): RFC 3720 (ref. '47') (Obsoleted by RFC 7143) -- Obsolete informational reference (is this intentional?): RFC 5226 (ref. '54') (Obsoleted by RFC 8126) Summary: 5 errors (**), 0 flaws (~~), 13 warnings (==), 22 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 S. Shepler 3 Internet-Draft M. Eisler 4 Intended status: Standards Track D. Noveck 5 Expires: June 18, 2009 Editors 6 December 15, 2008 8 NFS Version 4 Minor Version 1 9 draft-ietf-nfsv4-minorversion1-29.txt 11 Status of this Memo 13 This Internet-Draft is submitted to IETF in full conformance with the 14 provisions of BCP 78 and BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt. 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 This Internet-Draft will expire on June 18, 2009. 34 Copyright Notice 36 Copyright (c) 2008 IETF Trust and the persons identified as the 37 document authors. All rights reserved. 39 This document is subject to BCP 78 and the IETF Trust's Legal 40 Provisions Relating to IETF Documents 41 (http://trustee.ietf.org/license-info) in effect on the date of 42 publication of this document. Please review these documents 43 carefully, as they describe your rights and restrictions with respect 44 to this document. 46 Abstract 48 This document describes NFS version 4 minor version one, including 49 features retained from the base protocol (NFS version 4 minor version 50 zero which is specified in RFC3530) and protocol extensions made 51 subsequently. Major extensions introduced in NFS version 4 minor 52 version one include: Sessions, Directory Delegations, and parallel 53 NFS (pNFS). NFS version 4 minor version one has no dependencies on 54 NFS version 4 minor version zero, and is considered a separate 55 protocol. Thus this document neither updates nor obsoletes RFC3530. 56 NFS minor version one is deemed superior to NFS minor version zero 57 with no loss of functionality, and its use is preferred over version 58 zero. Both NFS minor version zero and one can be used simultaneously 59 on the same network, between the same client and server. 61 Requirements Language 63 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 64 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 65 document are to be interpreted as described in RFC 2119 [1]. 67 Table of Contents 69 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 12 70 1.1. The NFS Version 4 Minor Version 1 Protocol . . . . . . . 12 71 1.2. Scope of this Document . . . . . . . . . . . . . . . . . 12 72 1.3. NFSv4 Goals . . . . . . . . . . . . . . . . . . . . . . 12 73 1.4. NFSv4.1 Goals . . . . . . . . . . . . . . . . . . . . . 13 74 1.5. General Definitions . . . . . . . . . . . . . . . . . . 13 75 1.6. Overview of NFSv4.1 Features . . . . . . . . . . . . . . 16 76 1.6.1. RPC and Security . . . . . . . . . . . . . . . . . . 16 77 1.6.2. Protocol Structure . . . . . . . . . . . . . . . . . 17 78 1.6.3. File System Model . . . . . . . . . . . . . . . . . 17 79 1.6.4. Locking Facilities . . . . . . . . . . . . . . . . . 19 80 1.7. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 20 81 2. Core Infrastructure . . . . . . . . . . . . . . . . . . . . . 21 82 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 21 83 2.2. RPC and XDR . . . . . . . . . . . . . . . . . . . . . . 21 84 2.2.1. RPC-based Security . . . . . . . . . . . . . . . . . 21 85 2.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 24 86 2.4. Client Identifiers and Client Owners . . . . . . . . . . 25 87 2.4.1. Upgrade from NFSv4.0 to NFSv4.1 . . . . . . . . . . 29 88 2.4.2. Server Release of Client ID . . . . . . . . . . . . 29 89 2.4.3. Resolving Client Owner Conflicts . . . . . . . . . . 30 90 2.5. Server Owners . . . . . . . . . . . . . . . . . . . . . 31 91 2.6. Security Service Negotiation . . . . . . . . . . . . . . 31 92 2.6.1. NFSv4.1 Security Tuples . . . . . . . . . . . . . . 32 93 2.6.2. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . 32 94 2.6.3. Security Error . . . . . . . . . . . . . . . . . . . 32 95 2.7. Minor Versioning . . . . . . . . . . . . . . . . . . . . 37 96 2.8. Non-RPC-based Security Services . . . . . . . . . . . . 39 97 2.8.1. Authorization . . . . . . . . . . . . . . . . . . . 39 98 2.8.2. Auditing . . . . . . . . . . . . . . . . . . . . . . 39 99 2.8.3. Intrusion Detection . . . . . . . . . . . . . . . . 40 100 2.9. Transport Layers . . . . . . . . . . . . . . . . . . . . 40 101 2.9.1. REQUIRED and RECOMMENDED Properties of Transports . 40 102 2.9.2. Client and Server Transport Behavior . . . . . . . . 41 103 2.9.3. Ports . . . . . . . . . . . . . . . . . . . . . . . 42 104 2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . 42 105 2.10.1. Motivation and Overview . . . . . . . . . . . . . . 42 106 2.10.2. NFSv4 Integration . . . . . . . . . . . . . . . . . 44 107 2.10.3. Channels . . . . . . . . . . . . . . . . . . . . . . 45 108 2.10.4. Server Scope . . . . . . . . . . . . . . . . . . . . 46 109 2.10.5. Trunking . . . . . . . . . . . . . . . . . . . . . . 49 110 2.10.6. Exactly Once Semantics . . . . . . . . . . . . . . . 52 111 2.10.7. RDMA Considerations . . . . . . . . . . . . . . . . 65 112 2.10.8. Sessions Security . . . . . . . . . . . . . . . . . 68 113 2.10.9. The Secret State Verifier (SSV) GSS Mechanism . . . 73 114 2.10.10. Session Mechanics - Steady State . . . . . . . . . . 77 115 2.10.11. Session Inactivity Timer . . . . . . . . . . . . . . 79 116 2.10.12. Session Mechanics - Recovery . . . . . . . . . . . . 80 117 2.10.13. Parallel NFS and Sessions . . . . . . . . . . . . . 84 118 3. Protocol Constants and Data Types . . . . . . . . . . . . . . 85 119 3.1. Basic Constants . . . . . . . . . . . . . . . . . . . . 85 120 3.2. Basic Data Types . . . . . . . . . . . . . . . . . . . . 86 121 3.3. Structured Data Types . . . . . . . . . . . . . . . . . 87 122 4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 96 123 4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 96 124 4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 96 125 4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 97 126 4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 97 127 4.2.1. General Properties of a Filehandle . . . . . . . . . 98 128 4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 98 129 4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 99 130 4.3. One Method of Constructing a Volatile Filehandle . . . . 100 131 4.4. Client Recovery from Filehandle Expiration . . . . . . . 100 132 5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 101 133 5.1. REQUIRED Attributes . . . . . . . . . . . . . . . . . . 103 134 5.2. RECOMMENDED Attributes . . . . . . . . . . . . . . . . . 103 135 5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 103 136 5.4. Classification of Attributes . . . . . . . . . . . . . . 105 137 5.5. Set-Only and Get-Only Attributes . . . . . . . . . . . . 106 138 5.6. REQUIRED Attributes - List and Definition References . . 106 139 5.7. RECOMMENDED Attributes - List and Definition 140 References . . . . . . . . . . . . . . . . . . . . . . . 107 141 5.8. Attribute Definitions . . . . . . . . . . . . . . . . . 109 142 5.8.1. Definitions of REQUIRED Attributes . . . . . . . . . 109 143 5.8.2. Definitions of Uncategorized RECOMMENDED 144 Attributes . . . . . . . . . . . . . . . . . . . . . 111 145 5.9. Interpreting owner and owner_group . . . . . . . . . . . 117 146 5.10. Character Case Attributes . . . . . . . . . . . . . . . 119 147 5.11. Directory Notification Attributes . . . . . . . . . . . 120 148 5.12. pNFS Attribute Definitions . . . . . . . . . . . . . . . 120 149 5.13. Retention Attributes . . . . . . . . . . . . . . . . . . 122 150 6. Access Control Attributes . . . . . . . . . . . . . . . . . . 125 151 6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 125 152 6.2. File Attributes Discussion . . . . . . . . . . . . . . . 126 153 6.2.1. Attribute 12: acl . . . . . . . . . . . . . . . . . 126 154 6.2.2. Attribute 58: dacl . . . . . . . . . . . . . . . . . 141 155 6.2.3. Attribute 59: sacl . . . . . . . . . . . . . . . . . 141 156 6.2.4. Attribute 33: mode . . . . . . . . . . . . . . . . . 141 157 6.2.5. Attribute 74: mode_set_masked . . . . . . . . . . . 142 158 6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 143 159 6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . 143 160 6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 144 161 6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 145 162 6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 145 163 6.4.2. Retrieving the mode and/or ACL Attributes . . . . . 147 164 6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 147 165 7. Single-server Namespace . . . . . . . . . . . . . . . . . . . 151 166 7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 152 167 7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 152 168 7.3. Server Pseudo File System . . . . . . . . . . . . . . . 152 169 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 153 170 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 153 171 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 154 172 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 154 173 7.8. Security Policy and Namespace Presentation . . . . . . . 154 174 8. State Management . . . . . . . . . . . . . . . . . . . . . . 155 175 8.1. Client and Session ID . . . . . . . . . . . . . . . . . 156 176 8.2. Stateid Definition . . . . . . . . . . . . . . . . . . . 157 177 8.2.1. Stateid Types . . . . . . . . . . . . . . . . . . . 157 178 8.2.2. Stateid Structure . . . . . . . . . . . . . . . . . 158 179 8.2.3. Special Stateids . . . . . . . . . . . . . . . . . . 160 180 8.2.4. Stateid Lifetime and Validation . . . . . . . . . . 161 181 8.2.5. Stateid Use for I/O Operations . . . . . . . . . . . 164 182 8.2.6. Stateid Use for SETATTR Operations . . . . . . . . . 165 183 8.3. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 165 184 8.4. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 168 185 8.4.1. Client Failure and Recovery . . . . . . . . . . . . 168 186 8.4.2. Server Failure and Recovery . . . . . . . . . . . . 169 187 8.4.3. Network Partitions and Recovery . . . . . . . . . . 174 188 8.5. Server Revocation of Locks . . . . . . . . . . . . . . . 179 189 8.6. Short and Long Leases . . . . . . . . . . . . . . . . . 180 190 8.7. Clocks, Propagation Delay, and Calculating Lease 191 Expiration . . . . . . . . . . . . . . . . . . . . . . . 180 192 8.8. Obsolete Locking Infrastructure From NFSv4.0 . . . . . . 181 193 9. File Locking and Share Reservations . . . . . . . . . . . . . 182 194 9.1. Opens and Byte-Range Locks . . . . . . . . . . . . . . . 182 195 9.1.1. State-owner Definition . . . . . . . . . . . . . . . 182 196 9.1.2. Use of the Stateid and Locking . . . . . . . . . . . 182 197 9.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 185 198 9.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 186 199 9.4. Stateid Seqid Values and Byte-Range Locks . . . . . . . 186 200 9.5. Issues with Multiple Open-Owners . . . . . . . . . . . . 187 201 9.6. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 187 202 9.7. Share Reservations . . . . . . . . . . . . . . . . . . . 188 203 9.8. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 189 204 9.9. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 190 205 9.10. Parallel OPENs . . . . . . . . . . . . . . . . . . . . . 191 206 9.11. Reclaim of Open and Byte-Range Locks . . . . . . . . . . 191 207 10. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 192 208 10.1. Performance Challenges for Client-Side Caching . . . . . 192 209 10.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 193 210 10.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 195 212 10.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 198 213 10.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 198 214 10.3.2. Data Caching and File Locking . . . . . . . . . . . 199 215 10.3.3. Data Caching and Mandatory File Locking . . . . . . 201 216 10.3.4. Data Caching and File Identity . . . . . . . . . . . 201 217 10.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 202 218 10.4.1. Open Delegation and Data Caching . . . . . . . . . . 205 219 10.4.2. Open Delegation and File Locks . . . . . . . . . . . 206 220 10.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 206 221 10.4.4. Recall of Open Delegation . . . . . . . . . . . . . 209 222 10.4.5. Clients that Fail to Honor Delegation Recalls . . . 211 223 10.4.6. Delegation Revocation . . . . . . . . . . . . . . . 212 224 10.4.7. Delegations via WANT_DELEGATION . . . . . . . . . . 212 225 10.5. Data Caching and Revocation . . . . . . . . . . . . . . 213 226 10.5.1. Revocation Recovery for Write Open Delegation . . . 214 227 10.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 214 228 10.7. Data and Metadata Caching and Memory Mapped Files . . . 216 229 10.8. Name and Directory Caching without Directory 230 Delegations . . . . . . . . . . . . . . . . . . . . . . 219 231 10.8.1. Name Caching . . . . . . . . . . . . . . . . . . . . 219 232 10.8.2. Directory Caching . . . . . . . . . . . . . . . . . 220 233 10.9. Directory Delegations . . . . . . . . . . . . . . . . . 221 234 10.9.1. Introduction to Directory Delegations . . . . . . . 221 235 10.9.2. Directory Delegation Design . . . . . . . . . . . . 222 236 10.9.3. Attributes in Support of Directory Notifications . . 223 237 10.9.4. Directory Delegation Recall . . . . . . . . . . . . 223 238 10.9.5. Directory Delegation Recovery . . . . . . . . . . . 224 239 11. Multi-Server Namespace . . . . . . . . . . . . . . . . . . . 224 240 11.1. Location Attributes . . . . . . . . . . . . . . . . . . 225 241 11.2. File System Presence or Absence . . . . . . . . . . . . 225 242 11.3. Getting Attributes for an Absent File System . . . . . . 226 243 11.3.1. GETATTR Within an Absent File System . . . . . . . . 227 244 11.3.2. READDIR and Absent File Systems . . . . . . . . . . 228 245 11.4. Uses of Location Information . . . . . . . . . . . . . . 228 246 11.4.1. File System Replication . . . . . . . . . . . . . . 229 247 11.4.2. File System Migration . . . . . . . . . . . . . . . 230 248 11.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 231 249 11.5. Location Entries and Server Identity . . . . . . . . . . 233 250 11.6. Additional Client-side Considerations . . . . . . . . . 233 251 11.7. Effecting File System Transitions . . . . . . . . . . . 234 252 11.7.1. File System Transitions and Simultaneous Access . . 235 253 11.7.2. Simultaneous Use and Transparent Transitions . . . . 236 254 11.7.3. Filehandles and File System Transitions . . . . . . 239 255 11.7.4. Fileids and File System Transitions . . . . . . . . 239 256 11.7.5. Fsids and File System Transitions . . . . . . . . . 240 257 11.7.6. The Change Attribute and File System Transitions . . 241 258 11.7.7. Lock State and File System Transitions . . . . . . . 241 259 11.7.8. Write Verifiers and File System Transitions . . . . 246 260 11.7.9. Readdir Cookies and Verifiers and File System 261 Transitions . . . . . . . . . . . . . . . . . . . . 246 262 11.7.10. File System Data and File System Transitions . . . . 246 263 11.8. Effecting File System Referrals . . . . . . . . . . . . 248 264 11.8.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 248 265 11.8.2. Referral Example (READDIR) . . . . . . . . . . . . . 252 266 11.9. The Attribute fs_locations . . . . . . . . . . . . . . . 254 267 11.10. The Attribute fs_locations_info . . . . . . . . . . . . 257 268 11.10.1. The fs_locations_server4 Structure . . . . . . . . . 261 269 11.10.2. The fs_locations_info4 Structure . . . . . . . . . . 266 270 11.10.3. The fs_locations_item4 Structure . . . . . . . . . . 267 271 11.11. The Attribute fs_status . . . . . . . . . . . . . . . . 269 272 12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 273 273 12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 273 274 12.2. pNFS Definitions . . . . . . . . . . . . . . . . . . . . 274 275 12.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 275 276 12.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 275 277 12.2.3. pNFS Client . . . . . . . . . . . . . . . . . . . . 275 278 12.2.4. Storage Device . . . . . . . . . . . . . . . . . . . 275 279 12.2.5. Storage Protocol . . . . . . . . . . . . . . . . . . 276 280 12.2.6. Control Protocol . . . . . . . . . . . . . . . . . . 276 281 12.2.7. Layout Types . . . . . . . . . . . . . . . . . . . . 277 282 12.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 277 283 12.2.9. Layout Iomode . . . . . . . . . . . . . . . . . . . 278 284 12.2.10. Device IDs . . . . . . . . . . . . . . . . . . . . . 278 285 12.3. pNFS Operations . . . . . . . . . . . . . . . . . . . . 280 286 12.4. pNFS Attributes . . . . . . . . . . . . . . . . . . . . 281 287 12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 281 288 12.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 281 289 12.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 282 290 12.5.3. Layout Stateid . . . . . . . . . . . . . . . . . . . 283 291 12.5.4. Committing a Layout . . . . . . . . . . . . . . . . 284 292 12.5.5. Recalling a Layout . . . . . . . . . . . . . . . . . 287 293 12.5.6. Revoking Layouts . . . . . . . . . . . . . . . . . . 295 294 12.5.7. Metadata Server Write Propagation . . . . . . . . . 296 295 12.6. pNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 296 296 12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 297 297 12.7.1. Recovery from Client Restart . . . . . . . . . . . . 298 298 12.7.2. Dealing with Lease Expiration on the Client . . . . 298 299 12.7.3. Dealing with Loss of Layout State on the Metadata 300 Server . . . . . . . . . . . . . . . . . . . . . . . 299 301 12.7.4. Recovery from Metadata Server Restart . . . . . . . 300 302 12.7.5. Operations During Metadata Server Grace Period . . . 302 303 12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 302 304 12.8. Metadata and Storage Device Roles . . . . . . . . . . . 302 305 12.9. Security Considerations for pNFS . . . . . . . . . . . . 303 306 13. NFSv4.1 as a Storage Protocol in pNFS: the File Layout Type . 304 307 13.1. Client ID and Session Considerations . . . . . . . . . . 304 308 13.1.1. Sessions Considerations for Data Servers . . . . . . 307 309 13.2. File Layout Definitions . . . . . . . . . . . . . . . . 307 310 13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 308 311 13.4. Interpreting the File Layout . . . . . . . . . . . . . . 312 312 13.4.1. Determining the Stripe Unit Number . . . . . . . . . 312 313 13.4.2. Interpreting the File Layout Using Sparse Packing . 312 314 13.4.3. Interpreting the File Layout Using Dense Packing . . 315 315 13.4.4. Sparse and Dense Stripe Unit Packing . . . . . . . . 317 316 13.5. Data Server Multipathing . . . . . . . . . . . . . . . . 319 317 13.6. Operations Sent to NFSv4.1 Data Servers . . . . . . . . 320 318 13.7. COMMIT Through Metadata Server . . . . . . . . . . . . . 322 319 13.8. The Layout Iomode . . . . . . . . . . . . . . . . . . . 324 320 13.9. Metadata and Data Server State Coordination . . . . . . 324 321 13.9.1. Global Stateid Requirements . . . . . . . . . . . . 324 322 13.9.2. Data Server State Propagation . . . . . . . . . . . 325 323 13.10. Data Server Component File Size . . . . . . . . . . . . 327 324 13.11. Layout Revocation and Fencing . . . . . . . . . . . . . 328 325 13.12. Security Considerations for the File Layout Type . . . . 328 326 14. Internationalization . . . . . . . . . . . . . . . . . . . . 329 327 14.1. Stringprep profile for the utf8str_cs type . . . . . . . 330 328 14.2. Stringprep profile for the utf8str_cis type . . . . . . 332 329 14.3. Stringprep profile for the utf8str_mixed type . . . . . 333 330 14.4. UTF-8 Capabilities . . . . . . . . . . . . . . . . . . . 334 331 14.5. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 335 332 15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 335 333 15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 336 334 15.1.1. General Errors . . . . . . . . . . . . . . . . . . . 338 335 15.1.2. Filehandle Errors . . . . . . . . . . . . . . . . . 340 336 15.1.3. Compound Structure Errors . . . . . . . . . . . . . 341 337 15.1.4. File System Errors . . . . . . . . . . . . . . . . . 343 338 15.1.5. State Management Errors . . . . . . . . . . . . . . 345 339 15.1.6. Security Errors . . . . . . . . . . . . . . . . . . 345 340 15.1.7. Name Errors . . . . . . . . . . . . . . . . . . . . 346 341 15.1.8. Locking Errors . . . . . . . . . . . . . . . . . . . 347 342 15.1.9. Reclaim Errors . . . . . . . . . . . . . . . . . . . 348 343 15.1.10. pNFS Errors . . . . . . . . . . . . . . . . . . . . 349 344 15.1.11. Session Use Errors . . . . . . . . . . . . . . . . . 350 345 15.1.12. Session Management Errors . . . . . . . . . . . . . 351 346 15.1.13. Client Management Errors . . . . . . . . . . . . . . 352 347 15.1.14. Delegation Errors . . . . . . . . . . . . . . . . . 353 348 15.1.15. Attribute Handling Errors . . . . . . . . . . . . . 353 349 15.1.16. Obsoleted Errors . . . . . . . . . . . . . . . . . . 354 350 15.2. Operations and their valid errors . . . . . . . . . . . 355 351 15.3. Callback operations and their valid errors . . . . . . . 371 352 15.4. Errors and the operations that use them . . . . . . . . 373 353 16. NFSv4.1 Procedures . . . . . . . . . . . . . . . . . . . . . 388 354 16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 388 355 16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 389 357 17. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . 400 358 18. NFSv4.1 Operations . . . . . . . . . . . . . . . . . . . . . 403 359 18.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 403 360 18.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 409 361 18.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 410 362 18.4. Operation 6: CREATE - Create a Non-Regular File Object . 413 363 18.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting 364 Recovery . . . . . . . . . . . . . . . . . . . . . . . . 416 365 18.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 417 366 18.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 417 367 18.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 419 368 18.9. Operation 11: LINK - Create Link to a File . . . . . . . 420 369 18.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 423 370 18.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 427 371 18.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 428 372 18.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 430 373 18.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 431 374 18.15. Operation 17: NVERIFY - Verify Difference in 375 Attributes . . . . . . . . . . . . . . . . . . . . . . . 433 376 18.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 434 377 18.17. Operation 19: OPENATTR - Open Named Attribute 378 Directory . . . . . . . . . . . . . . . . . . . . . . . 453 379 18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 454 380 18.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 456 381 18.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 456 382 18.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 458 383 18.22. Operation 25: READ - Read from File . . . . . . . . . . 459 384 18.23. Operation 26: READDIR - Read Directory . . . . . . . . . 461 385 18.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 465 386 18.25. Operation 28: REMOVE - Remove File System Object . . . . 466 387 18.26. Operation 29: RENAME - Rename Directory Entry . . . . . 468 388 18.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 472 389 18.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 473 390 18.29. Operation 33: SECINFO - Obtain Available Security . . . 474 391 18.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 478 392 18.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 481 393 18.32. Operation 38: WRITE - Write to File . . . . . . . . . . 482 394 18.33. Operation 40: BACKCHANNEL_CTL - Backchannel Control . . 486 395 18.34. Operation 41: BIND_CONN_TO_SESSION - Associate 396 Connection with Session . . . . . . . . . . . . . . . . 488 397 18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 491 398 18.36. Operation 43: CREATE_SESSION - Create New Session and 399 Confirm Client ID . . . . . . . . . . . . . . . . . . . 509 400 18.37. Operation 44: DESTROY_SESSION - Destroy a Session . . . 519 401 18.38. Operation 45: FREE_STATEID - Free Stateid with No 402 Locks . . . . . . . . . . . . . . . . . . . . . . . . . 520 403 18.39. Operation 46: GET_DIR_DELEGATION - Get a directory 404 delegation . . . . . . . . . . . . . . . . . . . . . . . 521 406 18.40. Operation 47: GETDEVICEINFO - Get Device Information . . 525 407 18.41. Operation 48: GETDEVICELIST - Get All Device Mappings 408 for a File System . . . . . . . . . . . . . . . . . . . 527 409 18.42. Operation 49: LAYOUTCOMMIT - Commit Writes Made Using 410 a Layout . . . . . . . . . . . . . . . . . . . . . . . . 529 411 18.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 532 412 18.44. Operation 51: LAYOUTRETURN - Release Layout 413 Information . . . . . . . . . . . . . . . . . . . . . . 542 414 18.45. Operation 52: SECINFO_NO_NAME - Get Security on 415 Unnamed Object . . . . . . . . . . . . . . . . . . . . . 546 416 18.46. Operation 53: SEQUENCE - Supply Per-Procedure 417 Sequencing and Control . . . . . . . . . . . . . . . . . 547 418 18.47. Operation 54: SET_SSV - Update SSV for a Client ID . . . 553 419 18.48. Operation 55: TEST_STATEID - Test Stateids for 420 Validity . . . . . . . . . . . . . . . . . . . . . . . . 555 421 18.49. Operation 56: WANT_DELEGATION - Request Delegation . . . 557 422 18.50. Operation 57: DESTROY_CLIENTID - Destroy a Client ID . . 561 423 18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims 424 Finished . . . . . . . . . . . . . . . . . . . . . . . . 561 425 18.52. Operation 10044: ILLEGAL - Illegal operation . . . . . . 564 426 19. NFSv4.1 Callback Procedures . . . . . . . . . . . . . . . . . 564 427 19.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 565 428 19.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 565 429 20. NFSv4.1 Callback Operations . . . . . . . . . . . . . . . . . 569 430 20.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 569 431 20.2. Operation 4: CB_RECALL - Recall a Delegation . . . . . . 570 432 20.3. Operation 5: CB_LAYOUTRECALL - Recall Layout from 433 Client . . . . . . . . . . . . . . . . . . . . . . . . . 571 434 20.4. Operation 6: CB_NOTIFY - Notify Client of Directory 435 Changes . . . . . . . . . . . . . . . . . . . . . . . . 575 436 20.5. Operation 7: CB_PUSH_DELEG - Offer Previously 437 Requested Delegation to Client . . . . . . . . . . . . . 579 438 20.6. Operation 8: CB_RECALL_ANY - Keep Any N Recallable 439 Objects . . . . . . . . . . . . . . . . . . . . . . . . 580 440 20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL - Signal 441 Resources for Recallable Objects . . . . . . . . . . . . 583 442 20.8. Operation 10: CB_RECALL_SLOT - Change Flow Control 443 Limits . . . . . . . . . . . . . . . . . . . . . . . . . 584 444 20.9. Operation 11: CB_SEQUENCE - Supply Backchannel 445 Sequencing and Control . . . . . . . . . . . . . . . . . 585 446 20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending 447 Delegation Wants . . . . . . . . . . . . . . . . . . . . 587 448 20.11. Operation 13: CB_NOTIFY_LOCK - Notify Client of 449 Possible Lock Availability . . . . . . . . . . . . . . . 588 450 20.12. Operation 14: CB_NOTIFY_DEVICEID - Notify Client of 451 Device ID Changes . . . . . . . . . . . . . . . . . . . 590 452 20.13. Operation 10044: CB_ILLEGAL - Illegal Callback 453 Operation . . . . . . . . . . . . . . . . . . . . . . . 592 455 21. Security Considerations . . . . . . . . . . . . . . . . . . . 592 456 22. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 594 457 22.1. Named Attribute Definitions . . . . . . . . . . . . . . 594 458 22.1.1. Initial Registry . . . . . . . . . . . . . . . . . . 595 459 22.1.2. Updating Registrations . . . . . . . . . . . . . . . 595 460 22.2. Device ID Notifications . . . . . . . . . . . . . . . . 595 461 22.2.1. Initial Registry . . . . . . . . . . . . . . . . . . 596 462 22.2.2. Updating Registrations . . . . . . . . . . . . . . . 596 463 22.3. Object Recall Types . . . . . . . . . . . . . . . . . . 596 464 22.3.1. Initial Registry . . . . . . . . . . . . . . . . . . 598 465 22.3.2. Updating Registrations . . . . . . . . . . . . . . . 598 466 22.4. Layout Types . . . . . . . . . . . . . . . . . . . . . . 598 467 22.4.1. Initial Registry . . . . . . . . . . . . . . . . . . 599 468 22.4.2. Updating Registrations . . . . . . . . . . . . . . . 599 469 22.4.3. Guidelines for Writing Layout Type Specifications . 599 470 22.5. Path Variable Definitions . . . . . . . . . . . . . . . 601 471 22.5.1. Path Variables Registry . . . . . . . . . . . . . . 601 472 22.5.2. Values for the ${ietf.org:CPU_ARCH} Variable . . . . 603 473 22.5.3. Values for the ${ietf.org:OS_TYPE} Variable . . . . 603 474 23. References . . . . . . . . . . . . . . . . . . . . . . . . . 604 475 23.1. Normative References . . . . . . . . . . . . . . . . . . 604 476 23.2. Informative References . . . . . . . . . . . . . . . . . 607 477 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 608 478 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 611 479 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 611 481 1. Introduction 483 1.1. The NFS Version 4 Minor Version 1 Protocol 485 The NFS version 4 minor version 1 (NFSv4.1) protocol is the second 486 minor version of the NFS version 4 (NFSv4) protocol. The first minor 487 version, NFSv4.0 is described in [29]. It generally follows the 488 guidelines for minor versioning model listed in Section 10 of RFC 489 3530. However, it diverges from guidelines 11 ("a client and server 490 that supports minor version X must support minor versions 0 through 491 X-1"), and 12 ("no features may be introduced as mandatory in a minor 492 version"). These divergences are due to the introduction of the 493 sessions model for managing non-idempotent operations and the 494 RECLAIM_COMPLETE operation. These two new features are 495 infrastructural in nature and simplify implementation of existing and 496 other new features. Making them anything but REQUIRED would add 497 undue complexity to protocol definition and implementation. NFSv4.1 498 accordingly updates the Minor Versioning guidelines (Section 2.7). 500 As a minor version, NFSv4.1 is consistent with the overall goals for 501 NFSv4, but extends the protocol so as to better meet those goals, 502 based on experiences with NFSv4.0. In addition, NFSv4.1 has adopted 503 some additional goals, which motivate some of the major extensions in 504 NFSv4.1. 506 1.2. Scope of this Document 508 This document describes the NFSv4.1 protocol. With respect to 509 NFSv4.0, this document does not: 511 o describe the NFSv4.0 protocol, except where needed to contrast 512 with NFSv4.1. 514 o modify the specification of the NFSv4.0 protocol. 516 o clarify the NFSv4.0 protocol. 518 1.3. NFSv4 Goals 520 The NFSv4 protocol is a further revision of the NFS protocol defined 521 already by NFSv3 [30]. It retains the essential characteristics of 522 previous versions: easy recovery; independence of transport 523 protocols, operating systems and file systems; simplicity; and good 524 performance. NFSv4 has the following goals: 526 o Improved access and good performance on the Internet. 528 The protocol is designed to transit firewalls easily, perform well 529 where latency is high and bandwidth is low, and scale to very 530 large numbers of clients per server. 532 o Strong security with negotiation built into the protocol. 534 The protocol builds on the work of the ONCRPC working group in 535 supporting the RPCSEC_GSS protocol. Additionally, the NFSv4.1 536 protocol provides a mechanism to allow clients and servers the 537 ability to negotiate security and require clients and servers to 538 support a minimal set of security schemes. 540 o Good cross-platform interoperability. 542 The protocol features a file system model that provides a useful, 543 common set of features that does not unduly favor one file system 544 or operating system over another. 546 o Designed for protocol extensions. 548 The protocol is designed to accept standard extensions within a 549 framework that enable and encourages backward compatibility. 551 1.4. NFSv4.1 Goals 553 NFSv4.1 has the following goals, within the framework established by 554 the overall NFSv4 goals. 556 o To correct significant structural weaknesses and oversights 557 discovered in the base protocol. 559 o To add clarity and specificity to areas left unaddressed or not 560 addressed in sufficient detail in the base protocol. However, as 561 stated in Section 1.2, it is not a goal to clarify the NFSv4.0 562 protocol in the NFSv4.1 specification. 564 o To add specific features based on experience with the existing 565 protocol and recent industry developments. 567 o To provide protocol support to take advantage of clustered server 568 deployments including the ability to provide scalable parallel 569 access to files distributed among multiple servers. 571 1.5. General Definitions 573 The following definitions are provided for the purpose of providing 574 an appropriate context for the reader. 576 Byte This document defines a byte as an octet, i.e. a datum exactly 577 8 bits in length. 579 Client The "client" is the entity that accesses the NFS server's 580 resources. The client may be an application which contains the 581 logic to access the NFS server directly. The client may also be 582 the traditional operating system client that provides remote file 583 system services for a set of applications. 585 A client is uniquely identified by a Client Owner. 587 With reference to file locking, the client is also the entity that 588 maintains a set of locks on behalf of one or more applications. 589 This client is responsible for crash or failure recovery for those 590 locks it manages. 592 Note that multiple clients may share the same transport and 593 connection and multiple clients may exist on the same network 594 node. 596 Client ID A 64-bit quantity used as a unique, short-hand reference 597 to a client supplied Verifier and client owner. The server is 598 responsible for supplying the client ID. 600 Client Owner The client owner is a unique string, opaque to the 601 server, which identifies a client. Multiple network connections 602 and source network addresses originating from those connections 603 may share a client owner. The server is expected to treat 604 requests from connections with the same client owner as coming 605 from the same client. 607 File System The collection of objects on a server (as identified by 608 the major identifier of a Server Owner, which is defined later in 609 this section), that share the same fsid attribute (see 610 Section 5.8.1.9). 612 Lease An interval of time defined by the server for which the client 613 is irrevocably granted a lock. At the end of a lease period the 614 lock may be revoked if the lease has not been extended. The lock 615 must be revoked if a conflicting lock has been granted after the 616 lease interval. 618 All leases granted by a server have the same fixed interval. Note 619 that the fixed interval was chosen to alleviate the expense a 620 server would have in maintaining state about variable length 621 leases across server failures. 623 Lock The term "lock" is used to refer to byte-range (in UNIX 624 environments, also known as record) locks, share reservations, 625 delegations, or layouts unless specifically stated otherwise. 627 Secret State Verifier (SSV) The SSV is a unique secret key shared 628 between a client and server. The SSV serves as the secret key for 629 an internal (that is, internal to NFSv4.1) GSS mechanism (the SSV 630 GSS mechanism, see Section 2.10.9). The SSV GSS mechanism uses 631 the SSV to compute Message Integrity Code (MIC) and Wrap tokens. 632 See Section 2.10.8.3 for more details on how NFSv4.1 uses the SSV 633 and the SSV GSS mechanism. 635 Server The "Server" is the entity responsible for coordinating 636 client access to a set of file systems and is identified by a 637 Server owner. A server can span multiple network addresses. 639 Server Owner The "Server Owner" identifies the server to the client. 640 The server owner consists of a major and minor identifier. When 641 the client has two connections each to a peer with the same major 642 identifier, the client assumes both peers are the same server (the 643 server namespace is the same via each connection), and assumes and 644 lock state is sharable across both connections. When each peer 645 has both the same major and minor identifier, the client assumes 646 each connection might be associable with the same session. 648 Stable Storage Stable storage is storage from which data stored by 649 an NFSv4.1 server can be recovered without data loss from multiple 650 power failures (including cascading power failures, that is, 651 several power failures in quick succession), operating system 652 failures, and/or hardware failure of components other than the 653 storage medium itself (such as disk, nonvolatile RAM, flash 654 memory, etc.). 656 Some examples of stable storage that are allowable for an NFS 657 server include: 659 1. Media commit of data, that is, the modified data has been 660 successfully written to the disk media, for example, the disk 661 platter. 663 2. An immediate reply disk drive with battery-backed on- drive 664 intermediate storage or uninterruptible power system (UPS). 666 3. Server commit of data with battery-backed intermediate storage 667 and recovery software. 669 4. Cache commit with uninterruptible power system (UPS) and 670 recovery software. 672 Stateid A 128-bit quantity returned by a server that uniquely 673 defines the open and locking state provided by the server for a 674 specific open-owner or lock-owner/open-owner pair for a specific 675 file and type of lock. 677 Verifier A 64-bit quantity generated by the client that the server 678 can use to determine if the client has restarted and lost all 679 previous lock state. 681 1.6. Overview of NFSv4.1 Features 683 To provide a reasonable context for the reader, the major features of 684 the NFSv4.1 protocol will be reviewed in brief. This will be done to 685 provide an appropriate context for both the reader who is familiar 686 with the previous versions of the NFS protocol and the reader that is 687 new to the NFS protocols. For the reader new to the NFS protocols, 688 there is still a set of fundamental knowledge that is expected. The 689 reader should be familiar with the XDR and RPC protocols as described 690 in [2] and [3]. A basic knowledge of file systems and distributed 691 file systems is expected as well. 693 In general this specification of NFSv4.1 will not distinguish those 694 features added in minor version one from those present in the base 695 protocol but will treat NFSv4.1 as a unified whole. See Section 1.7 696 for a summary of the differences between NFSv4.0 and NFSv4.1. 698 1.6.1. RPC and Security 700 As with previous versions of NFS, the External Data Representation 701 (XDR) and Remote Procedure Call (RPC) mechanisms used for the NFSv4.1 702 protocol are those defined in [2] and [3]. To meet end-to-end 703 security requirements, the RPCSEC_GSS framework [4] is used to extend 704 the basic RPC security. With the use of RPCSEC_GSS, various 705 mechanisms can be provided to offer authentication, integrity, and 706 privacy to the NFSv4 protocol. Kerberos V5 is used as described in 707 [5] to provide one security framework. With the use of RPCSEC_GSS, 708 other mechanisms may also be specified and used for NFSv4.1 security. 710 To enable in-band security negotiation, the NFSv4.1 protocol has 711 operations which provide the client a method of querying the server 712 about its policies regarding which security mechanisms must be used 713 for access to the server's file system resources. With this, the 714 client can securely match the security mechanism that meets the 715 policies specified at both the client and server. 717 NFSv4.1 introduces parallel access (see Section 1.6.2.2), which is 718 called pNFS. The security framework described in this section is 719 significantly modified by the introduction of pNFS (see 720 Section 12.9), because data access is sometimes not over RPC. The 721 level of significance varies with the Storage Protocol (see 722 Section 12.2.5) and can be as low as zero impact (see Section 13.12). 724 1.6.2. Protocol Structure 726 1.6.2.1. Core Protocol 728 Unlike NFSv3, which used a series of ancillary protocols (e.g. NLM, 729 NSM, MOUNT), within all minor versions of NFSv4 a single RPC protocol 730 is used to make requests to the server. Facilities that had been 731 separate protocols, such as locking, are now integrated within a 732 single unified protocol. 734 1.6.2.2. Parallel Access 736 Minor version one supports high-performance data access to a 737 clustered server implementation by enabling a separation of metadata 738 access and data access, with the latter done to multiple servers in 739 parallel. 741 Such parallel data access is controlled by recallable objects known 742 as "layouts", which are integrated into the protocol locking model. 743 Clients direct requests for data access to a set of data servers 744 specified by the layout via a data storage protocol which may be 745 NFSv4.1 or may be another protocol. 747 Because the protocols used for parallel data access are not 748 necessarily RPC-based, the RPC-based security model (Section 1.6.1) 749 is obviously impacted (see Section 12.9). The degree of impact 750 varies with the Storage Protocol (see Section 12.2.5) used for data 751 access, and can be as low as zero (see Section 13.12). 753 1.6.3. File System Model 755 The general file system model used for the NFSv4.1 protocol is the 756 same as previous versions. The server file system is hierarchical 757 with the regular files contained within being treated as opaque byte 758 streams. In a slight departure, file and directory names are encoded 759 with UTF-8 to deal with the basics of internationalization. 761 The NFSv4.1 protocol does not require a separate protocol to provide 762 for the initial mapping between path name and filehandle. All file 763 systems exported by a server are presented as a tree so that all file 764 systems are reachable from a special per-server global root 765 filehandle. This allows LOOKUP operations to be used to perform 766 functions previously provided by the MOUNT protocol. The server 767 provides any necessary pseudo file systems to bridge any gaps that 768 arise due to unexported gaps between exported file systems. 770 1.6.3.1. Filehandles 772 As in previous versions of the NFS protocol, opaque filehandles are 773 used to identify individual files and directories. Lookup-type and 774 create operations translate file and directory names to filehandles 775 which are then used to identify objects in subsequent operations. 777 The NFSv4.1 protocol provides support for persistent filehandles, 778 guaranteed to be valid for the lifetime of the file system object 779 designated. In addition it provides support to servers to provide 780 filehandles with more limited validity guarantees, called volatile 781 filehandles. 783 1.6.3.2. File Attributes 785 The NFSv4.1 protocol has a rich and extensible file object attribute 786 structure, which is divided into REQUIRED, RECOMMENDED, and named 787 attributes (see Section 5). 789 Several (but not all) of the REQUIRED attributes are derived from the 790 attributes of NFSv3 (see the definition of the fattr3 data type in 791 [30]). An example of a REQUIRED attribute is the file object's type 792 (Section 5.8.1.2) so that regular files can be distinguished from 793 directories (also known as folders in some operating environments) 794 and other types of objects. REQUIRED attributes are discussed in 795 Section 5.1. 797 An example of three RECOMMENDED attributes are acl, sacl, and dacl. 798 These attributes define an Access Control List (ACL) on a file object 799 (Section 6). An ACL provides directory and file access control 800 beyond the model used in NFSv3. The ACL definition allows for 801 specification of specific sets of permissions for individual users 802 and groups. In addition, ACL inheritance allows propagation of 803 access permissions and restriction down a directory tree as file 804 system objects are created. RECOMMENDED attributes are discussed in 805 Section 5.2. 807 A named attribute is an opaque byte stream that is associated with a 808 directory or file and referred to by a string name. Named attributes 809 are meant to be used by client applications as a method to associate 810 application-specific data with a regular file or directory. NFSv4.1 811 modifies named attributes relative to NFSv4.0 by tightening the 812 allowed operations in order to prevent the development of non- 813 interoperable implementations. Named attributes are discussed in 814 Section 5.3. 816 1.6.3.3. Multi-server Namespace 818 NFSv4.1 contains a number of features to allow implementation of 819 namespaces that cross server boundaries and that allow and facilitate 820 a non-disruptive transfer of support for individual file systems 821 between servers. They are all based upon attributes that allow one 822 file system to specify alternate or new locations for that file 823 system. 825 These attributes may be used together with the concept of absent file 826 systems, which provide specifications for additional locations but no 827 actual file system content. This allows a number of important 828 facilities: 830 o Location attributes may be used with absent file systems to 831 implement referrals whereby one server may direct the client to a 832 file system provided by another server. This allows extensive 833 multi-server namespaces to be constructed. 835 o Location attributes may be provided for present file systems to 836 provide the locations of alternate file system instances or 837 replicas to be used in the event that the current file system 838 instance becomes unavailable. 840 o Location attributes may be provided when a previously present file 841 system becomes absent. This allows non-disruptive migration of 842 file systems to alternate servers. 844 1.6.4. Locking Facilities 846 As mentioned previously, NFS v4.1 is a single protocol which includes 847 locking facilities. These locking facilities include support for 848 many types of locks including a number of sorts of recallable locks. 849 Recallable locks such as delegations allow the client to be assured 850 that certain events will not occur so long as that lock is held. 851 When circumstances change, the lock is recalled via a callback 852 request. The assurances provided by delegations allow more extensive 853 caching to be done safely when circumstances allow it. 855 The types of locks are: 857 o Share reservations as established by OPEN operations. 859 o Byte-range locks. 861 o File delegations, which are recallable locks that assure the 862 holder that inconsistent opens and file changes cannot occur so 863 long as the delegation is held. 865 o Directory delegations, which are recallable locks that assure the 866 holder that inconsistent directory modifications cannot occur so 867 long as the delegation is held. 869 o Layouts, which are recallable objects that assure the holder that 870 direct access to the file data may be performed directly by the 871 client and that no change to the data's location inconsistent with 872 that access may be made so long as the layout is held. 874 All locks for a given client are tied together under a single client- 875 wide lease. All requests made on sessions associated with the client 876 renew that lease. When leases are not promptly renewed locks are 877 subject to revocation. In the event of server restart, clients have 878 the opportunity to safely reclaim their locks within a special grace 879 period. 881 1.7. Differences from NFSv4.0 883 The following summarizes the major differences between minor version 884 one and the base protocol: 886 o Implementation of the sessions model (Section 2.10). 888 o Parallel access to data (Section 12). 890 o Addition of the RECLAIM_COMPLETE operation to better structure the 891 lock reclamation process (Section 18.51). 893 o Enhanced delegation support as follows. 895 * Delegations on directories and other file types in addition to 896 regular files (Section 18.39, Section 18.49). 898 * Operations to optimize acquisition of recalled or denied 899 delegations (Section 18.49, Section 20.5, Section 20.7). 901 * Notifications of changes to files and directories 902 (Section 18.39, Section 20.4). 904 * A method to allow a server to indicate it is recalling one or 905 more delegations for resource management reasons, and thus a 906 method to allow the client to pick which delegations to return 907 (Section 20.6). 909 o Attributes can be set atomically during exclusive file create via 910 the OPEN operation (see the new EXCLUSIVE4_1 creation method in 911 Section 18.16). 913 o Open files can be preserved if removed and the hard link count 914 ("hard link" is defined in an Open Group [6] standard) goes to 915 zero thus obviating the need for clients to rename deleted files 916 to partially hidden names -- colloquially called "silly rename" 917 (see the new OPEN4_RESULT_PRESERVE_UNLINKED reply flag in 918 Section 18.16). 920 o Improved compatibility with Microsoft Windows for Access Control 921 Lists (Section 6.2.3, Section 6.2.2, Section 6.4.3.2). 923 o Data retention (Section 5.13). 925 o Identification of the implementation of the NFS client and server 926 (Section 18.35). 928 o Support for notification of the availability of byte-range locks 929 (see the new OPEN4_RESULT_MAY_NOTIFY_LOCK reply flag in 930 Section 18.16 and see Section 20.11). 932 o In NFSv4.1, LIPKEY and SPKM-3 are not required security mechanisms 933 [31]. 935 2. Core Infrastructure 937 2.1. Introduction 939 NFSv4.1 relies on core infrastructure common to nearly every 940 operation. This core infrastructure is described in the remainder of 941 this section. 943 2.2. RPC and XDR 945 The NFSv4.1 protocol is a Remote Procedure Call (RPC) application 946 that uses RPC version 2 and the corresponding eXternal Data 947 Representation (XDR) as defined in [3] and [2]. 949 2.2.1. RPC-based Security 951 Previous NFS versions have been thought of as having a host-based 952 authentication model, where the NFS server authenticates the NFS 953 client, and trusts the client to authenticate all users. Actually, 954 NFS has always depended on RPC for authentication. One of the first 955 forms of RPC authentication, AUTH_SYS, had no strong authentication, 956 and required a host-based authentication approach. NFSv4.1 also 957 depends on RPC for basic security services, and mandates RPC support 958 for a user-based authentication model. The user-based authentication 959 model has user principals authenticated by a server, and in turn the 960 server authenticated by user principals. RPC provides some basic 961 security services which are used by NFSv4.1. 963 2.2.1.1. RPC Security Flavors 965 As described in section 7.2 "Authentication" of [3], RPC security is 966 encapsulated in the RPC header, via a security or authentication 967 flavor, and information specific to the specified security flavor. 968 Every RPC header conveys information used to identify and 969 authenticate a client and server. As discussed in Section 2.2.1.1.1, 970 some security flavors provide additional security services. 972 NFSv4.1 clients and servers MUST implement RPCSEC_GSS. (This 973 requirement to implement is not a requirement to use.) Other 974 flavors, such as AUTH_NONE, and AUTH_SYS, MAY be implemented as well. 976 2.2.1.1.1. RPCSEC_GSS and Security Services 978 RPCSEC_GSS ([4]) uses the functionality of GSS-API [7]. This allows 979 for the use of various security mechanisms by the RPC layer without 980 the additional implementation overhead of adding RPC security 981 flavors. 983 2.2.1.1.1.1. Identification, Authentication, Integrity, Privacy 985 Via the GSS-API, RPCSEC_GSS can be used to identify and authenticate 986 users on clients to servers, and servers to users. It can also 987 perform integrity checking on the entire RPC message, including the 988 RPC header, and the arguments or results. Finally, privacy, usually 989 via encryption, is a service available with RPCSEC_GSS. Privacy is 990 performed on the arguments and results. Note that if privacy is 991 selected, integrity, authentication, and identification are enabled. 992 If privacy is not selected, but integrity is selected, authentication 993 and identification are enabled. If integrity and privacy are not 994 selected, but authentication is enabled, identification is enabled. 995 RPCSEC_GSS does not provide identification as a separate service. 997 Although GSS-API has an authentication service distinct from its 998 privacy and integrity services, GSS-API's authentication service is 999 not used for RPCSEC_GSS's authentication service. Instead, each RPC 1000 request and response header is integrity protected with the GSS-API 1001 integrity service, and this allows RPCSEC_GSS to offer per-RPC 1002 authentication and identity. See [4] for more information. 1004 NFSv4.1 client and servers MUST support RPCSEC_GSS's integrity and 1005 authentication service. NFSv4.1 servers MUST support RPCSEC_GSS's 1006 privacy service. NFSv4.1 clients SHOULD support RPCSEC_GSS's privacy 1007 service. 1009 2.2.1.1.1.2. Security mechanisms for NFSv4.1 1011 RPCSEC_GSS, via GSS-API, normalizes access to mechanisms that provide 1012 security services. Therefore NFSv4.1 clients and servers MUST 1013 support the Kerberos V5 security mechanism. 1015 The use of RPCSEC_GSS requires selection of: mechanism, quality of 1016 protection (QOP), and service (authentication, integrity, privacy). 1017 For the mandated security mechanisms, NFSv4.1 specifies that a QOP of 1018 zero (0) is used, leaving it up to the mechanism or the mechanism's 1019 configuration to map QOP zero to an appropriate level of protection. 1020 Each mandated mechanism specifies minimum set of cryptographic 1021 algorithms for implementing integrity and privacy. NFSv4.1 clients 1022 and servers MUST be implemented on operating environments that comply 1023 with the REQUIRED cryptographic algorithms of each REQUIRED 1024 mechanism. 1026 2.2.1.1.1.2.1. Kerberos V5 1028 The Kerberos V5 GSS-API mechanism as described in [5] MUST be 1029 implemented with the RPCSEC_GSS services as specified in the 1030 following table: 1032 column descriptions: 1033 1 == number of pseudo flavor 1034 2 == name of pseudo flavor 1035 3 == mechanism's OID 1036 4 == RPCSEC_GSS service 1037 5 == NFSv4.1 clients MUST support 1038 6 == NFSv4.1 servers MUST support 1040 1 2 3 4 5 6 1041 ------------------------------------------------------------------ 1042 390003 krb5 1.2.840.113554.1.2.2 rpc_gss_svc_none yes yes 1043 390004 krb5i 1.2.840.113554.1.2.2 rpc_gss_svc_integrity yes yes 1044 390005 krb5p 1.2.840.113554.1.2.2 rpc_gss_svc_privacy no yes 1046 Note that the number and name of the pseudo flavor is presented here 1047 as a mapping aid to the implementor. Because the NFSv4.1 protocol 1048 includes a method to negotiate security and it understands the GSS- 1049 API mechanism, the pseudo flavor is not needed. The pseudo flavor is 1050 needed for the NFSv3 since the security negotiation is done via the 1051 MOUNT protocol as described in [32]. 1053 At the time NFSv4.1 was specified, AES with HMAC-SHA1 was a REQUIRED 1054 algorithm set for Kerberos V5. In contrast, when NFSv4.0 was 1055 specified, weaker algorithm sets were REQUIRED for Kerberos V5, and 1056 were REQUIRED in the NFSv4.0 specification, because the Kerberos V5 1057 specification at the time did not specify stronger algorithms. The 1058 NFSv4.1 specification does not specify REQUIRED algorithms for 1059 Kerberos V5, and instead, the implementor is expected to track the 1060 evolution of the Kerberos V5 standard if and when stronger algorithms 1061 are specified. 1063 2.2.1.1.1.2.1.1. Security Considerations for Cryptographic Algorithms 1064 in Kerberos V5 1066 When deploying NFSv4.1, the strength of the security achieved depends 1067 on the existing Kerberos V5 infrastructure. The algorithms of 1068 Kerberos V5 are not directly exposed to or selectable by the client 1069 or server, so there is some due diligence required by the user of 1070 NFSv4.1 to ensure that security is acceptable where where needed. 1072 2.2.1.1.1.3. GSS Server Principal 1074 Regardless of what security mechanism under RPCSEC_GSS is being used, 1075 the NFS server, MUST identify itself in GSS-API via a 1076 GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE 1077 names are of the form: 1079 service@hostname 1081 For NFS, the "service" element is 1083 nfs 1085 Implementations of security mechanisms will convert nfs@hostname to 1086 various different forms. For Kerberos V5 the following form is 1087 RECOMMENDED: 1089 nfs/hostname 1091 2.3. COMPOUND and CB_COMPOUND 1093 A significant departure from the versions of the NFS protocol before 1094 NFSv4 is the introduction of the COMPOUND procedure. For the NFSv4 1095 protocol, in all minor versions, there are exactly two RPC 1096 procedures, NULL and COMPOUND. The COMPOUND procedure is defined as 1097 a series of individual operations and these operations perform the 1098 sorts of functions performed by traditional NFS procedures. 1100 The operations combined within a COMPOUND request are evaluated in 1101 order by the server, without any atomicity guarantees. A limited set 1102 of facilities exist to pass results from one operation to another. 1103 Once an operation returns a failing result, the evaluation ends and 1104 the results of all evaluated operations are returned to the client. 1106 With the use of the COMPOUND procedure, the client is able to build 1107 simple or complex requests. These COMPOUND requests allow for a 1108 reduction in the number of RPCs needed for logical file system 1109 operations. For example, multi-component lookup requests can be 1110 constructed by combining multiple LOOKUP operations. Those can be 1111 further combined with operations such as GETATTR, READDIR, or OPEN 1112 plus READ to do more complicated sets of operation without incurring 1113 additional latency. 1115 NFSv4.1 also contains a considerable set of callback operations in 1116 which the server makes an RPC directed at the client. Callback RPCs 1117 have a similar structure to that of the normal server requests. In 1118 all minor versions of the NFSv4 protocol there are two callback RPC 1119 procedures, CB_NULL and CB_COMPOUND. The CB_COMPOUND procedure is 1120 defined in an analogous fashion to that of COMPOUND with its own set 1121 of callback operations. 1123 The addition of new server and callback operations within the 1124 COMPOUND and CB_COMPOUND request framework provides a means of 1125 extending the protocol in subsequent minor versions. 1127 Except for a small number of operations needed for session creation, 1128 server requests and callback requests are performed within the 1129 context of a session. Sessions provide a client context for every 1130 request and support robust reply protection for non-idempotent 1131 requests. 1133 2.4. Client Identifiers and Client Owners 1135 For each operation that obtains or depends on locking state, the 1136 specific client needs to be identifiable by the server. 1138 Each distinct client instance is represented by a client ID. A 1139 client ID is a 64-bit identifier representing a specific client at a 1140 given time. The client ID is changed whenever the client re- 1141 initializes, and may change when the server re-initializes. Client 1142 IDs are used to support lock identification and crash recovery. 1144 During steady state operation, the client ID associated with each 1145 operation is derived from the session (see Section 2.10) on which the 1146 operation is sent. A session is associated with a client ID when the 1147 session is created. 1149 Unlike NFSv4.0, the only NFSv4.1 operations possible before a client 1150 ID is established are those needed to establish the client ID. 1152 A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION 1153 operation using that client ID (eir_clientid as returned from 1154 EXCHANGE_ID) is required to establish and confirm the client ID on 1155 the server. Establishment of identification by a new incarnation of 1156 the client also has the effect of immediately releasing any locking 1157 state that a previous incarnation of that same client might have had 1158 on the server. Such released state would include all lock, share 1159 reservation, layout state, and where the server is not supporting the 1160 CLAIM_DELEGATE_PREV claim type, all delegation state associated with 1161 the same client with the same identity. For discussion of delegation 1162 state recovery, see Section 10.2.1. For discussion of layout state 1163 recovery see Section 12.7.1. 1165 Releasing such state requires that the server be able to determine 1166 that one client instance is the successor of another. Where this 1167 cannot be done, for any of a number of reasons, the locking state 1168 will remain for a time subject to lease expiration (see Section 8.3) 1169 and the new client will need to wait for such state to be removed, if 1170 it makes conflicting lock requests. 1172 Client identification is encapsulated in the following Client Owner 1173 data type: 1175 struct client_owner4 { 1176 verifier4 co_verifier; 1177 opaque co_ownerid; 1178 }; 1180 The first field, co_verifier, is a client incarnation verifier. The 1181 server will start the process of canceling the client's leased state 1182 if co_verifier is different than what the server has previously 1183 recorded for the identified client (as specified in the co_ownerid 1184 field). 1186 The second field, co_ownerid is a variable length string that 1187 uniquely defines the client so that subsequent instances of the same 1188 client bear the same co_ownerid with a different verifier. 1190 There are several considerations for how the client generates the 1191 co_ownerid string: 1193 o The string should be unique so that multiple clients do not 1194 present the same string. The consequences of two clients 1195 presenting the same string range from one client getting an error 1196 to one client having its leased state abruptly and unexpectedly 1197 canceled. 1199 o The string should be selected so that subsequent incarnations 1200 (e.g. restarts) of the same client cause the client to present the 1201 same string. The implementor is cautioned from an approach that 1202 requires the string to be recorded in a local file because this 1203 precludes the use of the implementation in an environment where 1204 there is no local disk and all file access is from an NFSv4.1 1205 server. 1207 o The string should be the same for each server network address that 1208 the client accesses. This way, if a server has multiple 1209 interfaces, the client can trunk traffic over multiple network 1210 paths as described in Section 2.10.5. (Note: the precise opposite 1211 was advised in the NFSv4.0 specification [29].) 1213 o The algorithm for generating the string should not assume that the 1214 client's network address will not change, unless the client 1215 implementation knows it is using statically assigned network 1216 addresses. This includes changes between client incarnations and 1217 even changes while the client is still running in its current 1218 incarnation. Thus with dynamic address assignment, if the client 1219 includes just the client's network address in the co_ownerid 1220 string, there is a real risk that after the client gives up the 1221 network address, another client, using a similar algorithm for 1222 generating the co_ownerid string, would generate a conflicting 1223 co_ownerid string. 1225 Given the above considerations, an example of a well generated 1226 co_ownerid string is one that includes: 1228 o If applicable, the client's statically assigned network address. 1230 o Additional information that tends to be unique, such as one or 1231 more of: 1233 * The client machine's serial number (for privacy reasons, it is 1234 best to perform some one way function on the serial number). 1236 * A MAC address (again, a one way function should be performed). 1238 * The timestamp of when the NFSv4.1 software was first installed 1239 on the client (though this is subject to the previously 1240 mentioned caution about using information that is stored in a 1241 file, because the file might only be accessible over NFSv4.1). 1243 * A true random number. However since this number ought to be 1244 the same between client incarnations, this shares the same 1245 problem as that of using the timestamp of the software 1246 installation. 1248 o For a user level NFSv4.1 client, it should contain additional 1249 information to distinguish the client from other user level 1250 clients running on the same host, such as a process identifier or 1251 other unique sequence. 1253 The client ID is assigned by the server (the eir_clientid result from 1254 EXCHANGE_ID) and should be chosen so that it will not conflict with a 1255 client ID previously assigned by the server. This applies across 1256 server restarts. 1258 In the event of a server restart, a client may find out that its 1259 current client ID is no longer valid when it receives an 1260 NFS4ERR_STALE_CLIENTID error. The precise circumstances depend on 1261 the characteristics of the sessions involved, specifically whether 1262 the session is persistent (see Section 2.10.6.5), but in each case 1263 the client will receive this error when it attempts to establish a 1264 new session with the existing client ID and receives the error 1265 NFS4ERR_STALE_CLIENTID, indicating that a new client ID needs to be 1266 obtained via EXCHANGE_ID and the new session established with that 1267 client ID. 1269 When a session is not persistent, the client will find out that it 1270 needs to create a new session as a result of getting an 1271 NFS4ERR_BADSESSION, since the session in question was lost as part of 1272 a server restart. When the existing client ID is presented to a 1273 server as part of creating a session and that client ID is not 1274 recognized, as would happen after a server restart, the server will 1275 reject the request with the error NFS4ERR_STALE_CLIENTID. 1277 In the case of the session being persistent, the client will re- 1278 establish communication using the existing session after the restart. 1279 This session will be associated with the existing client ID but may 1280 only be used to retransmit operations that the client previously 1281 transmitted and did not see replies to. Replies to operations that 1282 the server previously performed will come from the reply cache, 1283 otherwise NFS4ERR_DEADSESSION will be returned. Hence, such a 1284 session is referred to as "dead". In this situation, in order to 1285 perform new operations, the client needs to establish a new session. 1286 If an attempt is made to establish this new session with the existing 1287 client ID, the server will reject the request with 1288 NFS4ERR_STALE_CLIENTID. 1290 When NFS4ERR_STALE_CLIENTID is received in either of these 1291 situations, the client needs to obtain a new client ID by use of the 1292 EXCHANGE_ID operation, then use that client ID as the basis of a new 1293 session, and then proceed to any other necessary recovery for the 1294 server restart case (See Section 8.4.2). 1296 See the descriptions of EXCHANGE_ID (Section 18.35) and 1297 CREATE_SESSION (Section 18.36) for a complete specification of these 1298 operations. 1300 2.4.1. Upgrade from NFSv4.0 to NFSv4.1 1302 To facilitate upgrade from NFSv4.0 to NFSv4.1, a server may compare a 1303 client_owner4 in an EXCHANGE_ID with an nfs_client_id4 established 1304 using the SETCLIENTID operation of NFSv4.0. A server that does so 1305 will allow an upgraded client to avoid waiting until the lease (i.e. 1306 the lease established by the NFSv4.0 instance client) expires. This 1307 requires the client_owner4 be constructed the same way as the 1308 nfs_client_id4. If the latter's contents included the server's 1309 network address (per the recommendations of the NFSv4.0 specification 1310 [29]), and the NFSv4.1 client does not wish to use a client ID that 1311 prevents trunking, it should send two EXCHANGE_ID operations. The 1312 first EXCHANGE_ID will have a client_owner4 equal to the 1313 nfs_client_id4. This will clear the state created by the NFSv4.0 1314 client. The second EXCHANGE_ID will not have the server's network 1315 address. The state created for the second EXCHANGE_ID will not have 1316 to wait for lease expiration, because there will be no state to 1317 expire. 1319 2.4.2. Server Release of Client ID 1321 NFSv4.1 introduces a new operation called DESTROY_CLIENTID 1322 (Section 18.50) which the client SHOULD use to destroy a client ID it 1323 no longer needs. This permits graceful, bilateral release of a 1324 client ID. The operation cannot be used if there are sessions 1325 associated with the client ID, or state with an unexpired lease. 1327 If the server determines that the client holds no associated state 1328 for its client ID (including sessions, opens, locks, delegations, 1329 layouts, and wants), the server may choose to unilaterally release 1330 the client ID in order to conserve resources. If the client contacts 1331 the server after this release, the server MUST ensure the client 1332 receives the appropriate error so that it will use the EXCHANGE_ID/ 1333 CREATE_SESSION sequence to establish a new client ID. The server 1334 ought to be very hesitant to release a client ID since the resulting 1335 work on the client to recover from such an event will be the same 1336 burden as if the server had failed and restarted. Typically a server 1337 would not release a client ID unless there had been no activity from 1338 that client for many minutes. As long as there are sessions, opens, 1339 locks, delegations, layouts, or wants, the server MUST NOT release 1340 the client ID. See Section 2.10.12.1.4 for discussion on releasing 1341 inactive sessions. 1343 2.4.3. Resolving Client Owner Conflicts 1345 When the server gets an EXCHANGE_ID for a client owner that currently 1346 has no state, or that has state, but the lease has expired, the 1347 server MUST allow the EXCHANGE_ID, and confirm the new client ID if 1348 followed by the appropriate CREATE_SESSION. 1350 When the server gets an EXCHANGE_ID for a new incarnation of a client 1351 owner that currently has an old incarnation with state and an 1352 unexpired lease, the server is allowed to dispose of the state of the 1353 previous incarnation of the client owner if one of the following are 1354 true: 1356 o The principal that created the client ID for the client owner is 1357 the same as the principal that is issuing the EXCHANGE_ID. Note 1358 that if the client ID was created with SP4_MACH_CRED state 1359 protection (Section 18.35), the principal MUST be based on 1360 RPCSEC_GSS authentication, the RPCSEC_GSS service used MUST be 1361 integrity or privacy, and the same GSS mechanism and principal 1362 MUST be used as that used when the client ID was created. 1364 o The client ID was established with SP4_SSV protection 1365 (Section 18.35, Section 2.10.8.3) and the client sends the 1366 EXCHANGE_ID with the security flavor set to RPCSEC_GSS using the 1367 GSS SSV mechanism (Section 2.10.9). 1369 o The client ID was established with SP4_SSV protection, and under 1370 the conditions described herein, the EXCHANGE_ID was sent with 1371 SP4_MACH_CRED state protection. Because the SSV might not persist 1372 across client and server restart, and because the first time a 1373 client sends EXCHANGE_ID to a server it does not have an SSV, the 1374 client MAY send the subsequent EXCHANGE_ID without an SSV 1375 RPCSEC_GSS handle. Instead, as with SP4_MACH_CRED protection, the 1376 principal MUST be based on RPCSEC_GSS authentication, the 1377 RPCSEC_GSS service used MUST be integrity or privacy, and the same 1378 GSS mechanism and principal MUST be used as that used when the 1379 client ID was created. 1381 If none of the above situations apply, the server MUST return 1382 NFS4ERR_CLID_INUSE. 1384 If the server accepts the principal and co_ownerid as matching that 1385 which created the client ID, and the co_verifier in the EXCHANGE_ID 1386 differs from the co_verifier used when the client ID was created, 1387 then after the server receives a CREATE_SESSION that confirms the 1388 client ID, the server deletes state. If the co_verifier values are 1389 the same, (e.g. the client is either updating properties of the 1390 client ID (Section 18.35), or the client is attempting trunking 1391 (Section 2.10.5) the server MUST NOT delete state. 1393 2.5. Server Owners 1395 The Server Owner is similar to a Client Owner (Section 2.4), but 1396 unlike the Client Owner, there is no shorthand server ID. The Server 1397 Owner is defined in the following data type: 1399 struct server_owner4 { 1400 uint64_t so_minor_id; 1401 opaque so_major_id; 1402 }; 1404 The Server Owner is returned from EXCHANGE_ID. When the so_major_id 1405 fields are the same in two EXCHANGE_ID results, the connections each 1406 EXCHANGE_ID were sent over can be assumed to address the same Server 1407 (as defined in Section 1.5). If the so_minor_id fields are also the 1408 same, then not only do both connections connect to the same server, 1409 but the session can be shared across both connections. The reader is 1410 cautioned that multiple servers may deliberately or accidentally 1411 claim to have the same so_major_id or so_major_id/so_minor_id; the 1412 reader should examine Section 2.10.5 and Section 18.35 in order to 1413 avoid acting on falsely matching Server Owner values. 1415 The considerations for generating a so_major_id are similar to that 1416 for generating a co_ownerid string (see Section 2.4). The 1417 consequences of two servers generating conflicting so_major_id values 1418 are less dire than they are for co_ownerid conflicts because the 1419 client can use RPCSEC_GSS to compare the authenticity of each server 1420 (see Section 2.10.5). 1422 2.6. Security Service Negotiation 1424 With the NFSv4.1 server potentially offering multiple security 1425 mechanisms, the client needs a method to determine or negotiate which 1426 mechanism is to be used for its communication with the server. The 1427 NFS server may have multiple points within its file system namespace 1428 that are available for use by NFS clients. These points can be 1429 considered security policy boundaries, and in some NFS 1430 implementations are tied to NFS export points. In turn the NFS 1431 server may be configured such that each of these security policy 1432 boundaries may have different or multiple security mechanisms in use. 1434 The security negotiation between client and server SHOULD be done 1435 with a secure channel to eliminate the possibility of a third party 1436 intercepting the negotiation sequence and forcing the client and 1437 server to choose a lower level of security than required or desired. 1439 See Section 21 for further discussion. 1441 2.6.1. NFSv4.1 Security Tuples 1443 An NFS server can assign one or more "security tuples" to each 1444 security policy boundary in its namespace. Each security tuple 1445 consists of a security flavor (see Section 2.2.1.1), and if the 1446 flavor is RPCSEC_GSS, a GSS-API mechanism OID, a GSS-API quality of 1447 protection, and an RPCSEC_GSS service. 1449 2.6.2. SECINFO and SECINFO_NO_NAME 1451 The SECINFO and SECINFO_NO_NAME operations allow the client to 1452 determine, on a per filehandle basis, what security tuple is to be 1453 used for server access. In general, the client will not have to use 1454 either operation except during initial communication with the server 1455 or when the client crosses security policy boundaries at the server. 1456 However, the server's policies may also change at any time and force 1457 the client to negotiate a new security tuple. 1459 Where the use of different security tuples would affect the type of 1460 access that would be allowed if a request was sent over the same 1461 connection used for the SECINFO or SECINFO_NO_NAME operation (e.g. 1462 read-only vs. read-write) access, security tuples that allow greater 1463 access should be presented first. Where the general level of access 1464 is the same and different security flavors limit the range of 1465 principals whose privileges are recognized (e.g. allowing or 1466 disallowing root access), flavors supporting the greatest range of 1467 principals should be listed first. 1469 2.6.3. Security Error 1471 Based on the assumption that each NFSv4.1 client and server MUST 1472 support a minimum set of security (i.e., Kerberos V5 under 1473 RPCSEC_GSS), the NFS client will initiate file access to the server 1474 with one of the minimal security tuples. During communication with 1475 the server, the client may receive an NFS error of NFS4ERR_WRONGSEC. 1476 This error allows the server to notify the client that the security 1477 tuple currently being used contravenes the server's security policy. 1478 The client is then responsible for determining (see Section 2.6.3.1) 1479 what security tuples are available at the server and choosing one 1480 which is appropriate for the client. 1482 2.6.3.1. Using NFS4ERR_WRONGSEC, SECINFO, and SECINFO_NO_NAME 1484 This section explains of the mechanics of NFSv4.1 security 1485 negotiation. 1487 2.6.3.1.1. Put Filehandle Operations 1489 The term "put filehandle operation" refers to PUTROOTFH, PUTPUBFH, 1490 PUTFH, and RESTOREFH. Each of the subsections herein describes how 1491 the server handles a subseries of operations that starts with a put 1492 filehandle operation. 1494 2.6.3.1.1.1. Put Filehandle Operation + SAVEFH 1496 The client is saving a filehandle for a future RESTOREFH, LINK, or 1497 RENAME. SAVEFH MUST NOT return NFS4ERR_WRONGSEC. To determine 1498 whether the put filehandle operation returns NFS4ERR_WRONGSEC or not, 1499 the server implementation pretends SAVEFH is not in the series of 1500 operations and examines which of the situations described in the 1501 other subsections of Section 2.6.3.1.1 apply. 1503 2.6.3.1.1.2. Two or More Put Filehandle Operations 1505 For a series of N put filehandle operations, the server MUST NOT 1506 return NFS4ERR_WRONGSEC to the first N-1 put filehandle operations. 1507 The N'th put filehandle operation is handled as if it is the first in 1508 a subseries of operations. For example if the server received PUTFH, 1509 PUTROOTFH, LOOKUP, then the PUTFH is ignored for NFS4ERR_WRONGSEC 1510 purposes, and the PUTROOTFH, LOOKUP subseries is processed as 1511 according to Section 2.6.3.1.1.3. 1513 2.6.3.1.1.3. Put Filehandle Operation + LOOKUP (or OPEN of an Existing 1514 Name) 1516 This situation also applies to a put filehandle operation followed by 1517 a LOOKUP or an OPEN operation that specifies an existing component 1518 name. 1520 In this situation, the client is potentially crossing a security 1521 policy boundary, and the set of security tuples the parent directory 1522 supports may differ from those of the child. The server 1523 implementation may decide whether to impose any restrictions on 1524 security policy administration. There are at least three approaches 1525 (sec_policy_child is the tuple set of the child export, 1526 sec_policy_parent is that of the parent). 1528 a) sec_policy_child <= sec_policy_parent (<= for subset). This 1529 means that the set of security tuples specified on the security 1530 policy of a child directory is always a subset of that of its 1531 parent directory. 1533 b) sec_policy_child ^ sec_policy_parent != {} (^ for intersection, 1534 {} for the empty set). This means that the security tuples 1535 specified on the security policy of a child directory always has a 1536 non empty intersection with that of the parent. 1538 c) sec_policy_child ^ sec_policy_parent == {}. This means that 1539 the set of tuples specified on the security policy of a child 1540 directory may not intersect with that of the parent. In other 1541 words, there are no restrictions on how the system administrator 1542 may set up these tuples. 1544 In order for a server to support approaches (b) (for the case when a 1545 client chooses a flavor that is not a member of sec_policy_parent) 1546 and (c), the put filehandle operation cannot return NFS4ERR_WRONGSEC 1547 when there is a security tuple mismatch. Instead, it should be 1548 returned from the LOOKUP (or OPEN by existing component name) that 1549 follows. 1551 Since the above guideline does not contradict approach (a), it should 1552 be followed in general. Even if approach (a) is implemented, it is 1553 possible for the security tuple used to be acceptable for the target 1554 of LOOKUP but not for the filehandles used in the put filehandle 1555 operation. The put filehandle operation could be a PUTROOTFH or 1556 PUTPUBFH, where the client cannot know the security tuples for the 1557 root or public filehandle. Or the security policy for the filehandle 1558 used by the put filehandle operation could have changed since the 1559 time the filehandle was obtained. 1561 Therefore, an NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC in 1562 response to the put filehandle operation if the operation is 1563 immediately followed by a LOOKUP or an OPEN by component name. 1565 2.6.3.1.1.4. Put Filehandle Operation + LOOKUPP 1567 Since SECINFO only works its way down, there is no way LOOKUPP can 1568 return NFS4ERR_WRONGSEC without SECINFO_NO_NAME. SECINFO_NO_NAME 1569 solves this issue via style SECINFO_STYLE4_PARENT, which works in the 1570 opposite direction as SECINFO. As with Section 2.6.3.1.1.3, a put 1571 filehandle operation that is followed by a LOOKUPP MUST NOT return 1572 NFS4ERR_WRONGSEC. If the server does not support SECINFO_NO_NAME, 1573 the client's only recourse is to send the put filehandle operation, 1574 LOOKUPP, GETFH sequence of operations with every security tuple it 1575 supports. 1577 Regardless of whether SECINFO_NO_NAME is supported, an NFSv4.1 server 1578 MUST NOT return NFS4ERR_WRONGSEC in response to a put filehandle 1579 operation if the operation is immediately followed by a LOOKUPP. 1581 2.6.3.1.1.5. Put Filehandle Operation + SECINFO/SECINFO_NO_NAME 1583 A security sensitive client is allowed to choose a strong security 1584 tuple when querying a server to determine a file object's permitted 1585 security tuples. The security tuple chosen by the client does not 1586 have to be included in the tuple list of the security policy of the 1587 either parent directory indicated in the put filehandle operation, or 1588 the child file object indicated in SECINFO (or any parent directory 1589 indicated in SECINFO_NO_NAME). Of course the server has to be 1590 configured for whatever security tuple the client selects, otherwise 1591 the request will fail at RPC layer with an appropriate authentication 1592 error. 1594 In theory, there is no connection between the security flavor used by 1595 SECINFO or SECINFO_NO_NAME and those supported by the security 1596 policy. But in practice, the client may start looking for strong 1597 flavors from those supported by the security policy, followed by 1598 those in the REQUIRED set. 1600 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to a put 1601 filehandle operation that is immediately followed by SECINFO or 1602 SECINFO_NO_NAME. The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC 1603 from SECINFO or SECINFO_NO_NAME. 1605 2.6.3.1.1.6. Put Filehandle Operation + Nothing 1607 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC. 1609 2.6.3.1.1.7. Put Filehandle Operation + Anything Else 1611 "Anything Else" includes OPEN by filehandle. 1613 The security policy enforcement applies to the filehandle specified 1614 in the put filehandle operation. Therefore the put filehandle 1615 operation MUST return NFS4ERR_WRONGSEC when there is a security tuple 1616 mismatch. This avoids the complexity adding NFS4ERR_WRONGSEC as an 1617 allowable error to every other operation. 1619 A COMPOUND containing the series put filehandle operation + 1620 SECINFO_NO_NAME (style SECINFO_STYLE4_CURRENT_FH) is an efficient way 1621 for the client to recover from NFS4ERR_WRONGSEC. 1623 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to any operation 1624 other than a put filehandle operation, LOOKUP, LOOKUPP, and OPEN (by 1625 component name). 1627 2.6.3.1.1.8. Operations after SECINFO and SECINFO_NO_NAME 1629 Suppose a client sends a COMPOUND procedure containing the series 1630 SEQUENCE, PUTFH, SECINFO_NONAME, READ, and suppose the security tuple 1631 used does not match that required for the target file. By rule (see 1632 Section 2.6.3.1.1.5), neither PUTFH nor SECINFO_NO_NAME can return 1633 NFS4ERR_WRONGSEC. By rule (see Section 2.6.3.1.1.7), READ cannot 1634 return NFS4ERR_WRONGSEC. The issue is resolved by the fact that 1635 SECINFO and SECINFO_NO_NAME consume the current filehandle (note that 1636 this is a change from NFSv4.0). This leaves no current filehandle 1637 for READ to use, and READ returns NFS4ERR_NOFILEHANDLE. 1639 2.6.3.1.2. LINK and RENAME 1641 The LINK and RENAME operations use both the current and saved 1642 filehandles. When the current filehandle is injected into a series 1643 of operations via a put filehandle operation, the server MUST return 1644 NFS4ERR_WRONGSEC, per Section 2.6.3.1.1. LINK and RENAME MAY return 1645 NFS4ERR_WRONGSEC if the security policy of the saved filehandle 1646 rejects the security flavor used in the COMPOUND request's 1647 credentials. If the server does so, then if there is no intersection 1648 between the security policies of saved and current filehandles, this 1649 means it will be impossible for client to perform the intended LINK 1650 or RENAME operation. 1652 For example, suppose the client sends this COMPOUND request: 1653 SEQUENCE, PUTFH bFH, SAVEFH, PUTFH aFH, RENAME "c" "d", where 1654 filehandles bFH and aFH refer to different directories. Suppose no 1655 common security tuple exists between the security policies of aFH and 1656 bFH. If the client sends the request using credentials acceptable to 1657 bFH's security policy but not aFH's policy, then the PUTFH aFH 1658 operation will fail with NFS4ERR_WRONGSEC. After a SECINFO_NO_NAME 1659 request, the client sends SEQUENCE, PUTFH bFH, SAVEFH, PUTFH aFH, 1660 RENAME "c" "d", using credentials acceptable to aFH's security 1661 policy, but not bFH's policy. The server returns NFS4ERR_WRONGSEC on 1662 the RENAME operation. 1664 To prevent a client from an endless sequence of a request containing 1665 LINK or RENAME, followed by a request containing SECINFO_NO_NAME, the 1666 server MUST detect when the security policies of the current and 1667 saved filehandles have no mutually acceptable security tuple, and 1668 MUST NOT return NFS4ERR_WRONGSEC in that situation. Instead the 1669 server MUST return NFS4ERR_XDEV. 1671 Thus while a server MAY return NFS4ERR_WRONGSEC from LINK and RENAME, 1672 the server implementor may reasonably decide the consequences are not 1673 worth the security benefits, and so allow the security policy of the 1674 current filehandle to override that of the saved filehandle. 1676 2.7. Minor Versioning 1678 To address the requirement of an NFS protocol that can evolve as the 1679 need arises, the NFSv4.1 protocol contains the rules and framework to 1680 allow for future minor changes or versioning. 1682 The base assumption with respect to minor versioning is that any 1683 future accepted minor version will be documented in one or more 1684 standards track RFCs. Minor version zero of the NFSv4 protocol is 1685 represented by [29], and minor version one is represented by this 1686 document [[Comment.1: RFC Editor: change "document" to "RFC" when we 1687 publish]]. The COMPOUND and CB_COMPOUND procedures support the 1688 encoding of the minor version being requested by the client. 1690 The following items represent the basic rules for the development of 1691 minor versions. Note that a future minor version may modify or add 1692 to the following rules as part of the minor version definition. 1694 1. Procedures are not added or deleted 1696 To maintain the general RPC model, NFSv4 minor versions will not 1697 add to or delete procedures from the NFS program. 1699 2. Minor versions may add operations to the COMPOUND and 1700 CB_COMPOUND procedures. 1702 The addition of operations to the COMPOUND and CB_COMPOUND 1703 procedures does not affect the RPC model. 1705 * Minor versions may append attributes to the bitmap4 that 1706 represents sets of attributes and the fattr4 that represents 1707 sets of attribute values. 1709 This allows for the expansion of the attribute model to allow 1710 for future growth or adaptation. 1712 * Minor version X must append any new attributes after the last 1713 documented attribute. 1715 Since attribute results are specified as an opaque array of 1716 per-attribute XDR encoded results, the complexity of adding 1717 new attributes in the midst of the current definitions would 1718 be too burdensome. 1720 3. Minor versions must not modify the structure of an existing 1721 operation's arguments or results. 1723 Again the complexity of handling multiple structure definitions 1724 for a single operation is too burdensome. New operations should 1725 be added instead of modifying existing structures for a minor 1726 version. 1728 This rule does not preclude the following adaptations in a minor 1729 version. 1731 * adding bits to flag fields such as new attributes to 1732 GETATTR's bitmap4 data type and providing corresponding 1733 variants of opaque arrays, such as a notify4 used together 1734 with such bitmaps. 1736 * adding bits to existing attributes like ACLs that have flag 1737 words 1739 * extending enumerated types (including NFS4ERR_*) with new 1740 values 1742 * adding cases to a switched union 1744 4. Minor versions must not modify the structure of existing 1745 attributes. 1747 5. Minor versions must not delete operations. 1749 This prevents the potential reuse of a particular operation 1750 "slot" in a future minor version. 1752 6. Minor versions must not delete attributes. 1754 7. Minor versions must not delete flag bits or enumeration values. 1756 8. Minor versions may declare an operation MUST NOT be implemented. 1758 Specifying an operation MUST NOT be implemented is equivalent to 1759 obsoleting an operation. For the client, it means that the 1760 operation should not be sent to the server. For the server, an 1761 NFS error can be returned as opposed to "dropping" the request 1762 as an XDR decode error. This approach allows for the 1763 obsolescence of an operation while maintaining its structure so 1764 that a future minor version can reintroduce the operation. 1766 1. Minor versions may declare an attribute MUST NOT be 1767 implemented. 1769 2. Minor versions may declare a flag bit or enumeration value 1770 MUST NOT be implemented. 1772 9. Minor versions may downgrade features from REQUIRED to 1773 RECOMMENDED, or RECOMMENDED to OPTIONAL. 1775 10. Minor versions may upgrade features from OPTIONAL to RECOMMENDED 1776 or RECOMMENDED to REQUIRED. 1778 11. A client and server that supports minor version X should support 1779 minor versions 0 (zero) through X-1 as well. 1781 12. Except for infrastructural changes, a minor version must not 1782 introduce REQUIRED new features. 1784 This rule allows for the introduction of new functionality and 1785 forces the use of implementation experience before designating a 1786 feature as REQUIRED. On the other hand, some classes of 1787 features are infrastructural and have broad effects. Allowing 1788 infrastructural features to be RECOMMENDED or OPTIONAL 1789 complicates implementation of the minor version. 1791 13. A client MUST NOT attempt to use a stateid, filehandle, or 1792 similar returned object from the COMPOUND procedure with minor 1793 version X for another COMPOUND procedure with minor version Y, 1794 where X != Y. 1796 2.8. Non-RPC-based Security Services 1798 As described in Section 2.2.1.1.1.1, NFSv4.1 relies on RPC for 1799 identification, authentication, integrity, and privacy. NFSv4.1 1800 itself provides or enables additional security services as described 1801 in the next several subsections. 1803 2.8.1. Authorization 1805 Authorization to access a file object via an NFSv4.1 operation is 1806 ultimately determined by the NFSv4.1 server. A client can 1807 predetermine its access to a file object via the OPEN (Section 18.16) 1808 and the ACCESS (Section 18.1) operations. 1810 Principals with appropriate access rights can modify the 1811 authorization on a file object via the SETATTR (Section 18.30) 1812 operation. Attributes that affect access rights include: mode, 1813 owner, owner_group, acl, dacl, and sacl. See Section 5. 1815 2.8.2. Auditing 1817 NFSv4.1 provides auditing on a per file object basis, via the acl and 1818 sacl attributes as described in Section 6. It is outside the scope 1819 of this specification to specify audit log formats or management 1820 policies. 1822 2.8.3. Intrusion Detection 1824 NFSv4.1 provides alarm control on a per file object basis, via the 1825 acl and sacl attributes as described in Section 6. Alarms may serve 1826 as the basis for intrusion detection. It is outside the scope of 1827 this specification to specify heuristics for detecting intrusion via 1828 alarms. 1830 2.9. Transport Layers 1832 2.9.1. REQUIRED and RECOMMENDED Properties of Transports 1834 NFSv4.1 works over RDMA and non-RDMA-based transports with the 1835 following attributes: 1837 o The transport supports reliable delivery of data, which NFSv4.1 1838 requires but neither NFSv4.1 nor RPC has facilities for ensuring. 1839 [33] 1841 o The transport delivers data in the order it was sent. Ordered 1842 delivery simplifies detection of transmit errors, and simplifies 1843 the sending of arbitrary sized requests and responses, via the 1844 record marking protocol [3]. 1846 Where an NFSv4.1 implementation supports operation over the IP 1847 network protocol, any transport used between NFS and IP MUST be among 1848 the IETF-approved congestion control transport protocols. At the 1849 time this document was written, the only two transports that had the 1850 above attributes were TCP and SCTP. To enhance the possibilities for 1851 interoperability, an NFSv4.1 implementation MUST support operation 1852 over the TCP transport protocol. 1854 Even if NFSv4.1 is used over a non-IP network protocol, it is 1855 RECOMMENDED that the transport support congestion control. 1857 It is permissible for a connectionless transport to be used under 1858 NFSv4.1, however reliable and in-order delivery of data combined with 1859 congestion control by the connectionless transport is REQUIRED; as a 1860 consequence UDP by itself MUST NOT be used as an NFSv4.1 transport. 1861 NFSv4.1 assumes that a client transport address and server transport 1862 address used to send data over a transport together constitute a 1863 connection, even if the underlying transport eschews the concept of a 1864 connection. 1866 2.9.2. Client and Server Transport Behavior 1868 If a connection-oriented transport (e.g. TCP) is used, the client 1869 and server SHOULD use long lived connections for at least three 1870 reasons: 1872 1. This will prevent the weakening of the transport's congestion 1873 control mechanisms via short lived connections. 1875 2. This will improve performance for the WAN environment by 1876 eliminating the need for connection setup handshakes. 1878 3. The NFSv4.1 callback model differs from NFSv4.0, and requires the 1879 client and server to maintain a client-created backchannel (see 1880 Section 2.10.3.1) for the server to use. 1882 In order to reduce congestion, if a connection-oriented transport is 1883 used, and the request is not the NULL procedure, 1885 o A requester MUST NOT retry a request unless the connection the 1886 request was sent over was lost before the reply was received. 1888 o A replier MUST NOT silently drop a request, even if the request is 1889 a retry. (The silent drop behavior of RPCSEC_GSS [4] does not 1890 apply because this behavior happens at the RPCSEC_GSS layer, a 1891 lower layer in the request processing). Instead, the replier 1892 SHOULD return an appropriate error (see Section 2.10.6.1) or it 1893 MAY disconnect the connection. 1895 When sending a reply, the replier MUST send the reply to the same 1896 full network address (e.g. if using an IP-based transport, the source 1897 port of the requester is part of the full network address) that the 1898 requester sent the request from. If using a connection-oriented 1899 transport, replies MUST be sent on the same connection the request 1900 was received from. 1902 If a connection is dropped after the replier receives the request but 1903 before the replier sends the reply, the replier might have an pending 1904 reply. If a connection is established with the same source and 1905 destination full network address as the dropped connection, then the 1906 replier MUST NOT send the reply until the client retries the request. 1907 The reason for this prohibition is that the client MAY retry a 1908 request over a different connection than is associated with the 1909 session. 1911 When using RDMA transports there are other reasons for not tolerating 1912 retries over the same connection: 1914 o RDMA transports use "credits" to enforce flow control, where a 1915 credit is a right to a peer to transmit a message. If one peer 1916 were to retransmit a request (or reply), it would consume an 1917 additional credit. If the replier retransmitted a reply, it would 1918 certainly result in an RDMA connection loss, since the requester 1919 would typically only post a single receive buffer for each 1920 request. If the requester retransmitted a request, the additional 1921 credit consumed on the server might lead to RDMA connection 1922 failure unless the client accounted for it and decreased its 1923 available credit, leading to wasted resources. 1925 o RDMA credits present a new issue to the reply cache in NFSv4.1. 1926 The reply cache may be used when a connection within a session is 1927 lost, such as after the client reconnects. Credit information is 1928 a dynamic property of the RDMA connection, and stale values must 1929 not be replayed from the cache. This implies that the reply cache 1930 contents must not be blindly used when replies are sent from it, 1931 and credit information appropriate to the channel must be 1932 refreshed by the RPC layer. 1934 In addition, as described in Section 2.10.6.2, while a session is 1935 active, the NFSv4.1 requester MUST NOT stop waiting for a reply. 1937 2.9.3. Ports 1939 Historically, NFSv3 servers have listened over TCP port 2049. The 1940 registered port 2049 [34] for the NFS protocol should be the default 1941 configuration. NFSv4.1 clients SHOULD NOT use the RPC binding 1942 protocols as described in [35]. 1944 2.10. Session 1946 NFSv4.1 clients and servers MUST support and MUST use the session 1947 feature as described in this section. 1949 2.10.1. Motivation and Overview 1951 Previous versions and minor versions of NFS have suffered from the 1952 following: 1954 o Lack of support for Exactly Once Semantics (EOS). This includes 1955 lack of support for EOS through server failure and recovery. 1957 o Limited callback support, including no support for sending 1958 callbacks through firewalls, and races between replies to normal 1959 requests and callbacks. 1961 o Limited trunking over multiple network paths. 1963 o Requiring machine credentials for fully secure operation. 1965 Through the introduction of a session, NFSv4.1 addresses the above 1966 shortfalls with practical solutions: 1968 o EOS is enabled by a reply cache with a bounded size, making it 1969 feasible to keep the cache in persistent storage and enable EOS 1970 through server failure and recovery. One reason that previous 1971 revisions of NFS did not support EOS was because some EOS 1972 approaches often limited parallelism. As will be explained in 1973 Section 2.10.6, NFSv4.1 supports both EOS and unlimited 1974 parallelism. 1976 o The NFSv4.1 client (defined in Section 1.5, Paragraph 2) creates 1977 transport connections and provides them to the server to use for 1978 sending callback requests, thus solving the firewall issue 1979 (Section 18.34). Races between responses from client requests, 1980 and callbacks caused by the requests are detected via the 1981 session's sequencing properties which are a consequence of EOS 1982 (Section 2.10.6.3). 1984 o The NFSv4.1 client can add an arbitrary number of connections to 1985 the session, and thus provide trunking (Section 2.10.5). 1987 o The NFSv4.1 client and server produces a session key independent 1988 of client and server machine credentials which can be used to 1989 compute a digest for protecting critical session management 1990 operations (Section 2.10.8.3). 1992 o The NFSv4.1 client can also create secure RPCSEC_GSS contexts for 1993 use by the session's backchannel that do not require the server to 1994 authenticate to a client machine principal (Section 2.10.8.2). 1996 A session is a dynamically created, long-lived server object created 1997 by a client, used over time from one or more transport connections. 1998 Its function is to maintain the server's state relative to the 1999 connection(s) belonging to a client instance. This state is entirely 2000 independent of the connection itself, and indeed the state exists 2001 whether the connection exists or not. A client may have one or more 2002 sessions associated with it so that client-associated state may be 2003 accessed using any of the sessions associated with that client's 2004 client ID, when connections are associated with those sessions. When 2005 no connections are associated with any of a client ID's sessions for 2006 an extended time, such objects as locks, opens, delegations, layouts, 2007 etc. are subject to expiration. The session serves as an object 2008 representing a means of access by a client to the associated client 2009 state on the server, independent of the physical means of access to 2010 that state. 2012 A single client may create multiple sessions. A single session MUST 2013 NOT serve multiple clients. 2015 2.10.2. NFSv4 Integration 2017 Sessions are part of NFSv4.1 and not NFSv4.0. Normally, a major 2018 infrastructure change such as sessions would require a new major 2019 version number to an ONC RPC program like NFS. However, because 2020 NFSv4 encapsulates its functionality in a single procedure, COMPOUND, 2021 and because COMPOUND can support an arbitrary number of operations, 2022 sessions have been added to NFSv4.1 with little difficulty. COMPOUND 2023 includes a minor version number field, and for NFSv4.1 this minor 2024 version is set to 1. When the NFSv4 server processes a COMPOUND with 2025 the minor version set to 1, it expects a different set of operations 2026 than it does for NFSv4.0. NFSv4.1 defines the SEQUENCE operation, 2027 which is required for every COMPOUND that operates over an 2028 established session, with the exception of some session 2029 administration operations, such as DESTROY_SESSION (Section 18.37). 2031 2.10.2.1. SEQUENCE and CB_SEQUENCE 2033 In NFSv4.1, when the SEQUENCE operation is present, it MUST be the 2034 first operation in the COMPOUND procedure. The primary purpose of 2035 SEQUENCE is to carry the session identifier. The session identifier 2036 associates all other operations in the COMPOUND procedure with a 2037 particular session. SEQUENCE also contains required information for 2038 maintaining EOS (see Section 2.10.6). Session-enabled NFSv4.1 2039 COMPOUND requests thus have the form: 2041 +-----+--------------+-----------+------------+-----------+---- 2042 | tag | minorversion | numops |SEQUENCE op | op + args | ... 2043 | | (== 1) | (limited) | + args | | 2044 +-----+--------------+-----------+------------+-----------+---- 2046 and the replies have the form: 2048 +------------+-----+--------+-------------------------------+--// 2049 |last status | tag | numres |status + SEQUENCE op + results | // 2050 +------------+-----+--------+-------------------------------+--// 2051 //-----------------------+---- 2052 // status + op + results | ... 2053 //-----------------------+---- 2055 A CB_COMPOUND procedure request and reply has a similar form to 2056 COMPOUND, but instead of a SEQUENCE operation, there is a CB_SEQUENCE 2057 operation. CB_COMPOUND also has an additional field called 2058 "callback_ident", which is superfluous in NFSv4.1 and MUST be ignored 2059 by the client. CB_SEQUENCE has the same information as SEQUENCE, and 2060 also includes other information needed to resolve callback races 2061 (Section 2.10.6.3). 2063 2.10.2.2. Client ID and Session Association 2065 Each client ID (Section 2.4) can have zero or more active sessions. 2066 A client ID and associated session are required to perform file 2067 access in NFSv4.1. Each time a session is used (whether by a client 2068 sending a request to the server, or the client replying to a callback 2069 request from the server), the state leased to its associated client 2070 ID is automatically renewed. 2072 State such as share reservations, locks, delegations, and layouts 2073 (Section 1.6.4) is tied to the client ID. Client state is not tied 2074 to any individual session. Successive state changing operations from 2075 a given state owner MAY go over different sessions, provided the 2076 session is associated with the same client ID. A callback MAY arrive 2077 over a different session than from the session that originally 2078 acquired the state pertaining to the callback. For example, if 2079 session A is used to acquire a delegation, a request to recall the 2080 delegation MAY arrive over session B if both sessions are associated 2081 with the same client ID. Section 2.10.8.1 and Section 2.10.8.2 2082 discuss the security considerations around callbacks. 2084 2.10.3. Channels 2086 A channel is not a connection. A channel represents the direction 2087 ONC RPC requests are sent. 2089 Each session has one or two channels: the fore channel and the 2090 backchannel. Because there are at most two channels per session, and 2091 because each channel has a distinct purpose, channels are not 2092 assigned identifiers. 2094 The fore channel is used for ordinary requests from the client to the 2095 server, and carries COMPOUND requests and responses. A session 2096 always has a fore channel. 2098 The backchannel used for callback requests from server to client, and 2099 carries CB_COMPOUND requests and responses. Whether there is a 2100 backchannel or not is a decision by the client, however many features 2101 of NFSv4.1 require a backchannel. NFSv4.1 servers MUST support 2102 backchannels. 2104 Each session has resources for each channel, including separate reply 2105 caches (see Section 2.10.6.1). Note that even the backchannel 2106 requires a reply cache because some callback operations are 2107 nonidempotent. 2109 2.10.3.1. Association of Connections, Channels, and Sessions 2111 Each channel is associated with zero or more transport connections 2112 (whether of the same transport protocol or different transport 2113 protocols). A connection can be associated with one channel or both 2114 channels of a session; the client and server negotiate whether a 2115 connection will carry traffic for one channel or both channels via 2116 the CREATE_SESSION (Section 18.36) and the BIND_CONN_TO_SESSION 2117 (Section 18.34) operations. When a session is created via 2118 CREATE_SESSION, the connection that transported the CREATE_SESSION 2119 request is automatically associated with the fore channel, and 2120 optionally the backchannel. If the client specifies no state 2121 protection (Section 18.35) when the session is created, then when 2122 SEQUENCE is transmitted on a different connection, the connection is 2123 automatically associated with the fore channel of the session 2124 specified in the SEQUENCE operation. 2126 A connection's association with a session is not exclusive. A 2127 connection associated with the channel(s) of one session may be 2128 simultaneously associated with the channel(s) of other sessions 2129 including sessions associated with other client IDs. 2131 It is permissible for connections of multiple transport types to be 2132 associated with the same channel. For example both a TCP and RDMA 2133 connection can be associated with the fore channel. In the event an 2134 RDMA and non-RDMA connection are associated with the same channel, 2135 the maximum number of slots SHOULD be at least one more than the 2136 total number of RDMA credits (Section 2.10.6.1. This way if all RDMA 2137 credits are used, the non-RDMA connection can have at least one 2138 outstanding request. If a server supports multiple transport types, 2139 it MUST allow a client to associate connections from each transport 2140 to a channel. 2142 It is permissible for a connection of one type of transport to be 2143 associated with the fore channel, and a connection of a different 2144 type to be associated with the backchannel. 2146 2.10.4. Server Scope 2148 Servers each specify a server scope value in the form of an opaque 2149 string eir_server_scope returned as part of the results of an 2150 EXCHANGE_ID operation. The purpose of the server scope is to allow a 2151 group of servers to indicate to clients that a set of servers sharing 2152 the same server scope value have arranged to use compatible values of 2153 otherwise opaque identifiers. Thus the identifiers generated by one 2154 server of that set may be presented to another of that same scope. 2156 The use of such compatible values does not imply that a value 2157 generated by one server will always be accepted by another. In most 2158 cases, it will not. However, a server will not accept a value 2159 generated by another inadvertently. When it does accept it, it will 2160 be because it is recognized as valid and carrying the same meaning as 2161 on another server of the same scope. 2163 When servers are of the same server scope, this compatibility of 2164 values applies to the follow identifiers: 2166 o Filehandle values. A filehandle value accepted by two servers of 2167 the same server scope denotes the same object. A write done to 2168 one server is reflected immediately in a read done to the other 2169 and locks obtained on one server conflict with those requested on 2170 the other. 2172 o Session ID values. A session ID value accepted by two servers of 2173 the same server scope denotes the same session. 2175 o Client ID values. A client ID value accepted as valid by two 2176 servers of the same server scope is associated with two clients 2177 with the same client owner and verifier. 2179 o State ID values when the corresponding client ID is recognized as 2180 valid. If the same stateid value is accepted as valid on two 2181 servers of the same scope and the client IDs on the two servers 2182 represent the same client owner and verifier, then the two stateid 2183 values designate the same set of locks and are for the same file 2185 o Server owner values. When the server scope values are the same, 2186 server owner value may be validly compared. In cases where the 2187 server scope are different, server owner values are treated as 2188 different even if they contain all identical bytes. 2190 The co-ordination among servers required to provide such 2191 compatibility can be quite minimal, and limited to a simple partition 2192 of the ID space. The recognition of common values requires 2193 additional implementation, but this can be tailored to the specific 2194 situations in which that recognition is desired. 2196 Clients will have occasion to compare the server scope values of 2197 multiple servers under a number of circumstances, each of which will 2198 be discussed under the appropriate functional section. 2200 o When server owner values received in response to EXCHANGE_ID 2201 operations issued to multiple network addresses are compared for 2202 the purpose of determining the validity of various forms of 2203 trunking, as described in Section 2.10.5. 2205 o When network or server reconfiguration causes the same network 2206 address to possibly be directed to different servers, with the 2207 necessity for the client to determine when lock reclaim should be 2208 attempted, as described in Section 8.4.2.1 2210 o When file system migration causes the transfer of responsibility 2211 for a file system between servers and the client needs to 2212 determine whether state has been transferred with the file system 2213 (as described in Section 11.7.7) or whether the client needs to 2214 reclaim state on a similar basis as in the case of server restart, 2215 as described in Section 8.4.2. 2217 When two replies from EXCHANGE_ID each from two different server 2218 network addresses have the same server scope, there are a number of 2219 ways a client can validate that the common server scope is due to two 2220 servers cooperating in a group. 2222 o If both EXCHANGE_ID requests were sent with RPCSEC_GSS 2223 authentication and the server principal is the same for both 2224 targets, the equality of server scope is validated. It is 2225 RECOMMENDED that two servers intending to share the same server 2226 scope also share the same principal name. 2228 o The client may accept the appearance of the second server in 2229 fs_locations or fs_locations_info attribute for a relevant file 2230 system. For example, if there is a migration event for a 2231 particular file system or there are locks to be reclaimed on a 2232 particular file system, the attributes for that particular file 2233 system may be used. The client sends the GETATTR request to the 2234 first server for the fs_locations or fs_locations_info attribute 2235 with RPCSEC_GSS authentication. It may need to do this in advance 2236 of the need to verify the common server scope. If the client 2237 successfully authenticates the reply to GETATTR, and the GETATTR 2238 request and reply containing the fs_locations or fs_locations_info 2239 attribute refers to the second server, then the equality of server 2240 scope is supported. A client may choose to limit the use of this 2241 form of support to information relevant to the specific file 2242 system involved (e.g. a file system being migrated). 2244 2.10.5. Trunking 2246 Trunking is the use of multiple connections between a client and 2247 server in order to increase the speed of data transfer. NFSv4.1 2248 supports two types of trunking: session trunking and client ID 2249 trunking. 2251 NFSv4.1 servers MUST support both forms of trunking within the 2252 context of a single server network address and MUST support both 2253 forms within the context of the set of network addresses used to 2254 access a single server. NFSv4.1 servers in a clustered configuration 2255 MAY allow network addresses for different servers to use client ID 2256 trunking. 2258 Clients may use either form of trunking as long as they do not, when 2259 trunking between different server network addresses, violate the 2260 servers' mandates as to the kinds of trunking to be allowed (see 2261 below). With regard to callback channels, the client MUST allow the 2262 server to choose among all callback channels valid for a given client 2263 ID and MUST support trunking when the connections supporting the 2264 backchannel allow session or client ID trunking to be used for 2265 callbacks 2267 Session trunking is essentially the association of multiple 2268 connections, each with potentially different target and/or source 2269 network addresses, to the same session. When the target network 2270 addresses (server addresses) of the two connections are the same, the 2271 server MUST support such session trunking. When the target network 2272 addresses are different, the server MAY indicate such support using 2273 the data returned by the EXCHANGE_ID operation (see below). 2275 Client ID trunking is the association of multiple sessions to the 2276 same client ID. Servers MUST support client ID trunking for two 2277 target network addresses whenever they allow session trunking for 2278 those same two network addresses. In addition, a server MAY, by 2279 presenting the same major server owner ID (Section 2.5), and server 2280 scope (Section 2.10.4) allow an additional case of client ID 2281 trunking. When two servers return the same major server owner and 2282 server scope, it means that the two servers are cooperating on 2283 locking state management which is a prerequisite for client ID 2284 trunking. 2286 Understanding and distinguishing when the client is allowed to use 2287 session and client ID trunking requires understanding how the results 2288 of the EXCHANGE_ID (Section 18.35) operation identify a server. 2289 Suppose a client sends EXCHANGE_ID over two different connections 2290 each with a possibly different target network address but each 2291 EXCHANGE_ID operation has the same value in the eia_clientowner 2292 field. If the same NFSv4.1 server is listening over each connection, 2293 then each EXCHANGE_ID result MUST return the same values of 2294 eir_clientid, eir_server_owner.so_major_id and eir_server_scope. The 2295 client can then treat each connection as referring to the same server 2296 (subject to verification, see Paragraph 8 later in this section), and 2297 it can use each connection to trunk requests and replies. The 2298 client's choice is whether session trunking or client ID trunking 2299 applies. 2301 Session Trunking. If the eia_clientowner argument is the same in two 2302 different EXCHANGE_ID requests, and the eir_clientid, 2303 eir_server_owner.so_major_id, eir_server_owner.so_minor_id, and 2304 eir_server_scope results match in both EXCHANGE_ID results, then 2305 the client is permitted to perform session trunking. If the 2306 client has no session mapping to the tuple of eir_clientid, 2307 eir_server_owner.so_major_id, eir_server_scope, 2308 eir_server_owner.so_minor_id, then it creates the session via a 2309 CREATE_SESSION operation over one of the connections, which 2310 associates the connection to the session. If there is a session 2311 for the tuple, the client can send BIND_CONN_TO_SESSION to 2312 associate the connection to the session. 2314 Of course, if the client does not desire to use session trunking, 2315 it is not required to do so. It can invoke CREATE_SESSION on the 2316 connection. This will result in client ID trunking as described 2317 below. It can also decide to drop the connection if it does not 2318 choose to use trunking. 2320 Client ID Trunking. If the eia_clientowner argument is the same in 2321 two different EXCHANGE_ID requests, and the eir_clientid, 2322 eir_server_owner.so_major_id, and eir_server_scope results match 2323 in both EXCHANGE_ID results, then the client is permitted to 2324 perform client ID trunking (regardless whether the 2325 eir_server_owner.so_minor_id results match). The client can 2326 associate each connection with different sessions, where each 2327 session is associated with the same server. 2329 The client completes the act of client ID trunking by invoking 2330 CREATE_SESSION on each connection, using the same client ID that 2331 was returned in eir_clientid. These invocations create two 2332 sessions and also associate each connection with its respective 2333 session. The client is free to choose not to use client ID 2334 trunking by simply dropping the connection at this point. 2336 When doing client ID trunking, locking state is shared across 2337 sessions associated with that same client ID. This requires the 2338 server to coordinate state across sessions. 2340 The client should be prepared for the possibility that 2341 eir_server_owner values may be different on subsequent EXCHANGE_ID 2342 requests made to the same network address, as a result of various 2343 sorts of reconfiguration events. When this happens and the changes 2344 result in the invalidation of previously valid forms of trunking, the 2345 client should cease to use those forms, either by dropping 2346 connections or by adding sessions. For a discussion of lock reclaim 2347 as it relates to such reconfiguration events, see Section 8.4.2.1. 2349 When two servers over two connections claim matching or partially 2350 matching eir_server_owner, eir_server_scope, and eir_clientid values, 2351 the client does not have to trust the servers' claims. The client 2352 may verify these claims before trunking traffic in the following 2353 ways: 2355 o For session trunking, clients SHOULD reliably verify if 2356 connections between different network paths are in fact associated 2357 with the same NFSv4.1 server and usable on the same session, and 2358 servers MUST allow clients to perform reliable verification. When 2359 a client ID is created, the client SHOULD specify that 2360 BIND_CONN_TO_SESSION is to be verified according to the SP4_SSV or 2361 SP4_MACH_CRED (Section 18.35) state protection options. For 2362 SP4_SSV, reliable verification depends on a shared secret (the 2363 SSV) that is established via the SET_SSV (Section 18.47) 2364 operation. 2366 When a new connection is associated with the session (via the 2367 BIND_CONN_TO_SESSION operation, see Section 18.34), if the client 2368 specified SP4_SSV state protection for the BIND_CONN_TO_SESSION 2369 operation, the client MUST send the BIND_CONN_TO_SESSION with 2370 RPCSEC_GSS protection, using integrity or privacy, and an 2371 RPCSEC_GSS handle created with the GSS SSV mechanism 2372 (Section 2.10.9). 2374 If the client mistakenly tries to associate a connection to a 2375 session of a wrong server, the server will either reject the 2376 attempt because it is not aware of the session identifier of the 2377 BIND_CONN_TO_SESSION arguments, or it will reject the attempt 2378 because the RPCSEC_GSS authentication fails. Even if the server 2379 mistakenly or maliciously accepts the connection association 2380 attempt, the RPCSEC_GSS verifier it computes in the response will 2381 not be verified by the client, so the client will know it cannot 2382 use the connection for trunking the specified session. 2384 If the client specified SP4_MACH_CRED state protection, the 2385 BIND_CONN_TO_SESSION operation will use RPCSEC_GSS integrity or 2386 privacy, using the same credential that was used when the client 2387 ID was created. Mutual authentication via RPCSEC_GSS assures the 2388 client that the connection is associated with the correct session 2389 of the correct server. 2391 o For client ID trunking, the client has at least two options for 2392 verifying that the same client ID obtained from two different 2393 EXCHANGE_ID operations came from the same server. The first 2394 option is to use RPCSEC_GSS authentication when issuing each 2395 EXCHANGE_ID. Each time an EXCHANGE_ID is sent with RPCSEC_GSS 2396 authentication, the client notes the principal name of the GSS 2397 target. If the EXCHANGE_ID results indicate client ID trunking is 2398 possible, and the GSS targets' principal names are the same, the 2399 servers are the same and client ID trunking is allowed. 2401 The second option for verification is to use SP4_SSV protection. 2402 When the client sends EXCHANGE_ID it specifies SP4_SSV protection. 2403 The first EXCHANGE_ID the client sends always has to be confirmed 2404 by a CREATE_SESSION call. The client then sends SET_SSV. Later 2405 the client sends EXCHANGE_ID to a second destination network 2406 address different from the one the first EXCHANGE_ID was sent to. 2407 The client checks that each EXCHANGE_ID reply has the same 2408 eir_clientid, eir_server_owner.so_major_id, and eir_server_scope. 2409 If so, the client verifies the claim by issuing a CREATE_SESSION 2410 to the second destination address, protected with RPCSEC_GSS 2411 integrity using an RPCSEC_GSS handle returned by the second 2412 EXCHANGE_ID. If the server accepts the CREATE_SESSION request, 2413 and if the client verifies the RPCSEC_GSS verifier and integrity 2414 codes, then the client has proof the second server knows the SSV, 2415 and thus the two servers are co-operating for the purposes of 2416 specifying server scope and client ID trunking. 2418 2.10.6. Exactly Once Semantics 2420 Via the session, NFSv4.1 offers Exactly Once Semantics (EOS) for 2421 requests sent over a channel. EOS is supported on both the fore and 2422 back channels. 2424 Each COMPOUND or CB_COMPOUND request that is sent with a leading 2425 SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver 2426 exactly once. This requirement holds regardless of whether the 2427 request is sent with reply caching specified (see 2428 Section 2.10.6.1.3). The requirement holds even if the requester is 2429 issuing the request over a session created between a pNFS data client 2430 and pNFS data server. To understand the rationale for this 2431 requirement, divide the requests into three classifications: 2433 o Nonidempotent requests. 2435 o Idempotent modifying requests. 2437 o Idempotent non-modifying requests. 2439 An example of a non-idempotent request is RENAME. If is obvious that 2440 if a replier executes the same RENAME request twice, and the first 2441 execution succeeds, the re-execution will fail. If the replier 2442 returns the result from the re-execution, this result is incorrect. 2443 Therefore, EOS is required for nonidempotent requests. 2445 An example of an idempotent modifying request is a COMPOUND request 2446 containing a WRITE operation. Repeated execution of the same WRITE 2447 has the same effect as execution of that write a single time. 2448 Nevertheless, enforcing EOS for WRITEs and other idempotent modifying 2449 requests is necessary to avoid data corruption. 2451 Suppose a client sends WRITE A to a noncompliant server that does not 2452 enforce EOS, and receives no response, perhaps due to a network 2453 partition. The client reconnects to the server and re-sends WRITE A. 2454 Now, the server has outstanding two instances of A. The server can be 2455 in a situation in which it executes and replies to the retry of A, 2456 while the first A is still waiting in the server's internal I/O 2457 system for some resource. Upon receiving the reply to the second 2458 attempt of WRITE A, the client believes its write is done so it is 2459 free to send WRITE B which overlaps the range of A. When the original 2460 A is dispatched from the server's I/O system, and executed (thus the 2461 second time A will have been written), then what has been written by 2462 B can be overwritten and thus corrupted. 2464 An example of an idempotent non-modifying request is a COMPOUND 2465 containing SEQUENCE, PUTFH, READLINK and nothing else. The re- 2466 execution of a such a request will not cause data corruption, or 2467 produce an incorrect result. Nonetheless, to keep the implementation 2468 simple, the replier MUST enforce EOS for all requests whether 2469 idempotent and non-modifying or not. 2471 Note that true and complete EOS is not possible unless the server 2472 persists the reply cache in stable storage, unless the server is 2473 somehow implemented to never require a restart (indeed if such a 2474 server exists, the distinction between a reply cache kept in stable 2475 storage versus one that is not is one without meaning). See 2476 Section 2.10.6.5 for a discussion of persistence in the reply cache. 2477 Regardless, even if the server does not persist the reply cache, EOS 2478 improves robustness and correctness over previous versions of NFS 2479 because the legacy duplicate request/reply caches were based on the 2480 ONC RPC transaction identifier (XID). Section 2.10.6.1 explains the 2481 shortcomings of the XID as a basis for a reply cache and describes 2482 how NFSv4.1 sessions improve upon the XID. 2484 2.10.6.1. Slot Identifiers and Reply Cache 2486 The RPC layer provides a transaction ID (XID), which, while required 2487 to be unique, is not convenient for tracking requests for two 2488 reasons. First, the XID is only meaningful to the requester; it 2489 cannot be interpreted by the replier except to test for equality with 2490 previously sent requests. When consulting an RPC-based duplicate 2491 request cache, the opaqueness of the XID requires a computationally 2492 expensive lookup (often via a hash that includes XID and source 2493 address). NFSv4.1 requests use a non-opaque slot ID which is an 2494 index into a slot table, which is far more efficient. Second, 2495 because RPC requests can be executed by the replier in any order, 2496 there is no bound on the number of requests that may be outstanding 2497 at any time. To achieve perfect EOS using ONC RPC would require 2498 storing all replies in the reply cache. XIDs are 32 bits; storing 2499 over four billion (2^32) replies in the reply cache is not practical. 2500 In practice, previous versions of NFS have chosen to store a fixed 2501 number of replies in the cache, and use a least recently used (LRU) 2502 approach to replacing cache entries with new entries when the cache 2503 is full. In NFSv4.1, the number of outstanding requests is bounded 2504 by the size of the slot table, and a sequence ID per slot is used to 2505 tell the replier when it is safe to delete a cached reply. 2507 In the NFSv4.1 reply cache, when the requester sends a new request, 2508 it selects a slot ID in the range 0..N, where N is the replier's 2509 current maximum slot ID granted to the requester on the session over 2510 which the request is to be sent. The value of N starts out as equal 2511 to ca_maxrequests - 1 (Section 18.36), but can be adjusted by the 2512 response to SEQUENCE or CB_SEQUENCE as described later in this 2513 section. The slot ID must be unused by any of the requests which the 2514 requester has already active on the session. "Unused" here means the 2515 requester has no outstanding request for that slot ID. 2517 A slot contains a sequence ID and the cached reply corresponding to 2518 the request sent with that sequence ID. The sequence ID is a 32 bit 2519 unsigned value, and is therefore in the range 0..0xFFFFFFFF (2^32 - 2520 1). The first time a slot is used, the requester MUST specify a 2521 sequence ID of one (1) (Section 18.36). Each time a slot is reused, 2522 the request MUST specify a sequence ID that is one greater than that 2523 of the previous request on the slot. If the previous sequence ID was 2524 0xFFFFFFFF, then the next request for the slot MUST have the sequence 2525 ID set to zero (i.e. (2^32 - 1) + 1 mod 2^32). 2527 The sequence ID accompanies the slot ID in each request. It is for 2528 the critical check at the server: it used to efficiently determine 2529 whether a request using a certain slot ID is a retransmit or a new, 2530 never-before-seen request. It is not feasible for the client to 2531 assert that it is retransmitting to implement this, because for any 2532 given request the client cannot know whether the server has seen it 2533 unless the server actually replies. Of course, if the client has 2534 seen the server's reply, the client would not retransmit. 2536 The replier compares each received request's sequence ID with the 2537 last one previously received for that slot ID, to see if the new 2538 request is: 2540 o A new request, in which the sequence ID is one greater than that 2541 previously seen in the slot (accounting for sequence wraparound). 2542 The replier proceeds to execute the new request, and the replier 2543 MUST increase the slot's sequence ID by one. 2545 o A retransmitted request, in which the sequence ID is equal to that 2546 currently recorded in the slot. If the original request has 2547 executed to completion, the replier returns the cached reply. See 2548 Section 2.10.6.2 for direction on how the replier deals with 2549 retries of requests that are still in progress. 2551 o A misordered retry, in which the sequence ID is less than 2552 (accounting for sequence wraparound) that previously seen in the 2553 slot. The replier MUST return NFS4ERR_SEQ_MISORDERED (as the 2554 result from SEQUENCE or CB_SEQUENCE). 2556 o A misordered new request, in which the sequence ID is two or more 2557 than (accounting for sequence wraparound) than that previously 2558 seen in the slot. Note that because the sequence ID MUST 2559 wraparound to zero (0) once it reaches 0xFFFFFFFF, a misordered 2560 new request and a misordered retry cannot be distinguished. Thus, 2561 the replier MUST return NFS4ERR_SEQ_MISORDERED (as the result from 2562 SEQUENCE or CB_SEQUENCE). 2564 Unlike the XID, the slot ID is always within a specific range; this 2565 has two implications. The first implication is that for a given 2566 session, the replier need only cache the results of a limited number 2567 of COMPOUND requests . The second implication derives from the 2568 first, which is that unlike XID-indexed reply caches (also known as 2569 duplicate request caches - DRCs), the slot ID-based reply cache 2570 cannot be overflowed. Through use of the sequence ID to identify 2571 retransmitted requests, the replier does not need to actually cache 2572 the request itself, reducing the storage requirements of the reply 2573 cache further. These facilities make it practical to maintain all 2574 the required entries for an effective reply cache. 2576 The slot ID, sequence ID, and session ID therefore take over the 2577 traditional role of the XID and source network address in the 2578 replier's reply cache implementation. This approach is considerably 2579 more portable and completely robust - it is not subject to the 2580 reassignment of ports as clients reconnect over IP networks. In 2581 addition, the RPC XID is not used in the reply cache, enhancing 2582 robustness of the cache in the face of any rapid reuse of XIDs by the 2583 requester. While the replier does not care about the XID for the 2584 purposes of reply cache management (but the replier MUST return the 2585 same XID that was in the request), nonetheless there are 2586 considerations for the XID in NFSv4.1 that are the same as all other 2587 previous versions of NFS. The RPC XID remains in each message and 2588 needs to be formulated in NFSv4.1 requests as in any other ONC RPC 2589 request. The reasons include: 2591 o The RPC layer retains its existing semantics and implementation. 2593 o The requester and replier must be able to interoperate at the RPC 2594 layer, prior to the NFSv4.1 decoding of the SEQUENCE or 2595 CB_SEQUENCE operation. 2597 o If an operation is being used that does not start with SEQUENCE or 2598 CB_SEQUENCE (e.g. BIND_CONN_TO_SESSION), then the RPC XID is 2599 needed for correct operation to match the reply to the request. 2601 o The SEQUENCE or CB_SEQUENCE operation may generate an error. If 2602 so, the embedded slot ID, sequence ID, and session ID (if present) 2603 in the request will not be in the reply, and the requester has 2604 only the XID to match the reply to the request. 2606 Given that well formulated XIDs continue to be required, this begs 2607 the question why SEQUENCE and CB_SEQUENCE replies have a session ID, 2608 slot ID and sequence ID? Having the session ID in the reply means 2609 the requester does not have to use the XID to lookup the session ID, 2610 which would be necessary if the connection were associated with 2611 multiple sessions. Having the slot ID and sequence ID in the reply 2612 means requester does not have to use the XID to lookup the slot ID 2613 and sequence ID. Furthermore, since the XID is only 32 bits, it is 2614 too small to guarantee the re-association of a reply with its request 2615 ([36]); having session ID, slot ID, and sequence ID in the reply 2616 allows the client to validate that the reply in fact belongs to the 2617 matched request. 2619 The SEQUENCE (and CB_SEQUENCE) operation also carries a 2620 "highest_slotid" value which carries additional requester slot usage 2621 information. The requester MUST always indicate the slot ID 2622 representing the outstanding request with the highest-numbered slot 2623 value. The requester should in all cases provide the most 2624 conservative value possible, although it can be increased somewhat 2625 above the actual instantaneous usage to maintain some minimum or 2626 optimal level. This provides a way for the requester to yield unused 2627 request slots back to the replier, which in turn can use the 2628 information to reallocate resources. 2630 The replier responds with both a new target highest_slotid, and an 2631 enforced highest_slotid, described as follows: 2633 o The target highest_slotid is an indication to the requester of the 2634 highest_slotid the replier wishes the requester to be using. This 2635 permits the replier to withdraw (or add) resources from a 2636 requester that has been found to not be using them, in order to 2637 more fairly share resources among a varying level of demand from 2638 other requesters. The requester must always comply with the 2639 replier's value updates, since they indicate newly established 2640 hard limits on the requester's access to session resources. 2641 However, because of request pipelining, the requester may have 2642 active requests in flight reflecting prior values, therefore the 2643 replier must not immediately require the requester to comply. 2645 o The enforced highest_slotid indicates the highest slot ID the 2646 requester is permitted to use on a subsequent SEQUENCE or 2647 CB_SEQUENCE operation. The replier's enforced highest_slotid 2648 SHOULD be no less than the highest_slotid the requester indicated 2649 in the SEQUENCE or CB_SEQUENCE arguments. 2651 If a replier detects the client is being intransigent, i.e. it 2652 fails in a series of requests to honor the target highest_slotid 2653 even though the replier knows there are no outstanding requests a 2654 higher slot ids, it MAY take more forceful action. When faced 2655 with intransigence, the replier MAY reply with a new enforced 2656 highest_slotid that is less than its previous enforced 2657 highest_slotid. Thereafter, if the requester continues to send 2658 requests with a highest_slotid that is greater than the replier's 2659 new enforced highest_slotid the server MAY return 2660 NFS4ERR_BAD_HIGHSLOT, unless the slot ID in the request is greater 2661 than the new enforced highest_slotid, and the request is a retry. 2663 The replier SHOULD retain the slots it wants to retire until the 2664 requester sends a request with a highest_slotid less than or equal 2665 to the replier's new enforced highest_slotid. Also if a request 2666 is received with a slot that is higher than the new enforced 2667 highest_slotid, and the sequence ID is one higher than what is in 2668 the slot's reply cache, then the server can both retire the slot 2669 and return NFS4ERR_BADSLOT (however the server MUST NOT do one and 2670 not the other). (The reason it is safe to retire the slot is 2671 because that by using the next sequence ID, the client is 2672 indicating it has received the previous reply for the slot.) Once 2673 the replier has forcibly lowered the enforced highest_slotid, the 2674 requester is only allowed to send retries to the to-be-retired 2675 slots. 2677 o The requester SHOULD use the lowest available slot when issuing a 2678 new request. This way, the replier may be able to retire slot 2679 entries faster. However, where the replier is actively adjusting 2680 its granted highest_slotid, it will not be able to use only the 2681 receipt of the slot ID and highest_slotid in the request. Neither 2682 the slot ID nor the highest_slotid used in a request may reflect 2683 the replier's current idea of the requester's session limit, 2684 because the request may have been sent from the requester before 2685 the update was received. Therefore, in the downward adjustment 2686 case, the replier may have to retain a number of reply cache 2687 entries at least as large as the old value of maximum requests 2688 outstanding, until it can infer that the requester has seen a 2689 reply containing the new granted highest_slotid. The replier can 2690 infer that requester as seen such a reply when it receives a new 2691 request with the same slot ID as the request replied to and the 2692 next higher sequence ID. 2694 2.10.6.1.1. Caching of SEQUENCE and CB_SEQUENCE Replies 2696 When a SEQUENCE or CB_SEQUENCE operation is successfully executed, 2697 its reply MUST always be cached. Specifically, session ID, sequence 2698 ID, and slot ID MUST be cached in the reply cache. The reply from 2699 SEQUENCE also includes the highest slot ID, target highest slot ID, 2700 and status flags. Instead of caching these values, the server MAY 2701 re-compute the values from the current state of the fore channel, 2702 session and/or client ID as appropriate. Similarly, the reply from 2703 CB_SEQUENCE includes a highest slot ID and target highest slot ID. 2704 The client MAY re-compute the values from the current state of the 2705 session as appropriate. 2707 Regardless of whether a replier is re-computing highest slot ID, 2708 target slot ID, and status on replies to retries or not, the 2709 requester MUST NOT assume the values are being re-computed whenever 2710 it receives a reply after a retry is sent, since it has no way of 2711 knowing whether the reply it has received was sent by the server in 2712 response to the retry, or is a delayed response to the original 2713 request. Therefore, it may be the case that highest slot ID, target 2714 slot ID, or status bits may reflect the state of affairs when the 2715 request was first executed. Although acting based on such delayed 2716 information is valid, it may cause the receiver to do unneeded work. 2717 Requesters MAY choose to send additional requests to get the current 2718 state of affairs or use the state of affairs reported by subsequent 2719 requests, in preference to acting immediately on data which may be 2720 out of date. 2722 2.10.6.1.2. Errors from SEQUENCE and CB_SEQUENCE 2724 Any time SEQUENCE or CB_SEQUENCE return an error, the sequence ID of 2725 the slot MUST NOT change. The replier MUST NOT modify the reply 2726 cache entry for the slot whenever an error is returned from SEQUENCE 2727 or CB_SEQUENCE. 2729 2.10.6.1.3. Optional Reply Caching 2731 On a per-request basis the requester can choose to direct the replier 2732 to cache the reply to all operations after the first operation 2733 (SEQUENCE or CB_SEQUENCE) via the sa_cachethis or csa_cachethis 2734 fields of the arguments to SEQUENCE or CB_SEQUENCE. The reason it 2735 would not direct the replier to cache the entire reply is that the 2736 request is composed of all idempotent operations [33]. Caching the 2737 reply may offer little benefit. If the reply is too large (see 2738 Section 2.10.6.4), it may not be cacheable anyway. Even if the reply 2739 to idempotent request is small enough to cache, unnecessarily caching 2740 the reply slows down the server and increases RPC latency. 2742 Whether the requester requests the reply to be cached or not has no 2743 effect on the slot processing. If the results of SEQUENCE or 2744 CB_SEQUENCE are NFS4_OK, then the slot's sequence ID MUST be 2745 incremented by one. If a requester does not direct the replier to 2746 cache the reply, the replier MUST do one of following: 2748 o The replier can cache the entire original reply. Even though 2749 sa_cachethis or csa_cachethis are FALSE, the replier is always 2750 free to cache. It may choose this approach in order to simplify 2751 implementation. 2753 o The replier enters into its reply cache a reply consisting of the 2754 original results to the SEQUENCE or CB_SEQUENCE operation, and 2755 with the next operation in COMPOUND or CB_COMPOUND having the 2756 error NFS4ERR_RETRY_UNCACHED_REP. Thus if the requester later 2757 retries the request, it will get NFS4ERR_RETRY_UNCACHED_REP. 2759 2.10.6.2. Retry and Replay of Reply 2761 A requester MUST NOT retry a request, unless the connection it used 2762 to send the request disconnects. The requester can then reconnect 2763 and re-send the request, or it can re-send the request over a 2764 different connection that is associated with the same session. 2766 If the requester is a server wanting to re-send a callback operation 2767 over the backchannel of session, the requester of course cannot 2768 reconnect because only the client can associate connections with the 2769 backchannel. The server can re-send the request over another 2770 connection that is bound to the same session's backchannel. If there 2771 is no such connection, the server MUST indicate that the session has 2772 no backchannel by setting the SEQ4_STATUS_CB_PATH_DOWN_SESSION flag 2773 bit in the response to the next SEQUENCE operation from the client. 2774 The client MUST then associate a connection with the session (or 2775 destroy the session). 2777 Note that it is not fatal for a client to retry without a disconnect 2778 between the request and retry. However the retry does consume 2779 resources, especially with RDMA, where each request, retry or not, 2780 consumes a credit. Retries for no reason, especially retries sent 2781 shortly after the previous attempt, are a poor use of network 2782 bandwidth and defeat the purpose of a transport's inherent congestion 2783 control system. 2785 A requester MUST wait for a reply to a request before using the slot 2786 for another request. If it does not wait for a reply, then the 2787 requester does not know what sequence ID to use for the slot on its 2788 next request. For example, suppose a requester sends a request with 2789 sequence ID 1, and does not wait for the response. The next time it 2790 uses the slot, it sends the new request with sequence ID 2. If the 2791 replier has not seen the request with sequence ID 1, then the replier 2792 is not expecting sequence ID 2, and rejects the requester's new 2793 request with NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or 2794 CB_SEQUENCE). 2796 RDMA fabrics do not guarantee that the memory handles (Steering Tags) 2797 within each RPC/RDMA "chunk" ([8]) are valid on a scope outside that 2798 of a single connection. Therefore, handles used by the direct 2799 operations become invalid after connection loss. The server must 2800 ensure that any RDMA operations which must be replayed from the reply 2801 cache use the newly provided handle(s) from the most recent request. 2803 A retry might be sent while the original request is still in progress 2804 on the replier. The replier SHOULD deal with the issue by returning 2805 NFS4ERR_DELAY as the reply to SEQUENCE or CB_SEQUENCE operation, but 2806 implementations MAY return NFS4ERR_MISORDERED. Since errors from 2807 SEQUENCE and CB_SEQUENCE are never recorded in the reply cache, this 2808 approach allows the results of the execution of the original request 2809 to be properly recorded in the reply cache (assuming the requester 2810 specified the reply to be cached). 2812 2.10.6.3. Resolving Server Callback Races 2814 It is possible for server callbacks to arrive at the client before 2815 the reply from related fore channel operations. For example, a 2816 client may have been granted a delegation to a file it has opened, 2817 but the reply to the OPEN (informing the client of the granting of 2818 the delegation) may be delayed in the network. If a conflicting 2819 operation arrives at the server, it will recall the delegation using 2820 the backchannel, which may be on a different transport connection, 2821 perhaps even a different network, or even a different session 2822 associated with the same client ID 2824 The presence of a session between client and server alleviates this 2825 issue. When a session is in place, each client request is uniquely 2826 identified by its { session ID, slot ID, sequence ID } triple. By 2827 the rules under which slot entries (reply cache entries) are retired, 2828 the server has knowledge whether the client has "seen" each of the 2829 server's replies. The server can therefore provide sufficient 2830 information to the client to allow it to disambiguate between an 2831 erroneous or conflicting callback race condition. 2833 For each client operation which might result in some sort of server 2834 callback, the server SHOULD "remember" the { session ID, slot ID, 2835 sequence ID } triple of the client request until the slot ID 2836 retirement rules allow the server to determine that the client has, 2837 in fact, seen the server's reply. Until the time the { session ID, 2838 slot ID, sequence ID } request triple can be retired, any recalls of 2839 the associated object MUST carry an array of these referring 2840 identifiers (in the CB_SEQUENCE operation's arguments), for the 2841 benefit of the client. After this time, it is not necessary for the 2842 server to provide this information in related callbacks, since it is 2843 certain that a race condition can no longer occur. 2845 The CB_SEQUENCE operation which begins each server callback carries a 2846 list of "referring" { session ID, slot ID, sequence ID } triples. If 2847 the client finds the request corresponding to the referring session 2848 ID, slot ID and sequence ID to be currently outstanding (i.e. the 2849 server's reply has not been seen by the client), it can determine 2850 that the callback has raced the reply, and act accordingly. If the 2851 client does not find the request corresponding the referring triple 2852 to be outstanding (including the case of a session ID referring to a 2853 destroyed session), then there is no race with respect to this 2854 triple. The server SHOULD limit the referring triples to requests 2855 that refer to just those that apply to the objects referred to in the 2856 CB_COMPOUND procedure. 2858 The client must not simply wait forever for the expected server reply 2859 to arrive before responding to the CB_COMPOUND that won the race, 2860 because it is possible that it will be delayed indefinitely. The 2861 client should assume the likely case that the reply will arrive 2862 within the average round trip time for COMPOUND requests to the 2863 server, and wait that period of time. If that period of time expires 2864 it can respond to the CB_COMPOUND with NFS4ERR_DELAY. 2866 There are other scenarios under which callbacks may race replies. 2867 Among them are pNFS layout recalls as described in Section 12.5.5.2. 2869 2.10.6.4. COMPOUND and CB_COMPOUND Construction Issues 2871 Very large requests and replies may pose both buffer management 2872 issues (especially with RDMA) and reply cache issues. When the 2873 session is created, (Section 18.36), for each channel (fore and 2874 back), the client and server negotiate the maximum sized request they 2875 will send or process (ca_maxrequestsize), the maximum sized reply 2876 they will return or process (ca_maxresponsesize), and the maximum 2877 sized reply they will store in the reply cache 2878 (ca_maxresponsesize_cached). 2880 If a request exceeds ca_maxrequestsize, the reply will have the 2881 status NFS4ERR_REQ_TOO_BIG. A replier MAY return NFS4ERR_REQ_TOO_BIG 2882 as the status for first operation (SEQUENCE or CB_SEQUENCE) in the 2883 request (which means no operations in the request executed, and the 2884 state of the slot in the reply cache is unchanged), or it MAY opt to 2885 return it on a subsequent operation in the same COMPOUND or 2886 CB_COMPOUND request (which means at least one operation did execute 2887 and the state of the slot in reply cache does change). The replier 2888 SHOULD set NFS4ERR_REQ_TOO_BIG on the operation that exceeds 2889 ca_maxrequestsize. 2891 If a reply exceeds ca_maxresponsesize, the reply will have the status 2892 NFS4ERR_REP_TOO_BIG. A replier MAY return NFS4ERR_REP_TOO_BIG as the 2893 status for first operation (SEQUENCE or CB_SEQUENCE) in the request, 2894 or it MAY opt to return it on a subsequent operation (in the same 2895 COMPOUND or CB_COMPOUND reply). A replier MAY return 2896 NFS4ERR_REP_TOO_BIG in the reply to SEQUENCE or CB_SEQUENCE, even if 2897 the response would still exceed ca_maxresponsesize. 2899 If sa_cachethis or csa_cachethis are TRUE, then the replier MUST 2900 cache a reply except if an error is returned by the SEQUENCE or 2901 CB_SEQUENCE operation (see Section 2.10.6.1.2). If the reply exceeds 2902 ca_maxresponsesize_cached, (and sa_cachethis or csa_cachethis are 2903 TRUE) then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. Even 2904 if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that matter) 2905 is returned on a operation other than first operation (SEQUENCE or 2906 CB_SEQUENCE), then the reply MUST be cached if sa_cachethis or 2907 csa_cachethis are TRUE. For example, if a COMPOUND has eleven 2908 operations, including SEQUENCE, the fifth operation is a RENAME, and 2909 the tenth operation is a READ for one million bytes, the server may 2910 return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth operation. Since 2911 the server executed several operations, especially the non-idempotent 2912 RENAME, the client's request to cache the reply needs to be honored 2913 in order for correct operation of exactly once semantics. If the 2914 client retries the request, the server will have cached a reply that 2915 contains results for ten of the eleven requested operations, with the 2916 tenth operation having a status of NFS4ERR_REP_TOO_BIG_TO_CACHE. 2918 A client needs to take care that when sending operations that change 2919 the current filehandle (except for PUTFH, PUTPUBFH, PUTROOTFH and 2920 RESTOREFH) that it not exceed the maximum reply buffer before the 2921 GETFH operation. Otherwise the client will have to retry the 2922 operation that changed the current filehandle, in order to obtain the 2923 desired filehandle. For the OPEN operation (see Section 18.16), 2924 retry is not always available as an option. The following guidelines 2925 for the handling of filehandle changing operations are advised: 2927 o Within the same COMPOUND procedure, a client SHOULD send GETFH 2928 immediately after a current filehandle changing operation. A 2929 client MUST send GETFH after a current filehandle changing 2930 operation that is also non-idempotent (e.g., the OPEN operation), 2931 unless the operation is RESTOREFH. RESTOREFH is an exception, 2932 because even though it is non-idempotent, the filehandle RESTOREFH 2933 produced originated from an operation that is either idempotent 2934 (e.g. PUTFH, LOOKUP), or non-idempotent (e.g. OPEN, CREATE). If 2935 the origin is non-idempotent, then because the client MUST send 2936 GETFH after the origin operation, the client can recover if 2937 RESTOREFH returns an error. 2939 o A server MAY return NFS4ERR_REP_TOO_BIG or 2940 NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a 2941 filehandle changing operation if the reply would be too large on 2942 the next operation. 2944 o A server SHOULD return NFS4ERR_REP_TOO_BIG or 2945 NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a 2946 filehandle changing non-idempotent operation if the reply would be 2947 too large on the next operation, especially if the operation is 2948 OPEN. 2950 o A server MAY return NFS4ERR_UNSAFE_COMPOUND to a non-idempotent 2951 current filehandle changing operation, if it looks at the next 2952 operation (in the same COMPOUND procedure) and finds it is not 2953 GETFH. The server SHOULD do this if it is unable to determine in 2954 advance whether the total response size would exceed 2955 ca_maxresponsesize_cached or ca_maxresponsesize. 2957 2.10.6.5. Persistence 2959 Since the reply cache is bounded, it is practical for the reply cache 2960 to persist across server restarts. The replier MUST persist the 2961 following information if it agreed to persist the session (when the 2962 session was created; see Section 18.36): 2964 o The session ID. 2966 o The slot table including the sequence ID and cached reply for each 2967 slot. 2969 The above are sufficient for a replier to provide EOS semantics for 2970 any requests that were sent and executed before the server restarted. 2971 If the replier is a client then there is no need for it to persist 2972 any more information, unless the client will be persisting all other 2973 state across client restart. In which case, the server will never 2974 see any NFSv4.1-level protocol manifestation of a client restart. If 2975 the replier is a server, with just the slot table and session ID 2976 persisting, any requests the client retries after the server restart 2977 will return the results that are cached in reply cache. and any new 2978 requests (i.e. the sequence ID is one (1) greater than the slot's 2979 sequence ID) MUST be rejected with NFS4ERR_DEADSESSION (returned by 2980 SEQUENCE). Such a session is considered dead. A server MAY re- 2981 animate a session after a server restart so that the session will 2982 accept new requests as well as retries. To re-animate a session the 2983 server needs to persist additional information through server 2984 restart: 2986 o The client ID. This is a prerequisite to let the client to create 2987 more sessions associated with the same client ID as the 2989 o The client ID's sequence ID that is used for creating sessions 2990 (see Section 18.35 and Section 18.36). This is a prerequisite to 2991 let the client create more sessions. 2993 o The principal that created the client ID. This allows the server 2994 to authenticate the client when it sends EXCHANGE_ID. 2996 o The SSV, if SP4_SSV state protection was specified when the client 2997 ID was created (see Section 18.35). This lets the client create 2998 new sessions, and associate connections with the new and existing 2999 sessions. 3001 o The properties of the client ID as defined in Section 18.35. 3003 A persistent reply cache places certain demands on the server. The 3004 execution of the sequence of operations (starting with SEQUENCE) and 3005 placement of its results in the persistent cache MUST be atomic. If 3006 a client retries an sequence of operations that was previously 3007 executed on the server the only acceptable outcomes are either the 3008 original cached reply or an indication that client ID or session has 3009 been lost (indicating a catastrophic loss of the reply cache or a 3010 session that has been deleted because the client failed to use the 3011 session for an extended period of time). 3013 A server could fail and restart in the middle of a COMPOUND procedure 3014 that contains one or more non-idempotent or idempotent-but-modifying 3015 operations. This creates an even higher challenge for atomic 3016 execution and placement of results in the reply cache. One way to 3017 view the problem is as a single transaction consisting of each 3018 operation in the COMPOUND followed by storing the result in 3019 persistent storage, then finally a transaction commit. If there is a 3020 failure before the transaction is committed, then the server rolls 3021 back the transaction. If server itself fails, then when it restarts, 3022 its recovery logic could roll back the transaction before starting 3023 the NFSv4.1 server. 3025 While the description of the implementation for atomic execution of 3026 the request and caching of the reply is beyond the scope of this 3027 document, an example implementation for NFSv2 [37] is described in 3028 [38]. 3030 2.10.7. RDMA Considerations 3032 A complete discussion of the operation of RPC-based protocols over 3033 RDMA transports is in [8]. A discussion of the operation of NFSv4, 3034 including NFSv4.1, over RDMA is in [9]. Where RDMA is considered, 3035 this specification assumes the use of such a layering; it addresses 3036 only the upper layer issues relevant to making best use of RPC/RDMA. 3038 2.10.7.1. RDMA Connection Resources 3040 RDMA requires its consumers to register memory and post buffers of a 3041 specific size and number for receive operations. 3043 Registration of memory can be a relatively high-overhead operation, 3044 since it requires pinning of buffers, assignment of attributes (e.g. 3045 readable/writable), and initialization of hardware translation. 3046 Preregistration is desirable to reduce overhead. These registrations 3047 are specific to hardware interfaces and even to RDMA connection 3048 endpoints, therefore negotiation of their limits is desirable to 3049 manage resources effectively. 3051 Following basic registration, these buffers must be posted by the RPC 3052 layer to handle receives. These buffers remain in use by the RPC/ 3053 NFSv4.1 implementation; the size and number of them must be known to 3054 the remote peer in order to avoid RDMA errors which would cause a 3055 fatal error on the RDMA connection. 3057 NFSv4.1 manages slots as resources on a per session basis (see 3058 Section 2.10), while RDMA connections manage credits on a per 3059 connection basis. This means that in order for a peer to send data 3060 over RDMA to a remote buffer, it has to have both an NFSv4.1 slot, 3061 and an RDMA credit. If multiple RDMA connections are associated with 3062 a session, then if the total number of credits across all RDMA 3063 connections associated with the session is X, and the number slots in 3064 the session is Y, then the maximum number of outstanding requests is 3065 lesser of X and Y. 3067 2.10.7.2. Flow Control 3069 Previous versions of NFS do not provide flow control; instead they 3070 rely on the windowing provided by transports like TCP to throttle 3071 requests. This does not work with RDMA, which provides no operation 3072 flow control and will terminate a connection in error when limits are 3073 exceeded. Limits such as maximum number of requests outstanding are 3074 therefore negotiated when a session is created (see the 3075 ca_maxrequests field in Section 18.36). These limits then provide 3076 the maxima which each connection associated with the session's 3077 channel(s) must remain within. RDMA connections are managed within 3078 these limits as described in section 3.3 ("Flow Control"[[Comment.2: 3079 RFC Editor: please verify section and title of the RPCRDMA document 3080 which is currently at 3081 http://tools.ietf.org/html/draft-ietf-nfsv4-rpcrdma-08#section-3.3]]) 3082 of [8]; if there are multiple RDMA connections, then the maximum 3083 number of requests for a channel will be divided among the RDMA 3084 connections. Put a different way, the onus is on the replier to 3085 ensure that total number of RDMA credits across all connections 3086 associated with the replier's channel does exceed the channel's 3087 maximum number of outstanding requests. 3089 The limits may also be modified dynamically at the replier's choosing 3090 by manipulating certain parameters present in each NFSv4.1 reply. In 3091 addition, the CB_RECALL_SLOT callback operation (see Section 20.8) 3092 can be sent by a server to a client to return RDMA credits to the 3093 server, thereby lowering the maximum number of requests a client can 3094 have outstanding to the server. 3096 2.10.7.3. Padding 3098 Header padding is requested by each peer at session initiation (see 3099 the ca_headerpadsize argument to CREATE_SESSION in Section 18.36), 3100 and subsequently used by the RPC RDMA layer, as described in [8]. 3101 Zero padding is permitted. 3103 Padding leverages the useful property that RDMA preserve alignment of 3104 data, even when they are placed into anonymous (untagged) buffers. 3105 If requested, client inline writes will insert appropriate pad bytes 3106 within the request header to align the data payload on the specified 3107 boundary. The client is encouraged to add sufficient padding (up to 3108 the negotiated size) so that the "data" field of the NFSv4.1 WRITE 3109 operation is aligned. Most servers can make good use of such 3110 padding, which allows them to chain receive buffers in such a way 3111 that any data carried by client requests will be placed into 3112 appropriate buffers at the server, ready for file system processing. 3113 The receiver's RPC layer encounters no overhead from skipping over 3114 pad bytes, and the RDMA layer's high performance makes the insertion 3115 and transmission of padding on the sender a significant optimization. 3116 In this way, the need for servers to perform RDMA Read to satisfy all 3117 but the largest client writes is obviated. An added benefit is the 3118 reduction of message round trips on the network - a potentially good 3119 trade, where latency is present. 3121 The value to choose for padding is subject to a number of criteria. 3122 A primary source of variable-length data in the RPC header is the 3123 authentication information, the form of which is client-determined, 3124 possibly in response to server specification. The contents of 3125 COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all 3126 go into the determination of a maximal NFSv4.1 request size and 3127 therefore minimal buffer size. The client must select its offered 3128 value carefully, so as not to overburden the server, and vice- versa. 3129 The benefit of an appropriate padding value is higher performance. 3130 [[Comment.3: RFC editor please keep this diagram on one page.]] 3132 Sender gather: 3133 |RPC Request|Pad bytes|Length| -> |User data...| 3134 \------+----------------------/ \ 3135 \ \ 3136 \ Receiver scatter: \-----------+- ... 3137 /-----+----------------\ \ \ 3138 |RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->... 3140 In the above case, the server may recycle unused buffers to the next 3141 posted receive if unused by the actual received request, or may pass 3142 the now-complete buffers by reference for normal write processing. 3143 For a server which can make use of it, this removes any need for data 3144 copies of incoming data, without resorting to complicated end-to-end 3145 buffer advertisement and management. This includes most kernel-based 3146 and integrated server designs, among many others. The client may 3147 perform similar optimizations, if desired. 3149 2.10.7.4. Dual RDMA and Non-RDMA Transports 3151 Some RDMA transports (e.g., RFC5040 [10]), permit a "streaming" (non- 3152 RDMA) phase, where ordinary traffic might flow before "stepping up" 3153 to RDMA mode, commencing RDMA traffic. Some RDMA transports start 3154 connections always in RDMA mode. NFSv4.1 allows, but does not 3155 assume, a streaming phase before RDMA mode. When a connection is 3156 associated with a session, the client and server negotiate whether 3157 the connection is used in RDMA or non-RDMA mode (see Section 18.36 3158 and Section 18.34). 3160 2.10.8. Sessions Security 3162 2.10.8.1. Session Callback Security 3164 Via session / connection association, NFSv4.1 improves security over 3165 that provided by NFSv4.0 for the backchannel. The connection is 3166 client-initiated (see Section 18.34), and subject to the same 3167 firewall and routing checks as the fore channel. At the client's 3168 option (see Section 18.35), connection association is fully 3169 authenticated before being activated (see Section 18.34). Traffic 3170 from the server over the backchannel is authenticated exactly as the 3171 client specifies (see Section 2.10.8.2). 3173 2.10.8.2. Backchannel RPC Security 3175 When the NFSv4.1 client establishes the backchannel, it informs the 3176 server of the security flavors and principals to use when sending 3177 requests. If the security flavor is RPCSEC_GSS, the client expresses 3178 the principal in the form of an established RPCSEC_GSS context. The 3179 server is free to use any of the flavor/principal combinations the 3180 client offers, but it MUST NOT use unoffered combinations. This way, 3181 the client need not provide a target GSS principal for the 3182 backchannel as it did with NFSv4.0, nor the server have to implement 3183 an RPCSEC_GSS initiator as it did with NFSv4.0 [29]. 3185 The CREATE_SESSION (Section 18.36) and BACKCHANNEL_CTL 3186 (Section 18.33) operations allow the client to specify flavor/ 3187 principal combinations. 3189 Also note that the SP4_SSV state protection mode (see Section 18.35 3190 and Section 2.10.8.3) has the side benefit of providing SSV-derived 3191 RPCSEC_GSS contexts (Section 2.10.9). 3193 2.10.8.3. Protection from Unauthorized State Changes 3195 As described to this point in the specification, the state model of 3196 NFSv4.1 is vulnerable to an attacker that sends a SEQUENCE operation 3197 with a forged session ID and with a slot ID that it expects the 3198 legitimate client to use next. When the legitimate client uses the 3199 slot ID with the same sequence number, the server returns the 3200 attacker's result from the reply cache which disrupts the legitimate 3201 client and thus denies service to it. Similarly an attacker could 3202 send a CREATE_SESSION with a forged client ID to create a new session 3203 associated with the client ID. The attacker could send requests 3204 using the new session that change locking state, such as LOCKU 3205 operations to release locks the legitimate client has acquired. 3206 Setting a security policy on the file which requires RPCSEC_GSS 3207 credentials when manipulating the file's state is one potential work 3208 around, but has the disadvantage of preventing a legitimate client 3209 from releasing state when RPCSEC_GSS is required to do so, but a GSS 3210 context cannot be obtained (possibly because the user has logged off 3211 the client). 3213 NFSv4.1 provides three options to a client for state protection which 3214 are specified when a client creates a client ID via EXCHANGE_ID 3215 (Section 18.35). 3217 The first (SP4_NONE) is to simply waive state protection. 3219 The other two options (SP4_MACH_CRED and SP4_SSV) share several 3220 traits: 3222 o An RPCSEC_GSS-based credential is used to authenticate client ID 3223 and session maintenance operations, including creating and 3224 destroying a session, associating a connection with the session, 3225 and destroying the client ID. 3227 o Because RPCSEC_GSS is used to authenticate client ID and session 3228 maintenance, the attacker cannot associate a rogue connection with 3229 a legitimate session, or associate a rogue session with a 3230 legitimate client ID in order to maliciously alter the client ID's 3231 lock state via CLOSE, LOCKU, DELEGRETURN, LAYOUTRETURN, etc. 3233 o In cases where the server's security policies on a portion of its 3234 namespace require RPCSEC_GSS authentication, a client may have to 3235 use an RPCSEC_GSS credential to remove per-file state (e.g., 3236 LOCKU, CLOSE, etc.). The server may require that the principal 3237 that removes the state match certain criteria (e.g., the principal 3238 might have to be the same as the one that acquired the state). 3239 However, the client might not have an RPCSEC_GSS context for such 3240 a principal, and might not be able to create such a context 3241 (perhaps because the user has logged off). When the client 3242 establishes SP4_MACH_CRED or SP4_SSV protection, it can specify a 3243 list of operations that the server MUST allow using the machine 3244 credential (if SP4_MACH_CRED is used) or the SSV credential (if 3245 SP4_SSV is used). 3247 The SP4_MACH_CRED state protection option uses a machine credential 3248 where the principal that creates the client ID, MUST also be the 3249 principal that performs client ID and session maintenance operations. 3251 The security of the machine credential state protection approach 3252 depends entirely on safe guarding the per-machine credential. 3253 Assuming a proper safe guard, using the per-machine credential for 3254 operations like CREATE_SESSION, BIND_CONN_TO_SESSION, 3255 DESTROY_SESSION, and DESTROY_CLIENTID will prevent an attacker from 3256 associating a rogue connection with a session, or associating a rogue 3257 session with a client ID. 3259 There are at least three scenarios for the SP4_MACH_CRED option: 3261 1. That the system administrator configures a unique, permanent per- 3262 machine credential for one of the mandated GSS mechanisms (e.g., 3263 if Kerberos V5 is used, a "keytab" containing a principal derived 3264 from a client host name could be used). 3266 2. The client is used by a single user, and so the client ID and its 3267 sessions are used by just that user. If the user's credential 3268 expires, then session and client ID maintenance cannot occur, but 3269 since the client has a single user, only that user is 3270 inconvenienced. 3272 3. The physical client has multiple users, but the client 3273 implementation has a unique client ID for each user. This is 3274 effectively the same as the second scenario, but a disadvantage 3275 is that each user needs to be allocated at least one session 3276 each, so the approach suffers from lack of economy. 3278 The SP4_SSV protection option uses the SSV (Section 1.5), via 3279 RPCSEC_GSS and the SSV GSS mechanism (Section 2.10.9) to protect 3280 state from attack. The SP4_SSV protection option is intended for the 3281 situation comprised of a client that has multiple active users, and a 3282 system administrator who wants to avoid the burden of installing a 3283 permanent machine credential on each client. The SSV is established 3284 and updated on the server via SET_SSV (see Section 18.47). To 3285 prevent eavesdropping, a client SHOULD send SET_SSV via RPCSEC_GSS 3286 with the privacy service. Several aspects of the SSV make it 3287 intractable for an attacker to guess the SSV, and thus associate 3288 rogue connections with a session, and rogue sessions with a client 3289 ID: 3291 o The arguments to and results of SET_SSV include digests of the old 3292 and new SSV, respectively. 3294 o Because the initial value of the SSV is zero, therefore known, the 3295 client that opts for SP4_SSV protection and opts to apply SP4_SSV 3296 protection to BIND_CONN_TO_SESSION and CREATE_SESSION MUST send at 3297 least one SET_SSV operation before the first BIND_CONN_TO_SESSION 3298 operation or before the second CREATE_SESSION operation on a 3299 client ID. If it does not, the SSV mechanism will not generate 3300 tokens (Section 2.10.9). A client SHOULD send SET_SSV as soon as 3301 a session is created. 3303 o A SET_SSV request does not replace the SSV with the argument to 3304 SET_SSV. Instead, the current SSV on the server is logically 3305 exclusive ORed (XORed) with the argument to SET_SSV. Each time a 3306 new principal uses a client ID for the first time, the client 3307 SHOULD send a SET_SSV with that principal's RPCSEC_GSS 3308 credentials, with RPCSEC_GSS service set to RPC_GSS_SVC_PRIVACY. 3310 Here are the types of attacks that can be attempted by an attacker 3311 named Eve on a victim named Bob, and how SP4_SSV protection foils 3312 each attack: 3314 o Suppose Eve is the first user to log into a legitimate client. 3315 Eve's use of an NFSv4.1 file system will cause the legitimate 3316 client to create a client ID with SP4_SSV protection, specifying 3317 that the BIND_CONN_TO_SESSION operation MUST use the SSV 3318 credential. Eve's use of the file system also causes an SSV to be 3319 created. The SET_SSV operation that creates the SSV will be 3320 protected by the RPCSEC_GSS context created by the legitimate 3321 client which uses Eve's GSS principal and credentials. Eve can 3322 eavesdrop on the network while her RPCSEC_GSS context is created, 3323 and the SET_SSV using her context is sent. Even if the legitimate 3324 client sends the SET_SSV with RPC_GSS_SVC_PRIVACY, because Eve 3325 knows her own credentials, she can decrypt the SSV. Eve can 3326 compute an RPCSEC_GSS credential that BIND_CONN_TO_SESSION will 3327 accept, and so associate a new connection with the legitimate 3328 session. Eve can change the slot ID and sequence state of a 3329 legitimate session, and/or the SSV state, in such a way that when 3330 Bob accesses the server via the same legitimate client, the 3331 legitimate client will be unable to use the session. 3333 The client's only recourse is to create a new client ID for Bob to 3334 use, and establish a new SSV for the client ID. The client will 3335 be unable to delete the old client ID, and will let the lease on 3336 the old client ID expire. 3338 Once the legitimate client establishes an SSV over the new session 3339 using Bob's RPCSEC_GSS context, Eve can use the new session via 3340 the legitimate client, but she cannot disrupt Bob. Moreover, 3341 because the client SHOULD have modified the SSV due to Eve using 3342 the new session, Bob cannot get revenge on Eve by associating a 3343 rogue connection with the session. 3345 The question is how did the legitimate client detect that Eve has 3346 hijacked the old session? When the client detects that a new 3347 principal, Bob, wants to use the session, it SHOULD have sent a 3348 SET_SSV, which leads to following sub-scenarios: 3350 * Let us suppose that from the rogue connection, Eve sent a 3351 SET_SSV with the same slot ID and sequence ID that the 3352 legitimate client later uses. The server will assume the 3353 SET_SSV sent with Bob's credentials is a retry, and return to 3354 the legitimate client the reply it sent Eve. However, unless 3355 Eve can correctly guess the SSV the legitimate client will use, 3356 the digest verification checks in the SET_SSV response will 3357 fail. That is an indication to the client that the session has 3358 apparently been hijacked. 3360 * Alternatively, Eve sent a SET_SSV with a different slot ID than 3361 the legitimate client uses for its SET_SSV. Then the digest 3362 verification of the SET_SSV sent with Bob's credentials fails 3363 on the server, and the error returned to the client makes it 3364 apparent that the session has been hijacked. 3366 * Alternatively, Eve sent an operation other than SET_SSV, but 3367 with the same slot ID and sequence that the legitimate client 3368 uses for its SET_SSV. The server returns to the legitimate 3369 client the response it sent Eve. The client sees that the 3370 response is not at all what it expects. The client assumes 3371 either session hijacking or a server bug, and either way 3372 destroys the old session. 3374 o Eve associates a rogue connection with the session as above, and 3375 then destroys the session. Again, Bob goes to use the server from 3376 the legitimate client, which sends a SET_SSV using Bob's 3377 credentials. The client receives an error that indicates the 3378 session does not exist. When the client tries to create a new 3379 session, this will fail because the SSV it has does not match that 3380 the server has, and now the client knows the session was hijacked. 3381 The legitimate client establishes a new client ID. 3383 o If Eve creates a connection before the legitimate client 3384 establishes an SSV, because the initial value of the SSV is zero 3385 and therefore known, Eve can send a SET_SSV that will pass the 3386 digest verification check. However because the new connection has 3387 not been associated with the session, the SET_SSV is rejected for 3388 that reason. 3390 In summary, an attacker's disruption of state when SP4_SSV protection 3391 is in use is limited to the formative period of a client ID, its 3392 first session, and the establishment of the SSV. Once a non- 3393 malicious user uses the client ID, the client quickly detects any 3394 hijack and rectifies the situation. Once a non-malicious user 3395 successfully modifies the SSV, the attacker cannot use NFSv4.1 3396 operations to disrupt the non-malicious user. 3398 Note that neither the SP4_MACH_CRED nor SP4_SSV protection approaches 3399 prevent hijacking of a transport connection that has previously been 3400 associated with a session. If the goal of a counter threat strategy 3401 is to prevent connection hijacking, the use of IPsec is RECOMMENDED. 3403 If a connection hijack occurs, the hijacker could in theory change 3404 locking state and negatively impact the service to legitimate 3405 clients. However if the server is configured to require the use of 3406 RPCSEC_GSS with integrity or privacy on the affected file objects, 3407 and if EXCHGID4_FLAG_BIND_PRINC_STATEID capability (Section 18.35), 3408 is in force, this will thwart unauthorized attempts to change locking 3409 state. 3411 2.10.9. The Secret State Verifier (SSV) GSS Mechanism 3413 The SSV provides the secret key for a GSS mechanism internal to 3414 NFSv4.1 that NFSv4.1 uses for state protection. Contexts for this 3415 mechanism are not established via the RPCSEC_GSS protocol. Instead, 3416 the contexts are automatically created when EXCHANGE_ID specifies 3417 SP4_SSV protection. The only tokens defined are the PerMsgToken 3418 (emitted by GSS_GetMIC) and the SealedMessage token (emitted by 3419 GSS_Wrap). 3421 The mechanism OID for the SSV mechanism is: 3422 iso.org.dod.internet.private.enterprise.Michael Eisler.nfs.ssv_mech 3423 (1.3.6.1.4.1.28882.1.1). While the SSV mechanism does not define any 3424 initial context tokens, the OID can be used to let servers indicate 3425 that the SSV mechanism is acceptable whenever the client sends a 3426 SECINFO or SECINFO_NO_NAME operation (see Section 2.6). 3428 The SSV mechanism defines four subkeys derived from the SSV value. 3429 Each time SET_SSV is invoked the subkeys are recalculated by the 3430 client and server. The calculation of each of the four subkeys 3431 depends on each of the four respective ssv_subkey4 enumerated values. 3432 The calculation uses the HMAC [11], algorithm, using the current SSV 3433 as the key, the one way hash algorithm as negotiated by EXCHANGE_ID, 3434 and the input text as represented by the XDR encoded enumeration 3435 value for that subkey of data type ssv_subkey4. If the length of the 3436 output of the HMAC algorithm exceeds the length of key of encryption 3437 algorithm (which is also negotiated by EXCHANGE_ID), then the subkey 3438 MUST be truncated from the HMAC output, i.e. if the subkey is of N 3439 bytes long, then the first N bytes of the HMAC output MUST be used 3440 for the subkey. The specification of EXCHANGE_ID states that the 3441 length of the output of the HMAC algorithm MUST NOT be less than 3442 length of subkey needed for the encryption algorithm (see 3443 Section 18.35). 3445 /* Input for computing subkeys */ 3446 enum ssv_subkey4 { 3447 SSV4_SUBKEY_MIC_I2T = 1, 3448 SSV4_SUBKEY_MIC_T2I = 2, 3449 SSV4_SUBKEY_SEAL_I2T = 3, 3450 SSV4_SUBKEY_SEAL_T2I = 4 3451 }; 3453 The subkey derived from SSV4_SUBKEY_MIC_I2T is used for calculating 3454 message integrity codes (MICs) that originate from the NFSv4.1 3455 client, whether as part of a request over the fore channel, or a 3456 response over the backchannel. The subkey derived from 3457 SSV4_SUBKEY_MIC_T2I is used for MICs originating from the NFSv4.1 3458 server. The subkey derived from SSV4_SUBKEY_SEAL_I2T is used for 3459 encryption text originating from the NFSv4.1 client and the subkey 3460 derived from SSV4_SUBKEY_SEAL_T2I is used for encryption text 3461 originating from the NFSv4.1 server. 3463 The PerMsgToken description is based on an XDR definition: 3465 /* Input for computing smt_hmac */ 3466 struct ssv_mic_plain_tkn4 { 3467 uint32_t smpt_ssv_seq; 3468 opaque smpt_orig_plain<>; 3469 }; 3471 /* SSV GSS PerMsgToken token */ 3472 struct ssv_mic_tkn4 { 3473 uint32_t smt_ssv_seq; 3474 opaque smt_hmac<>; 3475 }; 3477 The field smt_hmac is an HMAC calculated by using the subkey derived 3478 from SSV4_SUBKEY_MIC_I2T or SSV4_SUBKEY_MIC_T2I as the key, the one 3479 way hash algorithm as negotiated by EXCHANGE_ID, and the input text 3480 as represented by data of type ssv_mic_plain_tkn4. The field 3481 smpt_ssv_seq is the same as smt_ssv_seq. The field smpt_orig_plain 3482 is the "message" input passed to GSS_GetMIC() (see Section 2.3.1 of 3483 [7]). The caller of GSS_GetMIC() provides a pointer to a buffer 3484 containing the plain text. The SSV mechanism's entry point for 3485 GSS_GetMIC() encodes this into an opaque array, and the encoding will 3486 include an initial four byte length, plus any necessary padding. 3487 Prepended to this will be the XDR encoded value of smpt_ssv_seq thus 3488 making up an XDR encoding of a value of data type ssv_mic_plain_tkn4, 3489 which in turn is the input into the HMAC. 3491 The token emitted by GSS_GetMIC() is XDR encoded and of XDR data type 3492 ssv_mic_tkn4. The field smt_ssv_seq comes from the SSV sequence 3493 number which is equal to 1 after SET_SSV (Section 18.47) is called 3494 the first time on a client ID. Thereafter, the SSV sequence number 3495 is incremented on each SET_SSV. Thus smt_ssv_seq represents the 3496 version of the SSV at the time GSS_GetMIC() was called. As noted in 3497 Section 18.35, the client and server can maintain multiple concurrent 3498 versions of the SSV. This allows the SSV to be changed without 3499 serializing all RPC calls that use the SSV mechanism with SET_SSV 3500 operations. Once the HMAC is calculated, it is XDR encoded into 3501 smt_hmac, which will include an initial four byte length, and any 3502 necessary padding. Prepended to this will be the XDR encoded value 3503 of smt_ssv_seq. 3505 The SealedMessage description is based on an XDR definition: 3507 /* Input for computing ssct_encr_data and ssct_hmac */ 3508 struct ssv_seal_plain_tkn4 { 3509 opaque sspt_confounder<>; 3510 uint32_t sspt_ssv_seq; 3511 opaque sspt_orig_plain<>; 3512 opaque sspt_pad<>; 3513 }; 3515 /* SSV GSS SealedMessage token */ 3516 struct ssv_seal_cipher_tkn4 { 3517 uint32_t ssct_ssv_seq; 3518 opaque ssct_iv<>; 3519 opaque ssct_encr_data<>; 3520 opaque ssct_hmac<>; 3521 }; 3522 The token emitted by GSS_Wrap() is XDR encoded and of XDR data type 3523 ssv_seal_cipher_tkn4. 3525 The ssct_ssv_seq field has the same meaning as smt_ssv_seq. 3527 The ssct_encr_data field is the result of encrypting a value of the 3528 XDR encoded data type ssv_seal_plain_tkn4. The encryption key is the 3529 subkey derived from SSV4_SUBKEY_SEAL_I2T or SSV4_SUBKEY_SEAL_T2I, and 3530 the encryption algorithm is that negotiated by EXCHANGE_ID. 3532 The ssct_iv field is the initialization vector (IV) for the 3533 encryption algorithm (if applicable) and is sent in clear text. The 3534 content and size of the IV MUST comply with specification of the 3535 encryption algorithm. For example, the id-aes256-CBC algorithm MUST 3536 use a 16 byte initialization vector (IV) which MUST be unpredictable 3537 for each instance of a value of type ssv_seal_plain_tkn4 that is 3538 encrypted with a particular SSV key. 3540 The ssct_hmac field is the result of computing an HMAC using value of 3541 the XDR encoded data type ssv_seal_plain_tkn4 as the input text. The 3542 key is the subkey derived from SSV4_SUBKEY_MIC_I2T or 3543 SSV4_SUBKEY_MIC_T2I, and the one way hash algorithm is that 3544 negotiated by EXCHANGE_ID. 3546 The sspt_confounder field is a random value. 3548 The sspt_ssv_seq field is the same as ssvt_ssv_seq. 3550 The field sspt_orig_plain field is the original plaintext and is the 3551 "input_message" input passed to GSS_Wrap() (see Section 2.3.3 of 3552 [7]). As with the handling of the plaintext by the SSV mechanism's 3553 GSS_GetMIC() entry point, the entry point for GSS_Wrap() expects a 3554 pointer to the plaintext, and will XDR encode an opaque array into 3555 sspt_orig_plain representing the plain text, along with the other 3556 fields of an instance of data type ssv_seal_plain_tkn4. 3558 The sspt_pad field is present to support encryption algorithms that 3559 require inputs to be in fixed sized blocks. The content of sspt_pad 3560 is zero filled except for the length. Beware that the XDR encoding 3561 of ssv_seal_plain_tkn4 contains three variable length arrays, and so 3562 each array consumes four bytes for an array length, and each array 3563 that follows the length is always padded to a multiple of four bytes 3564 per the XDR standard. 3566 For example suppose the encryption algorithm uses 16 byte blocks, and 3567 the sspt_confounder is three bytes long, and the sspt_orig_plain 3568 field is 15 bytes long. The XDR encoding of sspt_confounder uses 3569 eight bytes (4 + 3 + 1 byte pad), the XDR encoding of sspt_ssv_seq 3570 uses four bytes, the XDR encoding of sspt_orig_plain uses 20 bytes (4 3571 + 15 + 1 byte pad), and the smallest XDR encoding of the sspt_pad 3572 field is four bytes. This totals 36 bytes. The next multiple of 16 3573 is 48, thus the length field of sspt_pad needs to be set to 12 bytes, 3574 or a total encoding of 16 bytes. The total number of XDR encoded 3575 bytes is thus 8 + 4 + 20 + 16 = 48. 3577 GSS_Wrap() emits a token that is an XDR encoding of a value of data 3578 type ssv_seal_cipher_tkn4. Note that regardless whether the caller 3579 of GSS_Wrap() requests confidentiality or not, the token always has 3580 confidentiality. This is because the SSV mechanism is for 3581 RPCSEC_GSS, and RPCSEC_GSS never produces GSS_wrap() tokens without 3582 confidentiality. 3584 There is one SSV per client ID. Effectively there is a single GSS 3585 context for a client ID / SSV pair. All SSV mechanism RPCSEC_GSS 3586 handles of a client ID / SSV pair share the same GSS context. SSV 3587 GSS contexts do not expire except when the SSV is destroyed (causes 3588 would include the client ID being destroyed or a server restart). 3589 Since one purpose of context expiration is to replace keys that have 3590 been in use for "too long" hence vulnerable to compromise by brute 3591 force or accident, the client can replace the SSV key by sending 3592 periodic SET_SSV operations, by cycling through different users' 3593 RPCSEC_GSS credentials. This way the SSV is replaced without 3594 destroying the SSV's GSS contexts. 3596 SSV RPCSEC_GSS handles can be expired or deleted by the server at any 3597 time and the EXCHANGE_ID operation can be used to create more SSV 3598 RPCSEC_GSS handles. Expiration of SSV RPCSEC_GSS handles does not 3599 imply that the SSV or its GSS context have expired. 3601 The client MUST establish an SSV via SET_SSV before the SSV GSS 3602 context can be used to emit tokens from GSS_Wrap() and GSS_GetMIC(). 3603 If SET_SSV has not been successfully called, attempts to emit tokens 3604 MUST fail. 3606 The SSV mechanism does not support replay detection and sequencing in 3607 its tokens because RPCSEC_GSS does not use those features (See 3608 Section 5.2.2 "Context Creation Requests" in [4]). 3610 2.10.10. Session Mechanics - Steady State 3612 2.10.10.1. Obligations of the Server 3614 The server has the primary obligation to monitor the state of 3615 backchannel resources that the client has created for the server 3616 (RPCSEC_GSS contexts and backchannel connections). If these 3617 resources vanish, the server takes action as specified in 3618 Section 2.10.12.2. 3620 2.10.10.2. Obligations of the Client 3622 The client SHOULD honor the following obligations in order to utilize 3623 the session: 3625 o Keep a necessary session from going idle on the server. A client 3626 that requires a session, but nonetheless is not sending operations 3627 risks having the session be destroyed by the server. This is 3628 because sessions consume resources, and resource limitations may 3629 force the server to cull an inactive session. A server MAY 3630 consider a session to be inactive if the client has not used the 3631 session before the session inactivity timer (Section 2.10.11) has 3632 expired. 3634 o Destroy the session when not needed. If a client has multiple 3635 sessions, one of which has no requests waiting for replies, and 3636 has been idle for some period of time, it SHOULD destroy the 3637 session. 3639 o Maintain GSS contexts for the backchannel. If the client requires 3640 the server to use the RPCSEC_GSS security flavor for callbacks, 3641 then it needs to be sure the contexts handed to the server via 3642 BACKCHANNEL_CTL are unexpired. 3644 o Preserve a connection for a backchannel. The server requires a 3645 backchannel in order to gracefully recall recallable state, or 3646 notify the client of certain events. Note that if the connection 3647 is not being used for the fore channel, there is no way for the 3648 client tell if the connection is still alive (e.g., the server 3649 restarted without sending a disconnect). The onus is on the 3650 server, not the client, to determine if the backchannel's 3651 connection is alive, and to indicate in the response to a SEQUENCE 3652 operation when the last connection associated with a session's 3653 backchannel has disconnected. 3655 2.10.10.3. Steps the Client Takes To Establish a Session 3657 If the client does not have a client ID, the client sends EXCHANGE_ID 3658 to establish a client ID. If it opts for SP4_MACH_CRED or SP4_SSV 3659 protection, in the spo_must_enforce list of operations, it SHOULD at 3660 minimum specify: CREATE_SESSION, DESTROY_SESSION, 3661 BIND_CONN_TO_SESSION, BACKCHANNEL_CTL, and DESTROY_CLIENTID. If opts 3662 for SP4_SSV protection, the client needs to ask for SSV-based 3663 RPCSEC_GSS handles. 3665 The client uses the client ID to send a CREATE_SESSION on a 3666 connection to the server. The results of CREATE_SESSION indicate 3667 whether the server will persist the session reply cache through a 3668 server restarted or not, and the client notes this for future 3669 reference. 3671 If the client specified SP4_SSV state protection when the client ID 3672 was created, then it SHOULD send SET_SSV in the first COMPOUND after 3673 the session is created. Each time a new principal goes to use the 3674 client ID, it SHOULD send a SET_SSV again. 3676 If the client wants to use delegations, layouts, directory 3677 notifications, or any other state that requires a backchannel, then 3678 it needs to add a connection to the backchannel if CREATE_SESSION did 3679 not already do so. The client creates a connection, and calls 3680 BIND_CONN_TO_SESSION to associate the connection with the session and 3681 the session's backchannel. If CREATE_SESSION did not already do so, 3682 the client MUST tell the server what security is required in order 3683 for the client to accept callbacks. The client does this via 3684 BACKCHANNEL_CTL. If the client selected SP4_MACH_CRED or SP4_SSV 3685 protection when it called EXCHANGE_ID, then the client SHOULD specify 3686 that the backchannel use RPCSEC_GSS contexts for security. 3688 If the client wants to use additional connections for the 3689 backchannel, then it needs to call BIND_CONN_TO_SESSION on each 3690 connection it wants to use with the session. If the client wants to 3691 use additional connections for the fore channel, then it needs to 3692 call BIND_CONN_TO_SESSION if it specified SP4_SSV or SP4_MACH_CRED 3693 state protection when the client ID was created. 3695 At this point the session has reached steady state. 3697 2.10.11. Session Inactivity Timer 3699 The server MAY maintain a session inactivity timer for each session. 3700 If the session inactivity timer expires, then the server MAY destroy 3701 the session. To avoid losing a session due to inactivity, the client 3702 MUST renew the session inactivity timer. The length of session 3703 inactivity timer MUST NOT be less than the lease_time attribute 3704 (Section 5.8.1.11). As with lease renewal (Section 8.3), when the 3705 server receives a SEQUENCE operation, it resets the session 3706 inactivity timer, and MUST NOT allow the timer to expire while the 3707 rest of the operations in the COMPOUND procedure's request are still 3708 executing. Once the last operation has finished, the server MUST set 3709 the session inactivity timer to expire no sooner that the sum of the 3710 current time and the value of the lease_time attribute. 3712 2.10.12. Session Mechanics - Recovery 3714 2.10.12.1. Events Requiring Client Action 3716 The following events require client action to recover. 3718 2.10.12.1.1. RPCSEC_GSS Context Loss by Callback Path 3720 If all RPCSEC_GSS contexts granted by the client to the server for 3721 callback use have expired, the client MUST establish a new context 3722 via BACKCHANNEL_CTL. The sr_status_flags field of the SEQUENCE 3723 results indicates when callback contexts are nearly expired, or fully 3724 expired (see Section 18.46.3). 3726 2.10.12.1.2. Connection Loss 3728 If the client loses the last connection of the session, and if wants 3729 to retain the session, then it needs to create a new connection, and 3730 if, when the client ID was created, BIND_CONN_TO_SESSION was 3731 specified in the spo_must_enforce list, the client MUST use 3732 BIND_CONN_TO_SESSION to associate the connection with the session. 3734 If there was a request outstanding at the time the of connection 3735 loss, then if client wants to continue to use the session it MUST 3736 retry the request, as described in Section 2.10.6.2. Note that it is 3737 not necessary to retry requests over a connection with the same 3738 source network address or the same destination network address as the 3739 lost connection. As long as the session ID, slot ID, and sequence ID 3740 in the retry match that of the original request, the server will 3741 recognize the request as a retry if it executed the request prior to 3742 disconnect. 3744 If the connection that was lost was the last one associated with the 3745 backchannel, and the client wants to retain the backchannel and/or 3746 not put recallable state subject to revocation, the client needs to 3747 reconnect, and if it does, it MUST associate the connection to the 3748 session and backchannel via BIND_CONN_TO_SESSION. The server SHOULD 3749 indicate when it has no callback connection via the sr_status_flags 3750 result from SEQUENCE. 3752 2.10.12.1.3. Backchannel GSS Context Loss 3754 Via the sr_status_flags result of the SEQUENCE operation or other 3755 means, the client will learn if some or all of the RPCSEC_GSS 3756 contexts it assigned to the backchannel have been lost. If the 3757 client wants to the retain the backchannel and/or not put recallable 3758 state subjection to revocation, the client needs to use 3759 BACKCHANNEL_CTL to assign new contexts. 3761 2.10.12.1.4. Loss of Session 3763 The replier might lose a record of the session. Causes include: 3765 o Replier failure and restart 3767 o A catastrophe that causes the reply cache to be corrupted or lost 3768 on the media it was stored on. This applies even if the replier 3769 indicated in the CREATE_SESSION results that it would persist the 3770 cache. 3772 o The server purges the session of a client that has been inactive 3773 for a very extended period of time. 3775 o As a result of configuration changes among a set of clustered 3776 servers, a network address previously connected to one server 3777 becomes connected to a different server which has no knowledge of 3778 the session in question. Such a configuration change will 3779 generally only happen when the original server ceases to function 3780 for a time. 3782 Loss of reply cache is equivalent to loss of session. The replier 3783 indicates loss of session to the requester by returning 3784 NFS4ERR_BADSESSION on the next operation that uses the session ID 3785 that refers to the lost session. 3787 After an event like a server restart, the client may have lost its 3788 connections. The client assumes for the moment that the session has 3789 not been lost. It reconnects, and if it specified connection 3790 association enforcement when the session was created, it invokes 3791 BIND_CONN_TO_SESSION using the session ID. Otherwise, it invokes 3792 SEQUENCE. If BIND_CONN_TO_SESSION or SEQUENCE returns 3793 NFS4ERR_BADSESSION, the client knows the session is not available to 3794 it when communicating with that network address. If the connection 3795 survives session loss, then the next SEQUENCE operation the client 3796 sends over the connection will get back NFS4ERR_BADSESSION. The 3797 client again knows the session was lost. 3799 Here is one suggested algorithm for the client when it gets 3800 NFS4ERR_BADSESSION. It is not obligatory in that, if a client does 3801 not want to take advantage of such features as trunking, it may omit 3802 parts of it. However, it is a useful example which draws attention 3803 to various possible recovery issues: 3805 1. If the client has other connections to other server network 3806 addresses associated with the same session, attempt a COMPOUND 3807 with a single operation, SEQUENCE, on each of the other 3808 connections. 3810 2. If the attempts succeed, the session is still alive, and this is 3811 a strong indicator the server's network address has moved. The 3812 client might send an EXCHANGE_ID on the connection that returned 3813 NFS4ERR_BADSESSION to see if there are opportunities for client 3814 ID trunking (i.e. the same client ID and so_major are returned). 3815 The client might use DNS to see if the moved network address was 3816 replaced with another, so that the performance and availability 3817 benefits of session trunking can continue. 3819 3. If the SEQUENCE requests fail with NFS4ERR_BADSESSION then the 3820 session no longer exists on any of the server network addresses 3821 the client has connections associated with that session ID. It 3822 is possible the session is still alive and available on other 3823 network addresses. The client sends an EXCHANGE_ID on all the 3824 connections to see if the server owner is still listening on 3825 those network addresses. If the same server owner is returned, 3826 but a new client ID is returned, this is a strong indicator of a 3827 server restart. If both the same server owner and same client ID 3828 are returned, then this is a strong indication that the server 3829 did delete the session, and the client will need to send a 3830 CREATE_SESSION if it has no other sessions for that client ID. 3831 If a different server owner is returned, the client can use DNS 3832 to find other network addresses. If it does not, or if DNS does 3833 not find any other addresses for the server, then the client will 3834 be unable to provide NFSv4.1 service, and fatal errors should be 3835 returned to processes that were using the server. If the client 3836 is using a "mount" paradigm, unmounting the server is advised. 3838 4. If the client knows of no other connections associated with the 3839 session ID, and server network addresses that are, or have been 3840 associated with the session ID, then the client can use DNS to 3841 find other network addresses. If it does not, or if DNS does not 3842 find any other addresses for the server, then the client will be 3843 unable to provide NFSv4.1 service, and fatal errors should be 3844 returned to processes that were using the server. If the client 3845 is using a "mount" paradigm, unmounting the server is advised. 3847 If there is a reconfiguration event which results in the same network 3848 address being assigned to servers where the eir_server_scope value is 3849 different, it cannot be guaranteed that a session ID generated by the 3850 first will be recognized as invalid by the first. Therefore, in 3851 managing server reconfigurations among servers with different server 3852 scope values, it is necessary to make sure that all clients have 3853 disconnected from the first server before effecting the 3854 reconfiguration. Nonetheless, clients should not assume that servers 3855 will always adhere to this requirement; clients MUST be prepared to 3856 deal with unexpected effects of server reconfigurations. Even where 3857 a session ID is inappropriately recognized as valid, it is likely 3858 that either the connection will not be recognized as valid, or that a 3859 sequence value for a slot will not be correct. Therefore, when a 3860 client receives results indicating such unexpected errors, the use of 3861 EXCHANGE_ID to determine the current server configuration and present 3862 the client to the server is RECOMMENDED. 3864 A variation on the above is that after a server's network address 3865 moves, there is no NFSv4.1 server listening. E.g. no listener on 3866 port 2049, the NFSv4 server returns NFS4ERR_MINOR_VERS_MISMATCH, the 3867 NFS server returns a PROG_MISMATCH error, the RPC listener on 2049 3868 returns PROG_MISMATCH, or attempts to re-connect to the network 3869 address timeout. These SHOULD be treated as equivalent to SEQUENCE 3870 returning NFS4ERR_BADSESSION for these purposes. 3872 When the client detects session loss, it needs to call CREATE_SESSION 3873 to recover. Any non-idempotent operations that were in progress 3874 might have been performed on the server at the time of session loss. 3875 The client has no general way to recover from this. 3877 Note that loss of session does not imply loss of lock, open, 3878 delegation, or layout state because locks, opens, delegations, and 3879 layouts are tied to the client ID and depend on the client ID, not 3880 the session. Nor does loss of lock, open, delegation, or layout 3881 state imply loss of session state, because the session depends on the 3882 client ID; loss of client ID however does imply loss of session, 3883 lock, open, delegation, and layout state. See Section 8.4.2. A 3884 session can survive a server restart, but lock recovery may still be 3885 needed. 3887 It is possible CREATE_SESSION will fail with NFS4ERR_STALE_CLIENTID 3888 (e.g. the server restarts and does not preserve client ID state). If 3889 so, the client needs to call EXCHANGE_ID, followed by CREATE_SESSION. 3891 2.10.12.2. Events Requiring Server Action 3893 The following events require server action to recover. 3895 2.10.12.2.1. Client Crash and Restart 3897 As described in Section 18.35, a restarted client sends EXCHANGE_ID 3898 in such a way it causes the server to delete any sessions it had. 3900 2.10.12.2.2. Client Crash with No Restart 3902 If a client crashes and never comes back, it will never send 3903 EXCHANGE_ID with its old client owner. Thus the server has session 3904 state that will never be used again. After an extended period of 3905 time and if the server has resource constraints, it MAY destroy the 3906 old session as well as locking state. 3908 2.10.12.2.3. Extended Network Partition 3910 To the server, the extended network partition may be no different 3911 from a client crash with no restart (see Section 2.10.12.2.2). 3912 Unless the server can discern that there is a network partition, it 3913 is free to treat the situation as if the client has crashed 3914 permanently. 3916 2.10.12.2.4. Backchannel Connection Loss 3918 If there were callback requests outstanding at the time of a 3919 connection loss, then the server MUST retry the request, as described 3920 in Section 2.10.6.2. Note that it is not necessary to retry requests 3921 over a connection with the same source network address or the same 3922 destination network address as the lost connection. As long as the 3923 session ID, slot ID, and sequence ID in the retry match that of the 3924 original request, the callback target will recognize the request as a 3925 retry even if it did see the request prior to disconnect. 3927 If the connection lost is the last one associated with the 3928 backchannel, then the server MUST indicate that in the 3929 sr_status_flags field of every SEQUENCE reply until the backchannel 3930 is reestablished. There are two situations each of which use 3931 different status flags: no connectivity for the session's 3932 backchannel, and no connectivity for any session backchannel of the 3933 client. See Section 18.46 for a description of the appropriate flags 3934 in sr_status_flags. 3936 2.10.12.2.5. GSS Context Loss 3938 The server SHOULD monitor when the number RPCSEC_GSS contexts 3939 assigned to the backchannel reaches one, and when that one context is 3940 near expiry (i.e. between one and two periods of lease time), 3941 indicate so in the sr_status_flags field of all SEQUENCE replies. 3942 The server MUST indicate when the all of the backchannel's assigned 3943 RPCSEC_GSS contexts have expired in the sr_status_flags field of all 3944 SEQUENCE replies. 3946 2.10.13. Parallel NFS and Sessions 3948 A client and server can potentially be a non-pNFS implementation, a 3949 metadata server implementation, a data server implementation, or two 3950 or three types of implementations. The EXCHGID4_FLAG_USE_NON_PNFS, 3951 EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not 3952 mutually exclusive) are passed in the EXCHANGE_ID arguments and 3953 results to allow the client to indicate how it wants to use sessions 3954 created under the client ID, and to allow the server to indicate how 3955 it will allow the sessions to be used. See Section 13.1 for pNFS 3956 sessions considerations. 3958 3. Protocol Constants and Data Types 3960 The syntax and semantics to describe the data types of the NFSv4.1 3961 protocol are defined in the XDR RFC4506 [2] and RPC RFC1831 [3] 3962 documents. The next sections build upon the XDR data types to define 3963 constants, types and structures specific to this protocol. The full 3964 list of XDR data types is in [12]. 3966 3.1. Basic Constants 3968 const NFS4_FHSIZE = 128; 3969 const NFS4_VERIFIER_SIZE = 8; 3970 const NFS4_OPAQUE_LIMIT = 1024; 3971 const NFS4_SESSIONID_SIZE = 16; 3973 const NFS4_INT64_MAX = 0x7fffffffffffffff; 3974 const NFS4_UINT64_MAX = 0xffffffffffffffff; 3975 const NFS4_INT32_MAX = 0x7fffffff; 3976 const NFS4_UINT32_MAX = 0xffffffff; 3978 const NFS4_MAXFILELEN = 0xffffffffffffffff; 3979 const NFS4_MAXFILEOFF = 0xfffffffffffffffe; 3981 Except where noted, all these constants are defined in bytes. 3983 o NFS4_FHSIZE is the maximum size of a filehandle. 3985 o NFS4_VERIFIER_SIZE is the fixed size of a verifier. 3987 o NFS4_OPAQUE_LIMIT is the maximum size of certain opaque 3988 information. 3990 o NFS4_SESSIONID_SIZE is the fixed size of a session identifier. 3992 o NFS4_INT64_MAX is the maximum value of a signed 64 bit integer. 3994 o NFS4_UINT64_MAX is the maximum value of an unsigned 64 bit 3995 integer. 3997 o NFS4_INT32_MAX is the maximum value of a signed 32 bit integer. 3999 o NFS4_UINT32_MAX is the maximum value of an unsigned 32 bit 4000 integer. 4002 o NFS4_MAXFILELEN is the maximum length of a regular file. 4004 o NFS4_MAXFILEOFF is the maximum offset into a regular file. 4006 3.2. Basic Data Types 4008 These are the base NFSv4.1 data types. 4010 +---------------+---------------------------------------------------+ 4011 | Data Type | Definition | 4012 +---------------+---------------------------------------------------+ 4013 | int32_t | typedef int int32_t; | 4014 | uint32_t | typedef unsigned int uint32_t; | 4015 | int64_t | typedef hyper int64_t; | 4016 | uint64_t | typedef unsigned hyper uint64_t; | 4017 | attrlist4 | typedef opaque attrlist4<>; | 4018 | | Used for file/directory attributes. | 4019 | bitmap4 | typedef uint32_t bitmap4<>; | 4020 | | Used in attribute array encoding. | 4021 | changeid4 | typedef uint64_t changeid4; | 4022 | | Used in the definition of change_info4. | 4023 | clientid4 | typedef uint64_t clientid4; | 4024 | | Shorthand reference to client identification. | 4025 | count4 | typedef uint32_t count4; | 4026 | | Various count parameters (READ, WRITE, COMMIT). | 4027 | length4 | typedef uint64_t length4; | 4028 | | The length of a byte range within a file. | 4029 | mode4 | typedef uint32_t mode4; | 4030 | | Mode attribute data type. | 4031 | nfs_cookie4 | typedef uint64_t nfs_cookie4; | 4032 | | Opaque cookie value for READDIR. | 4033 | nfs_fh4 | typedef opaque nfs_fh4; | 4034 | | Filehandle definition. | 4035 | nfs_ftype4 | enum nfs_ftype4; | 4036 | | Various defined file types. | 4037 | nfsstat4 | enum nfsstat4; | 4038 | | Return value for operations. | 4039 | offset4 | typedef uint64_t offset4; | 4040 | | Various offset designations (READ, WRITE, LOCK, | 4041 | | COMMIT). | 4042 | qop4 | typedef uint32_t qop4; | 4043 | | Quality of protection designation in SECINFO. | 4044 | sec_oid4 | typedef opaque sec_oid4<>; | 4045 | | Security Object Identifier. The sec_oid4 data | 4046 | | type is not really opaque. Instead it contains an | 4047 | | ASN.1 OBJECT IDENTIFIER as used by GSS-API in the | 4048 | | mech_type argument to GSS_Init_sec_context. See | 4049 | | [7] for details. | 4050 | sequenceid4 | typedef uint32_t sequenceid4; | 4051 | | Sequence number used for various session | 4052 | | operations (EXCHANGE_ID, CREATE_SESSION, | 4053 | | SEQUENCE, CB_SEQUENCE). | 4054 | seqid4 | typedef uint32_t seqid4; | 4055 | | Sequence identifier used for file locking. | 4056 | sessionid4 | typedef opaque sessionid4[NFS4_SESSIONID_SIZE]; | 4057 | | Session identifier. | 4058 | slotid4 | typedef uint32_t slotid4; | 4059 | | Sequencing artifact for various session | 4060 | | operations (SEQUENCE, CB_SEQUENCE). | 4061 | utf8string | typedef opaque utf8string<>; | 4062 | | UTF-8 encoding for strings. | 4063 | utf8str_cis | typedef utf8string utf8str_cis; | 4064 | | Case-insensitive UTF-8 string. | 4065 | utf8str_cs | typedef utf8string utf8str_cs; | 4066 | | Case-sensitive UTF-8 string. | 4067 | utf8str_mixed | typedef utf8string utf8str_mixed; | 4068 | | UTF-8 strings with a case sensitive prefix and a | 4069 | | case insensitive suffix. | 4070 | component4 | typedef utf8str_cs component4; | 4071 | | Represents path name components. | 4072 | linktext4 | typedef utf8str_cs linktext4; | 4073 | | Symbolic link contents ("symbolic link" is | 4074 | | defined in an Open Group [13] standard). | 4075 | pathname4 | typedef component4 pathname4<>; | 4076 | | Represents path name for fs_locations. | 4077 | verifier4 | typedef opaque verifier4[NFS4_VERIFIER_SIZE]; | 4078 | | Verifier used for various operations (COMMIT, | 4079 | | CREATE, EXCHANGE_ID, OPEN, READDIR, WRITE) | 4080 | | NFS4_VERIFIER_SIZE is defined as 8. | 4081 +---------------+---------------------------------------------------+ 4083 End of Base Data Types 4085 Table 1 4087 3.3. Structured Data Types 4089 3.3.1. nfstime4 4091 struct nfstime4 { 4092 int64_t seconds; 4093 uint32_t nseconds; 4094 }; 4096 The nfstime4 data type gives the number of seconds and nanoseconds 4097 since midnight or 0 hour January 1, 1970 Coordinated Universal Time 4098 (UTC). Values greater than zero for the seconds field denote dates 4099 after the 0 hour January 1, 1970. Values less than zero for the 4100 seconds field denote dates before the 0 hour January 1, 1970. In 4101 both cases, the nseconds field is to be added to the seconds field 4102 for the final time representation. For example, if the time to be 4103 represented is one-half second before 0 hour January 1, 1970, the 4104 seconds field would have a value of negative one (-1) and the 4105 nseconds fields would have a value of one-half second (500000000). 4106 Values greater than 999,999,999 for nseconds are invalid. 4108 This data type is used to pass time and date information. A server 4109 converts to and from its local representation of time when processing 4110 time values, preserving as much accuracy as possible. If the 4111 precision of timestamps stored for a file system object is less than 4112 defined, loss of precision can occur. An adjunct time maintenance 4113 protocol is RECOMMENDED to reduce client and server time skew. 4115 3.3.2. time_how4 4117 enum time_how4 { 4118 SET_TO_SERVER_TIME4 = 0, 4119 SET_TO_CLIENT_TIME4 = 1 4120 }; 4122 3.3.3. settime4 4124 union settime4 switch (time_how4 set_it) { 4125 case SET_TO_CLIENT_TIME4: 4126 nfstime4 time; 4127 default: 4128 void; 4129 }; 4131 The time_how4 and settime4 data types are used for setting timestamps 4132 in file object attributes. If set_it is SET_TO_SERVER_TIME4, then 4133 the server uses its local representation of time for the time value. 4135 3.3.4. specdata4 4137 struct specdata4 { 4138 uint32_t specdata1; /* major device number */ 4139 uint32_t specdata2; /* minor device number */ 4140 }; 4142 This data type represents the device numbers for the device file 4143 types NF4CHR and NF4BLK. 4145 3.3.5. fsid4 4147 struct fsid4 { 4148 uint64_t major; 4149 uint64_t minor; 4150 }; 4152 3.3.6. chg_policy4 4154 struct change_policy4 { 4155 uint64_t cp_major; 4156 uint64_t cp_minor; 4157 }; 4159 The chg_policy4 data type is used for the change_policy RECOMMENDED 4160 attribute. It provides change sequencing indication analogous to the 4161 change attribute. To enable the server to present a value valid 4162 across server re-initialization without requiring persistent storage, 4163 two 64-bit quantities are used, allowing one to be a server instance 4164 ID and the second to be incremented non-persistently, within a given 4165 server instance. 4167 3.3.7. fattr4 4169 struct fattr4 { 4170 bitmap4 attrmask; 4171 attrlist4 attr_vals; 4172 }; 4174 The fattr4 data type is used to represent file and directory 4175 attributes. 4177 The bitmap is a counted array of 32 bit integers used to contain bit 4178 values. The position of the integer in the array that contains bit n 4179 can be computed from the expression (n / 32) and its bit within that 4180 integer is (n mod 32). 4182 0 1 4183 +-----------+-----------+-----------+-- 4184 | count | 31 .. 0 | 63 .. 32 | 4185 +-----------+-----------+-----------+-- 4187 3.3.8. change_info4 4189 struct change_info4 { 4190 bool atomic; 4191 changeid4 before; 4192 changeid4 after; 4193 }; 4195 This data type is used with the CREATE, LINK, OPEN, REMOVE, and 4196 RENAME operations to let the client know the value of the change 4197 attribute for the directory in which the target file system object 4198 resides. 4200 3.3.9. netaddr4 4202 struct netaddr4 { 4203 /* see struct rpcb in RFC 1833 */ 4204 string na_r_netid<>; /* network id */ 4205 string na_r_addr<>; /* universal address */ 4206 }; 4208 The netaddr4 data type is used to identify network transport 4209 endpoints. The r_netid and r_addr fields respectively contain a 4210 netid and uaddr. The netid and uaddr concepts are defined in [14]. 4211 The netid and uaddr formats for TCP over IPv4 and TCP over IPv6 are 4212 defined in [14], specifically Tables 2 and 3 and Sections 4.2.3.3 and 4213 4.2.3.4. 4215 3.3.10. state_owner4 4217 struct state_owner4 { 4218 clientid4 clientid; 4219 opaque owner; 4220 }; 4222 typedef state_owner4 open_owner4; 4223 typedef state_owner4 lock_owner4; 4225 The state_owner4 data type is the base type for the open_owner4 4226 Section 3.3.10.1 and lock_owner4 Section 3.3.10.2. 4228 3.3.10.1. open_owner4 4230 This data type is used to identify the owner of open state. 4232 3.3.10.2. lock_owner4 4234 This structure is used to identify the owner of byte-range locking 4235 state. 4237 3.3.11. open_to_lock_owner4 4239 struct open_to_lock_owner4 { 4240 seqid4 open_seqid; 4241 stateid4 open_stateid; 4242 seqid4 lock_seqid; 4243 lock_owner4 lock_owner; 4244 }; 4246 This data type is used for the first LOCK operation done for an 4247 open_owner4. It provides both the open_stateid and lock_owner such 4248 that the transition is made from a valid open_stateid sequence to 4249 that of the new lock_stateid sequence. Using this mechanism avoids 4250 the confirmation of the lock_owner/lock_seqid pair since it is tied 4251 to established state in the form of the open_stateid/open_seqid. 4253 3.3.12. stateid4 4255 struct stateid4 { 4256 uint32_t seqid; 4257 opaque other[12]; 4258 }; 4260 This data type is used for the various state sharing mechanisms 4261 between the client and server. The client never modifies a value of 4262 data type stateid. The starting value of the seqid field is 4263 undefined. The server is required to increment the seqid field by 4264 one (1) at each transition of the stateid. This is important since 4265 the client will inspect the seqid in OPEN stateids to determine the 4266 order of OPEN processing done by the server. 4268 3.3.13. layouttype4 4270 enum layouttype4 { 4271 LAYOUT4_NFSV4_1_FILES = 0x1, 4272 LAYOUT4_OSD2_OBJECTS = 0x2, 4273 LAYOUT4_BLOCK_VOLUME = 0x3 4274 }; 4276 This data type indicates what type of layout is being used. The file 4277 server advertises the layout types it supports through the 4278 fs_layout_type file system attribute (Section 5.12.1). A client asks 4279 for layouts of a particular type in LAYOUTGET, and processes those 4280 layouts in its layout-type-specific logic. 4282 The layouttype4 data type is 32 bits in length. The range 4283 represented by the layout type is split into three parts. Type 0x0 4284 is reserved. Types within the range 0x00000001-0x7FFFFFFF are 4285 globally unique and are assigned according to the description in 4286 Section 22.4; they are maintained by IANA. Types within the range 4287 0x80000000-0xFFFFFFFF are site specific and for private use only. 4289 The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file 4290 layout type, as defined in Section 13, is to be used. The 4291 LAYOUT4_OSD2_OBJECTS enumeration specifies that the object layout, as 4292 defined in [39], is to be used. Similarly, the LAYOUT4_BLOCK_VOLUME 4293 enumeration specifies that the block/volume layout, as defined in 4294 [40], is to be used. 4296 3.3.14. deviceid4 4298 const NFS4_DEVICEID4_SIZE = 16; 4300 typedef opaque deviceid4[NFS4_DEVICEID4_SIZE]; 4302 Layout information includes device IDs that specify a storage device 4303 through a compact handle. Addressing and type information is 4304 obtained with the GETDEVICEINFO operation. Device IDs are not 4305 guaranteed to be valid across metadata server restarts. A device ID 4306 is unique per client ID and layout type. See Section 12.2.10 for 4307 more details. 4309 3.3.15. device_addr4 4311 struct device_addr4 { 4312 layouttype4 da_layout_type; 4313 opaque da_addr_body<>; 4314 }; 4316 The device address is used to set up a communication channel with the 4317 storage device. Different layout types will require different data 4318 types to define how they communicate with storage devices. The 4319 opaque da_addr_body field is interpreted based on the specified 4320 da_layout_type field. 4322 This document defines the device address for the NFSv4.1 file layout 4323 (see Section 13.3), which identifies a storage device by network IP 4324 address and port number. This is sufficient for the clients to 4325 communicate with the NFSv4.1 storage devices, and may be sufficient 4326 for other layout types as well. Device types for object storage 4327 devices and block storage devices (e.g., SCSI volume labels) are 4328 defined by their respective layout specifications. 4330 3.3.16. layout_content4 4332 struct layout_content4 { 4333 layouttype4 loc_type; 4334 opaque loc_body<>; 4335 }; 4337 The loc_body field is interpreted based on the layout type 4338 (loc_type). This document defines the loc_body for the NFSv4.1 file 4339 layout type is defined; see Section 13.3 for its definition. 4341 3.3.17. layout4 4343 struct layout4 { 4344 offset4 lo_offset; 4345 length4 lo_length; 4346 layoutiomode4 lo_iomode; 4347 layout_content4 lo_content; 4348 }; 4350 The layout4 data type defines a layout for a file. The layout type 4351 specific data is opaque within lo_content. Since layouts are sub- 4352 dividable, the offset and length together with the file's filehandle, 4353 the client ID, iomode, and layout type, identify the layout. 4355 3.3.18. layoutupdate4 4357 struct layoutupdate4 { 4358 layouttype4 lou_type; 4359 opaque lou_body<>; 4360 }; 4362 The layoutupdate4 data type is used by the client to return updated 4363 layout information to the metadata server via the LAYOUTCOMMIT 4364 (Section 18.42) operation. This data type provides a channel to pass 4365 layout type specific information (in field lou_body) back to the 4366 metadata server. E.g., for the block/volume layout type this could 4367 include the list of reserved blocks that were written. The contents 4368 of the opaque lou_body argument are determined by the layout type. 4369 The NFSv4.1 file-based layout does not use this data type; if 4370 lou_type is LAYOUT4_NFSV4_1_FILES, the lou_body field MUST have a 4371 zero length. 4373 3.3.19. layouthint4 4375 struct layouthint4 { 4376 layouttype4 loh_type; 4377 opaque loh_body<>; 4378 }; 4380 The layouthint4 data type is used by the client to pass in a hint 4381 about the type of layout it would like created for a particular file. 4382 It is the data type specified by the layout_hint attribute described 4383 in Section 5.12.4. The metadata server may ignore the hint, or may 4384 selectively ignore fields within the hint. This hint should be 4385 provided at create time as part of the initial attributes within 4386 OPEN. The loh_body field is specific to the type of layout 4387 (loh_type). The NFSv4.1 file-based layout uses the 4388 nfsv4_1_file_layouthint4 data type as defined in Section 13.3. 4390 3.3.20. layoutiomode4 4392 enum layoutiomode4 { 4393 LAYOUTIOMODE4_READ = 1, 4394 LAYOUTIOMODE4_RW = 2, 4395 LAYOUTIOMODE4_ANY = 3 4396 }; 4398 The iomode specifies whether the client intends to just read or both 4399 read and write the data represented by the layout. While the 4400 LAYOUTIOMODE4_ANY iomode MUST NOT be used in the arguments to the 4401 LAYOUTGET operation, it MAY be used in the arguments to the 4402 LAYOUTRETURN and CB_LAYOUTRECALL operations. The LAYOUTIOMODE4_ANY 4403 iomode specifies that layouts pertaining to both LAYOUTIOMODE4_READ 4404 and LAYOUTIOMODE4_RW iomodes are being returned or recalled, 4405 respectively. The metadata server's use of the iomode may depend on 4406 the layout type being used. The storage devices MAY validate I/O 4407 accesses against the iomode and reject invalid accesses. 4409 3.3.21. nfs_impl_id4 4411 struct nfs_impl_id4 { 4412 utf8str_cis nii_domain; 4413 utf8str_cs nii_name; 4414 nfstime4 nii_date; 4415 }; 4417 This data type is used to identify client and server implementation 4418 details. The nii_domain field is the DNS domain name that the 4419 implementer is associated with. The nii_name field is the product 4420 name of the implementation and is completely free form. It is 4421 RECOMMENDED that the nii_name be used to distinguish machine 4422 architecture, machine platforms, revisions, versions, and patch 4423 levels. The nii_date field is the timestamp of when the software 4424 instance was published or built. 4426 3.3.22. threshold_item4 4428 struct threshold_item4 { 4429 layouttype4 thi_layout_type; 4430 bitmap4 thi_hintset; 4431 opaque thi_hintlist<>; 4432 }; 4434 This data type contains a list of hints specific to a layout type for 4435 helping the client determine when it should send I/O directly through 4436 the metadata server versus the storage devices. The data type 4437 consists of the layout type (thi_layout_type), a bitmap (thi_hintset) 4438 describing the set of hints supported by the server (they may differ 4439 based on the layout type), and a list of hints (thi_hintlist), whose 4440 content is determined by the hintset bitmap. See the mdsthreshold 4441 attribute for more details. 4443 The thi_hintset field is a bitmap of the following values: 4445 +-------------------------+---+---------+---------------------------+ 4446 | name | # | Data | Description | 4447 | | | Type | | 4448 +-------------------------+---+---------+---------------------------+ 4449 | threshold4_read_size | 0 | length4 | The file size below which | 4450 | | | | it is RECOMMENDED to read | 4451 | | | | data through the MDS. | 4452 | threshold4_write_size | 1 | length4 | The file size below which | 4453 | | | | it is RECOMMENDED to | 4454 | | | | write data through the | 4455 | | | | MDS. | 4456 | threshold4_read_iosize | 2 | length4 | For read I/O sizes below | 4457 | | | | this threshold it is | 4458 | | | | RECOMMENDED to read data | 4459 | | | | through the MDS | 4460 | threshold4_write_iosize | 3 | length4 | For write I/O sizes below | 4461 | | | | this threshold it is | 4462 | | | | RECOMMENDED to write data | 4463 | | | | through the MDS | 4464 +-------------------------+---+---------+---------------------------+ 4466 3.3.23. mdsthreshold4 4468 struct mdsthreshold4 { 4469 threshold_item4 mth_hints<>; 4470 }; 4472 This data type holds an array of elements of data type 4473 threshold_item4, each of which is valid for a particular layout type. 4474 An array is necessary because a server can support multiple layout 4475 types for a single file. 4477 4. Filehandles 4479 The filehandle in the NFS protocol is a per server unique identifier 4480 for a file system object. The contents of the filehandle are opaque 4481 to the client. Therefore, the server is responsible for translating 4482 the filehandle to an internal representation of the file system 4483 object. 4485 4.1. Obtaining the First Filehandle 4487 The operations of the NFS protocol are defined in terms of one or 4488 more filehandles. Therefore, the client needs a filehandle to 4489 initiate communication with the server. With the NFSv3 protocol 4490 (RFC1813 [30]), there exists an ancillary protocol to obtain this 4491 first filehandle. The MOUNT protocol, RPC program number 100005, 4492 provides the mechanism of translating a string based file system path 4493 name to a filehandle which can then be used by the NFS protocols. 4495 The MOUNT protocol has deficiencies in the area of security and use 4496 via firewalls. This is one reason that the use of the public 4497 filehandle was introduced in RFC2054 [41] and RFC2055 [42]. With the 4498 use of the public filehandle in combination with the LOOKUP operation 4499 in the NFSv3 protocol, it has been demonstrated that the MOUNT 4500 protocol is unnecessary for viable interaction between NFS client and 4501 server. 4503 Therefore, the NFSv4.1 protocol will not use an ancillary protocol 4504 for translation from string based path names to a filehandle. Two 4505 special filehandles will be used as starting points for the NFS 4506 client. 4508 4.1.1. Root Filehandle 4510 The first of the special filehandles is the ROOT filehandle. The 4511 ROOT filehandle is the "conceptual" root of the file system name 4512 space at the NFS server. The client uses or starts with the ROOT 4513 filehandle by employing the PUTROOTFH operation. The PUTROOTFH 4514 operation instructs the server to set the "current" filehandle to the 4515 ROOT of the server's file tree. Once this PUTROOTFH operation is 4516 used, the client can then traverse the entirety of the server's file 4517 tree with the LOOKUP operation. A complete discussion of the server 4518 name space is in the Section 7. 4520 4.1.2. Public Filehandle 4522 The second special filehandle is the PUBLIC filehandle. Unlike the 4523 ROOT filehandle, the PUBLIC filehandle may be bound or represent an 4524 arbitrary file system object at the server. The server is 4525 responsible for this binding. It may be that the PUBLIC filehandle 4526 and the ROOT filehandle refer to the same file system object. 4527 However, it is up to the administrative software at the server and 4528 the policies of the server administrator to define the binding of the 4529 PUBLIC filehandle and server file system object. The client may not 4530 make any assumptions about this binding. The client uses the PUBLIC 4531 filehandle via the PUTPUBFH operation. 4533 4.2. Filehandle Types 4535 In the NFSv3 protocol, there was one type of filehandle with a single 4536 set of semantics. This type of filehandle is termed "persistent" in 4537 NFSv4.1. The semantics of a persistent filehandle remain the same as 4538 before. A new type of filehandle introduced in NFSv4.1 is the 4539 "volatile" filehandle, which attempts to accommodate certain server 4540 environments. 4542 The volatile filehandle type was introduced to address server 4543 functionality or implementation issues which make correct 4544 implementation of a persistent filehandle infeasible. Some server 4545 environments do not provide a file system level invariant that can be 4546 used to construct a persistent filehandle. The underlying server 4547 file system may not provide the invariant or the server's file system 4548 programming interfaces may not provide access to the needed 4549 invariant. Volatile filehandles may ease the implementation of 4550 server functionality such as hierarchical storage management or file 4551 system reorganization or migration. However, the volatile filehandle 4552 increases the implementation burden for the client. 4554 Since the client will need to handle persistent and volatile 4555 filehandles differently, a file attribute is defined which may be 4556 used by the client to determine the filehandle types being returned 4557 by the server. 4559 4.2.1. General Properties of a Filehandle 4561 The filehandle contains all the information the server needs to 4562 distinguish an individual file. To the client, the filehandle is 4563 opaque. The client stores filehandles for use in a later request and 4564 can compare two filehandles from the same server for equality by 4565 doing a byte-by-byte comparison. However, the client MUST NOT 4566 otherwise interpret the contents of filehandles. If two filehandles 4567 from the same server are equal, they MUST refer to the same file. 4568 Servers SHOULD try to maintain a one-to-one correspondence between 4569 filehandles and files but this is not required. Clients MUST use 4570 filehandle comparisons only to improve performance, not for correct 4571 behavior. All clients need to be prepared for situations in which it 4572 cannot be determined whether two filehandles denote the same object 4573 and in such cases, avoid making invalid assumptions which might cause 4574 incorrect behavior. Further discussion of filehandle and attribute 4575 comparison in the context of data caching is presented in the 4576 Section 10.3.4. 4578 As an example, in the case that two different path names when 4579 traversed at the server terminate at the same file system object, the 4580 server SHOULD return the same filehandle for each path. This can 4581 occur if a hard link (see [6]) is used to create two file names which 4582 refer to the same underlying file object and associated data. For 4583 example, if paths /a/b/c and /a/d/c refer to the same file, the 4584 server SHOULD return the same filehandle for both path names 4585 traversals. 4587 4.2.2. Persistent Filehandle 4589 A persistent filehandle is defined as having a fixed value for the 4590 lifetime of the file system object to which it refers. Once the 4591 server creates the filehandle for a file system object, the server 4592 MUST accept the same filehandle for the object for the lifetime of 4593 the object. If the server restarts, the NFS server MUST honor the 4594 same filehandle value as it did in the server's previous 4595 instantiation. Similarly, if the file system is migrated, the new 4596 NFS server MUST honor the same filehandle as the old NFS server. 4598 The persistent filehandle will be become stale or invalid when the 4599 file system object is removed. When the server is presented with a 4600 persistent filehandle that refers to a deleted object, it MUST return 4601 an error of NFS4ERR_STALE. A filehandle may become stale when the 4602 file system containing the object is no longer available. The file 4603 system may become unavailable if it exists on removable media and the 4604 media is no longer available at the server or the file system in 4605 whole has been destroyed or the file system has simply been removed 4606 from the server's name space (i.e. unmounted in a UNIX environment). 4608 4.2.3. Volatile Filehandle 4610 A volatile filehandle does not share the same longevity 4611 characteristics of a persistent filehandle. The server may determine 4612 that a volatile filehandle is no longer valid at many different 4613 points in time. If the server can definitively determine that a 4614 volatile filehandle refers to an object that has been removed, the 4615 server should return NFS4ERR_STALE to the client (as is the case for 4616 persistent filehandles). In all other cases where the server 4617 determines that a volatile filehandle can no longer be used, it 4618 should return an error of NFS4ERR_FHEXPIRED. 4620 The REQUIRED attribute "fh_expire_type" is used by the client to 4621 determine what type of filehandle the server is providing for a 4622 particular file system. This attribute is a bitmask with the 4623 following values: 4625 FH4_PERSISTENT The value of FH4_PERSISTENT is used to indicate a 4626 persistent filehandle, which is valid until the object is removed 4627 from the file system. The server will not return 4628 NFS4ERR_FHEXPIRED for this filehandle. FH4_PERSISTENT is defined 4629 as a value in which none of the bits specified below are set. 4631 FH4_VOLATILE_ANY The filehandle may expire at any time, except as 4632 specifically excluded (i.e. FH4_NO_EXPIRE_WITH_OPEN). 4634 FH4_NOEXPIRE_WITH_OPEN May only be set when FH4_VOLATILE_ANY is set. 4635 If this bit is set, then the meaning of FH4_VOLATILE_ANY is 4636 qualified to exclude any expiration of the filehandle when it is 4637 open. 4639 FH4_VOL_MIGRATION The filehandle will expire as a result of a file 4640 system transition (migration or replication), in those case in 4641 which the continuity of filehandle use is not specified by 4642 _handle_ class information within the fs_locations_info attribute. 4643 When this bit is set, clients without access to fs_locations_info 4644 information should assume filehandles will expire on file system 4645 transitions. 4647 FH4_VOL_RENAME The filehandle will expire during rename. This 4648 includes a rename by the requesting client or a rename by any 4649 other client. If FH4_VOL_ANY is set, FH4_VOL_RENAME is redundant. 4651 Servers which provide volatile filehandles that may expire while open 4652 (i.e. if FH4_VOL_MIGRATION or FH4_VOL_RENAME is set or if 4653 FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set), should 4654 deny a RENAME or REMOVE that would affect an OPEN file of any of the 4655 components leading to the OPEN file. In addition, the server should 4656 deny all RENAME or REMOVE requests during the grace period upon 4657 server restart. 4659 Servers which provide volatile filehandles that may expire while open 4660 require special care as regards handling of RENAMEs and REMOVEs. 4661 This situation can arise if FH4_VOL_MIGRATION or FH4_VOL_RENAME is 4662 set, if FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set, 4663 or if a non-readonly file system has a transition target in a 4664 different _handle _ class. In these cases, the server should deny a 4665 RENAME or REMOVE that would affect an OPEN file of any of the 4666 components leading to the OPEN file. In addition, the server should 4667 deny all RENAME or REMOVE requests during the grace period, in order 4668 to make sure that reclaims of files where filehandles may have 4669 expired do not do a reclaim for the wrong file. 4671 Volatile filehandles are especially suitable for implementation of 4672 the pseudo file systems used to bridge exports. See Section 7.5 for 4673 a discussion of this. 4675 4.3. One Method of Constructing a Volatile Filehandle 4677 A volatile filehandle, while opaque to the client could contain: 4679 [volatile bit = 1 | server boot time | slot | generation number] 4681 o slot is an index in the server volatile filehandle table 4683 o generation number is the generation number for the table entry/ 4684 slot 4686 When the client presents a volatile filehandle, the server makes the 4687 following checks, which assume that the check for the volatile bit 4688 has passed. If the server boot time is less than the current server 4689 boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return 4690 NFS4ERR_BADHANDLE. If the generation number does not match, return 4691 NFS4ERR_FHEXPIRED. 4693 When the server restarts, the table is gone (it is volatile). 4695 If volatile bit is 0, then it is a persistent filehandle with a 4696 different structure following it. 4698 4.4. Client Recovery from Filehandle Expiration 4700 If possible, the client SHOULD recover from the receipt of an 4701 NFS4ERR_FHEXPIRED error. The client must take on additional 4702 responsibility so that it may prepare itself to recover from the 4703 expiration of a volatile filehandle. If the server returns 4704 persistent filehandles, the client does not need these additional 4705 steps. 4707 For volatile filehandles, most commonly the client will need to store 4708 the component names leading up to and including the file system 4709 object in question. With these names, the client should be able to 4710 recover by finding a filehandle in the name space that is still 4711 available or by starting at the root of the server's file system name 4712 space. 4714 If the expired filehandle refers to an object that has been removed 4715 from the file system, obviously the client will not be able to 4716 recover from the expired filehandle. 4718 It is also possible that the expired filehandle refers to a file that 4719 has been renamed. If the file was renamed by another client, again 4720 it is possible that the original client will not be able to recover. 4721 However, in the case that the client itself is renaming the file and 4722 the file is open, it is possible that the client may be able to 4723 recover. The client can determine the new path name based on the 4724 processing of the rename request. The client can then regenerate the 4725 new filehandle based on the new path name. The client could also use 4726 the compound operation mechanism to construct a set of operations 4727 like: 4729 RENAME A B 4730 LOOKUP B 4731 GETFH 4733 Note that the COMPOUND procedure does not provide atomicity. This 4734 example only reduces the overhead of recovering from an expired 4735 filehandle. 4737 5. File Attributes 4739 To meet the requirements of extensibility and increased 4740 interoperability with non-UNIX platforms, attributes need to be 4741 handled in a flexible manner. The NFSv3 fattr3 structure contains a 4742 fixed list of attributes that not all clients and servers are able to 4743 support or care about. The fattr3 structure can not be extended as 4744 new needs arise and it provides no way to indicate non-support. With 4745 the NFSv4.1 protocol, the client is able query what attributes the 4746 server supports and construct requests with only those supported 4747 attributes (or a subset thereof). 4749 To this end, attributes are divided into three groups: REQUIRED, 4750 RECOMMENDED, and named. Both REQUIRED and RECOMMENDED attributes are 4751 supported in the NFSv4.1 protocol by a specific and well-defined 4752 encoding and are identified by number. They are requested by setting 4753 a bit in the bit vector sent in the GETATTR request; the server 4754 response includes a bit vector to list what attributes were returned 4755 in the response. New REQUIRED or RECOMMENDED attributes may be added 4756 to the NFSv4 protocol as part of a new minor version by publishing a 4757 standards-track RFC which allocates a new attribute number value and 4758 defines the encoding for the attribute. See Section 2.7 for further 4759 discussion. 4761 Named attributes are accessed by the new OPENATTR operation, which 4762 accesses a hidden directory of attributes associated with a file 4763 system object. OPENATTR takes a filehandle for the object and 4764 returns the filehandle for the attribute hierarchy. The filehandle 4765 for the named attributes is a directory object accessible by LOOKUP 4766 or READDIR and contains files whose names represent the named 4767 attributes and whose data bytes are the value of the attribute. For 4768 example: 4770 +----------+-----------+---------------------------------+ 4771 | LOOKUP | "foo" | ; look up file | 4772 | GETATTR | attrbits | | 4773 | OPENATTR | | ; access foo's named attributes | 4774 | LOOKUP | "x11icon" | ; look up specific attribute | 4775 | READ | 0,4096 | ; read stream of bytes | 4776 +----------+-----------+---------------------------------+ 4778 Named attributes are intended for data needed by applications rather 4779 than by an NFS client implementation. NFS implementors are strongly 4780 encouraged to define their new attributes as RECOMMENDED attributes 4781 by bringing them to the IETF standards-track process. 4783 The set of attributes which are classified as REQUIRED is 4784 deliberately small since servers need to do whatever it takes to 4785 support them. A server should support as many of the RECOMMENDED 4786 attributes as possible but by their definition, the server is not 4787 required to support all of them. Attributes are deemed REQUIRED if 4788 the data is both needed by a large number of clients and is not 4789 otherwise reasonably computable by the client when support is not 4790 provided on the server. 4792 Note that the hidden directory returned by OPENATTR is a convenience 4793 for protocol processing. The client should not make any assumptions 4794 about the server's implementation of named attributes and whether the 4795 underlying file system at the server has a named attribute directory 4796 or not. Therefore, operations such as SETATTR and GETATTR on the 4797 named attribute directory are undefined. 4799 5.1. REQUIRED Attributes 4801 These MUST be supported by every NFSv4.1 client and server in order 4802 to ensure a minimum level of interoperability. The server MUST store 4803 and return these attributes and the client MUST be able to function 4804 with an attribute set limited to these attributes. With just the 4805 REQUIRED attributes some client functionality may be impaired or 4806 limited in some ways. A client may ask for any of these attributes 4807 to be returned by setting a bit in the GETATTR request and the server 4808 must return their value. 4810 5.2. RECOMMENDED Attributes 4812 These attributes are understood well enough to warrant support in the 4813 NFSv4.1 protocol. However, they may not be supported on all clients 4814 and servers. A client may ask for any of these attributes to be 4815 returned by setting a bit in the GETATTR request but must handle the 4816 case where the server does not return them. A client MAY ask for the 4817 set of attributes the server supports and SHOULD NOT request 4818 attributes the server does not support. A server should be tolerant 4819 of requests for unsupported attributes and simply not return them 4820 rather than considering the request an error. It is expected that 4821 servers will support all attributes they comfortably can and only 4822 fail to support attributes which are difficult to support in their 4823 operating environments. A server should provide attributes whenever 4824 they don't have to "tell lies" to the client. For example, a file 4825 modification time should be either an accurate time or should not be 4826 supported by the server. This will not always be comfortable to 4827 clients but the client is better positioned decide whether and how to 4828 fabricate or construct an attribute or whether to do without the 4829 attribute. 4831 5.3. Named Attributes 4833 These attributes are not supported by direct encoding in the NFSv4 4834 protocol but are accessed by string names rather than numbers and 4835 correspond to an uninterpreted stream of bytes which are stored with 4836 the file system object. The name space for these attributes may be 4837 accessed by using the OPENATTR operation. The OPENATTR operation 4838 returns a filehandle for a virtual "named attribute directory" and 4839 further perusal and modification of the name space may be done using 4840 operations that work on more typical directories. In particular, 4841 READDIR may be used to get a list of such named attributes and LOOKUP 4842 and OPEN may select a particular attribute. Creation of a new named 4843 attribute may be the result of an OPEN specifying file creation. 4845 Once an OPEN is done, named attributes may be examined and changed by 4846 normal READ and WRITE operations using the filehandles and stateids 4847 returned by OPEN. 4849 Named attributes and the named attribute directory may have their own 4850 (non-named) attributes. Each of these objects MUST have all of the 4851 REQUIRED attributes and may have additional RECOMMENDED attributes. 4852 However, the set of attributes for named attributes and the named 4853 attribute directory need not be as large as, and typically will not 4854 be as large as that for other objects in that file system. 4856 Named attributes and the named attribute directory may be the target 4857 of delegations (in the case of the named attribute directory these 4858 will be directory delegations). However, since granting of 4859 delegations or not is within the server's discretion, a server need 4860 not support delegations on named attributes or the named attribute 4861 directory. 4863 It is RECOMMENDED that servers support arbitrary named attributes. A 4864 client should not depend on the ability to store any named attributes 4865 in the server's file system. If a server does support named 4866 attributes, a client which is also able to handle them should be able 4867 to copy a file's data and metadata with complete transparency from 4868 one location to another; this would imply that names allowed for 4869 regular directory entries are valid for named attribute names as 4870 well. 4872 In NFSv4.1, the structure of named attribute directories is 4873 restricted in a number of ways, in order to prevent the development 4874 of non-interoperable implementations in which some servers support a 4875 fully general hierarchical directory structure for named attributes 4876 while others support a limited set, but fully adequate to the 4877 feature's goals. In such an environment, clients or applications 4878 might come to depend on non-portable extensions. The restrictions 4879 are: 4881 o CREATE is not allowed in a named attribute directory. Thus, such 4882 objects as symbolic links and special files are not allowed to be 4883 named attributes. Further, directories may not be created in a 4884 named attribute directory so no hierarchical structure of named 4885 attributes for a single object is allowed. 4887 o If OPENATTR is done on a named attribute directory or on a named 4888 attribute, the server MUST return NFS4ERR_WRONG_TYPE. 4890 o Doing a RENAME of a named attribute to a different named attribute 4891 directory or to an ordinary (i.e. non-named-attribute) directory 4892 is not allowed. 4894 o Creating hard links between named attribute directories or between 4895 named attribute directories and ordinary directories is not 4896 allowed. 4898 Names of attributes will not be controlled by this document or other 4899 IETF standards track documents. See Section 22.1 for further 4900 discussion. 4902 5.4. Classification of Attributes 4904 Each of the REQUIRED and RECOMMENDED attributes can be classified in 4905 one of three categories: per server (i.e. the value of the attribute 4906 will be the same for all file objects that share the same server 4907 owner; see Section 2.5 for a definition of server owner), per file 4908 system (i.e. the value of the attribute will be the same for some or 4909 all file objects that share the same fsid attribute (Section 5.8.1.9) 4910 and Server Owner), or per file system object. Note that it is 4911 possible that some per file system attributes may vary within the 4912 file system, depending on the value of the "homogeneous" 4913 (Section 5.8.2.16) attribute. Note that the attributes 4914 time_access_set and time_modify_set are not listed in this section 4915 because they are write-only attributes corresponding to time_access 4916 and time_modify, and are used in a special instance of SETATTR. 4918 o The per server attribute is: 4920 lease_time 4922 o The per file system attributes are: 4924 supported_attrs, suppattr_exclcreat, fh_expire_type, 4925 link_support, symlink_support, unique_handles, aclsupport, 4926 cansettime, case_insensitive, case_preserving, 4927 chown_restricted, files_avail, files_free, files_total, 4928 fs_locations, homogeneous, maxfilesize, maxname, maxread, 4929 maxwrite, no_trunc, space_avail, space_free, space_total, 4930 time_delta, change_policy, fs_status, fs_layout_type, 4931 fs_locations_info, fs_charset_cap 4933 o The per file system object attributes are: 4935 type, change, size, named_attr, fsid, rdattr_error, filehandle, 4936 acl, archive, fileid, hidden, maxlink, mimetype, mode, 4937 numlinks, owner, owner_group, rawdev, space_used, system, 4938 time_access, time_backup, time_create, time_metadata, 4939 time_modify, mounted_on_fileid, dir_notif_delay, 4940 dirent_notif_delay, dacl, sacl, layout_type, layout_hint, 4941 layout_blksize, layout_alignment, mdsthreshold, retention_get, 4942 retention_set, retentevt_get, retentevt_set, retention_hold, 4943 mode_set_masked 4945 For quota_avail_hard, quota_avail_soft, and quota_used see their 4946 definitions below for the appropriate classification. 4948 5.5. Set-Only and Get-Only Attributes 4950 Some REQUIRED and RECOMMENDED attributes are set-only, i.e. they can 4951 be set via SETATTR but not retrieved via GETATTR. Similarly, some 4952 REQUIRED and RECOMMENDED attributes are get-only, i.e. they can be 4953 retrieved GETATTR but not set via SETATTR. If a client attempts to 4954 set a get-only attribute or get a set-only attributes, the server 4955 MUST return NFS4ERR_INVAL. 4957 5.6. REQUIRED Attributes - List and Definition References 4959 The list of REQUIRED attributes appears in Table 2. The meaning of 4960 the columns of the table are: 4962 o Name: the name of attribute 4964 o Id: the number assigned to the attribute. In the event of 4965 conflicts between the assigned number and [12], the latter is 4966 likely authoritative, but should be resolved with Errata to this 4967 document and/or [12]. See [43] for the Errata process. 4969 o Data Type: The XDR data type of the attribute. 4971 o Acc: Access allowed to the attribute. R means read-only (GETATTR 4972 may retrieve, SETATTR may not set). W means write-only (SETATTR 4973 may set, GETATTR may not retrieve). R W means read/write (GETATTR 4974 may retrieve, SETATTR may set). 4976 o Defined in: the section of this specification that describes the 4977 attribute. 4979 +--------------------+----+------------+-----+------------------+ 4980 | Name | Id | Data Type | Acc | Defined in: | 4981 +--------------------+----+------------+-----+------------------+ 4982 | supported_attrs | 0 | bitmap4 | R | Section 5.8.1.1 | 4983 | type | 1 | nfs_ftype4 | R | Section 5.8.1.2 | 4984 | fh_expire_type | 2 | uint32_t | R | Section 5.8.1.3 | 4985 | change | 3 | uint64_t | R | Section 5.8.1.4 | 4986 | size | 4 | uint64_t | R W | Section 5.8.1.5 | 4987 | link_support | 5 | bool | R | Section 5.8.1.6 | 4988 | symlink_support | 6 | bool | R | Section 5.8.1.7 | 4989 | named_attr | 7 | bool | R | Section 5.8.1.8 | 4990 | fsid | 8 | fsid4 | R | Section 5.8.1.9 | 4991 | unique_handles | 9 | bool | R | Section 5.8.1.10 | 4992 | lease_time | 10 | nfs_lease4 | R | Section 5.8.1.11 | 4993 | rdattr_error | 11 | enum | R | Section 5.8.1.12 | 4994 | filehandle | 19 | nfs_fh4 | R | Section 5.8.1.13 | 4995 | suppattr_exclcreat | 75 | bitmap4 | R | Section 5.8.1.14 | 4996 +--------------------+----+------------+-----+------------------+ 4998 Table 2 5000 5.7. RECOMMENDED Attributes - List and Definition References 5002 The RECOMMENDED attributes are defined in Table 3. The meanings of 5003 the column headers are the same as Table 2; see Section 5.6 for the 5004 meanings. 5006 +--------------------+----+----------------+-----+------------------+ 5007 | Name | Id | Data Type | Acc | Defined in: | 5008 +--------------------+----+----------------+-----+------------------+ 5009 | acl | 12 | nfsace4<> | R W | Section 6.2.1 | 5010 | aclsupport | 13 | uint32_t | R | Section 6.2.1.2 | 5011 | archive | 14 | bool | R W | Section 5.8.2.1 | 5012 | cansettime | 15 | bool | R | Section 5.8.2.2 | 5013 | case_insensitive | 16 | bool | R | Section 5.8.2.3 | 5014 | case_preserving | 17 | bool | R | Section 5.8.2.4 | 5015 | change_policy | 60 | chg_policy4 | R | Section 5.8.2.5 | 5016 | chown_restricted | 18 | bool | R | Section 5.8.2.6 | 5017 | dacl | 58 | nfsacl41 | R W | Section 6.2.2 | 5018 | dir_notif_delay | 56 | nfstime4 | R | Section 5.11.1 | 5019 | dirent_notif_delay | 57 | nfstime4 | R | Section 5.11.2 | 5020 | fileid | 20 | uint64_t | R | Section 5.8.2.7 | 5021 | files_avail | 21 | uint64_t | R | Section 5.8.2.8 | 5022 | files_free | 22 | uint64_t | R | Section 5.8.2.9 | 5023 | files_total | 23 | uint64_t | R | Section 5.8.2.10 | 5024 | fs_charset_cap | 76 | uint32_t | R | Section 5.8.2.11 | 5025 | fs_layout_type | 62 | layouttype4<> | R | Section 5.12.1 | 5026 | fs_locations | 24 | fs_locations | R | Section 5.8.2.12 | 5027 | fs_locations_info | 67 | * | R | Section 5.8.2.13 | 5028 | fs_status | 61 | fs4_status | R | Section 5.8.2.14 | 5029 | hidden | 25 | bool | R W | Section 5.8.2.15 | 5030 | homogeneous | 26 | bool | R | Section 5.8.2.16 | 5031 | layout_alignment | 66 | uint32_t | R | Section 5.12.2 | 5032 | layout_blksize | 65 | uint32_t | R | Section 5.12.3 | 5033 | layout_hint | 63 | layouthint4 | W | Section 5.12.4 | 5034 | layout_type | 64 | layouttype4<> | R | Section 5.12.5 | 5035 | maxfilesize | 27 | uint64_t | R | Section 5.8.2.17 | 5036 | maxlink | 28 | uint32_t | R | Section 5.8.2.18 | 5037 | maxname | 29 | uint32_t | R | Section 5.8.2.19 | 5038 | maxread | 30 | uint64_t | R | Section 5.8.2.20 | 5039 | maxwrite | 31 | uint64_t | R | Section 5.8.2.21 | 5040 | mdsthreshold | 68 | mdsthreshold4 | R | Section 5.12.6 | 5041 | mimetype | 32 | utf8<> | R W | Section 5.8.2.22 | 5042 | mode | 33 | mode4 | R W | Section 6.2.4 | 5043 | mode_set_masked | 74 | mode_masked4 | W | Section 6.2.5 | 5044 | mounted_on_fileid | 55 | uint64_t | R | Section 5.8.2.23 | 5045 | no_trunc | 34 | bool | R | Section 5.8.2.24 | 5046 | numlinks | 35 | uint32_t | R | Section 5.8.2.25 | 5047 | owner | 36 | utf8<> | R W | Section 5.8.2.26 | 5048 | owner_group | 37 | utf8<> | R W | Section 5.8.2.27 | 5049 | quota_avail_hard | 38 | uint64_t | R | Section 5.8.2.28 | 5050 | quota_avail_soft | 39 | uint64_t | R | Section 5.8.2.29 | 5051 | quota_used | 40 | uint64_t | R | Section 5.8.2.30 | 5052 | rawdev | 41 | specdata4 | R | Section 5.8.2.31 | 5053 | retentevt_get | 71 | retention_get4 | R | Section 5.13.3 | 5054 | retentevt_set | 72 | retention_set4 | W | Section 5.13.4 | 5055 | retention_get | 69 | retention_get4 | R | Section 5.13.1 | 5056 | retention_hold | 73 | uint64_t | R W | Section 5.13.5 | 5057 | retention_set | 70 | retention_set4 | W | Section 5.13.2 | 5058 | sacl | 59 | nfsacl41 | R W | Section 6.2.3 | 5059 | space_avail | 42 | uint64_t | R | Section 5.8.2.32 | 5060 | space_free | 43 | uint64_t | R | Section 5.8.2.33 | 5061 | space_total | 44 | uint64_t | R | Section 5.8.2.34 | 5062 | space_used | 45 | uint64_t | R | Section 5.8.2.35 | 5063 | system | 46 | bool | R W | Section 5.8.2.36 | 5064 | time_access | 47 | nfstime4 | R | Section 5.8.2.37 | 5065 | time_access_set | 48 | settime4 | W | Section 5.8.2.38 | 5066 | time_backup | 49 | nfstime4 | R W | Section 5.8.2.39 | 5067 | time_create | 50 | nfstime4 | R W | Section 5.8.2.40 | 5068 | time_delta | 51 | nfstime4 | R | Section 5.8.2.41 | 5069 | time_metadata | 52 | nfstime4 | R | Section 5.8.2.42 | 5070 | time_modify | 53 | nfstime4 | R | Section 5.8.2.43 | 5071 | time_modify_set | 54 | settime4 | W | Section 5.8.2.44 | 5072 +--------------------+----+----------------+-----+------------------+ 5074 Table 3 5076 * fs_locations_info4 5078 5.8. Attribute Definitions 5080 5.8.1. Definitions of REQUIRED Attributes 5082 5.8.1.1. Attribute 0: supported_attrs 5084 The bit vector which would retrieve all REQUIRED and RECOMMENDED 5085 attributes that are supported for this object. The scope of this 5086 attribute applies to all objects with a matching fsid. 5088 5.8.1.2. Attribute 1: type 5090 Designates the type of an object in terms of one of a number of 5091 special constants: 5093 o NF4REG designates a regular file. 5095 o NF4DIR designates a directory. 5097 o NF4BLK designates a block device special file. 5099 o NF4CHR designates a character device special file. 5101 o NF4LNK designates a symbolic link. 5103 o NF4SOCK designates a named socket special file. 5105 o NF4FIFO designates a fifo special file. 5107 o NF4ATTRDIR designates a named attribute directory. 5109 o NF4NAMEDATTR designates a named attribute. 5111 Within the explanatory text and operation descriptions, the following 5112 phrases will be used with the meanings given below: 5114 o The phrase "is a directory" means that the object is of type 5115 NF4DIR or of type NF4ATTRDIR. 5117 o The phrase "is a special file" means that the object is of one of 5118 the types NF4BLK, NF4CHR, NF4SOCK, or NF4FIFO. 5120 o The phrase "is an ordinary file" means that the object is of type 5121 NF4REG or of type NF4NAMEDATTR. 5123 5.8.1.3. Attribute 2: fh_expire_type 5125 Server uses this to specify filehandle expiration behavior to the 5126 client. See Section 4 for additional description. 5128 5.8.1.4. Attribute 3: change 5130 A value created by the server that the client can use to determine if 5131 file data, directory contents or attributes of the object have been 5132 modified. The server may return the object's time_metadata attribute 5133 for this attribute's value but only if the file system object can not 5134 be updated more frequently than the resolution of time_metadata. 5136 5.8.1.5. Attribute 4: size 5138 The size of the object in bytes. 5140 5.8.1.6. Attribute 5: link_support 5142 True, if the object's file system supports hard links. 5144 5.8.1.7. Attribute 6: symlink_support 5146 True, if the object's file system supports symbolic links. 5148 5.8.1.8. Attribute 7: named_attr 5150 True, if this object has named attributes. In other words, object 5151 has a non-empty named attribute directory. 5153 5.8.1.9. Attribute 8: fsid 5155 Unique file system identifier for the file system holding this 5156 object. fsid contains major and minor components each of which are of 5157 data type uint64_t. 5159 5.8.1.10. Attribute 9: unique_handles 5161 True, if two distinct filehandles guaranteed to refer to two 5162 different file system objects. 5164 5.8.1.11. Attribute 10: lease_time 5166 Duration of leases at server in seconds. 5168 5.8.1.12. Attribute 11: rdattr_error 5170 Error returned from an attempt to retrieve attributes during a 5171 READDIR operation. 5173 5.8.1.13. Attribute 19: filehandle 5175 The filehandle of this object (primarily for READDIR requests). 5177 5.8.1.14. Attribute 75: suppattr_exclcreat 5179 The bit vector which would set all REQUIRED and RECOMMENDED 5180 attributes that are supported by the EXCLUSIVE4_1 method of file 5181 creation via the OPEN operation. The scope of this attribute applies 5182 to all objects with a matching fsid. 5184 5.8.2. Definitions of Uncategorized RECOMMENDED Attributes 5186 The definitions of most of the RECOMMENDED attributes follow. 5187 Collections that share a common category are defined in other 5188 sections. 5190 5.8.2.1. Attribute 14: archive 5192 True, if this file has been archived since the time of last 5193 modification (deprecated in favor of time_backup). 5195 5.8.2.2. Attribute 15: cansettime 5197 True, if the server able to change the times for a file system object 5198 as specified in a SETATTR operation. 5200 5.8.2.3. Attribute 16: case_insensitive 5202 True, if file name comparisons on this file system are case 5203 insensitive. 5205 5.8.2.4. Attribute 17: case_preserving 5207 True, if file name case on this file system is preserved. 5209 5.8.2.5. Attribute 60: change_policy 5211 A value created by the server that the client can use to determine if 5212 some server policy related to the current file system has been 5213 subject to change. If the value remains the same then the client can 5214 be sure that the values of the attributes related to fs location and 5215 the fss_type field of the fs_status attribute have not changed. On 5216 the other hand, a change in this value does necessarily imply a 5217 change in policy. It is up to the client to interrogate the server 5218 to determine if some policy relevant to it has changed. See 5219 Section 3.3.6 for details. 5221 This attribute MUST change when the value returned by the 5222 fs_locations or fs_locations_info attribute changes, when a file 5223 system goes from read-only to writable or vice versa, or when the 5224 allowable set of security flavors for the file system or any part 5225 thereof is changed. 5227 5.8.2.6. Attribute 18: chown_restricted 5229 If TRUE, the server will reject any request to change either the 5230 owner or the group associated with a file if the caller is not a 5231 privileged user (for example, "root" in UNIX operating environments 5232 or in Windows 2000 the "Take Ownership" privilege). 5234 5.8.2.7. Attribute 20: fileid 5236 A number uniquely identifying the file within the file system. 5238 5.8.2.8. Attribute 21: files_avail 5240 File slots available to this user on the file system containing this 5241 object - this should be the smallest relevant limit. 5243 5.8.2.9. Attribute 22: files_free 5245 Free file slots on the file system containing this object - this 5246 should be the smallest relevant limit. 5248 5.8.2.10. Attribute 23: files_total 5250 Total file slots on the file system containing this object. 5252 5.8.2.11. Attribute 76: fs_charset_cap 5254 Character set capabilities for this file system. See Section 14.4. 5256 5.8.2.12. Attribute 24: fs_locations 5258 Locations where this file system may be found. If the server returns 5259 NFS4ERR_MOVED as an error, this attribute MUST be supported. See 5260 Section 11.9 for more details. 5262 5.8.2.13. Attribute 67: fs_locations_info 5264 Full function file system location. See Section 11.10 for more 5265 details. 5267 5.8.2.14. Attribute 61: fs_status 5269 Generic file system type information. See Section 11.11 for more 5270 details. 5272 5.8.2.15. Attribute 25: hidden 5274 True, if the file is considered hidden with respect to the Windows 5275 API. 5277 5.8.2.16. Attribute 26: homogeneous 5279 True, if this object's file system is homogeneous, i.e. are per file 5280 system attributes the same for all file system's objects. 5282 5.8.2.17. Attribute 27: maxfilesize 5284 Maximum supported file size for the file system of this object. 5286 5.8.2.18. Attribute 28: maxlink 5288 Maximum number of links for this object. 5290 5.8.2.19. Attribute 29: maxname 5292 Maximum file name size supported for this object. 5294 5.8.2.20. Attribute 30: maxread 5296 Maximum read size supported for this object. 5298 5.8.2.21. Attribute 31: maxwrite 5300 Maximum write size supported for this object. This attribute SHOULD 5301 be supported if the file is writable. Lack of this attribute can 5302 lead to the client either wasting bandwidth or not receiving the best 5303 performance. 5305 5.8.2.22. Attribute 32: mimetype 5307 MIME body type/subtype of this object. 5309 5.8.2.23. Attribute 55: mounted_on_fileid 5311 Like fileid, but if the target filehandle is the root of a file 5312 system, this attribute represents the fileid of the underlying 5313 directory. 5315 UNIX-based operating environments connect a file system into the 5316 namespace by connecting (mounting) the file system onto the existing 5317 file object (the mount point, usually a directory) of an existing 5318 file system. When the mount point's parent directory is read via an 5319 API like readdir(), the return results are directory entries, each 5320 with a component name and a fileid. The fileid of the mount point's 5321 directory entry will be different from the fileid that the stat() 5322 system call returns. The stat() system call is returning the fileid 5323 of the root of the mounted file system, whereas readdir() is 5324 returning the fileid stat() would have returned before any file 5325 systems were mounted on the mount point. 5327 Unlike NFSv3, NFSv4.1 allows a client's LOOKUP request to cross other 5328 file systems. The client detects the file system crossing whenever 5329 the filehandle argument of LOOKUP has an fsid attribute different 5330 from that of the filehandle returned by LOOKUP. A UNIX-based client 5331 will consider this a "mount point crossing". UNIX has a legacy 5332 scheme for allowing a process to determine its current working 5333 directory. This relies on readdir() of a mount point's parent and 5334 stat() of the mount point returning fileids as previously described. 5335 The mounted_on_fileid attribute corresponds to the fileid that 5336 readdir() would have returned as described previously. 5338 While the NFSv4.1 client could simply fabricate a fileid 5339 corresponding to what mounted_on_fileid provides (and if the server 5340 does not support mounted_on_fileid, the client has no choice), there 5341 is a risk that the client will generate a fileid that conflicts with 5342 one that is already assigned to another object in the file system. 5343 Instead, if the server can provide the mounted_on_fileid, the 5344 potential for client operational problems in this area is eliminated. 5346 If the server detects that there is no mounted point at the target 5347 file object, then the value for mounted_on_fileid that it returns is 5348 the same as that of the fileid attribute. 5350 The mounted_on_fileid attribute is RECOMMENDED, so the server SHOULD 5351 provide it if possible, and for a UNIX-based server, this is 5352 straightforward. Usually, mounted_on_fileid will be requested during 5353 a READDIR operation, in which case it is trivial (at least for UNIX- 5354 based servers) to return mounted_on_fileid since it is equal to the 5355 fileid of a directory entry returned by readdir(). If 5356 mounted_on_fileid is requested in a GETATTR operation, the server 5357 should obey an invariant that has it returning a value that is equal 5358 to the file object's entry in the object's parent directory, i.e. 5359 what readdir() would have returned. Some operating environments 5360 allow a series of two or more file systems to be mounted onto a 5361 single mount point. In this case, for the server to obey the 5362 aforementioned invariant, it will need to find the base mount point, 5363 and not the intermediate mount points. 5365 5.8.2.24. Attribute 34: no_trunc 5367 If this attribute is TRUE, then if the client uses a file name longer 5368 than name_max, an error will be returned instead of the name being 5369 truncated. 5371 5.8.2.25. Attribute 35: numlinks 5373 Number of hard links to this object. 5375 5.8.2.26. Attribute 36: owner 5377 The string name of the owner of this object. 5379 5.8.2.27. Attribute 37: owner_group 5381 The string name of the group ownership of this object. 5383 5.8.2.28. Attribute 38: quota_avail_hard 5385 The value in bytes which represents the amount of additional disk 5386 space beyond the current allocation that can be allocated to this 5387 file or directory before further allocations will be refused. It is 5388 understood that this space may be consumed by allocations to other 5389 files or directories. 5391 5.8.2.29. Attribute 39: quota_avail_soft 5393 The value in bytes which represents the amount of additional disk 5394 space that can be allocated to this file or directory before the user 5395 may reasonably be warned. It is understood that this space may be 5396 consumed by allocations to other files or directories though there is 5397 a rule as to which other files or directories. 5399 5.8.2.30. Attribute 40: quota_used 5401 The value in bytes which represent the amount of disc space used by 5402 this file or directory and possibly a number of other similar files 5403 or directories, where the set of "similar" meets at least the 5404 criterion that allocating space to any file or directory in the set 5405 will reduce the "quota_avail_hard" of every other file or directory 5406 in the set. 5408 Note that there may be a number of distinct but overlapping sets of 5409 files or directories for which a quota_used value is maintained. 5410 E.g. "all files with a given owner", "all files with a given group 5411 owner". etc. The server is at liberty to choose any of those sets 5412 when providing the content of the quota_used attribute, but should do 5413 so in a repeatable way. The rule may be configured per file system 5414 or may be "choose the set with the smallest quota". 5416 5.8.2.31. Attribute 41: rawdev 5418 Raw device identifier; the UNIX device major/minor node information. 5419 If the value of type is not NF4BLK or NF4CHR, the value returned 5420 SHOULD NOT be considered useful. 5422 5.8.2.32. Attribute 42: space_avail 5424 Disk space in bytes available to this user on the file system 5425 containing this object - this should be the smallest relevant limit. 5427 5.8.2.33. Attribute 43: space_free 5429 Free disk space in bytes on the file system containing this object - 5430 this should be the smallest relevant limit. 5432 5.8.2.34. Attribute 44: space_total 5434 Total disk space in bytes on the file system containing this object. 5436 5.8.2.35. Attribute 45: space_used 5438 Number of file system bytes allocated to this object. 5440 5.8.2.36. Attribute 46: system 5442 This attribute is TRUE if this file is a "system" file with respect 5443 to the Windows operating environment. 5445 5.8.2.37. Attribute 47: time_access 5447 The time_access attribute represents the time of last access to the 5448 object by a read that was satisfied by the server. The notion of 5449 what is an "access" depends on server's operating environment and/or 5450 the server's file system semantics. For example, for servers obeying 5451 POSIX semantics, time_access would be updated only by the READ and 5452 READDIR operations and not any of the operations that modify the 5453 content of the object [15], [16], [17]. Of course, setting the 5454 corresponding time_access_set attribute is another way to modify the 5455 time_access attribute. 5457 Whenever the file object resides on a writable file system, the 5458 server should make best efforts to record time_access into stable 5459 storage. However, to mitigate the performance effects of doing so, 5460 and most especially whenever the server is satisfying the read of the 5461 object's content from its cache, the server MAY cache access time 5462 updates and lazily write them to stable storage. It is also 5463 acceptable to give administrators of the server the option to disable 5464 time_access updates. 5466 5.8.2.38. Attribute 48: time_access_set 5468 Set the time of last access to the object. SETATTR use only. 5470 5.8.2.39. Attribute 49: time_backup 5472 The time of last backup of the object. 5474 5.8.2.40. Attribute 50: time_create 5476 The time of creation of the object. This attribute does not have any 5477 relation to the traditional UNIX file attribute "ctime" or "change 5478 time". 5480 5.8.2.41. Attribute 51: time_delta 5482 Smallest useful server time granularity. 5484 5.8.2.42. Attribute 52: time_metadata 5486 The time of last metadata modification of the object. 5488 5.8.2.43. Attribute 53: time_modify 5490 The time of last modification to the object. 5492 5.8.2.44. Attribute 54: time_modify_set 5494 Set the time of last modification to the object. SETATTR use only. 5496 5.9. Interpreting owner and owner_group 5498 The RECOMMENDED attributes "owner" and "owner_group" (and also users 5499 and groups within the "acl" attribute) are represented in terms of a 5500 UTF-8 string. To avoid a representation that is tied to a particular 5501 underlying implementation at the client or server, the use of the 5502 UTF-8 string has been chosen. Note that section 6.1 of RFC2624 [44] 5503 provides additional rationale. It is expected that the client and 5504 server will have their own local representation of owner and 5505 owner_group that is used for local storage or presentation to the end 5506 user. Therefore, it is expected that when these attributes are 5507 transferred between the client and server that the local 5508 representation is translated to a syntax of the form "user@ 5509 dns_domain". This will allow for a client and server that do not use 5510 the same local representation the ability to translate to a common 5511 syntax that can be interpreted by both. 5513 Similarly, security principals may be represented in different ways 5514 by different security mechanisms. Servers normally translate these 5515 representations into a common format, generally that used by local 5516 storage, to serve as a means of identifying the users corresponding 5517 to these security principals. When these local identifiers are 5518 translated to the form of the owner attribute, associated with files 5519 created by such principals they identify, in a common format, the 5520 users associated with each corresponding set of security principals. 5522 The translation used to interpret owner and group strings is not 5523 specified as part of the protocol. This allows various solutions to 5524 be employed. For example, a local translation table may be consulted 5525 that maps between a numeric identifier to the user@dns_domain syntax. 5526 A name service may also be used to accomplish the translation. A 5527 server may provide a more general service, not limited by any 5528 particular translation (which would only translate a limited set of 5529 possible strings) by storing the owner and owner_group attributes in 5530 local storage without any translation or it may augment a translation 5531 method by storing the entire string for attributes for which no 5532 translation is available while using the local representation for 5533 those cases in which a translation is available. 5535 Servers that do not provide support for all possible values of the 5536 owner and owner_group attributes, SHOULD return an error 5537 (NFS4ERR_BADOWNER) when a string is presented that has no 5538 translation, as the value to be set for a SETATTR of the owner, 5539 owner_group, or acl attributes. When a server does accept an owner 5540 or owner_group value as valid on a SETATTR (and similarly for the 5541 owner and group strings in an acl), it is promising to return that 5542 same string when a corresponding GETATTR is done. Configuration 5543 changes (including changes from the mapping of the string to the 5544 local representation) and ill-constructed name translations (those 5545 that contain aliasing) may make that promise impossible to honor. 5546 Servers should make appropriate efforts to avoid a situation in which 5547 these attributes have their values changed when no real change to 5548 ownership has occurred. 5550 The "dns_domain" portion of the owner string is meant to be a DNS 5551 domain name. For example, user@example.org. Servers should accept 5552 as valid a set of users for at least one domain. A server may treat 5553 other domains as having no valid translations. A more general 5554 service is provided when a server is capable of accepting users for 5555 multiple domains, or for all domains, subject to security 5556 constraints. 5558 In the case where there is no translation available to the client or 5559 server, the attribute value will be constructed without the "@". 5560 Therefore, the absence of the @ from the owner or owner_group 5561 attribute signifies that no translation was available at the sender 5562 and that the receiver of the attribute should not use that string as 5563 a basis for translation into its own internal format. Even though 5564 the attribute value can not be translated, it may still be useful. 5565 In the case of a client, the attribute string may be used for local 5566 display of ownership. 5568 To provide a greater degree of compatibility with NFSv3, which 5569 identified users and groups by 32-bit unsigned user identifiers and 5570 group identifiers, owner and group strings that consist of decimal 5571 numeric values with no leading zeros can be given a special 5572 interpretation by clients and servers which choose to provide such 5573 support. The receiver may treat such a user or group string as 5574 representing the same user as would be represented by an NFSv3 uid or 5575 gid having the corresponding numeric value. A server is not 5576 obligated to accept such a string, but may return an NFS4ERR_BADOWNER 5577 instead. To avoid this mechanism being used to subvert user and 5578 group translation, so that a client might pass all of the owners and 5579 groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER 5580 error when there is a valid translation for the user or owner 5581 designated in this way. In that case, the client must use the 5582 appropriate name@domain string and not the special form for 5583 compatibility. 5585 The owner string "nobody" may be used to designate an anonymous user, 5586 which will be associated with a file created by a security principal 5587 that cannot be mapped through normal means to the owner attribute. 5588 Users and implementations of NFSv4.1 SHOULD NOT use "nobody" to 5589 designate a real user whose access is not anonymous. 5591 5.10. Character Case Attributes 5593 With respect to the case_insensitive and case_preserving attributes, 5594 each UCS-4 character (which UTF-8 encodes) can be mapped according to 5595 Appendix B.2 of RFC3454 [18]. For general character handling and 5596 internationalization issues, see Section 14. 5598 5.11. Directory Notification Attributes 5600 As described in Section 18.39, the client can request a minimum delay 5601 for notifications of changes to attributes, but the server is free to 5602 ignore what the client requests. The client can determine in advance 5603 what notification delays the server will accept by issuing a GETATTR 5604 for either or both of two directory notification attributes. When 5605 the client calls the GET_DIR_DELEGATION operation and asks for 5606 attribute change notifications, it should request notification delays 5607 that are no less than the values in the server-provided attributes. 5609 5.11.1. Attribute 56: dir_notif_delay 5611 The dir_notif_delay attribute is the minimum number of seconds the 5612 server will delay before notifying the client of a change to the 5613 directory's attributes. 5615 5.11.2. Attribute 57: dirent_notif_delay 5617 The dirent_notif_delay attribute is the minimum number of seconds the 5618 server will delay before notifying the client of a change to a file 5619 object that has an entry in the directory. 5621 5.12. pNFS Attribute Definitions 5623 5.12.1. Attribute 62: fs_layout_type 5625 The fs_layout_type attribute (see Section 3.3.13) applies to a file 5626 system and indicates what layout types are supported by the file 5627 system. When the client encounters a new fsid, the client SHOULD 5628 obtain the value for the fs_layout_type attribute associated with the 5629 new file system. This attribute is used by the client to determine 5630 if the layout types supported by the server match any of the client's 5631 supported layout types. 5633 5.12.2. Attribute 66: layout_alignment 5635 When a client holds layouts on files of a file system, the 5636 layout_alignment attribute indicates the preferred alignment for I/O 5637 to files on that file system. Where possible, the client should send 5638 READ and WRITE operations with offsets that are whole multiples of 5639 the layout_alignment attribute. 5641 5.12.3. Attribute 65: layout_blksize 5643 When a client holds layouts on files of a file system, the 5644 layout_blksize attribute indicates the preferred block size for I/O 5645 to files on that file system. Where possible, the client should send 5646 READ operations with a count argument that is a whole multiple of 5647 layout_blksize, and WRITE operations with a data argument of size 5648 that is a whole multiple of layout_blksize. 5650 5.12.4. Attribute 63: layout_hint 5652 The layout_hint attribute (see Section 3.3.19) may be set on newly 5653 created files to influence the metadata server's choice for the 5654 file's layout. If possible, this attribute is one of those set in 5655 the initial attributes within the OPEN operation. The metadata 5656 server may choose to ignore this attribute. The layout_hint 5657 attribute is a sub-set of the layout structure returned by LAYOUTGET. 5658 For example, instead of specifying particular devices, this would be 5659 used to suggest the stripe width of a file. The server 5660 implementation determines which fields within the layout will be 5661 used. 5663 5.12.5. Attribute 64: layout_type 5665 This attribute lists the layout type(s) available for a file. The 5666 value returned by the server is for informational purposes only. The 5667 client will use the LAYOUTGET operation to obtain the information 5668 needed in order to perform I/O. For example, the specific device 5669 information for the file and its layout. 5671 5.12.6. Attribute 68: mdsthreshold 5673 This attribute is a server provided hint used to communicate to the 5674 client when it is more efficient to send READ and WRITE operations to 5675 the metadata server or the data server. The two types of thresholds 5676 described are file size thresholds and I/O size thresholds. If a 5677 file's size is smaller than the file size threshold, data accesses 5678 SHOULD be sent to the metadata server. If an I/O request has a 5679 length that is below the I/O size threshold, the I/O SHOULD be sent 5680 to the metadata server. Each threshold type is specified separately 5681 for READ and WRITE. 5683 The server MAY provide both types of thresholds for a file. If both 5684 file size and I/O size are provided, the client SHOULD reach or 5685 exceed both thresholds before issuing its READ or WRITE requests to 5686 the data server. Alternatively, if only one of the specified 5687 thresholds are reached or exceeded, the I/O requests are sent to the 5688 metadata server. 5690 For each threshold type, a value of 0 indicates no READ or WRITE 5691 should be sent to the metadata server, while a value of all 1s 5692 indicates all READS or WRITES should be sent to the metadata server. 5694 The attribute is available on a per filehandle basis. If the current 5695 filehandle refers to a non-pNFS file or directory, the metadata 5696 server should return an attribute that is representative of the 5697 filehandle's file system. It is suggested that this attribute is 5698 queried as part of the OPEN operation. Due to dynamic system 5699 changes, the client should not assume that the attribute will remain 5700 constant for any specific time period, thus it should be periodically 5701 refreshed. 5703 5.13. Retention Attributes 5705 Retention is a concept whereby a file object can be placed in an 5706 immutable, undeletable, unrenamable state for a fixed or infinite 5707 duration of time. Once in this "retained" state, the file cannot be 5708 moved out of the state until the duration of retention has been 5709 reached. 5711 When retention is enabled, retention MUST extend to the data of the 5712 file, and the name of file. The server MAY extend retention to any 5713 other property of the file, including any subset of REQUIRED, 5714 RECOMMENDED, and named attributes, with the exceptions noted in this 5715 section. 5717 Servers MAY support or not support retention on any file object type. 5719 The five retention attributes are explained in the next subsections. 5721 5.13.1. Attribute 69: retention_get 5723 If retention is enabled for the associated file, this attribute's 5724 value represents the retention begin time of the file object. This 5725 attribute's value is only readable with the GETATTR operation and 5726 MUST NOT be modified by the SETATTR operation (Section 5.5). The 5727 value of the attribute consists of: 5729 const RET4_DURATION_INFINITE = 0xffffffffffffffff; 5730 struct retention_get4 { 5731 uint64_t rg_duration; 5732 nfstime4 rg_begin_time<1>; 5733 }; 5735 The field rg_duration is the duration in seconds indicating how long 5736 the file will be retained once retention is enabled. The field 5737 rg_begin_time is an array of up to one absolute time value. If the 5738 array is zero length, no beginning retention time has been 5739 established, and retention is not enabled. If rg_duration is equal 5740 to RET4_DURATION_INFINITE, the file, once retention is enabled, will 5741 be retained for an infinite duration. 5743 If (as soon as) rg_duration is zero, then rg_begin_time will be of 5744 zero length, and again, retention is not (no longer) enabled. 5746 5.13.2. Attribute 70: retention_set 5748 This attribute is used to set the retention duration and optionally 5749 enable retention for the associated file object. This attribute is 5750 only modifiable via the SETATTR operation and MUST NOT be retrieved 5751 by the GETATTR operation (Section 5.5). This attribute corresponds 5752 to retention_get. The value of the attribute consists of: 5754 struct retention_set4 { 5755 bool rs_enable; 5756 uint64_t rs_duration<1>; 5757 }; 5759 If the client sets rs_enable to TRUE, then it is enabling retention 5760 on the file object with the begin time of retention starting from the 5761 server's current time and date. The duration of the retention can 5762 also be provided if the rs_duration array is of length one. The 5763 duration is the time in seconds from the begin time of retention, and 5764 if set to RET4_DURATION_INFINITE, the file is to be retained forever. 5765 If retention is enabled, with no duration specified in either this 5766 SETATTR or a previous SETATTR, the duration defaults to zero seconds. 5767 The server MAY restrict the enabling of retention or the duration of 5768 retention on the basis of the ACE4_WRITE_RETENTION ACL permission. 5769 The enabling of retention MUST NOT prevent the enabling of event- 5770 based retention nor the modification of the retention_hold attribute. 5772 The following rules apply to both the retention_set and retentevt_set 5773 attributes. 5775 o As long as retention is not enabled, the client is permitted to 5776 decrease the duration. 5778 o The duration can always be set to an equal or higher value, even 5779 if retention is enabled. Note that once retention is enabled, the 5780 actual duration (as returned by the retention_get or retentevt_get 5781 attributes, see Section 5.13.1 or Section 5.13.3), is constantly 5782 counting down to zero (one unit per second), unless the duration 5783 was set to RET4_DURATION_INFINITE. Thus it will not be possible 5784 for the client to precisely extend the duration on a file that has 5785 retention enabled. 5787 o While retention is enabled, attempts to disable retention or 5788 decrease the retention's duration MUST fail with the error 5789 NFS4ERR_INVAL. 5791 o If the principal attempting to change retention_set or 5792 retentevt_set does not have ACE4_WRITE_RETENTION permissions, the 5793 attempt MUST fail with NFS4ERR_ACCESS. 5795 5.13.3. Attribute 71: retentevt_get 5797 Get the event-based retention duration, and if enabled, the event- 5798 based retention begin time of the file object. This attribute is 5799 like retention_get but refers to event-based retention. The event 5800 that triggers event-based retention is not defined by the NFSv4.1 5801 specification. 5803 5.13.4. Attribute 72: retentevt_set 5805 Set the event-based retention duration, and optionally enable event- 5806 based retention on the file object. This attribute corresponds to 5807 retentevt_get, is like retention_set, but refers to event-based 5808 retention. When event based retention is set, the file MUST be 5809 retained even if non-event-based retention has been set, and the 5810 duration of non-event-based retention has been reached. Conversely, 5811 when non-event-based retention has been set, the file MUST be 5812 retained even if event-based retention has been set, and the duration 5813 of event-based retention has been reached. The server MAY restrict 5814 the enabling of event-based retention or the duration of event-based 5815 retention on the basis of the ACE4_WRITE_RETENTION ACL permission. 5816 The enabling of event-based retention MUST NOT prevent the enabling 5817 of non-event-based retention nor the modification of the 5818 retention_hold attribute. 5820 5.13.5. Attribute 73: retention_hold 5822 Get or set administrative retention holds, one hold per bit position. 5824 This attribute allows one to 64 administrative holds, one hold per 5825 bit on the attribute. If retention_hold is not zero, then the file 5826 MUST NOT be deleted, renamed, or modified, even if the duration on 5827 enabled event or non-event-based retention has been reached. The 5828 server MAY restrict the modification of retention_hold on the basis 5829 of the ACE4_WRITE_RETENTION_HOLD ACL permission. The enabling of 5830 administration retention holds does not prevent the enabling of 5831 event-based or non-event-based retention. 5833 If the principal attempting to change retention_hold does not have 5834 ACE4_WRITE_RETENTION_HOLD permissions, the attempt MUST fail with 5835 NFS4ERR_ACCESS. 5837 6. Access Control Attributes 5839 Access Control Lists (ACLs) are file attributes that specify fine 5840 grained access control. This chapter covers the "acl", "dacl", 5841 "sacl", "aclsupport", "mode", "mode_set_masked" file attributes, and 5842 their interactions. Note that file attributes may apply to any file 5843 system object. 5845 6.1. Goals 5847 ACLs and modes represent two well established models for specifying 5848 permissions. This chapter specifies requirements that attempt to 5849 meet the following goals: 5851 o If a server supports the mode attribute, it should provide 5852 reasonable semantics to clients that only set and retrieve the 5853 mode attribute. 5855 o If a server supports ACL attributes, it should provide reasonable 5856 semantics to clients that only set and retrieve those attributes. 5858 o On servers that support the mode attribute, if ACL attributes have 5859 never been set on an object, via inheritance or explicitly, the 5860 behavior should be traditional UNIX-like behavior. 5862 o On servers that support the mode attribute, if the ACL attributes 5863 have been previously set on an object, either explicitly or via 5864 inheritance: 5866 * Setting only the mode attribute should effectively control the 5867 traditional UNIX-like permissions of read, write, and execute 5868 on owner, owner_group, and other. 5870 * Setting only the mode attribute should provide reasonable 5871 security. For example, setting a mode of 000 should be enough 5872 to ensure that future opens for read or write by any principal 5873 fail, regardless of a previously existing or inherited ACL. 5875 o NFSv4.1 may introduce different semantics relating to the mode and 5876 ACL attributes, but it does not render invalid any previously 5877 existing implementations. Additionally, this chapter provides 5878 clarifications based on previous implementations and discussions 5879 around them. 5881 o On servers that support both the mode and the acl or dacl 5882 attributes, the server must keep the two consistent with each 5883 other. The value of the mode attribute (with the exception of the 5884 three high order bits described in Section 6.2.4), must be 5885 determined entirely by the value of the ACL, so that use of the 5886 mode is never required for anything other than setting the three 5887 high order bits. See Section 6.4.1 for exact requirements. 5889 o When a mode attribute is set on an object, the ACL attributes may 5890 need to be modified so as to not conflict with the new mode. In 5891 such cases, it is desirable that the ACL keep as much information 5892 as possible. This includes information about inheritance, AUDIT 5893 and ALARM ACEs, and permissions granted and denied that do not 5894 conflict with the new mode. 5896 6.2. File Attributes Discussion 5898 6.2.1. Attribute 12: acl 5900 The NFSv4.1 ACL attribute contains an array of access control entries 5901 (ACEs) that are associated with the file system object. Although the 5902 client can read and write the acl attribute, the server is 5903 responsible for using the ACL to perform access control. The client 5904 can use the OPEN or ACCESS operations to check access without 5905 modifying or reading data or metadata. 5907 The NFS ACE structure is defined as follows: 5909 typedef uint32_t acetype4; 5911 typedef uint32_t aceflag4; 5913 typedef uint32_t acemask4; 5915 struct nfsace4 { 5916 acetype4 type; 5917 aceflag4 flag; 5918 acemask4 access_mask; 5919 utf8str_mixed who; 5920 }; 5922 To determine if a request succeeds, the server processes each nfsace4 5923 entry in order. Only ACEs which have a "who" that matches the 5924 requester are considered. Each ACE is processed until all of the 5925 bits of the requester's access have been ALLOWED. Once a bit (see 5926 below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer 5927 considered in the processing of later ACEs. If an ACCESS_DENIED_ACE 5928 is encountered where the requester's access still has unALLOWED bits 5929 in common with the "access_mask" of the ACE, the request is denied. 5931 When the ACL is fully processed, if there are bits in the requester's 5932 mask that have not been ALLOWED or DENIED, access is denied. 5934 Unlike the ALLOW and DENY ACE types, the ALARM and AUDIT ACE types do 5935 not affect a requester's access, and instead are for triggering 5936 events as a result of a requester's access attempt. Therefore, AUDIT 5937 and ALARM ACEs are processed only after processing ALLOW and DENY 5938 ACEs. 5940 The NFSv4.1 ACL model is quite rich. Some server platforms may 5941 provide access control functionality that goes beyond the UNIX-style 5942 mode attribute, but which is not as rich as the NFS ACL model. So 5943 that users can take advantage of this more limited functionality, the 5944 server may support the acl attributes by mapping between its ACL 5945 model and the NFSv4.1 ACL model. Servers must ensure that the ACL 5946 they actually store or enforce is at least as strict as the NFSv4 ACL 5947 that was set. It is tempting to accomplish this by rejecting any ACL 5948 that falls outside the small set that can be represented accurately. 5949 However, such an approach can render ACLs unusable without special 5950 client-side knowledge of the server's mapping, which defeats the 5951 purpose of having a common NFSv4 ACL protocol. Therefore servers 5952 should accept every ACL that they can without compromising security. 5953 To help accomplish this, servers may make a special exception, in the 5954 case of unsupported permission bits, to the rule that bits not 5955 ALLOWED or DENIED by an ACL must be denied. For example, a UNIX- 5956 style server might choose to silently allow read attribute 5957 permissions even though an ACL does not explicitly allow those 5958 permissions. (An ACL that explicitly denies permission to read 5959 attributes should still be rejected.) 5961 The situation is complicated by the fact that a server may have 5962 multiple modules that enforce ACLs. For example, the enforcement for 5963 NFSv4.1 access may be different from, but not weaker than, the 5964 enforcement for local access, and both may be different from the 5965 enforcement for access through other protocols such as SMB. So it 5966 may be useful for a server to accept an ACL even if not all of its 5967 modules are able to support it. 5969 The guiding principle with regard to NFSv4 access is that the server 5970 must not accept ACLs that appear to make access to the file more 5971 restrictive than it really is. 5973 6.2.1.1. ACE Type 5975 The constants used for the type field (acetype4) are as follows: 5977 const ACE4_ACCESS_ALLOWED_ACE_TYPE = 0x00000000; 5978 const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001; 5979 const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002; 5980 const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003; 5982 Only the ALLOWED and DENIED bits types may be used in the dacl 5983 attribute, and only the AUDIT and ALARM bits may be used in the sacl 5984 attribute. All four are permitted in the acl attribute. 5986 +------------------------------+--------------+---------------------+ 5987 | Value | Abbreviation | Description | 5988 +------------------------------+--------------+---------------------+ 5989 | ACE4_ACCESS_ALLOWED_ACE_TYPE | ALLOW | Explicitly grants | 5990 | | | the access defined | 5991 | | | in acemask4 to the | 5992 | | | file or directory. | 5993 | ACE4_ACCESS_DENIED_ACE_TYPE | DENY | Explicitly denies | 5994 | | | the access defined | 5995 | | | in acemask4 to the | 5996 | | | file or directory. | 5997 | ACE4_SYSTEM_AUDIT_ACE_TYPE | AUDIT | LOG (in a system | 5998 | | | dependent way) any | 5999 | | | access attempt to a | 6000 | | | file or directory | 6001 | | | which uses any of | 6002 | | | the access methods | 6003 | | | specified in | 6004 | | | acemask4. | 6005 | ACE4_SYSTEM_ALARM_ACE_TYPE | ALARM | Generate a system | 6006 | | | ALARM (system | 6007 | | | dependent) when any | 6008 | | | access attempt is | 6009 | | | made to a file or | 6010 | | | directory for the | 6011 | | | access methods | 6012 | | | specified in | 6013 | | | acemask4. | 6014 +------------------------------+--------------+---------------------+ 6016 The "Abbreviation" column denotes how the types will be referred to 6017 throughout the rest of this chapter. 6019 6.2.1.2. Attribute 13: aclsupport 6021 A server need not support all of the above ACE types. This attribute 6022 indicates which ACE types are supported for the current file system. 6023 The bitmask constants used to represent the above definitions within 6024 the aclsupport attribute are as follows: 6026 const ACL4_SUPPORT_ALLOW_ACL = 0x00000001; 6027 const ACL4_SUPPORT_DENY_ACL = 0x00000002; 6028 const ACL4_SUPPORT_AUDIT_ACL = 0x00000004; 6029 const ACL4_SUPPORT_ALARM_ACL = 0x00000008; 6031 Servers which support either the ALLOW or DENY ACE type SHOULD 6032 support both ALLOW and DENY ACE types. 6034 Clients should not attempt to set an ACE unless the server claims 6035 support for that ACE type. If the server receives a request to set 6036 an ACE that it cannot store, it MUST reject the request with 6037 NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE 6038 that it can store but cannot enforce, the server SHOULD reject the 6039 request with NFS4ERR_ATTRNOTSUPP. 6041 Support for any of the ACL attributes is optional (albeit, 6042 RECOMMENDED). However, a server that supports either of the new ACL 6043 attributes (dacl or sacl) MUST allow use of the new ACL attributes to 6044 access all of the ACE types which it supports. In other words, if 6045 such a server supports ALLOW or DENY ACEs, then it MUST support the 6046 dacl attribute, and if it supports AUDIT or ALARM ACEs, then it MUST 6047 support the sacl attribute. 6049 6.2.1.3. ACE Access Mask 6051 The bitmask constants used for the access mask field are as follows: 6053 const ACE4_READ_DATA = 0x00000001; 6054 const ACE4_LIST_DIRECTORY = 0x00000001; 6055 const ACE4_WRITE_DATA = 0x00000002; 6056 const ACE4_ADD_FILE = 0x00000002; 6057 const ACE4_APPEND_DATA = 0x00000004; 6058 const ACE4_ADD_SUBDIRECTORY = 0x00000004; 6059 const ACE4_READ_NAMED_ATTRS = 0x00000008; 6060 const ACE4_WRITE_NAMED_ATTRS = 0x00000010; 6061 const ACE4_EXECUTE = 0x00000020; 6062 const ACE4_DELETE_CHILD = 0x00000040; 6063 const ACE4_READ_ATTRIBUTES = 0x00000080; 6064 const ACE4_WRITE_ATTRIBUTES = 0x00000100; 6065 const ACE4_WRITE_RETENTION = 0x00000200; 6066 const ACE4_WRITE_RETENTION_HOLD = 0x00000400; 6068 const ACE4_DELETE = 0x00010000; 6069 const ACE4_READ_ACL = 0x00020000; 6070 const ACE4_WRITE_ACL = 0x00040000; 6071 const ACE4_WRITE_OWNER = 0x00080000; 6072 const ACE4_SYNCHRONIZE = 0x00100000; 6073 Note that some masks have coincident values, for example, 6074 ACE4_READ_DATA and ACE4_LIST_DIRECTORY. The mask entries 6075 ACE4_LIST_DIRECTORY, ACE4_ADD_FILE, and ACE4_ADD_SUBDIRECTORY are 6076 intended to be used with directory objects, while ACE4_READ_DATA, 6077 ACE4_WRITE_DATA, and ACE4_APPEND_DATA are intended to be used with 6078 non-directory objects. 6080 6.2.1.3.1. Discussion of Mask Attributes 6082 ACE4_READ_DATA 6084 Operation(s) affected: 6086 READ 6088 OPEN 6090 Discussion: 6092 Permission to read the data of the file. 6094 Servers SHOULD allow a user the ability to read the data of the 6095 file when only the ACE4_EXECUTE access mask bit is allowed. 6097 ACE4_LIST_DIRECTORY 6099 Operation(s) affected: 6101 READDIR 6103 Discussion: 6105 Permission to list the contents of a directory. 6107 ACE4_WRITE_DATA 6109 Operation(s) affected: 6111 WRITE 6113 OPEN 6115 SETATTR of size 6117 Discussion: 6119 Permission to modify a file's data. 6121 ACE4_ADD_FILE 6123 Operation(s) affected: 6125 CREATE 6127 LINK 6129 OPEN 6131 RENAME 6133 Discussion: 6135 Permission to add a new file in a directory. The CREATE 6136 operation is affected when nfs_ftype4 is NF4LNK, NF4BLK, 6137 NF4CHR, NF4SOCK, or NF4FIFO. (NF4DIR is not listed because it 6138 is covered by ACE4_ADD_SUBDIRECTORY.) OPEN is affected when 6139 used to create a regular file. LINK and RENAME are always 6140 affected. 6142 ACE4_APPEND_DATA 6144 Operation(s) affected: 6146 WRITE 6148 OPEN 6150 SETATTR of size 6152 Discussion: 6154 The ability to modify a file's data, but only starting at EOF. 6155 This allows for the notion of append-only files, by allowing 6156 ACE4_APPEND_DATA and denying ACE4_WRITE_DATA to the same user 6157 or group. If a file has an ACL such as the one described above 6158 and a WRITE request is made for somewhere other than EOF, the 6159 server SHOULD return NFS4ERR_ACCESS. 6161 ACE4_ADD_SUBDIRECTORY 6163 Operation(s) affected: 6165 CREATE 6167 RENAME 6169 Discussion: 6171 Permission to create a subdirectory in a directory. The CREATE 6172 operation is affected when nfs_ftype4 is NF4DIR. The RENAME 6173 operation is always affected. 6175 ACE4_READ_NAMED_ATTRS 6177 Operation(s) affected: 6179 OPENATTR 6181 Discussion: 6183 Permission to read the named attributes of a file or to lookup 6184 the named attributes directory. OPENATTR is affected when it 6185 is not used to create a named attribute directory. This is 6186 when 1.) createdir is TRUE, but a named attribute directory 6187 already exists, or 2.) createdir is FALSE. 6189 ACE4_WRITE_NAMED_ATTRS 6191 Operation(s) affected: 6193 OPENATTR 6195 Discussion: 6197 Permission to write the named attributes of a file or to create 6198 a named attribute directory. OPENATTR is affected when it is 6199 used to create a named attribute directory. This is when 6200 createdir is TRUE and no named attribute directory exists. The 6201 ability to check whether or not a named attribute directory 6202 exists depends on the ability to look it up, therefore, users 6203 also need the ACE4_READ_NAMED_ATTRS permission in order to 6204 create a named attribute directory. 6206 ACE4_EXECUTE 6208 Operation(s) affected: 6210 READ 6212 OPEN 6214 REMOVE 6216 RENAME 6218 LINK 6220 CREATE 6222 Discussion: 6224 Permission to execute a file. 6226 Servers SHOULD allow a user the ability to read the data of the 6227 file when only the ACE4_EXECUTE access mask bit is allowed. 6228 This is because there is no way to execute a file without 6229 reading the contents. Though a server may treat ACE4_EXECUTE 6230 and ACE4_READ_DATA bits identically when deciding to permit a 6231 READ operation, it SHOULD still allow the two bits to be set 6232 independently in ACLs, and MUST distinguish between them when 6233 replying to ACCESS operations. In particular, servers SHOULD 6234 NOT silently turn on one of the two bits when the other is set, 6235 as that would make it impossible for the client to correctly 6236 enforce the distinction between read and execute permissions. 6238 As an example, following a SETATTR of the following ACL: 6240 nfsuser:ACE4_EXECUTE:ALLOW 6242 A subsequent GETATTR of ACL for that file SHOULD return: 6244 nfsuser:ACE4_EXECUTE:ALLOW 6246 Rather than: 6248 nfsuser:ACE4_EXECUTE/ACE4_READ_DATA:ALLOW 6250 ACE4_EXECUTE 6252 Operation(s) affected: 6254 LOOKUP 6256 Discussion: 6258 Permission to traverse/search a directory. 6260 ACE4_DELETE_CHILD 6262 Operation(s) affected: 6264 REMOVE 6266 RENAME 6268 Discussion: 6270 Permission to delete a file or directory within a directory. 6271 See Section 6.2.1.3.2 for information on ACE4_DELETE and 6272 ACE4_DELETE_CHILD interact. 6274 ACE4_READ_ATTRIBUTES 6276 Operation(s) affected: 6278 GETATTR of file system object attributes 6280 VERIFY 6282 NVERIFY 6284 READDIR 6286 Discussion: 6288 The ability to read basic attributes (non-ACLs) of a file. On 6289 a UNIX system, basic attributes can be thought of as the stat 6290 level attributes. Allowing this access mask bit would mean the 6291 entity can execute "ls -l" and stat. If a READDIR operation 6292 requests attributes, this mask must be allowed for the READDIR 6293 to succeed. 6295 ACE4_WRITE_ATTRIBUTES 6297 Operation(s) affected: 6299 SETATTR of time_access_set, time_backup, 6301 time_create, time_modify_set, mimetype, hidden, system 6303 Discussion: 6305 Permission to change the times associated with a file or 6306 directory to an arbitrary value. Also permission to change the 6307 mimetype, hidden and system attributes. A user having 6308 ACE4_WRITE_DATA or ACE4_WRITE_ATTRIBUTES will be allowed to set 6309 the times associated with a file to the current server time. 6311 ACE4_WRITE_RETENTION 6313 Operation(s) affected: 6315 SETATTR of retention_set, retentevt_set. 6317 Discussion: 6319 Permission to modify the durations of event and non-event-based 6320 retention. Also permission to enable event and non-event-based 6321 retention. A server MAY behave such that setting 6322 ACE4_WRITE_ATTRIBUTES allows ACE4_WRITE_RETENTION. 6324 ACE4_WRITE_RETENTION_HOLD 6326 Operation(s) affected: 6328 SETATTR of retention_hold. 6330 Discussion: 6332 Permission to modify the administration retention holds. A 6333 server MAY map ACE4_WRITE_ATTRIBUTES to 6334 ACE_WRITE_RETENTION_HOLD. 6336 ACE4_DELETE 6338 Operation(s) affected: 6340 REMOVE 6342 Discussion: 6344 Permission to delete the file or directory. See 6345 Section 6.2.1.3.2 for information on ACE4_DELETE and 6346 ACE4_DELETE_CHILD interact. 6348 ACE4_READ_ACL 6350 Operation(s) affected: 6352 GETATTR of acl, dacl, or sacl 6354 NVERIFY 6356 VERIFY 6358 Discussion: 6360 Permission to read the ACL. 6362 ACE4_WRITE_ACL 6364 Operation(s) affected: 6366 SETATTR of acl and mode 6368 Discussion: 6370 Permission to write the acl and mode attributes. 6372 ACE4_WRITE_OWNER 6374 Operation(s) affected: 6376 SETATTR of owner and owner_group 6378 Discussion: 6380 Permission to write the owner and owner_group attributes. On 6381 UNIX systems, this is the ability to execute chown() and 6382 chgrp(). 6384 ACE4_SYNCHRONIZE 6386 Operation(s) affected: 6388 NONE 6390 Discussion: 6392 Permission to use the file object as a synchronization 6393 primitive for interprocess communication. This permission is 6394 not enforced or interpreted by the NFSv4.1 server on behalf of 6395 the client. 6397 Typically, the ACE4_SYNCHRONIZE permission is only meaningful 6398 on local file systems, i.e. file systems not accessed via 6399 NFSv4.1. The reason that the permission bit exists is that 6400 some operating environments, such as Windows, use 6401 ACE4_SYNCHRONIZE. 6403 For example, if a client copies a file that has 6404 ACE4_SYNCHRONIZE set from a local file system to an NFSv4.1 6405 server, and then later copies the file from the NFSv4.1 server 6406 to a local file system, it is likely that if ACE4_SYNCHRONIZE 6407 was set in the original file, the client will want it set in 6408 the second copy. The first copy will not have the permission 6409 set unless the NFSv4.1 server has the means to set the 6410 ACE4_SYNCHRONIZE bit. The second copy will not have the 6411 permission set unless the NFSv4.1 server has the means to 6412 retrieve the ACE4_SYNCHRONIZE bit. 6414 Server implementations need not provide the granularity of control 6415 that is implied by this list of masks. For example, POSIX-based 6416 systems might not distinguish ACE4_APPEND_DATA (the ability to append 6417 to a file) from ACE4_WRITE_DATA (the ability to modify existing 6418 contents); both masks would be tied to a single "write" permission 6419 [19]. When such a server returns attributes to the client, it would 6420 show both ACE4_APPEND_DATA and ACE4_WRITE_DATA if and only if the 6421 write permission is enabled. 6423 If a server receives a SETATTR request that it cannot accurately 6424 implement, it should err in the direction of more restricted access, 6425 except in the previously discussed cases of execute and read. For 6426 example, suppose a server cannot distinguish overwriting data from 6427 appending new data, as described in the previous paragraph. If a 6428 client submits an ALLOW ACE where ACE4_APPEND_DATA is set but 6429 ACE4_WRITE_DATA is not (or vice versa), the server should either turn 6430 off ACE4_APPEND_DATA or reject the request with NFS4ERR_ATTRNOTSUPP. 6432 6.2.1.3.2. ACE4_DELETE vs. ACE4_DELETE_CHILD 6434 Two access mask bits govern the ability to delete a directory entry: 6435 ACE4_DELETE on the object itself (the "target"), and 6436 ACE4_DELETE_CHILD on the containing directory (the "parent"). 6438 Many systems also take the "sticky bit" (MODE4_SVTX) on a directory 6439 to allow unlink only to a user that owns either the target or the 6440 parent; on some such systems the decision also depends on whether the 6441 target is writable. 6443 Servers SHOULD allow unlink if either ACE4_DELETE is permitted on the 6444 target, or ACE4_DELETE_CHILD is permitted on the parent. (Note that 6445 this is true even if the parent or target explicitly denies one of 6446 these permissions.) 6448 If the ACLs in question neither explicitly ALLOW nor DENY either of 6449 the above, and if MODE4_SVTX is not set on the parent, then the 6450 server SHOULD allow the removal if and only if ACE4_ADD_FILE is 6451 permitted. In the case where MODE4_SVTX is set, the server may also 6452 require the remover to own either the parent or the target, or may 6453 require the target to be writable. 6455 This allows servers to support something close to traditional UNIX- 6456 like semantics, with ACE4_ADD_FILE taking the place of the write bit. 6458 6.2.1.4. ACE flag 6460 The bitmask constants used for the flag field are as follows: 6462 const ACE4_FILE_INHERIT_ACE = 0x00000001; 6463 const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002; 6464 const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004; 6465 const ACE4_INHERIT_ONLY_ACE = 0x00000008; 6466 const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010; 6467 const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020; 6468 const ACE4_IDENTIFIER_GROUP = 0x00000040; 6469 const ACE4_INHERITED_ACE = 0x00000080; 6471 A server need not support any of these flags. If the server supports 6472 flags that are similar to, but not exactly the same as, these flags, 6473 the implementation may define a mapping between the protocol-defined 6474 flags and the implementation-defined flags. 6476 For example, suppose a client tries to set an ACE with 6477 ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the 6478 server does not support any form of ACL inheritance, the server 6479 should reject the request with NFS4ERR_ATTRNOTSUPP. If the server 6480 supports a single "inherit ACE" flag that applies to both files and 6481 directories, the server may reject the request (i.e., requiring the 6482 client to set both the file and directory inheritance flags). The 6483 server may also accept the request and silently turn on the 6484 ACE4_DIRECTORY_INHERIT_ACE flag. 6486 6.2.1.4.1. Discussion of Flag Bits 6488 ACE4_FILE_INHERIT_ACE 6489 Any non-directory file in any sub-directory will get this ACE 6490 inherited. 6492 ACE4_DIRECTORY_INHERIT_ACE 6493 Can be placed on a directory and indicates that this ACE should be 6494 added to each new directory created. 6495 If this flag is set in an ACE in an ACL attribute to be set on a 6496 non-directory file system object, the operation attempting to set 6497 the ACL SHOULD fail with NFS4ERR_ATTRNOTSUPP. 6499 ACE4_INHERIT_ONLY_ACE 6500 Can be placed on a directory but does not apply to the directory; 6501 ALLOW and DENY ACEs with this bit set do not affect access to the 6502 directory, and AUDIT and ALARM ACEs with this bit set do not 6503 trigger log or alarm events. Such ACEs only take effect once they 6504 are applied (with this bit cleared) to newly created files and 6505 directories as specified by the above two flags. 6506 If this flag is present on an ACE, but neither 6507 ACE4_DIRECTORY_INHERIT_ACE nor ACE4_FILE_INHERIT_ACE is present, 6508 then an operation attempting to set such an attribute SHOULD fail 6509 with NFS4ERR_ATTRNOTSUPP. 6511 ACE4_NO_PROPAGATE_INHERIT_ACE 6512 Can be placed on a directory. This flag tells the server that 6513 inheritance of this ACE should stop at newly created child 6514 directories. 6516 ACE4_INHERITED_ACE 6517 Indicates that this ACE is inherited from a parent directory. A 6518 server that supports automatic inheritance will place this flag on 6519 any ACEs inherited from the parent directory when creating a new 6520 object. Client applications will use this to perform automatic 6521 inheritance. Clients and servers MUST clear this bit in the acl 6522 attribute; it may only be used in the dacl and sacl attributes. 6524 ACE4_SUCCESSFUL_ACCESS_ACE_FLAG 6525 ACE4_FAILED_ACCESS_ACE_FLAG 6526 The ACE4_SUCCESSFUL_ACCESS_ACE_FLAG (SUCCESS) and 6527 ACE4_FAILED_ACCESS_ACE_FLAG (FAILED) flag bits may be set only on 6528 ACE4_SYSTEM_AUDIT_ACE_TYPE (AUDIT) and ACE4_SYSTEM_ALARM_ACE_TYPE 6529 (ALARM) ACE types. If during the processing of the file's ACL, 6530 the server encounters an AUDIT or ALARM ACE that matches the 6531 principal attempting the OPEN, the server notes that fact, and the 6532 presence, if any, of the SUCCESS and FAILED flags encountered in 6533 the AUDIT or ALARM ACE. Once the server completes the ACL 6534 processing, it then notes if the operation succeeded or failed. 6535 If the operation succeeded, and if the SUCCESS flag was set for a 6536 matching AUDIT or ALARM ACE, then the appropriate AUDIT or ALARM 6537 event occurs. If the operation failed, and if the FAILED flag was 6538 set for the matching AUDIT or ALARM ACE, then the appropriate 6539 AUDIT or ALARM event occurs. Either or both of the SUCCESS or 6540 FAILED can be set, but if neither is set, the AUDIT or ALARM ACE 6541 is not useful. 6543 The previously described processing applies to ACCESS operations 6544 even when they return NFS4_OK. For the purposes of AUDIT and 6545 ALARM, we consider an ACCESS operation to be a "failure" if it 6546 fails to return a bit that was requested and supported. 6548 ACE4_IDENTIFIER_GROUP 6549 Indicates that the "who" refers to a GROUP as defined under UNIX 6550 or a GROUP ACCOUNT as defined under Windows. Clients and servers 6551 MUST ignore the ACE4_IDENTIFIER_GROUP flag on ACEs with a who 6552 value equal to one of the special identifiers outlined in 6553 Section 6.2.1.5. 6555 6.2.1.5. ACE Who 6557 The "who" field of an ACE is an identifier that specifies the 6558 principal or principals to whom the ACE applies. It may refer to a 6559 user or a group, with the flag bit ACE4_IDENTIFIER_GROUP specifying 6560 which. 6562 There are several special identifiers which need to be understood 6563 universally, rather than in the context of a particular DNS domain. 6564 Some of these identifiers cannot be understood when an NFS client 6565 accesses the server, but have meaning when a local process accesses 6566 the file. The ability to display and modify these permissions is 6567 permitted over NFS, even if none of the access methods on the server 6568 understands the identifiers. 6570 +---------------+--------------------------------------------------+ 6571 | Who | Description | 6572 +---------------+--------------------------------------------------+ 6573 | OWNER | The owner of the file | 6574 | GROUP | The group associated with the file. | 6575 | EVERYONE | The world, including the owner and owning group. | 6576 | INTERACTIVE | Accessed from an interactive terminal. | 6577 | NETWORK | Accessed via the network. | 6578 | DIALUP | Accessed as a dialup user to the server. | 6579 | BATCH | Accessed from a batch job. | 6580 | ANONYMOUS | Accessed without any authentication. | 6581 | AUTHENTICATED | Any authenticated user (opposite of ANONYMOUS) | 6582 | SERVICE | Access from a system service. | 6583 +---------------+--------------------------------------------------+ 6585 Table 4 6587 To avoid conflict, these special identifiers are distinguished by an 6588 appended "@" and should appear in the form "xxxx@" (with no domain 6589 name after the "@"). For example: ANONYMOUS@. 6591 The ACE4_IDENTIFIER_GROUP flag MUST be ignored on entries with these 6592 special identifiers. When encoding entries with these special 6593 identifiers, the ACE4_IDENTIFIER_GROUP flag SHOULD be set to zero. 6595 6.2.1.5.1. Discussion of EVERYONE@ 6597 It is important to note that "EVERYONE@" is not equivalent to the 6598 UNIX "other" entity. This is because, by definition, UNIX "other" 6599 does not include the owner or owning group of a file. "EVERYONE@" 6600 means literally everyone, including the owner or owning group. 6602 6.2.2. Attribute 58: dacl 6604 The dacl attribute is like the acl attribute, but dacl allows just 6605 ALLOW and DENY ACEs. The dacl attribute supports automatic 6606 inheritance (see Section 6.4.3.2). 6608 6.2.3. Attribute 59: sacl 6610 The sacl attribute is like the acl attribute, but sacl allows just 6611 AUDIT and ALARM ACEs. The sacl attribute supports automatic 6612 inheritance (see Section 6.4.3.2). 6614 6.2.4. Attribute 33: mode 6616 The NFSv4.1 mode attribute is based on the UNIX mode bits. The 6617 following bits are defined: 6619 const MODE4_SUID = 0x800; /* set user id on execution */ 6620 const MODE4_SGID = 0x400; /* set group id on execution */ 6621 const MODE4_SVTX = 0x200; /* save text even after use */ 6622 const MODE4_RUSR = 0x100; /* read permission: owner */ 6623 const MODE4_WUSR = 0x080; /* write permission: owner */ 6624 const MODE4_XUSR = 0x040; /* execute permission: owner */ 6625 const MODE4_RGRP = 0x020; /* read permission: group */ 6626 const MODE4_WGRP = 0x010; /* write permission: group */ 6627 const MODE4_XGRP = 0x008; /* execute permission: group */ 6628 const MODE4_ROTH = 0x004; /* read permission: other */ 6629 const MODE4_WOTH = 0x002; /* write permission: other */ 6630 const MODE4_XOTH = 0x001; /* execute permission: other */ 6632 Bits MODE4_RUSR, MODE4_WUSR, and MODE4_XUSR apply to the principal 6633 identified in the owner attribute. Bits MODE4_RGRP, MODE4_WGRP, and 6634 MODE4_XGRP apply to principals identified in the owner_group 6635 attribute but who are not identified in the owner attribute. Bits 6636 MODE4_ROTH, MODE4_WOTH, MODE4_XOTH apply to any principal that does 6637 not match that in the owner attribute, and does not have a group 6638 matching that of the owner_group attribute. 6640 Bits within the mode other than those specified above are not defined 6641 by this protocol. A server MUST NOT return bits other than those 6642 defined above in a GETATTR or READDIR operation, and it MUST return 6643 NFS4ERR_INVAL if bits other than those defined above are set in a 6644 SETATTR, CREATE, OPEN, VERIFY or NVERIFY operation. 6646 6.2.5. Attribute 74: mode_set_masked 6648 The mode_set_masked attribute is a write-only attribute that allows 6649 individual bits in the mode attribute to be set or reset, without 6650 changing others. It allows, for example, the bits MODE4_SUID, 6651 MODE4_SGID, and MODE4_SVTX to be modified while leaving unmodified 6652 any of the nine low-order mode bits devoted to permissions. 6654 In such instances that the nine low-order bits are left unmodified, 6655 then neither the acl nor the dacl attribute should be automatically 6656 modified as discussed in Section 6.4.1. 6658 The mode_set_masked attribute consists of two words each in the form 6659 of a mode4. The first consists of the value to be applied to the 6660 current mode value and the second is a mask. Only bits set to one in 6661 the mask word are changed (set or reset) in the file's mode. All 6662 other bits in the mode remain unchanged. Bits in the first word that 6663 correspond to bits which are zero in the mask are ignored, except 6664 that undefined bits are checked for validity and can result in 6665 NFS4ERR_INVAL as described below. 6667 The mode_set_masked attribute is only valid in a SETATTR operation. 6668 If it is used in a CREATE or OPEN operation, the server MUST return 6669 NFS4ERR_INVAL. 6671 Bits not defined as valid in the mode attribute are not valid in 6672 either word of the mode_set_masked attribute. The server MUST return 6673 NFS4ERR_INVAL if any of those are on in a SETATTR. If the mode and 6674 mode_set_masked attributes are both specified in the same SETATTR, 6675 the server MUST also return NFS4ERR_INVAL. 6677 6.3. Common Methods 6679 The requirements in this section will be referred to in future 6680 sections, especially Section 6.4. 6682 6.3.1. Interpreting an ACL 6684 6.3.1.1. Server Considerations 6686 The server uses the algorithm described in Section 6.2.1 to determine 6687 whether an ACL allows access to an object. However, the ACL might 6688 not be the sole determiner of access. For example: 6690 o In the case of a file system exported as read-only, the server may 6691 deny write permissions even though an object's ACL grants it. 6693 o Server implementations MAY grant ACE4_WRITE_ACL and ACE4_READ_ACL 6694 permissions to prevent a situation from arising in which there is 6695 no valid way to ever modify the ACL. 6697 o All servers will allow a user the ability to read the data of the 6698 file when only the execute permission is granted (i.e. If the ACL 6699 denies the user the ACE4_READ_DATA access and allows the user 6700 ACE4_EXECUTE, the server will allow the user to read the data of 6701 the file). 6703 o Many servers have the notion of owner-override in which the owner 6704 of the object is allowed to override accesses that are denied by 6705 the ACL. This may be helpful, for example, to allow users 6706 continued access to open files on which the permissions have 6707 changed. 6709 o Many servers have the notion of a "superuser" that has privileges 6710 beyond an ordinary user. The superuser may be able to read or 6711 write data or metadata in ways that would not be permitted by the 6712 ACL. 6714 o A retention attribute might also block access otherwise allowed by 6715 ACLs (see Section 5.13). 6717 6.3.1.2. Client Considerations 6719 Clients SHOULD NOT do their own access checks based on their 6720 interpretation the ACL, but rather use the OPEN and ACCESS operations 6721 to do access checks. This allows the client to act on the results of 6722 having the server determine whether or not access should be granted 6723 based on its interpretation of the ACL. 6725 Clients must be aware of situations in which an object's ACL will 6726 define a certain access even though the server will not enforce it. 6727 In general, but especially in these situations, the client needs to 6728 do its part in the enforcement of access as defined by the ACL. To 6729 do this, the client MAY send the appropriate ACCESS operation prior 6730 to servicing the request of the user or application in order to 6731 determine whether the user or application should be granted the 6732 access requested. For examples in which the ACL may define accesses 6733 that the server doesn't enforce see Section 6.3.1.1. 6735 6.3.2. Computing a Mode Attribute from an ACL 6737 The following method can be used to calculate the MODE4_R*, MODE4_W* 6738 and MODE4_X* bits of a mode attribute, based upon an ACL. 6740 First, for each of the special identifiers OWNER@, GROUP@, and 6741 EVERYONE@, evaluate the ACL in order, considering only ALLOW and DENY 6742 ACEs for the identifier EVERYONE@ and for the identifier under 6743 consideration. The result of the evaluation will be an NFSv4 ACL 6744 mask showing exactly which bits are permitted to that identifier. 6746 Then translate the calculated mask for OWNER@, GROUP@, and EVERYONE@ 6747 into mode bits for, respectively, the user, group, and other, as 6748 follows: 6750 1. Set the read bit (MODE4_RUSR, MODE4_RGRP, or MODE4_ROTH) if and 6751 only if ACE4_READ_DATA is set in the corresponding mask. 6753 2. Set the write bit (MODE4_WUSR, MODE4_WGRP, or MODE4_WOTH) if and 6754 only if ACE4_WRITE_DATA and ACE4_APPEND_DATA are both set in the 6755 corresponding mask. 6757 3. Set the execute bit (MODE4_XUSR, MODE4_XGRP, or MODE4_XOTH), if 6758 and only if ACE4_EXECUTE is set in the corresponding mask. 6760 6.3.2.1. Discussion 6762 Some server implementations also add bits permitted to named users 6763 and groups to the group bits (MODE4_RGRP, MODE4_WGRP, and 6764 MODE4_XGRP). 6766 Implementations are discouraged from doing this, because it has been 6767 found to cause confusion for users who see members of a file's group 6768 denied access that the mode bits appear to allow. (The presence of 6769 DENY ACEs may also lead to such behavior, but DENY ACEs are expected 6770 to be more rarely used.) 6772 The same user confusion seen when fetching the mode also results if 6773 setting the mode does not effectively control permissions for the 6774 owner, group, and other users; this motivates some of the 6775 requirements that follow. 6777 6.4. Requirements 6779 The server that supports both mode and ACL must take care to 6780 synchronize the MODE4_*USR, MODE4_*GRP, and MODE4_*OTH bits with the 6781 ACEs which have respective who fields of "OWNER@", "GROUP@", and 6782 "EVERYONE@" so that the client can see semantically equivalent access 6783 permissions exist whether the client asks for owner, owner_group and 6784 mode attributes, or for just the ACL. 6786 In this section, much is made of the methods in Section 6.3.2. Many 6787 requirements refer to this section. But note that the methods have 6788 behaviors specified with "SHOULD". This is intentional, to avoid 6789 invalidating existing implementations that compute the mode according 6790 to the withdrawn POSIX ACL draft (1003.1e draft 17), rather than by 6791 actual permissions on owner, group, and other. 6793 6.4.1. Setting the mode and/or ACL Attributes 6795 In the case where a server supports the sacl or dacl attribute, in 6796 addition to the acl attribute, the server MUST fail a request to set 6797 the acl attribute simultaneously with a dacl or sacl attribute. The 6798 error to be given is NFS4ERR_ATTRNOTSUPP. 6800 6.4.1.1. Setting mode and not ACL 6802 When any of the nine low-order mode bits are subject to change, 6803 either because the mode attribute was set or because the 6804 mode_set_masked attribute was set and the mask included one or more 6805 bits from the nine low-order mode bits, and no ACL attribute is 6806 explicitly set, the acl and dacl attributes must be modified in 6807 accordance with the updated value of those bits. This must happen 6808 even if the value of the low-order bits is the same after the mode is 6809 set as before. 6811 Note that any AUDIT or ALARM ACEs (hence any ACEs in the sacl 6812 attribute) are unaffected by changes to the mode. 6814 In cases in which the permissions bits are subject to change, the acl 6815 and dacl attributes MUST be modified such that the mode computed via 6816 the method in Section 6.3.2 yields the low-order nine bits (MODE4_R*, 6817 MODE4_W*, MODE4_X*) of the mode attribute as modified by the 6818 attribute change. The ACL attributes SHOULD also be modified such 6819 that: 6821 1. If MODE4_RGRP is not set, entities explicitly listed in the ACL 6822 other than OWNER@ and EVERYONE@ SHOULD NOT be granted 6823 ACE4_READ_DATA. 6825 2. If MODE4_WGRP is not set, entities explicitly listed in the ACL 6826 other than OWNER@ and EVERYONE@ SHOULD NOT be granted 6827 ACE4_WRITE_DATA or ACE4_APPEND_DATA. 6829 3. If MODE4_XGRP is not set, entities explicitly listed in the ACL 6830 other than OWNER@ and EVERYONE@ SHOULD NOT be granted 6831 ACE4_EXECUTE. 6833 Access mask bits other those listed above, appearing in ALLOW ACEs, 6834 MAY also be disabled. 6836 Note that ACEs with the flag ACE4_INHERIT_ONLY_ACE set do not affect 6837 the permissions of the ACL itself, nor do ACEs of the type AUDIT and 6838 ALARM. As such, it is desirable to leave these ACEs unmodified when 6839 modifying the ACL attributes. 6841 Also note that the requirement may be met by discarding the acl and 6842 dacl, in favor of an ACL that represents the mode and only the mode. 6843 This is permitted, but it is preferable for a server to preserve as 6844 much of the ACL as possible without violating the above requirements. 6845 Discarding the ACL makes it effectively impossible for a file created 6846 with a mode attribute to inherit an ACL (see Section 6.4.3). 6848 6.4.1.2. Setting ACL and not mode 6850 When setting the acl or dacl and not setting the mode or 6851 mode_set_masked attributes, the permission bits of the mode need to 6852 be derived from the ACL. In this case, the ACL attribute SHOULD be 6853 set as given. The nine low-order bits of the mode attribute 6854 (MODE4_R*, MODE4_W*, MODE4_X*) MUST be modified to match the result 6855 of the method Section 6.3.2. The three high-order bits of the mode 6856 (MODE4_SUID, MODE4_SGID, MODE4_SVTX) SHOULD remain unchanged. 6858 6.4.1.3. Setting both ACL and mode 6860 When setting both the mode (includes use of either the mode attribute 6861 or the mode_set_masked attribute) and the acl or dacl attributes in 6862 the same operation, the attributes MUST be applied in this order: 6863 mode (or mode_set_masked), then ACL. The mode-related attribute is 6864 set as given, then the ACL attribute is set as given, possibly 6865 changing the final mode, as described above in Section 6.4.1.2. 6867 6.4.2. Retrieving the mode and/or ACL Attributes 6869 This section applies only to servers that support both the mode and 6870 ACL attributes. 6872 Some server implementations may have a concept of "objects without 6873 ACLs", meaning that all permissions are granted and denied according 6874 to the mode attribute, and that no ACL attribute is stored for that 6875 object. If an ACL attribute is requested of such a server, the 6876 server SHOULD return an ACL that does not conflict with the mode; 6877 that is to say, the ACL returned SHOULD represent the nine low-order 6878 bits of the mode attribute (MODE4_R*, MODE4_W*, MODE4_X*) as 6879 described in Section 6.3.2. 6881 For other server implementations, the ACL attribute is always present 6882 for every object. Such servers SHOULD store at least the three high- 6883 order bits of the mode attribute (MODE4_SUID, MODE4_SGID, 6884 MODE4_SVTX). The server SHOULD return a mode attribute if one is 6885 requested, and the low-order nine bits of the mode (MODE4_R*, 6886 MODE4_W*, MODE4_X*) MUST match the result of applying the method in 6887 Section 6.3.2 to the ACL attribute. 6889 6.4.3. Creating New Objects 6891 If a server supports any ACL attributes, it may use the ACL 6892 attributes on the parent directory to compute an initial ACL 6893 attribute for a newly created object. This will be referred to as 6894 the inherited ACL within this section. The act of adding one or more 6895 ACEs to the inherited ACL that are based upon ACEs in the parent 6896 directory's ACL will be referred to as inheriting an ACE within this 6897 section. 6899 Implementors should standardize on what the behavior of CREATE and 6900 OPEN must be depending on the presence or absence of the mode and ACL 6901 attributes. 6903 1. If just the mode is given in the call: 6905 In this case, inheritance SHOULD take place, but the mode MUST be 6906 applied to the inherited ACL as described in Section 6.4.1.1, 6907 thereby modifying the ACL. 6909 2. If just the ACL is given in the call: 6911 In this case, inheritance SHOULD NOT take place, and the ACL as 6912 defined in the CREATE or OPEN will be set without modification, 6913 and the mode modified as in Section 6.4.1.2 6915 3. If both mode and ACL are given in the call: 6917 In this case, inheritance SHOULD NOT take place, and both 6918 attributes will be set as described in Section 6.4.1.3. 6920 4. If neither mode nor ACL are given in the call: 6922 In the case where an object is being created without any initial 6923 attributes at all, e.g. an OPEN operation with an opentype4 of 6924 OPEN4_CREATE and a createmode4 of EXCLUSIVE4, inheritance SHOULD 6925 NOT take place (note that EXCLUSIVE4_1 is a better choice of 6926 createmode4, since it does permit initial attributes). Instead, 6927 the server SHOULD set permissions to deny all access to the newly 6928 created object. It is expected that the appropriate client will 6929 set the desired attributes in a subsequent SETATTR operation, and 6930 the server SHOULD allow that operation to succeed, regardless of 6931 what permissions the object is created with. For example, an 6932 empty ACL denies all permissions, but the server should allow the 6933 owner's SETATTR to succeed even though WRITE_ACL is implicitly 6934 denied. 6936 In other cases, inheritance SHOULD take place, and no 6937 modifications to the ACL will happen. The mode attribute, if 6938 supported, MUST be as computed in Section 6.3.2, with the 6939 MODE4_SUID, MODE4_SGID and MODE4_SVTX bits clear. If no 6940 inheritable ACEs exist on the parent directory, the rules for 6941 creating acl, dacl or sacl attributes are implementation defined. 6942 If either the dacl or sacl attribute is supported, then the 6943 ACL4_DEFAULTED flag SHOULD be set on the newly created 6944 attributes. 6946 6.4.3.1. The Inherited ACL 6948 If the object being created is not a directory, the inherited ACL 6949 SHOULD NOT inherit ACEs from the parent directory ACL unless the 6950 ACE4_FILE_INHERIT_FLAG is set. 6952 If the object being created is a directory, the inherited ACL should 6953 inherit all inheritable ACEs from the parent directory, those that 6954 have ACE4_FILE_INHERIT_ACE or ACE4_DIRECTORY_INHERIT_ACE flag set. 6955 If the inheritable ACE has ACE4_FILE_INHERIT_ACE set, but 6956 ACE4_DIRECTORY_INHERIT_ACE is clear, the inherited ACE on the newly 6957 created directory MUST have the ACE4_INHERIT_ONLY_ACE flag set to 6958 prevent the directory from being affected by ACEs meant for non- 6959 directories. 6961 When a new directory is created, the server MAY split any inherited 6962 ACE which is both inheritable and effective (in other words, which 6963 has neither ACE4_INHERIT_ONLY_ACE nor ACE4_NO_PROPAGATE_INHERIT_ACE 6964 set), into two ACEs, one with no inheritance flags, and one with 6965 ACE4_INHERIT_ONLY_ACE set. (In the case of a dacl or sacl attribute, 6966 both of those ACEs SHOULD also have the ACE4_INHERITED_ACE flag set.) 6967 This makes it simpler to modify the effective permissions on the 6968 directory without modifying the ACE which is to be inherited to the 6969 new directory's children. 6971 6.4.3.2. Automatic Inheritance 6973 The acl attribute consists only of an array of ACEs, but the sacl 6974 (Section 6.2.3) and dacl (Section 6.2.2) attributes also include an 6975 additional flag field. 6977 struct nfsacl41 { 6978 aclflag4 na41_flag; 6979 nfsace4 na41_aces<>; 6980 }; 6982 The flag field applies to the entire sacl or dacl; three flag values 6983 are defined: 6985 const ACL4_AUTO_INHERIT = 0x00000001; 6986 const ACL4_PROTECTED = 0x00000002; 6987 const ACL4_DEFAULTED = 0x00000004; 6989 and all other bits must be cleared. The ACE4_INHERITED_ACE flag may 6990 be set in the ACEs of the sacl or dacl (whereas it must always be 6991 cleared in the acl). 6993 Together these features allow a server to support automatic 6994 inheritance, which we now explain in more detail. 6996 Inheritable ACEs are normally inherited by child objects only at the 6997 time that the child objects are created; later modifications to 6998 inheritable ACEs do not result in modifications to inherited ACEs on 6999 descendants. 7001 However, the dacl and sacl provide an OPTIONAL mechanism which allows 7002 a client application to propagate changes to inheritable ACEs to an 7003 entire directory hierarchy. 7005 A server that supports this performs inheritance at object creation 7006 time in the normal way, and SHOULD set the ACE4_INHERITED_ACE flag on 7007 any inherited ACEs as they are added to the new object. 7009 A client application such as an ACL editor may then propagate changes 7010 to inheritable ACEs on a directory by recursively traversing that 7011 directory's descendants and modifying each ACL encountered to remove 7012 any ACEs with the ACE4_INHERITED_ACE flag and to replace them by the 7013 new inheritable ACEs (also with the ACE4_INHERITED_ACE flag set). It 7014 uses the existing ACE inheritance flags in the obvious way to decide 7015 which ACEs to propagate. (Note that it may encounter further 7016 inheritable ACEs when descending the directory hierarchy, and that 7017 those will also need to be taken into account when propagating 7018 inheritable ACEs to further descendants.) 7020 The reach of this propagation may be limited in two ways: first, 7021 automatic inheritance is not performed from any directory ACL that 7022 has the ACL4_AUTO_INHERIT flag cleared; and second, automatic 7023 inheritance stops wherever an ACL with the ACL4_PROTECTED flag is 7024 set, preventing modification of that ACL and also (if the ACL is set 7025 on a directory) of the ACL on any of the object's descendants. 7027 This propagation is performed independently for the sacl and the dacl 7028 attributes; thus the ACL4_AUTO_INHERIT and ACL4_PROTECTED flags may 7029 be independently set for the sacl and the dacl, and propagation of 7030 one type of acl may continue down a hierarchy even where propagation 7031 of the other acl has stopped. 7033 New objects should be created with a dacl and a sacl that both have 7034 the ACL4_PROTECTED flag cleared and the ACL4_AUTO_INHERIT flag set to 7035 the same value as that on, respectively, the sacl or dacl of the 7036 parent object. 7038 Both the dacl and sacl attributes are RECOMMENDED, and a server may 7039 support one without supporting the other. 7041 A server that supports both the old acl attribute and one or both of 7042 the new dacl or sacl attributes must do so in such a way as to keep 7043 all three attributes consistent with each other. Thus the ACEs 7044 reported in the acl attribute should be the union of the ACEs 7045 reported in the dacl and sacl attributes, except that the 7046 ACE4_INHERITED_ACE flag must be cleared from the ACEs in the acl. 7047 And of course a client that queries only the acl will be unable to 7048 determine the values of the sacl or dacl flag fields. 7050 When a client performs a SETATTR for the acl attribute, the server 7051 SHOULD set the ACL4_PROTECTED flag to true on both the sacl and the 7052 dacl. By using the acl attribute, as opposed to the dacl or sacl 7053 attributes, the client signals that it may not understand automatic 7054 inheritance, and thus cannot be trusted to set an ACL for which 7055 automatic inheritance would make sense. 7057 When a client application queries an ACL, modifies it, and sets it 7058 again, it should leave any ACEs marked with ACE4_INHERITED_ACE 7059 unchanged, in their original order, at the end of the ACL. If the 7060 application is unable to do this, it should set the ACL4_PROTECTED 7061 flag. This behavior is not enforced by servers, but violations of 7062 this rule may lead to unexpected results when applications perform 7063 automatic inheritance. 7065 If a server also supports the mode attribute, it SHOULD set the mode 7066 in such a way that leaves inherited ACEs unchanged, in their original 7067 order, at the end of the ACL. If it is unable to do so, it SHOULD 7068 set the ACL4_PROTECTED flag on the file's dacl. 7070 Finally, in the case where the request that creates a new file or 7071 directory does not also set permissions for that file or directory, 7072 and there are also no ACEs to inherit from the parent's directory, 7073 then the server's choice of ACL for the new object is implementation- 7074 dependent. In this case, the server SHOULD set the ACL4_DEFAULTED 7075 flag on the ACL it chooses for the new object. An application 7076 performing automatic inheritance takes the ACL4_DEFAULTED flag as a 7077 sign that the ACL should be completely replaced by one generated 7078 using the automatic inheritance rules. 7080 7. Single-server Namespace 7082 This chapter describes the NFSv4 single-server namespace. Single- 7083 server namespaces may be presented directly to clients, or they may 7084 be used as a basis to form larger multi-server namespaces (e.g. site- 7085 wide or organization-wide) to be presented to clients, as described 7086 in Section 11. 7088 7.1. Server Exports 7090 On a UNIX server, the namespace describes all the files reachable by 7091 pathnames under the root directory or "/". On a Windows server the 7092 namespace constitutes all the files on disks named by mapped disk 7093 letters. NFS server administrators rarely make the entire server's 7094 file system namespace available to NFS clients. More often portions 7095 of the namespace are made available via an "export" feature. In 7096 previous versions of the NFS protocol, the root filehandle for each 7097 export is obtained through the MOUNT protocol; the client sent a 7098 string that identified the export name within the namespace and the 7099 server returned the root filehandle for that export. The MOUNT 7100 protocol also provided an EXPORTS procedure that enumerated server's 7101 exports. 7103 7.2. Browsing Exports 7105 The NFSv4.1 protocol provides a root filehandle that clients can use 7106 to obtain filehandles for the exports of a particular server, via a 7107 series of LOOKUP operations within a COMPOUND, to traverse a path. A 7108 common user experience is to use a graphical user interface (perhaps 7109 a file "Open" dialog window) to find a file via progressive browsing 7110 through a directory tree. The client must be able to move from one 7111 export to another export via single-component, progressive LOOKUP 7112 operations. 7114 This style of browsing is not well supported by the NFSv3 protocol. 7115 In NFSv3, the client expects all LOOKUP operations to remain within a 7116 single server file system. For example, the device attribute will 7117 not change. This prevents a client from taking namespace paths that 7118 span exports. 7120 In the case of NFSv3, an automounter on the client can obtain a 7121 snapshot of the server's namespace using the EXPORTS procedure of the 7122 MOUNT protocol. If it understands the server's pathname syntax, it 7123 can create an image of the server's namespace on the client. The 7124 parts of the namespace that are not exported by the server are filled 7125 in with directories that might be constructed similarly to an NFSv4.1 7126 "pseudo file system" (see Section 7.3) that allows the user to browse 7127 from one mounted file system to another. There is a drawback to this 7128 representation of the server's namespace on the client: it is static. 7129 If the server administrator adds a new export the client will be 7130 unaware of it. 7132 7.3. Server Pseudo File System 7134 NFSv4.1 servers avoid this namespace inconsistency by presenting all 7135 the exports for a given server within the framework of a single 7136 namespace, for that server. An NFSv4.1 client uses LOOKUP and 7137 READDIR operations to browse seamlessly from one export to another. 7139 Where there are portions of the server namespace that are not 7140 exported, clients require some way of traversing those portions to 7141 reach actual exported file systems. A technique that servers may use 7142 to provide for this is to bridge unexported portion of the namespace 7143 via a "pseudo file system" that provides a view of exported 7144 directories only. A pseudo file system has a unique fsid and behaves 7145 like a normal, read-only file system. 7147 Based on the construction of the server's namespace, it is possible 7148 that multiple pseudo file systems may exist. For example, 7150 /a pseudo file system 7151 /a/b real file system 7152 /a/b/c pseudo file system 7153 /a/b/c/d real file system 7155 Each of the pseudo file systems is considered a separate entity and 7156 therefore MUST have its own fsid, unique among all the fsids for that 7157 server. 7159 7.4. Multiple Roots 7161 Certain operating environments are sometimes described as having 7162 "multiple roots". In such environments individual file systems are 7163 commonly represented by disk or volume names. NFSv4 servers for 7164 these platforms can construct a pseudo file system above these root 7165 names so that disk letters or volume names are simply directory names 7166 in the pseudo root. 7168 7.5. Filehandle Volatility 7170 The nature of the server's pseudo file system is that it is a logical 7171 representation of file system(s) available from the server. 7172 Therefore, the pseudo file system is most likely constructed 7173 dynamically when the server is first instantiated. It is expected 7174 that the pseudo file system may not have an on disk counterpart from 7175 which persistent filehandles could be constructed. Even though it is 7176 preferable that the server provide persistent filehandles for the 7177 pseudo file system, the NFS client should expect that pseudo file 7178 system filehandles are volatile. This can be confirmed by checking 7179 the associated "fh_expire_type" attribute for those filehandles in 7180 question. If the filehandles are volatile, the NFS client must be 7181 prepared to recover a filehandle value (e.g. with a series of LOOKUP 7182 operations) when receiving an error of NFS4ERR_FHEXPIRED. 7184 Because it is quite likely that servers will implement pseudo file 7185 systems using volatile filehandles, clients need to be prepared for 7186 them, rather than assuming that all filehandles will be persistent. 7188 7.6. Exported Root 7190 If the server's root file system is exported, one might conclude that 7191 a pseudo file system is unneeded. This not necessarily so. Assume 7192 the following file systems on a server: 7194 / fs1 (exported) 7195 /a fs2 (not exported) 7196 /a/b fs3 (exported) 7198 Because fs2 is not exported, fs3 cannot be reached with simple 7199 LOOKUPs. The server must bridge the gap with a pseudo file system. 7201 7.7. Mount Point Crossing 7203 The server file system environment may be constructed in such a way 7204 that one file system contains a directory which is 'covered' or 7205 mounted upon by a second file system. For example: 7207 /a/b (file system 1) 7208 /a/b/c/d (file system 2) 7210 The pseudo file system for this server may be constructed to look 7211 like: 7213 / (place holder/not exported) 7214 /a/b (file system 1) 7215 /a/b/c/d (file system 2) 7217 It is the server's responsibility to present the pseudo file system 7218 that is complete to the client. If the client sends a lookup request 7219 for the path "/a/b/c/d", the server's response is the filehandle of 7220 the root of the file system "/a/b/c/d". In previous versions of the 7221 NFS protocol, the server would respond with the filehandle of 7222 directory "/a/b/c/d" within the file system "/a/b". 7224 The NFS client will be able to determine if it crosses a server mount 7225 point by a change in the value of the "fsid" attribute. 7227 7.8. Security Policy and Namespace Presentation 7229 Because NFSv4 clients possess the ability to change the security 7230 mechanisms used, after determining what is allowed, by using SECINFO 7231 and SECINFO_NONAME, the server SHOULD NOT present a different view of 7232 the namespace based on the security mechanism being used by a client. 7233 Instead, it should present a consistent view and return 7234 NFS4ERR_WRONGSEC if an attempt is made to access data with an 7235 inappropriate security mechanism. 7237 If security considerations make it necessary to hide the existence of 7238 a particular file system, as opposed to all of the data within it, 7239 the server can apply the security policy of a shared resource in the 7240 server's namespace to components of the resource's ancestors. For 7241 example: 7243 / (place holder/not exported) 7244 /a/b (file system 1) 7245 /a/b/MySecretProject (file system 2) 7247 The /a/b/MySecretProject directory is a real file system and is the 7248 shared resource. Suppose the security policy for /a/b/ 7249 MySecretProject is Kerberos with integrity and it is desired to limit 7250 knowledge of the existence of this file system. In this case, the 7251 server should apply the same security policy to /a/b. This allows 7252 for knowledge of the existence of a file system to be secured when 7253 desirable. 7255 For the case of the use of multiple, disjoint security mechanisms in 7256 the server's resources, applying that sort of policy would result in 7257 the higher-level file system not being accessible using any security 7258 flavor, which would make the that higher-level file system 7259 inaccessible. Therefore, that sort of configuration is not 7260 compatible with hiding the existence (as opposed to the contents) 7261 from clients using multiple disjoint sets of security flavors. 7263 In other circumstances, a desirable policy is for the security of a 7264 particular object in the server's namespace should include the union 7265 of all security mechanisms of all direct descendants. A common and 7266 convenient practice, unless strong security requirements dictate 7267 otherwise, is to make all of the pseudo file system accessible by all 7268 of the valid security mechanisms. 7270 Where there is concern about the security of data on the network, 7271 clients should use strong security mechanisms to access the pseudo 7272 file system in order to prevent man-in-the-middle attacks. 7274 8. State Management 7276 Integrating locking into the NFS protocol necessarily causes it to be 7277 stateful. With the inclusion of such features as share reservations, 7278 file and directory delegations, recallable layouts, and support for 7279 mandatory byte-range locking, the protocol becomes substantially more 7280 dependent on proper management of state than the traditional 7281 combination of NFS and NLM [45]. These features include expanded 7282 locking facilities, which provide some measure of interclient 7283 exclusion, but the state also offers features not readily providable 7284 using a stateless model. There are three components to making this 7285 state manageable: 7287 o Clear division between client and server 7289 o Ability to reliably detect inconsistency in state between client 7290 and server 7292 o Simple and robust recovery mechanisms 7294 In this model, the server owns the state information. The client 7295 requests changes in locks and the server responds with the changes 7296 made. Non-client-initiated changes in locking state are infrequent. 7297 The client receives prompt notification of such changes and can 7298 adjust its view of the locking state to reflect the server's changes. 7300 Individual pieces of state created by the server and passed to the 7301 client at its request are represented by 128-bit stateids. These 7302 stateids may represent a particular open file, a set of byte-range 7303 locks held by a particular owner, or a recallable delegation of 7304 privileges to access a file in particular ways, or at a particular 7305 location. 7307 In all cases, there is a transition from the most general information 7308 which represents a client as a whole to the eventual lightweight 7309 stateid used for most client and server locking interactions. The 7310 details of this transition will vary with the type of object but it 7311 always starts with a client ID. 7313 8.1. Client and Session ID 7315 A client must establish a client ID (see Section 2.4) and then one or 7316 more sessionids (see Section 2.10) before performing any operations 7317 to open, lock, delegate, or obtain a layout for a file object. Each 7318 session ID is associated with a specific client ID, and thus serves 7319 as a shorthand reference to an NFSv4.1 client. 7321 For some types of locking interactions, the client will represent 7322 some number of internal locking entities called "owners", which 7323 normally correspond to processes internal to the client. For other 7324 types of locking-related objects, such as delegations and layouts, no 7325 such intermediate entities are provided for, and the locking-related 7326 objects are considered to be transferred directly between the server 7327 and a unitary client. 7329 8.2. Stateid Definition 7331 When the server grants a lock of any type (including opens, byte- 7332 range locks, delegations, and layouts) it responds with a unique 7333 stateid, that represents a set of locks (often a single lock) for the 7334 same file, of the same type, and sharing the same ownership 7335 characteristics. Thus opens of the same file by different open- 7336 owners each have an identifying stateid. Similarly, each set of 7337 byte-range locks on a file owned by a specific lock-owner has its own 7338 identifying stateid. Delegations and layouts also have associated 7339 stateids by which they may be referenced. The stateid is used as a 7340 shorthand reference to a lock or set of locks and given a stateid the 7341 server can determine the associated state-owner or state-owners (in 7342 the case of an open-owner/lock-owner pair) and the associated 7343 filehandle. When stateids are used, the current filehandle must be 7344 the one associated with that stateid. 7346 All stateids associated with a given client ID are associated with a 7347 common lease which represents the claim of those stateids and the 7348 objects they represent to be maintained by the server. See 7349 Section 8.3 for a discussion of leases. 7351 The server may assign stateids independently for different clients. 7352 A stateid with the same bit pattern for one client may designate an 7353 entirely different set of locks for a different client. The stateid 7354 is always interpreted with respect to the client ID associated with 7355 the current session. Stateids apply to all sessions associated with 7356 the given client ID and the client may use a stateid obtained from 7357 one session on another session associated with the same client ID. 7359 8.2.1. Stateid Types 7361 With the exception of special stateids (see Section 8.2.3), each 7362 stateid represents locking objects of one of a set of types defined 7363 by the NFSv4.1 protocol. Note that in all these cases, where we 7364 speak of guarantee, it is understood there are situations such as a 7365 client restart, or lock revocation, that allow the guarantee to be 7366 voided. 7368 o Stateids may represent opens of files. 7370 Each stateid in this case represents the open state for a given 7371 client ID/open-owner/filehandle triple. Such stateids are subject 7372 to change (with consequent incrementing of the stateid's seqid) in 7373 response to OPENs that result in upgrade and OPEN_DOWNGRADE 7374 operations. 7376 o Stateids may represent sets of byte-range locks. 7378 All locks held on a particular file by a particular owner and all 7379 gotten under the aegis of a particular open file are associated 7380 with a single stateid with the seqid being incremented whenever 7381 LOCK and LOCKU operations affect that set of locks. 7383 o Stateids may represent file delegations, which are recallable 7384 guarantees by the server to the client, that other clients will 7385 not reference, or will not modify a particular file, until the 7386 delegation is returned. In NFSv4.1, file delegations may be 7387 obtained on both regular and non-regular files. 7389 A stateid represents a single delegation held by a client for a 7390 particular filehandle. 7392 o Stateids may represent directory delegations, which are recallable 7393 guarantees by the server to the client, that other clients will 7394 not modify the directory, until the delegation is returned. 7396 A stateid represents a single delegation held by a client for a 7397 particular directory filehandle. 7399 o Stateids may represent layouts, which are recallable guarantees by 7400 the server to the client, that particular files may be accessed 7401 via an alternate data access protocol at specific locations. Such 7402 access is limited to particular sets of byte ranges and may 7403 proceed until those byte ranges are reduced or the layout is 7404 returned. 7406 A stateid represents the set of all layouts held by a particular 7407 client for a particular filehandle with a given layout type. The 7408 seqid is updated as the layouts of that set changes with layout 7409 stateid changing operations such as LAYOUTGET and LAYOUTRETURN. 7411 8.2.2. Stateid Structure 7413 Stateids are divided into two fields, a 96-bit "other" field 7414 identifying the specific set of locks and a 32-bit "seqid" sequence 7415 value. Except in the case of special stateids (see Section 8.2.3), a 7416 particular value of the "other" field denotes a set of locks of the 7417 same type (for example byte-range locks, opens, delegations, or 7418 layouts), for a specific file or directory, and sharing the same 7419 ownership characteristics. The seqid designates a specific instance 7420 of such a set of locks, and is incremented to indicate changes in 7421 such a set of locks, either by the addition or deletion of locks from 7422 the set, a change in the byte-range they apply to, or an upgrade or 7423 downgrade in the type of one or more locks. 7425 When such a set of locks is first created the server returns a 7426 stateid with seqid value of one. On subsequent operations which 7427 modify the set of locks the server is required to increment the seqid 7428 field by one (1) whenever it returns a stateid for the same state- 7429 owner/file/type combination and there is some change in the set of 7430 locks actually designated. In this case the server will return a 7431 stateid with an other field the same as previously used for that 7432 state-owner/file/type combination, with an incremented seqid field. 7433 This pattern continues until the seqid is incremented past 7434 NFS4_UINT32_MAX, and one (not zero) is the next seqid value. 7436 The purpose of the incrementing of the seqid is to allow the server 7437 to communicate to the client the order in which operations that 7438 modified locking state associated with a stateid have been processed 7439 and to make it possible for the client to send requests that are 7440 conditional on the set of locks not having changed since the stateid 7441 in question was returned. 7443 Except for layout stateids (Section 12.5.3) when a client sends a 7444 stateid to the server, it has two choices with regard to the seqid 7445 sent. It may set the seqid to zero to indicate to the server that it 7446 wishes the most up-to-date seqid for that stateid's "other" field to 7447 be used. This would be the common choice in the case of a stateid 7448 sent with a READ or WRITE operation. It also may set a non-zero 7449 value in which case the server checks if that seqid is the correct 7450 one. In that case the server is required to return 7451 NFS4ERR_OLD_STATEID if the seqid is lower than the most current value 7452 and NFS4ERR_BAD_STATEID if the seqid is greater than the most current 7453 value. This would be the common choice in the case of stateids sent 7454 with a CLOSE or OPEN_DOWNGRADE. Because OPENs may be sent in 7455 parallel for the same owner, a client might close a file without 7456 knowing that an OPEN upgrade had been done by the server, changing 7457 the lock in question. If CLOSE were sent with a zero seqid, the OPEN 7458 upgrade would be canceled before the client even received an 7459 indication that an upgrade had happened. 7461 When a stateid is sent by the server to client as part of a callback 7462 operation, it is not subject to checking for a current seqid and 7463 returning NFS4ERR_OLD_STATEID. This is because the client is not in 7464 a position to know the most up-to-date seqid and thus cannot verify 7465 it. Unless specially noted, the seqid value for a stateid sent by 7466 the server to the client as part of a callback is required to be zero 7467 with NFS4ERR_BAD_STATEID returned if it is not. 7469 In making comparisons between seqids, both by the client in 7470 determining the order of operations and by the server in determining 7471 whether the NFS4ERR_OLD_STATEID is to be returned, the possibility of 7472 the seqid being swapped around past the NFS4_UINT32_MAX value needs 7473 to be taken into account. When two seqid values are being compared, 7474 the total count of slots for all sessions associated with the current 7475 client is used to do this. When one seqid value is less that this 7476 total slot count and another seqid value is greater than 7477 NFS4_UINT32_MAX minus the total slot count, the former is to be 7478 treated as lower than the later, despite the fact that it is 7479 numerically greater. 7481 8.2.3. Special Stateids 7483 Stateid values whose "other" field is either all zeros or all ones 7484 are reserved. They may not be assigned by the server but have 7485 special meanings defined by the protocol. The particular meaning 7486 depends on whether the "other" field is all zeros or all ones and the 7487 specific value of the "seqid" field. 7489 The following combinations of "other" and "seqid" are defined in 7490 NFSv4.1: 7492 o When "other" and "seqid" are both zero, the stateid is treated as 7493 a special anonymous stateid, which can be used in READ, WRITE, and 7494 SETATTR requests to indicate the absence of any open state 7495 associated with the request. When an anonymous stateid value is 7496 used, and an existing open denies the form of access requested, 7497 then access will be denied to the request. This stateid MUST NOT 7498 be used on operations to data servers (Section 13.6). 7500 o When "other" and "seqid" are both all ones, the stateid is a 7501 special read bypass stateid. When this value is used in WRITE or 7502 SETATTR, it is treated like the anonymous value. When used in 7503 READ, the server MAY grant access, even if access would normally 7504 be denied to READ requests. This stateid MUST NOT be used on 7505 operations to data servers. 7507 o When "other" is zero and "seqid" is one, the stateid represents 7508 the current stateid, which is whatever value is the last stateid 7509 returned by an operation within the COMPOUND. In the case of an 7510 OPEN, the stateid returned for the open file, and not the 7511 delegation is used. The stateid passed to the operation in place 7512 of the special value has its "seqid" value set to zero, except 7513 when the current stateid is used by the operation CLOSE or 7514 OPEN_DOWNGRADE. If there is no operation in the COMPOUND which 7515 has returned a stateid value, the server MUST return the error 7516 NFS4ERR_BAD_STATEID. As illustrated in Figure 6, if the value of 7517 a current stateid is a special stateid, and the stateid of an 7518 operation's arguments has "other" set to zero, and "seqid" set to 7519 one, then the server MUST return the error NFS4ERR_BAD_STATEID. 7521 o When "other" is zero and "seqid" is NFS4_UINT32_MAX, the stateid 7522 represents a reserved stateid value defined to be invalid. When 7523 this stateid is used, the server MUST return the error 7524 NFS4ERR_BAD_STATEID. 7526 If a stateid value is used which has all zero or all ones in the 7527 "other" field, but does not match one of the cases above, the server 7528 MUST return the error NFS4ERR_BAD_STATEID. 7530 Special stateids, unlike other stateids, are not associated with 7531 individual client IDs or filehandles and can be used with all valid 7532 client IDs and filehandles. In the case of a special stateid 7533 designating the current stateid, the current stateid value 7534 substituted for the special stateid is associated with a particular 7535 client ID and filehandle, and so, if it is used where current 7536 filehandle does not match that associated with the current stateid, 7537 the operation to which the stateid is passed will return 7538 NFS4ERR_BAD_STATEID. 7540 8.2.4. Stateid Lifetime and Validation 7542 Stateids must remain valid until either a client restart or a server 7543 restart or until the client returns all of the locks associated with 7544 the stateid by means of an operation such as CLOSE or DELEGRETURN. 7545 If the locks are lost due to revocation the stateid remains a valid 7546 designation of that revoked state until the client frees it by using 7547 FREE_STATEID. Stateids associated with byte-range locks are an 7548 exception. They remain valid even if a LOCKU frees all remaining 7549 locks, so long as the open file with which they are associated 7550 remains open, unless the client does a FREE_STATEID to cause the 7551 stateid to be freed. 7553 It should be noted that there are situations in which the client's 7554 locks become invalid, without the client requesting they be returned. 7555 These include lease expiration and a number of forms of lock 7556 revocation within the lease period. It is important to note that in 7557 these situations, the stateid remains valid and the client can use it 7558 to determine the disposition of the associated lost locks. 7560 An "other" value must never be reused for a different purpose (i.e. 7561 different filehandle, owner, or type of locks) within the context of 7562 a single client ID. A server may retain the "other" value for the 7563 same purpose beyond the point where it may otherwise be freed but if 7564 it does so, it must maintain "seqid" continuity with previous values. 7566 One mechanism that may be used to satisfy the requirement that the 7567 server recognize invalid and out-of-date stateids is for the server 7568 to divide the "other" field of the stateid into two fields. 7570 o An index into a table of locking-state structures. 7572 o A generation number which is incremented on each allocation of a 7573 table entry for a particular use. 7575 And then store in each table entry, 7577 o The client ID with which the stateid is associated. 7579 o The current generation number for the (at most one) valid stateid 7580 sharing this index value. 7582 o The filehandle of the file on which the locks are taken. 7584 o An indication of the type of stateid (open, byte-range lock, file 7585 delegation, directory delegation, layout). 7587 o The last "seqid" value returned corresponding to the current 7588 "other" value. 7590 o An indication of the current status of the locks associated with 7591 this stateid. In particular, whether these have been revoked and 7592 if so, for what reason. 7594 With this information, an incoming stateid can be validated and the 7595 appropriate error returned when necessary. Special and non-special 7596 stateids are handled separately. (See Section 8.2.3 for a discussion 7597 of special stateids.) 7599 Note that stateids are implicitly qualified by the current client ID, 7600 as derived from the client ID associated with the current session. 7601 Note however, that the semantics of the session will prevent stateids 7602 associated with a previous client or server instance from being 7603 analyzed by this procedure. 7605 If server restart has resulted in an invalid client ID or a session 7606 ID which is invalid, SEQUENCE will return an error and the operation 7607 that takes a stateid as an argument will never be processed. 7609 If there has been a server restart where there is a persistent 7610 session, and all leased state has been lost, then the session in 7611 question will, although valid, be marked as dead, and any operation 7612 not satisfied by means of the reply cache will receive the error 7613 NFS4ERR_DEADSESSION, and thus not be processed as indicated below. 7615 When a stateid is being tested, and the "other" field is all zeros or 7616 all ones, a check that the "other" and "seqid" fields match a defined 7617 combination for a special stateid is done and the results determined 7618 as follows: 7620 o If the "other" and "seqid" fields do not match a defined 7621 combination associated with a special stateid, the error 7622 NFS4ERR_BAD_STATEID is returned. 7624 o If the special stateid is one designating the current stateid, and 7625 there is a current stateid, then the current stateid is 7626 substituted for the special stateid and the checks appropriate to 7627 non-special stateids in performed. 7629 o If the combination is valid in general but is not appropriate to 7630 the context in which the stateid is used (e.g. an all-zero stateid 7631 is used when an open stateid is required in a LOCK operation), the 7632 error NFS4ERR_BAD_STATEID is also returned. 7634 o Otherwise, the check is completed and the special stateid is 7635 accepted as valid. 7637 When a stateid is being tested, and the "other" field is neither all 7638 zeros or all ones, the following procedure could be used to validate 7639 an incoming stateid and return an appropriate error, when necessary, 7640 assuming that the "other" field would be divided into a table index 7641 and an entry generation. 7643 o If the table index field is outside the range of the associated 7644 table, return NFS4ERR_BAD_STATEID. 7646 o If the selected table entry is of a different generation than that 7647 specified in the incoming stateid, return NFS4ERR_BAD_STATEID. 7649 o If the selected table entry does not match the current filehandle, 7650 return NFS4ERR_BAD_STATEID. 7652 o If the client ID in the table entry does not match the client ID 7653 associated with the current session, return NFS4ERR_BAD_STATEID. 7655 o If the stateid represents revoked state, then return 7656 NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, or NFS4ERR_DELEG_REVOKED, 7657 as appropriate. 7659 o If the stateid type is not valid for the context in which the 7660 stateid appears, return NFS4ERR_BAD_STATEID. Note that a stateid 7661 may be valid in general, as would be reported by the TEST_STATEID 7662 operation, but be invalid for a particular operation, as, for 7663 example, when a stateid which doesn't represent byte-range locks 7664 is passed to the non-from_open case of LOCK or to LOCKU, or when a 7665 stateid which does not represent an open is passed to CLOSE or 7666 OPEN_DOWNGRADE. In such cases, the server MUST return 7667 NFS4ERR_BAD_STATEID. 7669 o If the "seqid" field is not zero, and it is greater than the 7670 current sequence value corresponding the current "other" field, 7671 return NFS4ERR_BAD_STATEID. 7673 o If the "seqid" field is not zero, and it is less than the current 7674 sequence value corresponding the current "other" field, return 7675 NFS4ERR_OLD_STATEID. 7677 o Otherwise, the stateid is valid and the table entry should contain 7678 any additional information about the type of stateid and 7679 information associated with that particular type of stateid, such 7680 as the associated set of locks, such as open-owner and lock-owner 7681 information, as well as information on the specific locks, such as 7682 open modes and byte ranges. 7684 8.2.5. Stateid Use for I/O Operations 7686 Clients performing I/O operations need to select an appropriate 7687 stateid based on the locks (including opens and delegations) held by 7688 the client and the various types of state-owners issuing the I/O 7689 requests. SETATTR operations which change the file size are treated 7690 like I/O operations in this regard. 7692 The following rules, applied in order of decreasing priority, govern 7693 the selection of the appropriate stateid. In following these rules, 7694 the client will only consider locks of which it has actually received 7695 notification by an appropriate operation response or callback. Note 7696 that the rules are slightly different in the case of I/O to data 7697 servers when file layouts are being used (see Section 13.9.1). 7699 o If the client holds a delegation for the file in question, the 7700 delegation stateid SHOULD be used. 7702 o Otherwise, if the lock-owner corresponding entity (e.g. process) 7703 issuing the I/O has a lock stateid for the associated open file, 7704 then the lock stateid for that lock-owner and open file SHOULD be 7705 used. 7707 o If there is no lock stateid, then the open stateid for the open 7708 file in question SHOULD be used. 7710 o Finally, if none of the above apply, then a special stateid SHOULD 7711 be used. 7713 Ignoring these rules may result in situations in which the server 7714 does not have information necessary to properly process the request. 7715 For example, when mandatory byte-range locks are in effect, if the 7716 stateid does not indicate the proper lock-owner, via a lock stateid, 7717 a request might be avoidably rejected. 7719 The server however should not try to enforce these ordering rules and 7720 should use whatever information is available to proper process I/O 7721 requests. In particular, when a client has a delegation for a given 7722 file, it SHOULD take note of this fact in processing a request, even 7723 if it is sent with a special stateid. 7725 8.2.6. Stateid Use for SETATTR Operations 7727 Because each operation is associated with a session ID and from that 7728 the clientid can be determined, operations do not need to include a 7729 stateid for the server to be able to determine whether they should 7730 cause a delegation to be recalled or are to be treated as done within 7731 the scope of the delegation. 7733 In the case of SETATTR operations, a stateid is present. In cases 7734 other than those which set the file size, the client may send either 7735 a special stateid or, when a delegation is held for the file in 7736 question, a delegation stateid. While the server SHOULD validate the 7737 stateid and may use the stateid to optimize the determination as to 7738 whether a delegation is held, it SHOULD note the presence of a 7739 delegation even when a special stateid is sent, and MUST accept a 7740 valid delegation stateid when sent. 7742 8.3. Lease Renewal 7744 Each client/server pair, as represented by a client ID, has a single 7745 lease. The purpose of the lease is to allow the client to indicate 7746 to the server, in a low-overhead way, that it is active, and thus 7747 that the server is to retain the client's locks. This arrangement 7748 allows the server to remove stale locking-related objects that are 7749 held by a client that has crashed or is otherwise unreachable, once 7750 the relevant lease expires. This in turn allows other clients to 7751 obtain conflicting locks without being delayed indefinitely by 7752 inactive or unreachable clients. It is not a mechanism for cache 7753 consistency and lease renewals may not be denied if the lease 7754 interval has not expired. 7756 Since each session is associated with a specific client (identified 7757 by the client's client ID), any operation sent on that session is an 7758 indication that the associated client is reachable. When a request 7759 is sent for a given session, successful execution of a SEQUENCE 7760 operation (or successful retrieval of the result of SEQUENCE from the 7761 reply cache) on an unexpired lease will result in the lease being 7762 implicitly renewed, for the standard renewal period (equal to the 7763 lease_time attribute). 7765 If the client ID's lease has not expired when the server receives a 7766 SEQUENCE operation, then the server MUST renew the lease. If the 7767 client ID's lease has expired when the server receives a SEQUENCE 7768 operation, the server MAY renew the lease; this depends on whether 7769 any state was revoked as a result of the client's failure to renew 7770 the lease before expiration. 7772 Absent other activity that would renew the lease, a COMPOUND 7773 consisting of a single SEQUENCE operation will suffice. The client 7774 should also take communication-related delays into account and take 7775 steps to ensure that the renewal messages actually reach the server 7776 in good time. For example: 7778 o When trunking is in effect, the client should consider issuing 7779 multiple requests on different connections, in order to ensure 7780 that renewal occurs, even in the event of blockage in the path 7781 used for one of those connections. 7783 o Transport retransmission delays might become so large as to 7784 approach or exceed the length of the lease period. This may be 7785 particularly likely when the server is unresponsive due to a 7786 restart; see Section 8.4.2.1. If the client implementation is not 7787 careful, transport retransmission delays can result in the client 7788 failing to detect a server restart before the grace period ends. 7789 The scenario is that the client is using a transport with 7790 exponential back off, such that the maximum retransmission timeout 7791 exceeds the both the grace period and the lease_time attribute. A 7792 network partition causes the client's connection's retransmission 7793 interval to back off, and even after the partition heals, the next 7794 transport-level retransmission is sent after the server has 7795 restarted and its grace period ends. 7797 The client MUST either recover from the ensuing NFS4ERR_NO_GRACE 7798 errors, or it MUST ensure that despite transport level 7799 retransmission intervals that exceed the lease_time, nonetheless a 7800 SEQUENCE operation is sent that renews the lease before 7801 expiration. The client can achieve this by associating a new 7802 connection with the session, and sending a SEQUENCE operation on 7803 it. However, if the attempt to establish a new connection is 7804 delayed for some reason (e.g. exponential backoff of the 7805 connection establishment packets), the client will have to abort 7806 the connection establishment attempt before the lease expires, and 7807 attempt to re-connect. 7809 If the server renews the lease upon receiving a SEQUENCE operation, 7810 the server MUST NOT allow the lease to expire while the rest of the 7811 operations in the COMPOUND procedure's request are still executing. 7812 Once the last operation has finished, and the response to COMPOUND 7813 has been sent, the server MUST set the lease to expire no sooner than 7814 the sum of current time and the value of the lease_time attribute. 7816 A client ID's lease can expire when it has been at least the lease 7817 interval (lease_time) since the last lease-renewing SEQUENCE 7818 operation was sent on any of the client ID's sessions and there are 7819 no active COMPOUND operations on any such sessions. 7821 Because the SEQUENCE operation is the basic mechanism to renew a 7822 lease, and because if must be done at least once for each lease 7823 period, it is the natural mechanism whereby the server will inform 7824 the client of changes in the lease status that the client needs to be 7825 informed of. The client should inspect the status flags 7826 (sr_status_flags) returned by sequence and take the appropriate 7827 action (see Section 18.46.3 for details). 7829 o The status bits SEQ4_STATUS_CB_PATH_DOWN and 7830 SEQ4_STATUS_CB_PATH_DOWN_SESSION indicate problems with the 7831 backchannel which the client may need to address in order to 7832 receive callback requests. 7834 o The status bits SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING and 7835 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED indicate problems with GSS 7836 contexts for the backchannel which the client may have to address 7837 to allow callback requests to be sent to it. 7839 o The status bits SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, 7840 SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, 7841 SEQ4_STATUS_ADMIN_STATE_REVOKED, and 7842 SEQ4_STATUS_RECALLABLE_STATE_REVOKED notify the client of lock 7843 revocation events. When these bits are set, the client should use 7844 TEST_STATEID to find what stateids have been revoked and use 7845 FREE_STATEID to acknowledge loss of the associated state. 7847 o The status bit SEQ4_STATUS_LEASE_MOVE indicates that 7848 responsibility for lease renewal has been transferred to one or 7849 more new servers. 7851 o The status bit SEQ4_STATUS_RESTART_RECLAIM_NEEDED indicates that 7852 due to server restart the client must reclaim locking state. 7854 o The status bit SEQ4_STATUS_BACKCHANNEL_FAULT indicates the server 7855 has encountered an unrecoverable fault with the backchannel (e.g. 7856 it has lost track of a sequence ID for a slot in the backchannel). 7858 8.4. Crash Recovery 7860 A critical requirement in crash recovery is that both the client and 7861 the server know when the other has failed. Additionally, it is 7862 required that a client sees a consistent view of data across server 7863 restarts. All READ and WRITE operations that may have been queued 7864 within the client or network buffers must wait until the client has 7865 successfully recovered the locks protecting the READ and WRITE 7866 operations. Any that reach the server before the server can safely 7867 determine that the client has recovered enough locking state to be 7868 sure that such operations can be safely processed must be rejected. 7869 This will happen because either: 7871 o The state presented is no longer valid since it is associated with 7872 a now invalid client ID. In this case the client will receive 7873 either an NFS4ERR_BADSESSION or NFS4ERR_DEADSESSION error, and any 7874 attempt to attach a new session to that invalid client ID will 7875 result in an NFS4ERR_STALE_CLIENTID error. 7877 o Subsequent recovery of locks may make execution of the operation 7878 inappropriate (NFS4ERR_GRACE). 7880 8.4.1. Client Failure and Recovery 7882 In the event that a client fails, the server may release the client's 7883 locks when the associated lease has expired. Conflicting locks from 7884 another client may only be granted after this lease expiration. As 7885 discussed in Section 8.3, when a client has not failed and re- 7886 establishes its lease before expiration occurs, requests for 7887 conflicting locks will not be granted. 7889 To minimize client delay upon restart, lock requests are associated 7890 with an instance of the client by a client-supplied verifier. This 7891 verifier is part of the client_owner4 sent in the initial EXCHANGE_ID 7892 call made by the client. The server returns a client ID as a result 7893 of the EXCHANGE_ID operation. The client then confirms the use of 7894 the client ID by establishing a session associated with that client 7895 ID (see Section 18.36.3 for a description how this is done). All 7896 locks, including opens, byte-range locks, delegations, and layouts 7897 obtained by sessions using that client ID are associated with that 7898 client ID. 7900 Since the verifier will be changed by the client upon each 7901 initialization, the server can compare a new verifier to the verifier 7902 associated with currently held locks and determine that they do not 7903 match. This signifies the client's new instantiation and subsequent 7904 loss (upon confirmation of the new client ID) of locking state. As a 7905 result, the server is free to release all locks held which are 7906 associated with the old client ID which was derived from the old 7907 verifier. At this point conflicting locks from other clients, kept 7908 waiting while the lease had not yet expired, can be granted. In 7909 addition, all stateids associated with the old client ID can also be 7910 freed, as they are no longer reference-able. 7912 Note that the verifier must have the same uniqueness properties as 7913 the verifier for the COMMIT operation. 7915 8.4.2. Server Failure and Recovery 7917 If the server loses locking state (usually as a result of a restart), 7918 it must allow clients time to discover this fact and re-establish the 7919 lost locking state. The client must be able to re-establish the 7920 locking state without having the server deny valid requests because 7921 the server has granted conflicting access to another client. 7922 Likewise, if there is a possibility that clients have not yet re- 7923 established their locking state for a file, and that such locking 7924 state might make it invalid to perform READ or WRITE operations, for 7925 example through the establishment of mandatory locks, the server must 7926 disallow READ and WRITE operations for that file. 7928 A client can determine that loss of locking state has occurred via 7929 several methods. 7931 1. When a SEQUENCE (most common) or other operation returns 7932 NFS4ERR_BADSESSION, this may mean the session has been destroyed, 7933 but the client ID is still valid. The client sends a 7934 CREATE_SESSION request with the client ID to re-establish the 7935 session. If CREATE_SESSION fails with NFS4ERR_STALE_CLIENTID, 7936 the client must establish a new client ID (see Section 8.1) and 7937 re-establish its lock state with the new client ID, after the 7938 CREATE_SESSION operation succeeds (see Section 8.4.2.1). 7940 2. When a SEQUENCE (most common) or other operation on a persistent 7941 session returns NFS4ERR_DEADSESSION, this indicates that a 7942 session is no longer usable for new, i.e. not satisfied from the 7943 reply cache, operations. Once all pending operations are 7944 determined to be either performed before the retry or not 7945 performed, the client sends a CREATE_SESSION request with the 7946 client ID to re-establish the session. If CREATE_SESSION fails 7947 with NFS4ERR_STALE_CLIENTID, the client must establish a new 7948 client ID (see Section 8.1) and re-establish its lock state after 7949 the CREATE_SESSION, with the new client ID, succeeds, 7950 (Section 8.4.2.1). 7952 3. When a operation, neither SEQUENCE nor preceded by SEQUENCE (for 7953 example, CREATE_SESSION, DESTROY_SESSION) returns 7954 NFS4ERR_STALE_CLIENTID. The client MUST establish a new client 7955 ID (Section 8.1) and re-establish its lock state 7956 (Section 8.4.2.1). 7958 8.4.2.1. State Reclaim 7960 When state information and the associated locks are lost as a result 7961 of a server restart, the protocol must provide a way to cause that 7962 state to be re-established. The approach used is to define, for most 7963 types of locking state (layouts are an exception), a request whose 7964 function is to allow the client to re-establish on the server a lock 7965 first obtained from a previous instance. Generally these requests 7966 are variants of the requests normally used to create locks of that 7967 type and are referred to as "reclaim-type" requests and the process 7968 of re-establishing such locks is referred to as "reclaiming" them. 7970 Because each client must have an opportunity to reclaim all of the 7971 locks that it has without the possibility that some other client will 7972 be granted a conflicting lock, a special period called the "grace 7973 period" is devoted to the reclaim process. During this period, 7974 requests creating client IDs and sessions are handled normally, but 7975 locking requests are subject to special restrictions. Only reclaim- 7976 type locking requests are allowed, unless the server can reliably 7977 determine (through state persistently maintained across restart 7978 instances), that granting any such lock cannot possibly conflict with 7979 a subsequent reclaim. When a request is made to obtain a new lock 7980 (i.e. not a reclaim-type request) during the grace period and such a 7981 determination cannot be made, the server must return the error 7982 NFS4ERR_GRACE. 7984 Once a session is established using the new client ID, the client 7985 will use reclaim-type locking requests (e.g. LOCK requests with 7986 reclaim set to TRUE and OPEN operations with a claim type of 7987 CLAIM_PREVIOUS; see Section 9.11) to re-establish its locking state. 7988 Once this is done, or if there is no such locking state to reclaim, 7989 the client sends a global RECLAIM_COMPLETE operation, i.e. one with 7990 the rca_one_fs argument set to FALSE, to indicate that it has 7991 reclaimed all of the locking state that it will reclaim. Once a 7992 client sends such a RECLAIM_COMPLETE operation, it may attempt non- 7993 reclaim locking operations, although it may get NFS4ERR_GRACE errors 7994 the operations until the period of special handling is over. See 7995 Section 11.7.7 for a discussion of the analogous handling lock 7996 reclamation in the case of file systems transitioning from server to 7997 server. 7999 During the grace period, the server must reject READ and WRITE 8000 operations and non-reclaim locking requests (i.e. other LOCK and OPEN 8001 operations) with an error of NFS4ERR_GRACE, unless it can guarantee 8002 that these may be done safely, as described below. 8004 The grace period may last until all clients which are known to 8005 possibly have had locks have done a global RECLAIM_COMPLETE 8006 operation, indicating that they have finished reclaiming the locks 8007 they held before the server restart. This means that a client which 8008 has done a RECLAIM_COMPLETE must be prepared to receive an 8009 NFS4ERR_GRACE when attempting to acquire new locks. In order for the 8010 server to know that all clients with possible prior lock state have 8011 done a RECLAIM_COMPLETE, the server must maintain in stable storage a 8012 list of clients which may have such locks. The server may also 8013 terminate the grace period before all clients have done a global 8014 RECLAIM_COMPLETE. The server SHOULD NOT terminate the grace period 8015 before a time equal to the lease period in order to give clients an 8016 opportunity to find out about the server restart, as a result of 8017 issuing requests on associated sessions with a frequency governed by 8018 the lease time. Note that when a client does not issue such requests 8019 (or they are issued by the client but not received by the server), it 8020 is possible for the grace period to expire before the client finds 8021 out that the server restart has occurred. 8023 Some additional time in order to allow a client to establish a new 8024 client ID and session and to effect lock reclaims may be added to the 8025 lease time. Note that analogous rules apply to file system-specific 8026 grace periods discussed in Section 11.7.7. 8028 If the server can reliably determine that granting a non-reclaim 8029 request will not conflict with reclamation of locks by other clients, 8030 the NFS4ERR_GRACE error does not have to be returned even within the 8031 grace period, although NFS4ERR_GRACE must always be returned to 8032 clients attempting a non-reclaim lock request before doing their own 8033 global RECLAIM_COMPLETE. For the server to be able to service READ 8034 and WRITE operations during the grace period, it must again be able 8035 to guarantee that no possible conflict could arise between a 8036 potential reclaim locking request and the READ or WRITE operation. 8037 If the server is unable to offer that guarantee, the NFS4ERR_GRACE 8038 error must be returned to the client. 8040 For a server to provide simple, valid handling during the grace 8041 period, the easiest method is to simply reject all non-reclaim 8042 locking requests and READ and WRITE operations by returning the 8043 NFS4ERR_GRACE error. However, a server may keep information about 8044 granted locks in stable storage. With this information, the server 8045 could determine if a regular lock or READ or WRITE operation can be 8046 safely processed. 8048 For example, if the server maintained on stable storage summary 8049 information on whether mandatory locks exist, either mandatory byte- 8050 range locks, or share reservations specifying deny modes, many 8051 requests could be allowed during the grace period. If it is known 8052 that no such share reservations exist, OPEN request that do not 8053 specify deny modes may be safely granted. If, in addition, it is 8054 known that no mandatory byte-range locks exist, either through 8055 information stored on stable storage or simply because the server 8056 does not support such locks, READ and WRITE requests may be safely 8057 processed during the grace period. Another important case is where 8058 it is known that no mandatory byte-range locks exist, either because 8059 the server does not provide support for them, or because their 8060 absence is known from persistently recorded data. In this case, READ 8061 and WRITE operations specifying stateids derived from reclaim-type 8062 operation may be validly processed during the grace period because 8063 the fact of the valid reclaim ensures that no lock subsequently 8064 granted can prevent the I/O. 8066 To reiterate, for a server that allows non-reclaim lock and I/O 8067 requests to be processed during the grace period, it MUST determine 8068 that no lock subsequently reclaimed will be rejected and that no lock 8069 subsequently reclaimed would have prevented any I/O operation 8070 processed during the grace period. 8072 Clients should be prepared for the return of NFS4ERR_GRACE errors for 8073 non-reclaim lock and I/O requests. In this case the client should 8074 employ a retry mechanism for the request. A delay (on the order of 8075 several seconds) between retries should be used to avoid overwhelming 8076 the server. Further discussion of the general issue is included in 8077 [46]. The client must account for the server that can perform I/O 8078 and non-reclaim locking requests within the grace period as well as 8079 those that cannot do so. 8081 A reclaim-type locking request outside the server's grace period can 8082 only succeed if the server can guarantee that no conflicting lock or 8083 I/O request has been granted since restart. 8085 A server may, upon restart, establish a new value for the lease 8086 period. Therefore, clients should, once a new client ID is 8087 established, refetch the lease_time attribute and use it as the basis 8088 for lease renewal for the lease associated with that server. 8089 However, the server must establish, for this restart event, a grace 8090 period at least as long as the lease period for the previous server 8091 instantiation. This allows the client state obtained during the 8092 previous server instance to be reliably re-established. 8094 The possibility exists, that because of server configuration events, 8095 the client will be communicating with a server different than the one 8096 on which the locks were obtained, as shown by the combination of 8097 eir_server_scope and eir_server_owner. This leads to the issue of if 8098 and when the client should attempt to reclaim locks previously 8099 obtained on what is being reported as a different server. The rules 8100 to resolve this question are as follows: 8102 o If the server scope is different the client should not attempt to 8103 reclaim locks. In this situation no lock reclaim is possible. 8104 Any attempt to re-obtain the locks with non-reclaim operations is 8105 problematic since there is no guarantee that the existing 8106 filehandles will be recognized by the new server, or that if 8107 recognized, they denote the same objects. It is best to treat the 8108 locks as having been revoked by the reconfiguration event. 8110 o If the server scope is the same, the client should attempt to 8111 reclaim locks, even if the eir_server_owner value is different. 8112 In this situation, it is the responsibility of the server to 8113 return NFS4ERR_NO_GRACE if it cannot provide correct support for 8114 lock reclaim operations, including the prevention of edge 8115 conditions. 8117 The eir_server_owner field is not used in making this determination. 8118 Its function is to specify trunking possibilities for the client (see 8119 Section 2.10.5) and not to control lock reclaim. 8121 8.4.2.1.1. Security Considerations for State Reclaim 8123 During the grace period, a client can reclaim state it believes or 8124 asserts it had before the server restarted. Unless the server 8125 maintained a complete record of all the state the client had, the 8126 server has little choice but to trust the client. (Of course if the 8127 server maintained a complete record, then it would not have to force 8128 the client to reclaim state after server restart.) While the server 8129 has to trust the client to tell the truth, such trust does not have 8130 any negative consequences for security. The fundamental rule for the 8131 server when processing reclaim requests is that it MUST NOT grant the 8132 reclaim if an equivalent non-reclaim request would not be granted 8133 during steady-state due to access control or access conflict issues. 8134 For example an OPEN request during a reclaim will be refused with 8135 NFS4ERR_ACCESS if the principal making the request does not have 8136 access to open the file according to the discretionary ACL 8137 (Section 6.2.2) on the file. 8139 Nonetheless, it is possible that client operating in error or 8140 maliciously could, during reclaim, prevent another client from 8141 reclaiming access to state. For example, an attacker could send an 8142 OPEN reclaim operation with a deny mode that prevents another client 8143 from reclaiming the open state it had before the server restarted. 8145 The attacker could perform the same denial of service during steady 8146 state prior to server restart, as long as the the attacker had 8147 permissions. Given that the attack vectors are equivalent, the grace 8148 period does not offer any additional opportunity for denial of 8149 service, and any concerns about this attack vector, whether during 8150 grace or steady state are addressed the same way: use RPCSEC_GSS for 8151 authentication, and limit access to the file only to principals the 8152 owner of the file trusts. 8154 Note that if prior to restart the server had client IDs with the 8155 EXCHGID4_FLAG_BIND_PRINC_STATEID (Section 18.35) capability set, then 8156 the server SHOULD record in stable storage the client owner and the 8157 principal that established the client ID via EXCHANGE_ID. If the 8158 server does not, then there is a risk a client will be unable to 8159 reclaim state if it does not have a credential for a principal that 8160 was originally authorized to establish the state. 8162 8.4.3. Network Partitions and Recovery 8164 If the duration of a network partition is greater than the lease 8165 period provided by the server, the server will not have received a 8166 lease renewal from the client. If this occurs, the server may free 8167 all locks held for the client, or it may allow the lock state to 8168 remain for a considerable period, subject to the constraint that if a 8169 request for a conflicting lock is made, locks associated with an 8170 expired lease do not prevent such a conflicting lock from being 8171 granted but MUST be revoked as necessary so as not to interfere with 8172 such conflicting requests. 8174 If the server chooses to delay freeing of lock state until there is a 8175 conflict, it may either free all of the clients locks once there is a 8176 conflict, or it may only revoke the minimum set of locks necessary to 8177 allow conflicting requests. When it adopts the finer-grained 8178 approach, it must revoke all locks associated with a given stateid, 8179 even if the conflict is with only a subset of locks. 8181 When the server chooses to free all of a client's lock state, either 8182 immediately upon lease expiration, or a result of the first attempt 8183 to obtain a conflicting a lock, the server may report the loss of 8184 lock state in a number of ways. 8186 The server may choose to invalidate the session and the associated 8187 client ID. In this case, once the client can communicate with the 8188 server, it will receive an NFS4ERR_BADSESSION error. Upon attempting 8189 to create a new session, it would get an NFS4ERR_STALE_CLIENTID. 8190 Upon creating the new client ID and new session it would attempt to 8191 reclaim locks not be allowed to do so by the server. 8193 Another possibility is for the server to maintain the session and 8194 client ID but for all stateids held by the client to become invalid 8195 or stale. Once the client can reach the server after such a network 8196 partition, the status returned by the SEQUENCE operation will 8197 indicate a loss of locking state, i.e. the flag 8198 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED will be set in sr_status_flags. 8199 In addition, all I/O submitted by the client with the now invalid 8200 stateids will fail with the server returning the error 8201 NFS4ERR_EXPIRED. Once the client learns of the loss of locking 8202 state, it will suitably notify the applications that held the 8203 invalidated locks. The client should then take action to free 8204 invalidated stateids, either by establishing a new client ID using a 8205 new verifier or by doing a FREE_STATEID operation to release each of 8206 the invalidated stateids. 8208 When the server adopts a finer-grained approach to revocation of 8209 locks when a client's lease has expired, only a subset of stateids 8210 will normally become invalid during a network partition. When the 8211 client can communicate with the server after such a network partition 8212 heals, the status returned by the SEQUENCE operation will indicate a 8213 partial loss of locking state 8214 (SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED). In addition, operations, 8215 including I/O submitted by the client, with the now invalid stateids 8216 will fail with the server returning the error NFS4ERR_EXPIRED. Once 8217 the client learns of the loss of locking state, it will use the 8218 TEST_STATEID operation on all of its stateids to determine which 8219 locks have been lost and then suitably notify the applications that 8220 held the invalidated locks. The client can then release the 8221 invalidated locking state and acknowledge the revocation of the 8222 associated locks by doing a FREE_STATEID operation on each of the 8223 invalidated stateids. 8225 When a network partition is combined with a server restart, there are 8226 edge conditions that place requirements on the server in order to 8227 avoid silent data corruption following the server restart. Two of 8228 these edge conditions are known, and are discussed below. 8230 The first edge condition arises as a result of the scenarios such as 8231 the following: 8233 1. Client A acquires a lock. 8235 2. Client A and server experience mutual network partition, such 8236 that client A is unable to renew its lease. 8238 3. Client A's lease expires, and the server releases the lock. 8240 4. Client B acquires a lock that would have conflicted with that of 8241 Client A. 8243 5. Client B releases its lock. 8245 6. Server restarts. 8247 7. Network partition between client A and server heals. 8249 8. Client A connects to new server instance and finds out about 8250 server restart. 8252 9. Client A reclaims its lock within the server's grace period. 8254 Thus, at the final step, the server has erroneously granted client 8255 A's lock reclaim. If client B modified the object the lock was 8256 protecting, client A will experience object corruption. 8258 The second known edge condition arises in situations such as the 8259 following: 8261 1. Client A acquires one or more locks. 8263 2. Server restarts. 8265 3. Client A and server experience mutual network partition, such 8266 that client A is unable to reclaim all of its locks within the 8267 grace period. 8269 4. Server's reclaim grace period ends. Client A has either no 8270 locks or an incomplete set of locks known to the server. 8272 5. Client B acquires a lock that would have conflicted with a lock 8273 of client A that was not reclaimed. 8275 6. Client B releases the lock. 8277 7. Server restarts a second time. 8279 8. Network partition between client A and server heals. 8281 9. Client A connects to new server instance and finds out about 8282 server restart. 8284 10. Client A reclaims its lock within the server's grace period. 8286 As with the first edge condition, the final step of the scenario of 8287 the second edge condition has the server erroneously granting client 8288 A's lock reclaim. 8290 Solving the first and second edge conditions requires that the server 8291 either always assumes after it restarts that some edge condition 8292 occurs, and thus return NFS4ERR_NO_GRACE for all reclaim attempts, or 8293 that the server record some information in stable storage. The 8294 amount of information the server records in stable storage is in 8295 inverse proportion to how harsh the server intends to be whenever 8296 edge conditions arise. The server that is completely tolerant of all 8297 edge conditions will record in stable storage every lock that is 8298 acquired, removing the lock record from stable storage only when the 8299 lock is released. For the two edge conditions discussed above, the 8300 harshest a server can be, and still support a grace period for 8301 reclaims, requires that the server record in stable storage 8302 information some minimal information. For example, a server 8303 implementation could, for each client, save in stable storage a 8304 record containing: 8306 o the co_ownerid field from the client_owner4 presented in the 8307 EXCHANGE_ID operation. 8309 o a boolean that indicates if the client's lease expired or if there 8310 was administrative intervention (see Section 8.5) to revoke a 8311 byte-range lock, share reservation, or delegation and there has 8312 been no acknowledgement, via FREE_STATEID, of such revocation. 8314 o a boolean that indicates whether the client may have locks that it 8315 believes to be reclaimable in situations which the grace period 8316 was terminated, making the server's view of lock reclaimability 8317 suspect. The server will set this for any client record in stable 8318 storage where the client has not done a suitable RECLAIM_COMPLETE 8319 (global or file system-specific depending on the target of the 8320 lock request) before it grants any new (i.e. not reclaimed) lock 8321 to any client. 8323 Assuming the above record keeping, for the first edge condition, 8324 after the server restarts, the record that client A's lease expired 8325 means that another client could have acquired a conflicting byte- 8326 range lock, share reservation, or delegation. Hence the server must 8327 reject a reclaim from client A with the error NFS4ERR_NO_GRACE. 8329 For the second edge condition, after the server restarts for a second 8330 time, the indication that the client had not completed its reclaims 8331 at the time at which the grace period ended means that the server 8332 must reject a reclaim from client A with the error NFS4ERR_NO_GRACE. 8334 When either edge condition occurs, the client's attempt to reclaim 8335 locks will result in the error NFS4ERR_NO_GRACE. When this is 8336 received, or after the client restarts with no lock state, the client 8337 will send a global RECLAIM_COMPLETE. When the RECLAIM_COMPLETE is 8338 received, the server and client are again in agreement regarding 8339 reclaimable locks and both booleans in persistent storage can be 8340 reset, to be set again only when there is a subsequent event that 8341 causes lock reclaim operations to be questionable. 8343 Regardless of the level and approach to record keeping, the server 8344 MUST implement one of the following strategies (which apply to 8345 reclaims of share reservations, byte-range locks, and delegations): 8347 1. Reject all reclaims with NFS4ERR_NO_GRACE. This is extremely 8348 unforgiving, but necessary if the server does not record lock 8349 state in stable storage. 8351 2. Record sufficient state in stable storage such that all known 8352 edge conditions involving server restart, including the two noted 8353 in this section, are detected. It is acceptable to erroneously 8354 recognize an edge condition and not allow a reclaim, when, with 8355 sufficient knowledge it would be allowed. The error the server 8356 would return in this case is NFS4ERR_NO_GRACE. Note it is not 8357 known if there are other edge conditions. 8359 In the event that, after a server restart, the server determines 8360 that there is unrecoverable damage or corruption to the 8361 information in stable storage, then for all clients and/or locks 8362 which may be affected, the server MUST return NFS4ERR_NO_GRACE. 8364 A mandate for the client's handling of the NFS4ERR_NO_GRACE error is 8365 outside the scope of this specification, since the strategies for 8366 such handling are very dependent on the client's operating 8367 environment. However, one potential approach is described below. 8369 When the client receives NFS4ERR_NO_GRACE, it could examine the 8370 change attribute of the objects the client is trying to reclaim state 8371 for, and use that to determine whether to re-establish the state via 8372 normal OPEN or LOCK requests. This is acceptable provided the 8373 client's operating environment allows it. In other words, the client 8374 implementor is advised to document for his users the behavior. The 8375 client could also inform the application that its byte-range lock or 8376 share reservations (whether they were delegated or not) have been 8377 lost, such as via a UNIX signal, a GUI pop-up window, etc. See 8378 Section 10.5 for a discussion of what the client should do for 8379 dealing with unreclaimed delegations on client state. 8381 For further discussion of revocation of locks see Section 8.5. 8383 8.5. Server Revocation of Locks 8385 At any point, the server can revoke locks held by a client and the 8386 client must be prepared for this event. When the client detects that 8387 its locks have been or may have been revoked, the client is 8388 responsible for validating the state information between itself and 8389 the server. Validating locking state for the client means that it 8390 must verify or reclaim state for each lock currently held. 8392 The first occasion of lock revocation is upon server restart. Note 8393 that this includes situations in which sessions are persistent and 8394 locking state is lost. In this class of instances, the client will 8395 receive an error (NFS4ERR_STALE_CLIENTID) on an operation that takes 8396 client ID, usually as part of recovery in response to a problem with 8397 the current session) and the client will proceed with normal crash 8398 recovery as described in the Section 8.4.2.1. 8400 The second occasion of lock revocation is the inability to renew the 8401 lease before expiration, as discussed in Section 8.4.3. While this 8402 is considered a rare or unusual event, the client must be prepared to 8403 recover. The server is responsible for determining the precise 8404 consequences of the lease expiration, informing the client of the 8405 scope of the lock revocation decided upon. The client then uses the 8406 status information provided by the server in the SEQUENCE results 8407 (field sr_status_flags, see Section 18.46.3) to synchronize its 8408 locking state with that of the server, in order to recover. 8410 The third occasion of lock revocation can occur as a result of 8411 revocation of locks within the lease period, either because of 8412 administrative intervention, or because a recallable lock (a 8413 delegation or layout) was not returned within the lease period after 8414 having been recalled. While these are considered rare events, they 8415 are possible and the client must be prepared to deal with them. When 8416 either of these events occur, the client finds out about the 8417 situation through the status returned by the SEQUENCE operation. Any 8418 use of stateids associated with locks revoked during the lease period 8419 will receive the error NFS4ERR_ADMIN_REVOKED or 8420 NFS4ERR_DELEG_REVOKED, as appropriate. 8422 In all situations in which a subset of locking state may have been 8423 revoked, which include all cases in which locking state is revoked 8424 within the lease period, it is up to the client to determine which 8425 locks have been revoked and which have not. It does this by using 8426 the TEST_STATEID operation on the appropriate set of stateids. Once 8427 the set of revoked locks has been determined, the applications can be 8428 notified, and the invalidated stateids can be freed and lock 8429 revocation acknowledged by using FREE_STATEID. 8431 8.6. Short and Long Leases 8433 When determining the time period for the server lease, the usual 8434 lease tradeoffs apply. Short leases are good for fast server 8435 recovery at a cost of increased operations to effect lease renewal 8436 (when there are no other operations during the period to effect lease 8437 renewal as a side-effect). Long leases are certainly kinder and 8438 gentler to servers trying to handle very large numbers of clients. 8439 The number of extra requests to effect lock renewal drops in inverse 8440 proportion to the lease time. The disadvantages of long leases 8441 include the possibility of slower recovery after certain failures. 8442 After server failure, a longer grace period may be required when some 8443 clients do not promptly reclaim their locks and do a global 8444 RECLAIM_COMPLETE. In the event of client failure, there can be a 8445 longer period for leases to expire thus forcing conflicting requests 8446 to wait. 8448 Long leases are practical if the server can store lease state in non- 8449 volatile memory. Upon recovery, the server can reconstruct the lease 8450 state from its non-volatile memory and continue operation with its 8451 clients and therefore long leases would not be an issue. 8453 8.7. Clocks, Propagation Delay, and Calculating Lease Expiration 8455 To avoid the need for synchronized clocks, lease times are granted by 8456 the server as a time delta. However, there is a requirement that the 8457 client and server clocks do not drift excessively over the duration 8458 of the lease. There is also the issue of propagation delay across 8459 the network which could easily be several hundred milliseconds as 8460 well as the possibility that requests will be lost and need to be 8461 retransmitted. 8463 To take propagation delay into account, the client should subtract it 8464 from lease times (e.g. if the client estimates the one-way 8465 propagation delay as 200 milliseconds, then it can assume that the 8466 lease is already 200 milliseconds old when it gets it). In addition, 8467 it will take another 200 milliseconds to get a response back to the 8468 server. So the client must send a lease renewal or write data back 8469 to the server at least 400 milliseconds before the lease would 8470 expire. If the propagation delay varies over the life of the lease 8471 (e.g. the client is on a mobile host), the client will need to 8472 continuously subtract the increase in propagation delay from the 8473 lease times. 8475 The server's lease period configuration should take into account the 8476 network distance of the clients that will be accessing the server's 8477 resources. It is expected that the lease period will take into 8478 account the network propagation delays and other network delay 8479 factors for the client population. Since the protocol does not allow 8480 for an automatic method to determine an appropriate lease period, the 8481 server's administrator may have to tune the lease period. 8483 8.8. Obsolete Locking Infrastructure From NFSv4.0 8485 There are a number of operations and fields within existing 8486 operations that no longer have a function in NFSv4.1. In one way or 8487 another, these changes are all due to the implementation of sessions 8488 which provides client context and exactly once semantics as a base 8489 feature of the protocol, separate from locking itself. 8491 The following NFSv4.0 operations MUST NOT be implemented in NFSv4.1. 8492 The server MUST return NFS4ERR_NOTSUPP if these operations are found 8493 in an NFSv4.1 COMPOUND. 8495 o SETCLIENTID since its function has been replaced by EXCHANGE_ID. 8497 o SETCLIENTID_CONFIRM since client ID confirmation now happens by 8498 means of CREATE_SESSION. 8500 o OPEN_CONFIRM because state-owner-based seqids have been replaced 8501 by the sequence ID in the SEQUENCE operation. 8503 o RELEASE_LOCKOWNER because lock-owners with no associated locks do 8504 not have any sequence-related state and so can be deleted by the 8505 server at will. 8507 o RENEW because every SEQUENCE operation for a session causes lease 8508 renewal, making a separate operation superfluous. 8510 Also, there are a number of fields, present in existing operations 8511 related to locking that have no use in minor version one. They were 8512 used in minor version zero to perform functions now provided in a 8513 different fashion. 8515 o Sequence ids used to sequence requests for a given state-owner and 8516 to provide retry protection, now provided via sessions. 8518 o Client IDs used to identify the client associated with a given 8519 request. Client identification is now available using the client 8520 ID associated with the current session, without needing an 8521 explicit client ID field. 8523 Such vestigial fields in existing operations have no function in 8524 NFSv4.1 and are ignored by the server. Note that client IDs in 8525 operations new to NFSv4.1 (such as CREATE_SESSION and 8526 DESTROY_CLIENTID) are not ignored. 8528 9. File Locking and Share Reservations 8530 To support Win32 share reservations it is necessary to provide 8531 operations which atomically open or create files. Having a separate 8532 share/unshare operation would not allow correct implementation of the 8533 Win32 OpenFile API. In order to correctly implement share semantics, 8534 the previous NFS protocol mechanisms used when a file is opened or 8535 created (LOOKUP, CREATE, ACCESS) need to be replaced. The NFSv4.1 8536 protocol defines an OPEN operation which is capable of atomically 8537 looking up, creating, and locking a file on the server. 8539 9.1. Opens and Byte-Range Locks 8541 It is assumed that manipulating a byte-range lock is rare when 8542 compared to READ and WRITE operations. It is also assumed that 8543 server restarts and network partitions are relatively rare. 8544 Therefore it is important that the READ and WRITE operations have a 8545 lightweight mechanism to indicate if they possess a held lock. A 8546 byte-range lock request contains the heavyweight information required 8547 to establish a lock and uniquely define the owner of the lock. 8549 9.1.1. State-owner Definition 8551 When opening a file or requesting a byte-range lock, the client must 8552 specify an identifier which represents the owner of the requested 8553 lock. This identifier is in the form of a state-owner, represented 8554 in the protocol by a state_owner4, a variable-length opaque array 8555 which, when concatenated with the current client ID uniquely defines 8556 the owner of lock managed by the client. This may be a thread ID, 8557 process ID, or other unique value. 8559 Owners of opens and owners of byte-range locks are separate entities 8560 and remain separate even if the same opaque arrays are used to 8561 designate owners of each. The protocol distinguishes between open- 8562 owners (represented by open_owner4 structures) and lock-owners 8563 (represented by lock_owner4 structures). 8565 Each open is associated with a specific open-owner while each byte- 8566 range lock is associated with a lock-owner and an open-owner, the 8567 latter being the open-owner associated with the open file under which 8568 the LOCK operation was done. Delegations and layouts, on the other 8569 hand, are not associated with a specific owner but are associated 8570 with the client as a whole (identified by a client ID). 8572 9.1.2. Use of the Stateid and Locking 8574 All READ, WRITE and SETATTR operations contain a stateid. For the 8575 purposes of this section, SETATTR operations which change the size 8576 attribute of a file are treated as if they are writing the area 8577 between the old and new size (i.e. the range truncated or added to 8578 the file by means of the SETATTR), even where SETATTR is not 8579 explicitly mentioned in the text. The stateid passed to one of these 8580 operations must be one that represents an open, a set of byte-range 8581 locks, or a delegation, or it may be a special stateid representing 8582 anonymous access or the special bypass stateid. 8584 If the state-owner performs a READ or WRITE in a situation in which 8585 it has established a byte-range lock or share reservation on the 8586 server (any OPEN constitutes a share reservation) the stateid 8587 (previously returned by the server) must be used to indicate what 8588 locks, including both byte-range locks and share reservations, are 8589 held by the state-owner. If no state is established by the client, 8590 either byte-range lock or share reservation, a special stateid for 8591 anonymous state (zero as "other" and "seqid") is used. (See 8592 Section 8.2.3 for a description of 'special' stateids in general.) 8593 Regardless whether a stateid for anonymous state or a stateid 8594 returned by the server is used, if there is a conflicting share 8595 reservation or mandatory byte-range lock held on the file, the server 8596 MUST refuse to service the READ or WRITE operation. 8598 Share reservations are established by OPEN operations and by their 8599 nature are mandatory in that when the OPEN denies READ or WRITE 8600 operations, that denial results in such operations being rejected 8601 with error NFS4ERR_LOCKED. Byte-range locks may be implemented by 8602 the server as either mandatory or advisory, or the choice of 8603 mandatory or advisory behavior may be determined by the server on the 8604 basis of the file being accessed (for example, some UNIX-based 8605 servers support a "mandatory lock bit" on the mode attribute such 8606 that if set, byte-range locks are required on the file before I/O is 8607 possible). When byte-range locks are advisory, they only prevent the 8608 granting of conflicting lock requests and have no effect on READs or 8609 WRITEs. Mandatory byte-range locks, however, prevent conflicting I/O 8610 operations. When they are attempted, they are rejected with 8611 NFS4ERR_LOCKED. When the client gets NFS4ERR_LOCKED on a file it 8612 knows it has the proper share reservation for, it will need to send a 8613 LOCK request on the region of the file that includes the region the 8614 I/O was to be performed on, with an appropriate locktype (i.e. 8615 READ*_LT for a READ operation, WRITE*_LT for a WRITE operation). 8617 Note that for UNIX environments that support mandatory file locking, 8618 the distinction between advisory and mandatory locking is subtle. In 8619 fact, advisory and mandatory byte-range locks are exactly the same in 8620 so far as the APIs and requirements on implementation. If the 8621 mandatory lock attribute is set on the file, the server checks to see 8622 if the lock-owner has an appropriate shared (read) or exclusive 8623 (write) byte-range lock on the region it wishes to read or write to. 8625 If there is no appropriate lock, the server checks if there is a 8626 conflicting lock (which can be done by attempting to acquire the 8627 conflicting lock on behalf of the lock-owner, and if successful, 8628 release the lock after the READ or WRITE is done), and if there is, 8629 the server returns NFS4ERR_LOCKED. 8631 For Windows environments, byte-range locks are always mandatory, so 8632 the server always checks for byte-range locks during I/O requests. 8634 Thus, the NFSv4.1 LOCK operation does not need to distinguish between 8635 advisory and mandatory byte-range locks. It is the NFSv4.1 server's 8636 processing of the READ and WRITE operations that introduces the 8637 distinction. 8639 Every stateid which is validly passed to READ, WRITE or SETATTR, with 8640 the exception of special stateid values, defines an access mode for 8641 the file (i.e. READ, WRITE, or READ-WRITE) 8643 o For stateids associated with opens, this is the mode defined by 8644 the original OPEN which caused the allocation of the open stateid 8645 and as modified by subsequent OPENs and OPEN_DOWNGRADEs for the 8646 same open-owner/file pair. 8648 o For stateids returned by byte-range lock requests, the appropriate 8649 mode is the access mode for the open stateid associated with the 8650 lock set represented by the stateid. 8652 o For delegation stateids the access mode is based on the type of 8653 delegation. 8655 When a READ, WRITE, or SETATTR (which specifies the size attribute) 8656 is done, the operation is subject to checking against the access mode 8657 to verify that the operation is appropriate given the stateid with 8658 which the operation is associated. 8660 In the case of WRITE-type operations (i.e. WRITEs and SETATTRs which 8661 set size), the server MUST verify that the access mode allows writing 8662 and MUST return an NFS4ERR_OPENMODE error if it does not. In the 8663 case, of READ, the server may perform the corresponding check on the 8664 access mode, or it may choose to allow READ on opens for WRITE only, 8665 to accommodate clients whose write implementation may unavoidably do 8666 reads (e.g. due to buffer cache constraints). However, even if READs 8667 are allowed in these circumstances, the server MUST still check for 8668 locks that conflict with the READ (e.g. another open specify denial 8669 of READs). Note that a server which does enforce the access mode 8670 check on READs need not explicitly check for conflicting share 8671 reservations since the existence of OPEN for read access guarantees 8672 that no conflicting share reservation can exist. 8674 The read bypass special stateid (all bits of "other" and "seqid" set 8675 to one) indicates a desire to bypass locking checks. The server MAY 8676 allow READ operations to bypass locking checks at the server, when 8677 this special stateid is used. However, WRITE operations with this 8678 special stateid value MUST NOT bypass locking checks and are treated 8679 exactly the same as if a special stateid for anonymous state were 8680 used. 8682 A lock may not be granted while a READ or WRITE operation using one 8683 of the special stateids is being performed and the scope of the lock 8684 to be granted would conflict with the READ or WRITE operation. This 8685 can occur when: 8687 o A mandatory byte range lock is requested with range that conflicts 8688 with the range of the READ or WRITE operation. For the purposes 8689 of this paragraph, a conflict occurs when a shared lock is 8690 requested and a WRITE operation is being performed, or an 8691 exclusive lock is requested and either a READ or a WRITE operation 8692 is being performed. 8694 o A share reservation is requested which denies reading and or 8695 writing and the corresponding operation is being performed. 8697 o A delegation is to be granted and the delegation type would 8698 prevent the I/O operation, i.e. READ and WRITE conflict with a 8699 write delegation and WRITE conflicts with a read delegation. 8701 When a client holds a delegation, it needs to ensure that the stateid 8702 sent conveys the association of operation with the delegation, to 8703 avoid the delegation from being avoidably recalled. When the 8704 delegation stateid, or a stateid open associated with that 8705 delegation, or a stateid representing byte-range locks derived form 8706 such an open is used, the server knows that the READ, WRITE, or 8707 SETATTR does not conflict with the delegation, but is sent under the 8708 aegis of the delegation. Even though it is possible for the server 8709 to determine from the client ID (via the session ID) that the client 8710 does in fact have a delegation, the server is not obliged to check 8711 this, so using a special stateid can result in avoidable recall of 8712 the delegation. 8714 9.2. Lock Ranges 8716 The protocol allows a lock-owner to request a lock with a byte range 8717 and then either upgrade, downgrade, or unlock a sub-range of the 8718 initial lock, or a range that consists of a range which overlaps, 8719 fully or partially, that initial lock or a combination of a set of 8720 existing locks for the same lock-owner. It is expected that this 8721 will be an uncommon type of request. In any case, servers or server 8722 file systems may not be able to support sub-range lock semantics. In 8723 the event that a server receives a locking request that represents a 8724 sub-range of current locking state for the lock-owner, the server is 8725 allowed to return the error NFS4ERR_LOCK_RANGE to signify that it 8726 does not support sub-range lock operations. Therefore, the client 8727 should be prepared to receive this error and, if appropriate, report 8728 the error to the requesting application. 8730 The client is discouraged from combining multiple independent locking 8731 ranges that happen to be adjacent into a single request since the 8732 server may not support sub-range requests and for reasons related to 8733 the recovery of file locking state in the event of server failure. 8734 As discussed in Section 8.4.2, the server may employ certain 8735 optimizations during recovery that work effectively only when the 8736 client's behavior during lock recovery is similar to the client's 8737 locking behavior prior to server failure. 8739 9.3. Upgrading and Downgrading Locks 8741 If a client has a write lock on a byte-range, it can request an 8742 atomic downgrade of the lock to a read lock via the LOCK request, by 8743 setting the type to READ_LT. If the server supports atomic 8744 downgrade, the request will succeed. If not, it will return 8745 NFS4ERR_LOCK_NOTSUPP. The client should be prepared to receive this 8746 error, and if appropriate, report the error to the requesting 8747 application. 8749 If a client has a read lock on a byte-range, it can request an atomic 8750 upgrade of the lock to a write lock via the LOCK request by setting 8751 the type to WRITE_LT or WRITEW_LT. If the server does not support 8752 atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the upgrade 8753 can be achieved without an existing conflict, the request will 8754 succeed. Otherwise, the server will return either NFS4ERR_DENIED or 8755 NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is returned if the 8756 client sent the LOCK request with the type set to WRITEW_LT and the 8757 server has detected a deadlock. The client should be prepared to 8758 receive such errors and if appropriate, report the error to the 8759 requesting application. 8761 9.4. Stateid Seqid Values and Byte-Range Locks 8763 When a lock or unlock request is done, passing a stateid, the stateid 8764 returned has the same "other" value and a "seqid" value that is 8765 incremented to reflect the occurrence of the lock or unlock request. 8766 The server MUST increment the value of the "seqid" field whenever 8767 there is any change to the locking status of any byte offset as 8768 described by any of locks covered by the stateid. A change in 8769 locking status includes a change from locked to unlocked or the 8770 reverse or a change from being locked for read to being locked for 8771 write or the reverse. 8773 When there is no such change, as, for example when a range already 8774 locked for write is locked again for write, the server MAY increment 8775 the "seqid" value. 8777 9.5. Issues with Multiple Open-Owners 8779 When the same file is opened by multiple open-owners, a client will 8780 have multiple open stateids for that file, each associated with a 8781 different open-owner. In that case, there can be multiple LOCK and 8782 LOCKU requests for the same lock-owner issued using the different 8783 open stateids, and so a situation may arise in which there are 8784 multiple stateids, each representing byte-range locks on the same 8785 file and held by the same lock-owner but each associated with a 8786 different open-owner. 8788 In such a situation, the locking status of each byte (i.e. whether it 8789 is locked, the read or write mode of the lock and the lock-owner 8790 holding the lock) MUST reflect the last LOCK or LOCKU operation done 8791 for the lock-owner in question, independent of the stateid through 8792 which the request was issued. 8794 When a byte is locked by the lock-owner in question, the open-owner 8795 to which that lock is assigned SHOULD be that of the open-owner 8796 associated with the stateid through which the last LOCK of that byte 8797 was done. When there is a change in the open-owner associated with 8798 locks for the stateid through which a LOCK or LOCKU was done, the 8799 "seqid" field of the stateid MUST be incremented, even if the 8800 locking, in terms of lock-owners has not changed. When there is a 8801 change to the set of locked bytes associated with a different stateid 8802 for the same lock-owner, i.e. associated with a different open-owner, 8803 the "seqid" value for that stateid MUST NOT be incremented. 8805 9.6. Blocking Locks 8807 Some clients require the support of blocking locks. While NFSv4.1 8808 provides a callback when a previously unavailable lock becomes 8809 available, this is an OPTIONAL feature and clients cannot depend on 8810 its presence. Clients need to be prepared to continually poll for 8811 the lock. This presents a fairness problem. Two of the lock types, 8812 READW and WRITEW, are used to indicate to the server that the client 8813 is requesting a blocking lock. When the callback is not used, the 8814 server should maintain an ordered list of pending blocking locks. 8815 When the conflicting lock is released, the server may wait for the 8816 period of time equal to lease_time for the first waiting client to 8817 re-request the lock. After the lease period expires, the next 8818 waiting client request is allowed the lock. Clients are required to 8819 poll at an interval sufficiently small that it is likely to acquire 8820 the lock in a timely manner. The server is not required to maintain 8821 a list of pending blocked locks as it is used to increase fairness 8822 and not correct operation. Because of the unordered nature of crash 8823 recovery, storing of lock state to stable storage would be required 8824 to guarantee ordered granting of blocking locks. 8826 Servers may also note the lock types and delay returning denial of 8827 the request to allow extra time for a conflicting lock to be 8828 released, allowing a successful return. In this way, clients can 8829 avoid the burden of needlessly frequent polling for blocking locks. 8830 The server should take care in the length of delay in the event the 8831 client retransmits the request. 8833 If a server receives a blocking lock request, denies it, and then 8834 later receives a nonblocking request for the same lock, which is also 8835 denied, then it should remove the lock in question from its list of 8836 pending blocking locks. Clients should use such a nonblocking 8837 request to indicate to the server that this is the last time they 8838 intend to poll for the lock, as may happen when the process 8839 requesting the lock is interrupted. This is a courtesy to the 8840 server, to prevent it from unnecessarily waiting a lease period 8841 before granting other lock requests. However, clients are not 8842 required to perform this courtesy, and servers must not depend on 8843 them doing so. Also, clients must be prepared for the possibility 8844 that this final locking request will be accepted. 8846 When server indicates, via the flag OPEN4_RESULT_MAY_NOTIFY_LOCK, 8847 that CB_NOTIFY_LOCK callbacks will be done for the current open file, 8848 the client should take notice of this, but, since this is a hint, 8849 cannot rely on a CB_NOTIFY_LOCK always being done. A client may 8850 reasonably reduce the frequency with which it polls for a denied 8851 lock, since the greater latency that might occur is likely to be 8852 eliminated given a prompt callback, but it still needs to poll. When 8853 it receives a CB_NOTIFY_LOCK it should promptly try to obtain the 8854 lock, but it should be aware that other clients may polling and the 8855 server is under no obligation to reserve the lock for that particular 8856 client. 8858 9.7. Share Reservations 8860 A share reservation is a mechanism to control access to a file. It 8861 is a separate and independent mechanism from byte-range locking. 8862 When a client opens a file, it sends an OPEN operation to the server 8863 specifying the type of access required (READ, WRITE, or BOTH) and the 8864 type of access to deny others (deny NONE, READ, WRITE, or BOTH). If 8865 the OPEN fails the client will fail the application's open request. 8867 Pseudo-code definition of the semantics: 8869 if (request.access == 0) { 8870 return (NFS4ERR_INVAL) 8871 } else { 8872 if ((request.access & file_state.deny)) || 8873 (request.deny & file_state.access)) { 8874 return (NFS4ERR_SHARE_DENIED) 8875 } 8876 return (NFS4ERR_OK); 8878 When doing this checking of share reservations on OPEN, the current 8879 file_state used in the algorithm includes bits that reflect all 8880 current opens, including those for the open-owner making the new OPEN 8881 request. 8883 The constants used for the OPEN and OPEN_DOWNGRADE operations for the 8884 access and deny fields are as follows: 8886 const OPEN4_SHARE_ACCESS_READ = 0x00000001; 8887 const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; 8888 const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; 8890 const OPEN4_SHARE_DENY_NONE = 0x00000000; 8891 const OPEN4_SHARE_DENY_READ = 0x00000001; 8892 const OPEN4_SHARE_DENY_WRITE = 0x00000002; 8893 const OPEN4_SHARE_DENY_BOTH = 0x00000003; 8895 9.8. OPEN/CLOSE Operations 8897 To provide correct share semantics, a client MUST use the OPEN 8898 operation to obtain the initial filehandle and indicate the desired 8899 access and what access, if any, to deny. Even if the client intends 8900 to use a special stateid for anonymous state or read bypass, it must 8901 still obtain the filehandle for the regular file with the OPEN 8902 operation so the appropriate share semantics can be applied. For 8903 clients that do not have a deny mode built into their open 8904 programming interfaces, deny equal to NONE should be used. 8906 The OPEN operation with the CREATE flag, also subsumes the CREATE 8907 operation for regular files as used in previous versions of the NFS 8908 protocol. This allows a create with a share to be done atomically. 8910 The CLOSE operation removes all share reservations held by the open- 8911 owner on that file. If byte-range locks are held, the client SHOULD 8912 release all locks before issuing a CLOSE. The server MAY free all 8913 outstanding locks on CLOSE but some servers may not support the CLOSE 8914 of a file that still has byte-range locks held. The server MUST 8915 return failure, NFS4ERR_LOCKS_HELD, if any locks would exist after 8916 the CLOSE. 8918 The LOOKUP operation will return a filehandle without establishing 8919 any lock state on the server. Without a valid stateid, the server 8920 will assume the client has the least access. For example, a file 8921 opened with deny READ/WRITE using a filehandle obtained through 8922 LOOKUP could only be read using the special read bypass stateid and 8923 could not be written at all because it would not have a valid stateid 8924 and the special anonymous stateid would not be allowed access. 8926 9.9. Open Upgrade and Downgrade 8928 When an OPEN is done for a file and the open-owner for which the open 8929 is being done already has the file open, the result is to upgrade the 8930 open file status maintained on the server to include the access and 8931 deny bits specified by the new OPEN as well as those for the existing 8932 OPEN. The result is that there is one open file, as far as the 8933 protocol is concerned, and it includes the union of the access and 8934 deny bits for all of the OPEN requests completed. The open is 8935 represented by a single stateid whose "other" values matches that of 8936 the original open, and whose "seqid" value is incremented to reflect 8937 the occurrence of the upgrade. The increment is required in cases in 8938 which the "upgrade" results in no change to the open mode (e.g. an 8939 OPEN is done for read when the existing open file is opened for read- 8940 write). Only a single CLOSE will be done to reset the effects of 8941 both OPENs. The client may use the stateid returned by the OPEN 8942 effecting the upgrade or with a stateid sharing the same "other" 8943 field and a seqid of zero, although care needs to be taken as far as 8944 upgrades which happen while the CLOSE is pending. Note that the 8945 client, when issuing the OPEN, may not know that the same file is in 8946 fact being opened. The above only applies if both OPENs result in 8947 the OPENed object being designated by the same filehandle. 8949 When the server chooses to export multiple filehandles corresponding 8950 to the same file object and returns different filehandles on two 8951 different OPENs of the same file object, the server MUST NOT "OR" 8952 together the access and deny bits and coalesce the two open files. 8953 Instead the server must maintain separate OPENs with separate 8954 stateids and will require separate CLOSEs to free them. 8956 When multiple open files on the client are merged into a single open 8957 file object on the server, the close of one of the open files (on the 8958 client) may necessitate change of the access and deny status of the 8959 open file on the server. This is because the union of the access and 8960 deny bits for the remaining opens may be smaller (i.e. a proper 8961 subset) than previously. The OPEN_DOWNGRADE operation is used to 8962 make the necessary change and the client should use it to update the 8963 server so that share reservation requests by other clients are 8964 handled properly. The stateid returned has the same "other" field as 8965 that passed to the server. The "seqid" value in the returned stateid 8966 MUST be incremented, even is situation in which there is no change 8967 the access and deny bits for the file. 8969 9.10. Parallel OPENs 8971 Unlike the case of NFSv4.0, in which OPEN operations for the same 8972 open-owner are inherently serialized because of the owner-based 8973 seqid, multiple OPENs for the same open-owner may be done in 8974 parallel. When clients do this, they may encounter situations in 8975 which, because of the existence of hard links, two OPEN operations 8976 may turn out to open the same file, with a later OPEN performed being 8977 an upgrade of the first, with this fact only visible to the client 8978 once the operations complete. 8980 In this situation, clients may determine the order in which the OPENs 8981 were performed by examining the stateids returned by the OPENs. 8982 Stateids that share a common value of the "other" field can be 8983 recognized as having opened the same file, with the order of the 8984 operations determinable from the order of the "seqid" fields, mod any 8985 possible wraparound of the 32-bit field. 8987 When the possibility exists that the client will send multiple OPENs 8988 for the same open-owner in parallel, it may be the case that an open 8989 upgrade may happen without the client knowing beforehand that this 8990 could happen. Because of this possibility, CLOSEs and 8991 OPEN_DOWNGRADEs, should generally be sent with a non-zero seqid in 8992 the stateid, to avoid the possibility that the status change 8993 associated with an open upgrade is not inadvertently lost. 8995 9.11. Reclaim of Open and Byte-Range Locks 8997 Special forms of the LOCK and OPEN operations are provided when it is 8998 necessary to re-establish byte-range locks or opens after a server 8999 failure. 9001 o To reclaim existing opens, an OPEN operation is performed using a 9002 CLAIM_PREVIOUS. Because the client, in this type of situation, 9003 will have already opened the file and have the filehandle of the 9004 target file, this operation requires that the current filehandle 9005 be the target file, rather than a directory and no file name is 9006 specified. 9008 o To reclaim byte-range locks, a LOCK operation with the reclaim 9009 parameter set to true is used. 9011 Reclaims of opens associated with delegations are discussed in 9012 Section 10.2.1. 9014 10. Client-Side Caching 9016 Client-side caching of data, of file attributes, and of file names is 9017 essential to providing good performance with the NFS protocol. 9018 Providing distributed cache coherence is a difficult problem and 9019 previous versions of the NFS protocol have not attempted it. 9020 Instead, several NFS client implementation techniques have been used 9021 to reduce the problems that a lack of coherence poses for users. 9022 These techniques have not been clearly defined by earlier protocol 9023 specifications and it is often unclear what is valid or invalid 9024 client behavior. 9026 The NFSv4.1 protocol uses many techniques similar to those that have 9027 been used in previous protocol versions. The NFSv4.1 protocol does 9028 not provide distributed cache coherence. However, it defines a more 9029 limited set of caching guarantees to allow locks and share 9030 reservations to be used without destructive interference from client 9031 side caching. 9033 In addition, the NFSv4.1 protocol introduces a delegation mechanism 9034 which allows many decisions normally made by the server to be made 9035 locally by clients. This mechanism provides efficient support of the 9036 common cases where sharing is infrequent or where sharing is read- 9037 only. 9039 10.1. Performance Challenges for Client-Side Caching 9041 Caching techniques used in previous versions of the NFS protocol have 9042 been successful in providing good performance. However, several 9043 scalability challenges can arise when those techniques are used with 9044 very large numbers of clients. This is particularly true when 9045 clients are geographically distributed which classically increases 9046 the latency for cache revalidation requests. 9048 The previous versions of the NFS protocol repeat their file data 9049 cache validation requests at the time the file is opened. This 9050 behavior can have serious performance drawbacks. A common case is 9051 one in which a file is only accessed by a single client. Therefore, 9052 sharing is infrequent. 9054 In this case, repeated reference to the server to find that no 9055 conflicts exist is expensive. A better option with regards to 9056 performance is to allow a client that repeatedly opens a file to do 9057 so without reference to the server. This is done until potentially 9058 conflicting operations from another client actually occur. 9060 A similar situation arises in connection with file locking. Sending 9061 file lock and unlock requests to the server as well as the read and 9062 write requests necessary to make data caching consistent with the 9063 locking semantics (see Section 10.3.2) can severely limit 9064 performance. When locking is used to provide protection against 9065 infrequent conflicts, a large penalty is incurred. This penalty may 9066 discourage the use of file locking by applications. 9068 The NFSv4.1 protocol provides more aggressive caching strategies with 9069 the following design goals: 9071 o Compatibility with a large range of server semantics. 9073 o Providing the same caching benefits as previous versions of the 9074 NFS protocol when unable to support the more aggressive model. 9076 o Requirements for aggressive caching are organized so that a large 9077 portion of the benefit can be obtained even when not all of the 9078 requirements can be met. 9080 The appropriate requirements for the server are discussed in later 9081 sections in which specific forms of caching are covered (see 9082 Section 10.4). 9084 10.2. Delegation and Callbacks 9086 Recallable delegation of server responsibilities for a file to a 9087 client improves performance by avoiding repeated requests to the 9088 server in the absence of inter-client conflict. With the use of a 9089 "callback" RPC from server to client, a server recalls delegated 9090 responsibilities when another client engages in sharing of a 9091 delegated file. 9093 A delegation is passed from the server to the client, specifying the 9094 object of the delegation and the type of delegation. There are 9095 different types of delegations but each type contains a stateid to be 9096 used to represent the delegation when performing operations that 9097 depend on the delegation. This stateid is similar to those 9098 associated with locks and share reservations but differs in that the 9099 stateid for a delegation is associated with a client ID and may be 9100 used on behalf of all the open-owners for the given client. A 9101 delegation is made to the client as a whole and not to any specific 9102 process or thread of control within it. 9104 The backchannel is established by CREATE_SESSION and 9105 BIND_CONN_TO_SESSION, and the client is required to maintain it. 9107 Because the backchannel may be down, even temporarily, correct 9108 protocol operation does not depend on them. Preliminary testing of 9109 backchannel functionality by means of a CB_COMPOUND procedure with a 9110 single operation, CB_SEQUENCE, can be used to check the continuity of 9111 the backchannel. A server avoids delegating responsibilities until 9112 it has determined that the backchannel exists. Because the granting 9113 of a delegation is always conditional upon the absence of conflicting 9114 access, clients MUST NOT assume that a delegation will be granted and 9115 they MUST always be prepared for OPENs, WANT_DELEGATIONs, and 9116 GET_DIR_DELEGATIONs to be processed without any delegations being 9117 granted. 9119 Once granted, a delegation behaves in many ways like a lock. There 9120 is an associated lease that is subject to renewal together with all 9121 of the other leases held by that client. 9123 Unlike locks, an operation by a second client to a delegated file 9124 will cause the server to recall a delegation through a callback. For 9125 individual operations, we will describe, under IMPLEMENTATION, when 9126 such operations are required to effect a recall. A number of points 9127 should be noted, however. 9129 o The server is free to recall a delegation whenever it feels it is 9130 desirable and may do so even if no operations requiring recall are 9131 being done. 9133 o Operations done outside the NFSv4 protocol, due to, for example, 9134 access by other protocols, or by local access, also need to result 9135 in delegation recall when they make analogous changes to file 9136 system data. What is crucial is if the change would invalidate 9137 the guarantees provided by the delegation. When this is possible, 9138 the delegation needs to be recalled and MUST be returned or 9139 revoked before allowing the operation to proceed. 9141 o The semantics of the file system are crucial in defining when 9142 delegation recall is required. If a particular change within a 9143 specific implementation causes change to a file attribute, then 9144 delegation recall is required, whether that operation has been 9145 specifically listed as requiring delegation recall. Again, what 9146 is critical is whether the guarantees provided by the delegation 9147 are being invalidated. 9149 Despite those caveats, the implementation sections for a number of 9150 operations describe situations in which delegation recall would be 9151 required under some common circumstances: 9153 o For GETATTR, see Section 18.7.4. 9155 o For OPEN, see Section 18.16.4. 9157 o For READ, see Section 18.22.4. 9159 o For REMOVE, see Section 18.25.4. 9161 o For RENAME, see Section 18.26.4. 9163 o For SETATTR, see Section 18.30.4. 9165 o For WRITE, see Section 18.32.4. 9167 On recall, the client holding the delegation needs to flush modified 9168 state (such as modified data) to the server and return the 9169 delegation. The conflicting request will not be acted on until the 9170 recall is complete. The recall is considered complete when the 9171 client returns the delegation or the server times its wait for the 9172 delegation to be returned and revokes the delegation as a result of 9173 the timeout. In the interim, the server will either delay responding 9174 to conflicting requests or respond to them with NFS4ERR_DELAY. 9175 Following the resolution of the recall, the server has the 9176 information necessary to grant or deny the second client's request. 9178 At the time the client receives a delegation recall, it may have 9179 substantial state that needs to be flushed to the server. Therefore, 9180 the server should allow sufficient time for the delegation to be 9181 returned since it may involve numerous RPCs to the server. If the 9182 server is able to determine that the client is diligently flushing 9183 state to the server as a result of the recall, the server may extend 9184 the usual time allowed for a recall. However, the time allowed for 9185 recall completion should not be unbounded. 9187 An example of this is when responsibility to mediate opens on a given 9188 file is delegated to a client (see Section 10.4). The server will 9189 not know what opens are in effect on the client. Without this 9190 knowledge the server will be unable to determine if the access and 9191 deny state for the file allows any particular open until the 9192 delegation for the file has been returned. 9194 A client failure or a network partition can result in failure to 9195 respond to a recall callback. In this case, the server will revoke 9196 the delegation which in turn will render useless any modified state 9197 still on the client. 9199 10.2.1. Delegation Recovery 9201 There are three situations that delegation recovery needs to deal 9202 with: 9204 o Client restart 9206 o Server restart 9208 o Network partition (full or backchannel-only) 9210 In the event the client restarts, the failure to renew the lease will 9211 result in the revocation of byte-range locks and share reservations. 9212 Delegations, however, may be treated a bit differently. 9214 There will be situations in which delegations will need to be 9215 reestablished after a client restarts. The reason for this is the 9216 client may have file data stored locally and this data was associated 9217 with the previously held delegations. The client will need to 9218 reestablish the appropriate file state on the server. 9220 To allow for this type of client recovery, the server MAY extend the 9221 period for delegation recovery beyond the typical lease expiration 9222 period. This implies that requests from other clients that conflict 9223 with these delegations will need to wait. Because the normal recall 9224 process may require significant time for the client to flush changed 9225 state to the server, other clients need be prepared for delays that 9226 occur because of a conflicting delegation. This longer interval 9227 would increase the window for clients to restart and consult stable 9228 storage so that the delegations can be reclaimed. For open 9229 delegations, such delegations are reclaimed using OPEN with a claim 9230 type of CLAIM_DELEGATE_PREV or CLAIM_DELEG_PREV_FH (See Section 10.5 9231 and Section 18.16 for discussion of open delegation and the details 9232 of OPEN respectively). 9234 A server MAY support claim types of CLAIM_DELEGATE_PREV and 9235 CLAIM_DELEG_PREV_FH, and if it does, it MUST NOT remove delegations 9236 upon a CREATE_SESSION that confirms a client ID created by 9237 EXCHANGE_ID, and instead MUST, for a period of time no less than that 9238 of the value of the lease_time attribute, maintain the client's 9239 delegations to allow time for the client to send CLAIM_DELEGATE_PREV 9240 requests. The server that supports CLAIM_DELEGATE_PREV and/or 9241 CLAIM_DELEG_PREV_FH MUST support the DELEGPURGE operation. 9243 When the server restarts, delegations are reclaimed (using the OPEN 9244 operation with CLAIM_PREVIOUS) in a similar fashion to byte-range 9245 locks and share reservations. However, there is a slight semantic 9246 difference. In the normal case if the server decides that a 9247 delegation should not be granted, it performs the requested action 9248 (e.g. OPEN) without granting any delegation. For reclaim, the 9249 server grants the delegation but a special designation is applied so 9250 that the client treats the delegation as having been granted but 9251 recalled by the server. Because of this, the client has the duty to 9252 write all modified state to the server and then return the 9253 delegation. This process of handling delegation reclaim reconciles 9254 three principles of the NFSv4.1 protocol: 9256 o Upon reclaim, a client reporting resources assigned to it by an 9257 earlier server instance must be granted those resources. 9259 o The server has unquestionable authority to determine whether 9260 delegations are to be granted and, once granted, whether they are 9261 to be continued. 9263 o The use of callbacks is not to be depended upon until the client 9264 has proven its ability to receive them. 9266 When a client needs to reclaim a delegation and there is no 9267 associated open, the client may use the CLAIM_PREVIOUS variant of the 9268 WANT_DELEGATION operation. However, since the server is not required 9269 to support this operation, an alternative is to reclaim via a dummy 9270 open together with the delegation using an OPEN of type 9271 CLAIM_PREVIOUS. The dummy open file can be released using a CLOSE to 9272 re-establish the original state to be reclaimed, a delegation without 9273 an associated open. 9275 When a client has more than a single open associated with a 9276 delegation, state for those additional opens can be established using 9277 OPEN operations of type CLAIM_DELEGATE_CUR. When these are used to 9278 establish opens associated with reclaimed delegations, the server 9279 MUST allow them when made within the grace period. 9281 When a network partition occurs, delegations are subject to freeing 9282 by the server when the lease renewal period expires. This is similar 9283 to the behavior for locks and share reservations. For delegations, 9284 however, the server may extend the period in which conflicting 9285 requests are held off. Eventually the occurrence of a conflicting 9286 request from another client will cause revocation of the delegation. 9287 A loss of the backchannel (e.g. by later network configuration 9288 change) will have the same effect. A recall request will fail and 9289 revocation of the delegation will result. 9291 A client normally finds out about revocation of a delegation when it 9292 uses a stateid associated with a delegation and receives one of the 9293 errors NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, or 9294 NFS4ERR_DELEG_REVOKED. It also may find out about delegation 9295 revocation after a client restart when it attempts to reclaim a 9296 delegation and receives that same error. Note that in the case of a 9297 revoked write open delegation, there are issues because data may have 9298 been modified by the client whose delegation is revoked and 9299 separately by other clients. See Section 10.5.1 for a discussion of 9300 such issues. Note also that when delegations are revoked, 9301 information about the revoked delegation will be written by the 9302 server to stable storage (as described in Section 8.4.3). This is 9303 done to deal with the case in which a server restarts after revoking 9304 a delegation but before the client holding the revoked delegation is 9305 notified about the revocation. 9307 10.3. Data Caching 9309 When applications share access to a set of files, they need to be 9310 implemented so as to take account of the possibility of conflicting 9311 access by another application. This is true whether the applications 9312 in question execute on different clients or reside on the same 9313 client. 9315 Share reservations and byte-range locks are the facilities the 9316 NFSv4.1 protocol provides to allow applications to coordinate access 9317 by using mutual exclusion facilities. The NFSv4.1 protocol's data 9318 caching must be implemented such that it does not invalidate the 9319 assumptions that those using these facilities depend upon. 9321 10.3.1. Data Caching and OPENs 9323 In order to avoid invalidating the sharing assumptions that 9324 applications rely on, NFSv4.1 clients should not provide cached data 9325 to applications or modify it on behalf of an application when it 9326 would not be valid to obtain or modify that same data via a READ or 9327 WRITE operation. 9329 Furthermore, in the absence of open delegation (see Section 10.4), 9330 two additional rules apply. Note that these rules are obeyed in 9331 practice by many NFSv3 clients. 9333 o First, cached data present on a client must be revalidated after 9334 doing an OPEN. Revalidating means that the client fetches the 9335 change attribute from the server, compares it with the cached 9336 change attribute, and if different, declares the cached data (as 9337 well as the cached attributes) as invalid. This is to ensure that 9338 the data for the OPENed file is still correctly reflected in the 9339 client's cache. This validation must be done at least when the 9340 client's OPEN operation includes DENY=WRITE or BOTH thus 9341 terminating a period in which other clients may have had the 9342 opportunity to open the file with WRITE access. Clients may 9343 choose to do the revalidation more often (i.e. at OPENs specifying 9344 DENY=NONE) to parallel the NFSv3 protocol's practice for the 9345 benefit of users assuming this degree of cache revalidation. 9347 Since the change attribute is updated for data and metadata 9348 modifications, some client implementors may be tempted to use the 9349 time_modify attribute and not the change attribute to validate 9350 cached data, so that metadata changes do not spuriously invalidate 9351 clean data. The implementor is cautioned in this approach. The 9352 change attribute is guaranteed to change for each update to the 9353 file, whereas time_modify is guaranteed to change only at the 9354 granularity of the time_delta attribute. Use by the client's data 9355 cache validation logic of time_modify and not change runs the risk 9356 of the client incorrectly marking stale data as valid. Thus any 9357 cache validation approach by the client MUST include the use of 9358 the change attribute. 9360 o Second, modified data must be flushed to the server before closing 9361 a file OPENed for write. This is complementary to the first rule. 9362 If the data is not flushed at CLOSE, the revalidation done after 9363 client OPENs as file is unable to achieve its purpose. The other 9364 aspect to flushing the data before close is that the data must be 9365 committed to stable storage, at the server, before the CLOSE 9366 operation is requested by the client. In the case of a server 9367 restart and a CLOSEd file, it may not be possible to retransmit 9368 the data to be written to the file. Hence, this requirement. 9370 10.3.2. Data Caching and File Locking 9372 For those applications that choose to use file locking instead of 9373 share reservations to exclude inconsistent file access, there is an 9374 analogous set of constraints that apply to client side data caching. 9375 These rules are effective only if the file locking is used in a way 9376 that matches in an equivalent way the actual READ and WRITE 9377 operations executed. This is as opposed to file locking that is 9378 based on pure convention. For example, it is possible to manipulate 9379 a two-megabyte file by dividing the file into two one-megabyte 9380 regions and protecting access to the two regions by file locks on 9381 bytes zero and one. A lock for write on byte zero of the file would 9382 represent the right to do READ and WRITE operations on the first 9383 region. A lock for write on byte one of the file would represent the 9384 right to do READ and WRITE operations on the second region. As long 9385 as all applications manipulating the file obey this convention, they 9386 will work on a local file system. However, they may not work with 9387 the NFSv4.1 protocol unless clients refrain from data caching. 9389 The rules for data caching in the file locking environment are: 9391 o First, when a client obtains a file lock for a particular region, 9392 the data cache corresponding to that region (if any cache data 9393 exists) must be revalidated. If the change attribute indicates 9394 that the file may have been updated since the cached data was 9395 obtained, the client must flush or invalidate the cached data for 9396 the newly locked region. A client might choose to invalidate all 9397 of non-modified cached data that it has for the file but the only 9398 requirement for correct operation is to invalidate all of the data 9399 in the newly locked region. 9401 o Second, before releasing a write lock for a region, all modified 9402 data for that region must be flushed to the server. The modified 9403 data must also be written to stable storage. 9405 Note that flushing data to the server and the invalidation of cached 9406 data must reflect the actual byte ranges locked or unlocked. 9407 Rounding these up or down to reflect client cache block boundaries 9408 will cause problems if not carefully done. For example, writing a 9409 modified block when only half of that block is within an area being 9410 unlocked may cause invalid modification to the region outside the 9411 unlocked area. This, in turn, may be part of a region locked by 9412 another client. Clients can avoid this situation by synchronously 9413 performing portions of write operations that overlap that portion 9414 (initial or final) that is not a full block. Similarly, invalidating 9415 a locked area which is not an integral number of full buffer blocks 9416 would require the client to read one or two partial blocks from the 9417 server if the revalidation procedure shows that the data which the 9418 client possesses may not be valid. 9420 The data that is written to the server as a prerequisite to the 9421 unlocking of a region must be written, at the server, to stable 9422 storage. The client may accomplish this either with synchronous 9423 writes or by following asynchronous writes with a COMMIT operation. 9424 This is required because retransmission of the modified data after a 9425 server restart might conflict with a lock held by another client. 9427 A client implementation may choose to accommodate applications which 9428 use byte-range locking in non-standard ways (e.g. using a byte-range 9429 lock as a global semaphore) by flushing to the server more data upon 9430 an LOCKU than is covered by the locked range. This may include 9431 modified data within files other than the one for which the unlocks 9432 are being done. In such cases, the client must not interfere with 9433 applications whose READs and WRITEs are being done only within the 9434 bounds of byte-range locks which the application holds. For example, 9435 an application locks a single byte of a file and proceeds to write 9436 that single byte. A client that chose to handle a LOCKU by flushing 9437 all modified data to the server could validly write that single byte 9438 in response to an unrelated unlock. However, it would not be valid 9439 to write the entire block in which that single written byte was 9440 located since it includes an area that is not locked and might be 9441 locked by another client. Client implementations can avoid this 9442 problem by dividing files with modified data into those for which all 9443 modifications are done to areas covered by an appropriate byte-range 9444 lock and those for which there are modifications not covered by a 9445 byte-range lock. Any writes done for the former class of files must 9446 not include areas not locked and thus not modified on the client. 9448 10.3.3. Data Caching and Mandatory File Locking 9450 Client side data caching needs to respect mandatory file locking when 9451 it is in effect. The presence of mandatory file locking for a given 9452 file is indicated when the client gets back NFS4ERR_LOCKED from a 9453 READ or WRITE on a file it has an appropriate share reservation for. 9454 When mandatory locking is in effect for a file, the client must check 9455 for an appropriate file lock for data being read or written. If a 9456 lock exists for the range being read or written, the client may 9457 satisfy the request using the client's validated cache. If an 9458 appropriate file lock is not held for the range of the read or write, 9459 the read or write request must not be satisfied by the client's cache 9460 and the request must be sent to the server for processing. When a 9461 read or write request partially overlaps a locked region, the request 9462 should be subdivided into multiple pieces with each region (locked or 9463 not) treated appropriately. 9465 10.3.4. Data Caching and File Identity 9467 When clients cache data, the file data needs to be organized 9468 according to the file system object to which the data belongs. For 9469 NFSv3 clients, the typical practice has been to assume for the 9470 purpose of caching that distinct filehandles represent distinct file 9471 system objects. The client then has the choice to organize and 9472 maintain the data cache on this basis. 9474 In the NFSv4.1 protocol, there is now the possibility to have 9475 significant deviations from a "one filehandle per object" model 9476 because a filehandle may be constructed on the basis of the object's 9477 pathname. Therefore, clients need a reliable method to determine if 9478 two filehandles designate the same file system object. If clients 9479 were simply to assume that all distinct filehandles denote distinct 9480 objects and proceed to do data caching on this basis, caching 9481 inconsistencies would arise between the distinct client side objects 9482 which mapped to the same server side object. 9484 By providing a method to differentiate filehandles, the NFSv4.1 9485 protocol alleviates a potential functional regression in comparison 9486 with the NFSv3 protocol. Without this method, caching 9487 inconsistencies within the same client could occur and this has not 9488 been present in previous versions of the NFS protocol. Note that it 9489 is possible to have such inconsistencies with applications executing 9490 on multiple clients but that is not the issue being addressed here. 9492 For the purposes of data caching, the following steps allow an 9493 NFSv4.1 client to determine whether two distinct filehandles denote 9494 the same server side object: 9496 o If GETATTR directed to two filehandles returns different values of 9497 the fsid attribute, then the filehandles represent distinct 9498 objects. 9500 o If GETATTR for any file with an fsid that matches the fsid of the 9501 two filehandles in question returns a unique_handles attribute 9502 with a value of TRUE, then the two objects are distinct. 9504 o If GETATTR directed to the two filehandles does not return the 9505 fileid attribute for both of the handles, then it cannot be 9506 determined whether the two objects are the same. Therefore, 9507 operations which depend on that knowledge (e.g. client side data 9508 caching) cannot be done reliably. Note that if GETATTR does not 9509 return the fileid attribute for both filehandles, it will return 9510 it for neither of the filehandles, since the fsid for both 9511 filehandles is the same. 9513 o If GETATTR directed to the two filehandles returns different 9514 values for the fileid attribute, then they are distinct objects. 9516 o Otherwise they are the same object. 9518 10.4. Open Delegation 9520 When a file is being OPENed, the server may delegate further handling 9521 of opens and closes for that file to the opening client. Any such 9522 delegation is recallable, since the circumstances that allowed for 9523 the delegation are subject to change. In particular, the server may 9524 receive a conflicting OPEN from another client, the server must 9525 recall the delegation before deciding whether the OPEN from the other 9526 client may be granted. Making a delegation is up to the server and 9527 clients should not assume that any particular OPEN either will or 9528 will not result in an open delegation. The following is a typical 9529 set of conditions that servers might use in deciding whether OPEN 9530 should be delegated: 9532 o The client must be able to respond to the server's callback 9533 requests. If a backchannel has been established, the server will 9534 send a CB_COMPOUND request, containing a single operation, 9535 CB_SEQUENCE, for a test of backchannel availability. 9537 o The client must have responded properly to previous recalls. 9539 o There must be no current open conflicting with the requested 9540 delegation. 9542 o There should be no current delegation that conflicts with the 9543 delegation being requested. 9545 o The probability of future conflicting open requests should be low 9546 based on the recent history of the file. 9548 o The existence of any server-specific semantics of OPEN/CLOSE that 9549 would make the required handling incompatible with the prescribed 9550 handling that the delegated client would apply (see below). 9552 There are two types of open delegations, read and write. A read open 9553 delegation allows a client to handle, on its own, requests to open a 9554 file for reading that do not deny read access to others. Multiple 9555 read open delegations may be outstanding simultaneously and do not 9556 conflict. A write open delegation allows the client to handle, on 9557 its own, all opens. Only one write open delegation may exist for a 9558 given file at a given time and it is inconsistent with any read open 9559 delegations. 9561 When a client has a read open delegation, it is assured that neither 9562 the contents, the attributes (with the exception of time_access), nor 9563 the names of any links to the file will change without its knowledge, 9564 so long as the delegation is held. When a client has a write open 9565 delegation, it may modify the file data locally since no other client 9566 will be accessing the file's data. The client holding a write 9567 delegation may only locally affect file attributes which are 9568 intimately connected with the file data: size, change, time_access, 9569 time_metadata, and time_modify. All other attributes must be 9570 reflected on the server. 9572 When a client has an open delegation, it does not need to send OPENs 9573 or CLOSEs to the server. Instead the client may update the 9574 appropriate status internally. For a read open delegation, opens 9575 that cannot be handled locally (opens for write or that deny read 9576 access) must be sent to the server. 9578 When an open delegation is made, the reply to the OPEN contains an 9579 open delegation structure which specifies the following: 9581 o the type of delegation (read or write). 9583 o space limitation information to control flushing of data on close 9584 (write open delegation only, see Section 10.4.1). 9586 o an nfsace4 specifying read and write permissions. 9588 o a stateid to represent the delegation for READ and WRITE. 9590 The delegation stateid is separate and distinct from the stateid for 9591 the OPEN proper. The standard stateid, unlike the delegation 9592 stateid, is associated with a particular lock-owner and will continue 9593 to be valid after the delegation is recalled and the file remains 9594 open. 9596 When a request internal to the client is made to open a file and an 9597 open delegation is in effect, it will be accepted or rejected solely 9598 on the basis of the following conditions. Any requirement for other 9599 checks to be made by the delegate should result in open delegation 9600 being denied so that the checks can be made by the server itself. 9602 o The access and deny bits for the request and the file as described 9603 in Section 9.7. 9605 o The read and write permissions as determined below. 9607 The nfsace4 passed with delegation can be used to avoid frequent 9608 ACCESS calls. The permission check should be as follows: 9610 o If the nfsace4 indicates that the open may be done, then it should 9611 be granted without reference to the server. 9613 o If the nfsace4 indicates that the open may not be done, then an 9614 ACCESS request must be sent to the server to obtain the definitive 9615 answer. 9617 The server may return an nfsace4 that is more restrictive than the 9618 actual ACL of the file. This includes an nfsace4 that specifies 9619 denial of all access. Note that some common practices such as 9620 mapping the traditional user "root" to the user "nobody" (see 9621 Section 5.9) may make it incorrect to return the actual ACL of the 9622 file in the delegation response. 9624 The use of a delegation together with various other forms of caching 9625 creates the possibility that no server authentication and 9626 authorization will ever be performed for a given user since all of 9627 the user's requests might be satisfied locally. Where the client is 9628 depending on the server for authentication and authorization, the 9629 client should be sure authentication and authorization occurs for 9630 each user by use of the ACCESS operation. This should be the case 9631 even if an ACCESS operation would not be required otherwise. As 9632 mentioned before, the server may enforce frequent authentication by 9633 returning an nfsace4 denying all access with every open delegation. 9635 10.4.1. Open Delegation and Data Caching 9637 An OPEN delegation allows much of the message overhead associated 9638 with the opening and closing files to be eliminated. An open when an 9639 open delegation is in effect does not require that a validation 9640 message be sent to the server. The continued endurance of the "read 9641 open delegation" provides a guarantee that no OPEN for write and thus 9642 no write has occurred. Similarly, when closing a file opened for 9643 write and if write open delegation is in effect, the data written 9644 does not have to be written to the server until the open delegation 9645 is recalled. The continued endurance of the open delegation provides 9646 a guarantee that no open and thus no read or write has been done by 9647 another client. 9649 For the purposes of open delegation, READs and WRITEs done without an 9650 OPEN are treated as the functional equivalents of a corresponding 9651 type of OPEN. Although client SHOULD NOT use special stateids when 9652 an open exists, delegation handling on the server can use the client 9653 ID associated with the current session to determine if the operation 9654 has been done by the holder of the delegation, in which case, no 9655 recall is necessary, or by another client, in which case the 9656 delegation must be recalled and I/O not proceed until the delegation 9657 is recalled or revoked. 9659 With delegations, a client is able to avoid writing data to the 9660 server when the CLOSE of a file is serviced. The file close system 9661 call is the usual point at which the client is notified of a lack of 9662 stable storage for the modified file data generated by the 9663 application. At the close, file data is written to the server and 9664 through normal accounting the server is able to determine if the 9665 available file system space for the data has been exceeded (i.e. 9666 server returns NFS4ERR_NOSPC or NFS4ERR_DQUOT). This accounting 9667 includes quotas. The introduction of delegations requires that a 9668 alternative method be in place for the same type of communication to 9669 occur between client and server. 9671 In the delegation response, the server provides either the limit of 9672 the size of the file or the number of modified blocks and associated 9673 block size. The server must ensure that the client will be able to 9674 write modified data to the server of a size equal to that provided in 9675 the original delegation. The server must make this assurance for all 9676 outstanding delegations. Therefore, the server must be careful in 9677 its management of available space for new or modified data taking 9678 into account available file system space and any applicable quotas. 9679 The server can recall delegations as a result of managing the 9680 available file system space. The client should abide by the server's 9681 state space limits for delegations. If the client exceeds the stated 9682 limits for the delegation, the server's behavior is undefined. 9684 Based on server conditions, quotas or available file system space, 9685 the server may grant write open delegations with very restrictive 9686 space limitations. The limitations may be defined in a way that will 9687 always force modified data to be flushed to the server on close. 9689 With respect to authentication, flushing modified data to the server 9690 after a CLOSE has occurred may be problematic. For example, the user 9691 of the application may have logged off the client and unexpired 9692 authentication credentials may not be present. In this case, the 9693 client may need to take special care to ensure that local unexpired 9694 credentials will in fact be available. This may be accomplished by 9695 tracking the expiration time of credentials and flushing data well in 9696 advance of their expiration or by making private copies of 9697 credentials to assure their availability when needed. 9699 10.4.2. Open Delegation and File Locks 9701 When a client holds a write open delegation, lock operations are 9702 performed locally. This includes those required for mandatory file 9703 locking. This can be done since the delegation implies that there 9704 can be no conflicting locks. Similarly, all of the revalidations 9705 that would normally be associated with obtaining locks and the 9706 flushing of data associated with the releasing of locks need not be 9707 done. 9709 When a client holds a read open delegation, lock operations are not 9710 performed locally. All lock operations, including those requesting 9711 non-exclusive locks, are sent to the server for resolution. 9713 10.4.3. Handling of CB_GETATTR 9715 The server needs to employ special handling for a GETATTR where the 9716 target is a file that has a write open delegation in effect. The 9717 reason for this is that the client holding the write delegation may 9718 have modified the data and the server needs to reflect this change to 9719 the second client that submitted the GETATTR. Therefore, the client 9720 holding the write delegation needs to be interrogated. The server 9721 will use the CB_GETATTR operation. The only attributes that the 9722 server can reliably query via CB_GETATTR are size and change. 9724 Since CB_GETATTR is being used to satisfy another client's GETATTR 9725 request, the server only needs to know if the client holding the 9726 delegation has a modified version of the file. If the client's copy 9727 of the delegated file is not modified (data or size), the server can 9728 satisfy the second client's GETATTR request from the attributes 9729 stored locally at the server. If the file is modified, the server 9730 only needs to know about this modified state. If the server 9731 determines that the file is currently modified, it will respond to 9732 the second client's GETATTR as if the file had been modified locally 9733 at the server. 9735 Since the form of the change attribute is determined by the server 9736 and is opaque to the client, the client and server need to agree on a 9737 method of communicating the modified state of the file. For the size 9738 attribute, the client will report its current view of the file size. 9739 For the change attribute, the handling is more involved. 9741 For the client, the following steps will be taken when receiving a 9742 write delegation: 9744 o The value of the change attribute will be obtained from the server 9745 and cached. Let this value be represented by c. 9747 o The client will create a value greater than c that will be used 9748 for communicating modified data is held at the client. Let this 9749 value be represented by d. 9751 o When the client is queried via CB_GETATTR for the change 9752 attribute, it checks to see if it holds modified data. If the 9753 file is modified, the value d is returned for the change attribute 9754 value. If this file is not currently modified, the client returns 9755 the value c for the change attribute. 9757 For simplicity of implementation, the client MAY for each CB_GETATTR 9758 return the same value d. This is true even if, between successive 9759 CB_GETATTR operations, the client again modifies in the file's data 9760 or metadata in its cache. The client can return the same value 9761 because the only requirement is that the client be able to indicate 9762 to the server that the client holds modified data. Therefore, the 9763 value of d may always be c + 1. 9765 While the change attribute is opaque to the client in the sense that 9766 it has no idea what units of time, if any, the server is counting 9767 change with, it is not opaque in that the client has to treat it as 9768 an unsigned integer, and the server has to be able to see the results 9769 of the client's changes to that integer. Therefore, the server MUST 9770 encode the change attribute in network order when sending it to the 9771 client. The client MUST decode it from network order to its native 9772 order when receiving it and the client MUST encode it network order 9773 when sending it to the server. For this reason, change is defined as 9774 an unsigned integer rather than an opaque array of bytes. 9776 For the server, the following steps will be taken when providing a 9777 write delegation: 9779 o Upon providing a write delegation, the server will cache a copy of 9780 the change attribute in the data structure it uses to record the 9781 delegation. Let this value be represented by sc. 9783 o When a second client sends a GETATTR operation on the same file to 9784 the server, the server obtains the change attribute from the first 9785 client. Let this value be cc. 9787 o If the value cc is equal to sc, the file is not modified and the 9788 server returns the current values for change, time_metadata, and 9789 time_modify (for example) to the second client. 9791 o If the value cc is NOT equal to sc, the file is currently modified 9792 at the first client and most likely will be modified at the server 9793 at a future time. The server then uses its current time to 9794 construct attribute values for time_metadata and time_modify. A 9795 new value of sc, which we will call nsc, is computed by the 9796 server, such that nsc >= sc + 1. The server then returns the 9797 constructed time_metadata, time_modify, and nsc values to the 9798 requester. The server replaces sc in the delegation record with 9799 nsc. To prevent the possibility of time_modify, time_metadata, 9800 and change from appearing to go backward (which would happen if 9801 the client holding the delegation fails to write its modified data 9802 to the server before the delegation is revoked or returned), the 9803 server SHOULD update the file's metadata record with the 9804 constructed attribute values. For reasons of reasonable 9805 performance, committing the constructed attribute values to stable 9806 storage is OPTIONAL. 9808 As discussed earlier in this section, the client MAY return the same 9809 cc value on subsequent CB_GETATTR calls, even if the file was 9810 modified in the client's cache yet again between successive 9811 CB_GETATTR calls. Therefore, the server must assume that the file 9812 has been modified yet again, and MUST take care to ensure that the 9813 new nsc it constructs and returns is greater than the previous nsc it 9814 returned. An example implementation's delegation record would 9815 satisfy this mandate by including a boolean field (let us call it 9816 "modified") that is set to FALSE when the delegation is granted, and 9817 an sc value set at the time of grant to the change attribute value. 9818 The modified field would be set to TRUE the first time cc != sc, and 9819 would stay TRUE until the delegation is returned or revoked. The 9820 processing for constructing nsc, time_modify, and time_metadata would 9821 use this pseudo code: 9823 if (!modified) { 9824 do CB_GETATTR for change and size; 9826 if (cc != sc) 9827 modified = TRUE; 9828 } else { 9829 do CB_GETATTR for size; 9830 } 9832 if (modified) { 9833 sc = sc + 1; 9834 time_modify = time_metadata = current_time; 9835 update sc, time_modify, time_metadata into file's metadata; 9836 } 9838 This would return to the client (that sent GETATTR) the attributes it 9839 requested, but make sure size comes from what CB_GETATTR returned. 9840 The server would not update the file's metadata with the client's 9841 modified size. 9843 In the case that the file attribute size is different than the 9844 server's current value, the server treats this as a modification 9845 regardless of the value of the change attribute retrieved via 9846 CB_GETATTR and responds to the second client as in the last step. 9848 This methodology resolves issues of clock differences between client 9849 and server and other scenarios where the use of CB_GETATTR break 9850 down. 9852 It should be noted that the server is under no obligation to use 9853 CB_GETATTR and therefore the server MAY simply recall the delegation 9854 to avoid its use. 9856 10.4.4. Recall of Open Delegation 9858 The following events necessitate recall of an open delegation: 9860 o Potentially conflicting OPEN request (or READ/WRITE done with 9861 "special" stateid) 9863 o SETATTR sent by another client 9865 o REMOVE request for the file 9867 o RENAME request for the file as either source or target of the 9868 RENAME 9870 Whether a RENAME of a directory in the path leading to the file 9871 results in recall of an open delegation depends on the semantics of 9872 the server's file system. If that file system denies such RENAMEs 9873 when a file is open, the recall must be performed to determine 9874 whether the file in question is, in fact, open. 9876 In addition to the situations above, the server may choose to recall 9877 open delegations at any time if resource constraints make it 9878 advisable to do so. Clients should always be prepared for the 9879 possibility of recall. 9881 When a client receives a recall for an open delegation, it needs to 9882 update state on the server before returning the delegation. These 9883 same updates must be done whenever a client chooses to return a 9884 delegation voluntarily. The following items of state need to be 9885 dealt with: 9887 o If the file associated with the delegation is no longer open and 9888 no previous CLOSE operation has been sent to the server, a CLOSE 9889 operation must be sent to the server. 9891 o If a file has other open references at the client, then OPEN 9892 operations must be sent to the server. The appropriate stateids 9893 will be provided by the server for subsequent use by the client 9894 since the delegation stateid will no longer be valid. These OPEN 9895 requests are done with the claim type of CLAIM_DELEGATE_CUR. This 9896 will allow the presentation of the delegation stateid so that the 9897 client can establish the appropriate rights to perform the OPEN. 9898 (see the Section 18.16 which describes the OPEN" operation for 9899 details.) 9901 o If there are granted file locks, the corresponding LOCK operations 9902 need to be performed. This applies to the write open delegation 9903 case only. 9905 o For a write open delegation, if at the time of recall the file is 9906 not open for write, all modified data for the file must be flushed 9907 to the server. If the delegation had not existed, the client 9908 would have done this data flush before the CLOSE operation. 9910 o For a write open delegation when a file is still open at the time 9911 of recall, any modified data for the file needs to be flushed to 9912 the server. 9914 o With the write open delegation in place, it is possible that the 9915 file was truncated during the duration of the delegation. For 9916 example, the truncation could have occurred as a result of an OPEN 9917 UNCHECKED with a size attribute value of zero. Therefore, if a 9918 truncation of the file has occurred and this operation has not 9919 been propagated to the server, the truncation must occur before 9920 any modified data is written to the server. 9922 In the case of write open delegation, file locking imposes some 9923 additional requirements. To precisely maintain the associated 9924 invariant, it is required to flush any modified data in any region 9925 for which a write lock was released while the write delegation was in 9926 effect. However, because the write open delegation implies no other 9927 locking by other clients, a simpler implementation is to flush all 9928 modified data for the file (as described just above) if any write 9929 lock has been released while the write open delegation was in effect. 9931 An implementation need not wait until delegation recall (or deciding 9932 to voluntarily return a delegation) to perform any of the above 9933 actions, if implementation considerations (e.g. resource availability 9934 constraints) make that desirable. Generally, however, the fact that 9935 the actual open state of the file may continue to change makes it not 9936 worthwhile to send information about opens and closes to the server, 9937 except as part of delegation return. Only in the case of closing the 9938 open that resulted in obtaining the delegation would clients be 9939 likely to do this early, since, in that case, the close once done 9940 will not be undone. Regardless of the client's choices on scheduling 9941 these actions, all must be performed before the delegation is 9942 returned, including (when applicable) the close that corresponds to 9943 the open that resulted in the delegation. These actions can be 9944 performed either in previous requests or in previous operations in 9945 the same COMPOUND request. 9947 10.4.5. Clients that Fail to Honor Delegation Recalls 9949 A client may fail to respond to a recall for various reasons, such as 9950 a failure of the backchannel from server to the client. The client 9951 may be unaware of a failure in the backchannel. This lack of 9952 awareness could result in the client finding out long after the 9953 failure that its delegation has been revoked, and another client has 9954 modified the data for which the client had a delegation. This is 9955 especially a problem for the client that held a write delegation. 9957 Status bits returned by SEQUENCE operations help to provide an 9958 alternate way of informing the client of issues regarding the status 9959 of the backchannel and of recalled delegations. When the backchannel 9960 is not available, the server returns the status bit 9961 SEQ4_STATUS_CB_PATH_DOWN on SEQUENCE operations. The client can 9962 react by attempting to re-establish the backchannel and by returning 9963 recallable objects if a backchannel cannot be successfully re- 9964 established. 9966 Whether the backchannel is functioning or not, it may be that the 9967 recalled delegation is not returned. Note that the client's lease 9968 might still be renewed, even though the recalled delegation is not 9969 returned. In this situation, servers SHOULD revoke delegations that 9970 are not returned in a period of time equal to the lease period. This 9971 period of time should allow the client time to note the backchannel- 9972 down status and re-establish the backchannel. 9974 When delegations are revoked, the server will return with the 9975 SEQ4_STATUS_RECALLABLE_STATE_REVOKED status bit set on subsequent 9976 SEQUENCE operations. The client should note this and then use 9977 TEST_STATEID to find which delegations have been recalled. 9979 10.4.6. Delegation Revocation 9981 At the point a delegation is revoked, if there are associated opens 9982 on the client, these opens may or may not be revoked. If no lock or 9983 open is granted that is inconsistent with the existing open, the 9984 stateid for the open may remain valid, and be disconnected from the 9985 revoked delegation, just as would be the case if the delegation were 9986 returned. 9988 For example, if an OPEN for read-write with DENY=NONE is associated 9989 with the delegation, granting of another such OPEN to a different 9990 client will revoke the delegation but need not revoke the OPEN, since 9991 no lock inconsistent with that OPEN has been granted. On the other 9992 hand, if an OPEN denying write is granted, then the existing open 9993 must be revoked. 9995 When opens and/or locks are revoked, the applications holding these 9996 opens or locks need to be notified. This notification usually occurs 9997 by returning errors for READ/WRITE operations or when a close is 9998 attempted for the open file. 10000 If no opens exist for the file at the point the delegation is 10001 revoked, then notification of the revocation is unnecessary. 10002 However, if there is modified data present at the client for the 10003 file, the user of the application should be notified. Unfortunately, 10004 it may not be possible to notify the user since active applications 10005 may not be present at the client. See Section 10.5.1 for additional 10006 details. 10008 10.4.7. Delegations via WANT_DELEGATION 10010 In addition to providing delegations as part of the reply to OPEN 10011 operations, servers MAY provide delegations separate from open, via 10012 the OPTIONAL WANT_DELEGATION operation. This allows delegations to 10013 be obtained in advance of an OPEN that might benefit from them, for 10014 objects which are not a valid target of OPEN, or to deal with cases 10015 in which a delegation has been recalled and the client wants to make 10016 an attempt to re-establish it if the absence of use by other clients 10017 allows that. 10019 The WANT_DELEGATION operation may be performed on any type of file 10020 object other than a directory. 10022 When a delegation is obtained using WANT_DELEGATION, any open files 10023 for the same filehandle held by that client are to be treated as 10024 subordinate to the delegation, just as if they had been created using 10025 an OPEN of type CLAIM_DELEGATE_CUR. They are otherwise unchanged as 10026 to seqid, access and deny modes, and the relationship with byte-range 10027 locks. Similarly, existing byte-range locks subordinate to an open 10028 which becomes subordinate to a delegation, become indirectly 10029 subordinate to that new delegation. 10031 The WANT_DELEGATION operation provides for delivery of delegations 10032 via callbacks, when the delegations are not immediately available. 10033 When a requested delegation is available, it is delivered to the 10034 client via a CB_PUSH_DELEG operation. When this happens, open files 10035 for the same filehandle become subordinate to the new delegation at 10036 the point at which the delegation is delivered , just as if they had 10037 been created using an OPEN of type CLAIM_DELEGATE_CUR. Similarly, 10038 for existing byte-range locks subordinate to an open. 10040 10.5. Data Caching and Revocation 10042 When locks and delegations are revoked, the assumptions upon which 10043 successful caching depend are no longer guaranteed. For any locks or 10044 share reservations that have been revoked, the corresponding state- 10045 owner needs to be notified. This notification includes applications 10046 with a file open that has a corresponding delegation which has been 10047 revoked. Cached data associated with the revocation must be removed 10048 from the client. In the case of modified data existing in the 10049 client's cache, that data must be removed from the client without it 10050 being written to the server. As mentioned, the assumptions made by 10051 the client are no longer valid at the point when a lock or delegation 10052 has been revoked. For example, another client may have been granted 10053 a conflicting lock after the revocation of the lock at the first 10054 client. Therefore, the data within the lock range may have been 10055 modified by the other client. Obviously, the first client is unable 10056 to guarantee to the application what has occurred to the file in the 10057 case of revocation. 10059 Notification to a state-owner will in many cases consist of simply 10060 returning an error on the next and all subsequent READs/WRITEs to the 10061 open file or on the close. Where the methods available to a client 10062 make such notification impossible because errors for certain 10063 operations may not be returned, more drastic action such as signals 10064 or process termination may be appropriate. The justification for 10065 this is that an invariant for which an application depends on may be 10066 violated. Depending on how errors are typically treated for the 10067 client operating environment, further levels of notification 10068 including logging, console messages, and GUI pop-ups may be 10069 appropriate. 10071 10.5.1. Revocation Recovery for Write Open Delegation 10073 Revocation recovery for a write open delegation poses the special 10074 issue of modified data in the client cache while the file is not 10075 open. In this situation, any client which does not flush modified 10076 data to the server on each close must ensure that the user receives 10077 appropriate notification of the failure as a result of the 10078 revocation. Since such situations may require human action to 10079 correct problems, notification schemes in which the appropriate user 10080 or administrator is notified may be necessary. Logging and console 10081 messages are typical examples. 10083 If there is modified data on the client, it must not be flushed 10084 normally to the server. A client may attempt to provide a copy of 10085 the file data as modified during the delegation under a different 10086 name in the file system name space to ease recovery. Note that when 10087 the client can determine that the file has not been modified by any 10088 other client, or when the client has a complete cached copy of file 10089 in question, such a saved copy of the client's view of the file may 10090 be of particular value for recovery. In other case, recovery using a 10091 copy of the file based partially on the client's cached data and 10092 partially on the server copy as modified by other clients, will be 10093 anything but straightforward, so clients may avoid saving file 10094 contents in these situations or mark the results specially to warn 10095 users of possible problems. 10097 Saving of such modified data in delegation revocation situations may 10098 be limited to files of a certain size or might be used only when 10099 sufficient disk space is available within the target file system. 10100 Such saving may also be restricted to situations when the client has 10101 sufficient buffering resources to keep the cached copy available 10102 until it is properly stored to the target file system. 10104 10.6. Attribute Caching 10106 This section pertains to the caching of a file's attributes on a 10107 client when that client does not hold a delegation on the file. 10109 The attributes discussed in this section do not include named 10110 attributes. Individual named attributes are analogous to files and 10111 caching of the data for these needs to be handled just as data 10112 caching is for ordinary files. Similarly, LOOKUP results from an 10113 OPENATTR directory are to be cached on the same basis as any other 10114 pathnames and similarly for directory contents. 10116 Clients may cache file attributes obtained from the server and use 10117 them to avoid subsequent GETATTR requests. Such caching is write 10118 through in that modification to file attributes is always done by 10119 means of requests to the server and should not be done locally and 10120 cached. The exception to this are modifications to attributes that 10121 are intimately connected with data caching. Therefore, extending a 10122 file by writing data to the local data cache is reflected immediately 10123 in the size as seen on the client without this change being 10124 immediately reflected on the server. Normally such changes are not 10125 propagated directly to the server but when the modified data is 10126 flushed to the server, analogous attribute changes are made on the 10127 server. When open delegation is in effect, the modified attributes 10128 may be returned to the server in reaction to a CB_RECALL call. 10130 The result of local caching of attributes is that the attribute 10131 caches maintained on individual clients will not be coherent. 10132 Changes made in one order on the server may be seen in a different 10133 order on one client and in a third order on a different client. 10135 The typical file system application programming interfaces do not 10136 provide means to atomically modify or interrogate attributes for 10137 multiple files at the same time. The following rules provide an 10138 environment where the potential incoherences mentioned above can be 10139 reasonably managed. These rules are derived from the practice of 10140 previous NFS protocols. 10142 o All attributes for a given file (per-fsid attributes excepted) are 10143 cached as a unit at the client so that no non-serializability can 10144 arise within the context of a single file. 10146 o An upper time boundary is maintained on how long a client cache 10147 entry can be kept without being refreshed from the server. 10149 o When operations are performed that change attributes at the 10150 server, the updated attribute set is requested as part of the 10151 containing RPC. This includes directory operations that update 10152 attributes indirectly. This is accomplished by following the 10153 modifying operation with a GETATTR operation and then using the 10154 results of the GETATTR to update the client's cached attributes. 10156 Note that if the full set of attributes to be cached is requested by 10157 READDIR, the results can be cached by the client on the same basis as 10158 attributes obtained via GETATTR. 10160 A client may validate its cached version of attributes for a file by 10161 fetching just both the change and time_access attributes and assuming 10162 that if the change attribute has the same value as it did when the 10163 attributes were cached, then no attributes other than time_access 10164 have changed. The reason why time_access is also fetched is because 10165 many servers operate in environments where the operation that updates 10166 change does not update time_access. For example, POSIX file 10167 semantics do not update access time when a file is modified by the 10168 write system call [17]. Therefore, the client that wants a current 10169 time_access value should fetch it with change during the attribute 10170 cache validation processing and update its cached time_access. 10172 The client may maintain a cache of modified attributes for those 10173 attributes intimately connected with data of modified regular files 10174 (size, time_modify, and change). Other than those three attributes, 10175 the client MUST NOT maintain a cache of modified attributes. 10176 Instead, attribute changes are immediately sent to the server. 10178 In some operating environments, the equivalent to time_access is 10179 expected to be implicitly updated by each read of the content of the 10180 file object. If an NFS client is caching the content of a file 10181 object, whether it is a regular file, directory, or symbolic link, 10182 the client SHOULD NOT update the time_access attribute (via SETATTR 10183 or a small READ or READDIR request) on the server with each read that 10184 is satisfied from cache. The reason is that this can defeat the 10185 performance benefits of caching content, especially since an explicit 10186 SETATTR of time_access may alter the change attribute on the server. 10187 If the change attribute changes, clients that are caching the content 10188 will think the content has changed, and will re-read unmodified data 10189 from the server. Nor is the client encouraged to maintain a modified 10190 version of time_access in its cache, since this would mean that the 10191 client will either eventually have to write the access time to the 10192 server with bad performance effects, or it would never update the 10193 server's time_access, thereby resulting in a situation where an 10194 application that caches access time between a close and open of the 10195 same file observes the access time oscillating between the past and 10196 present. The time_access attribute always means the time of last 10197 access to a file by a read that was satisfied by the server. This 10198 way clients will tend to see only time_access changes that go forward 10199 in time. 10201 10.7. Data and Metadata Caching and Memory Mapped Files 10203 Some operating environments include the capability for an application 10204 to map a file's content into the application's address space. Each 10205 time the application accesses a memory location that corresponds to a 10206 block that has not been loaded into the address space, a page fault 10207 occurs and the file is read (or if the block does not exist in the 10208 file, the block is allocated and then instantiated in the 10209 application's address space). 10211 As long as each memory mapped access to the file requires a page 10212 fault, the relevant attributes of the file that are used to detect 10213 access and modification (time_access, time_metadata, time_modify, and 10214 change) will be updated. However, in many operating environments, 10215 when page faults are not required these attributes will not be 10216 updated on reads or updates to the file via memory access (regardless 10217 whether the file is local file or is being access remotely). A 10218 client or server MAY fail to update attributes of a file that is 10219 being accessed via memory mapped I/O. This has several implications: 10221 o If there is an application on the server that has memory mapped a 10222 file that a client is also accessing, the client may not be able 10223 to get a consistent value of the change attribute to determine 10224 whether its cache is stale or not. A server that knows that the 10225 file is memory mapped could always pessimistically return updated 10226 values for change so as to force the application to always get the 10227 most up to date data and metadata for the file. However, due to 10228 the negative performance implications of this, such behavior is 10229 OPTIONAL. 10231 o If the memory mapped file is not being modified on the server, and 10232 instead is just being read by an application via the memory mapped 10233 interface, the client will not see an updated time_access 10234 attribute. However, in many operating environments, neither will 10235 any process running on the server. Thus NFS clients are at no 10236 disadvantage with respect to local processes. 10238 o If there is another client that is memory mapping the file, and if 10239 that client is holding a write delegation, the same set of issues 10240 as discussed in the previous two bullet items apply. So, when a 10241 server does a CB_GETATTR to a file that the client has modified in 10242 its cache, the reply from CB_GETATTR will not necessarily be 10243 accurate. As discussed earlier, the client's obligation is to 10244 report that the file has been modified since the delegation was 10245 granted, not whether it has been modified again between successive 10246 CB_GETATTR calls, and the server MUST assume that any file the 10247 client has modified in cache has been modified again between 10248 successive CB_GETATTR calls. Depending on the nature of the 10249 client's memory management system, this weak obligation may not be 10250 possible. A client MAY return stale information in CB_GETATTR 10251 whenever the file is memory mapped. 10253 o The mixture of memory mapping and file locking on the same file is 10254 problematic. Consider the following scenario, where a page size 10255 on each client is 8192 bytes. 10257 * Client A memory maps first page (8192 bytes) of file X 10259 * Client B memory maps first page (8192 bytes) of file X 10261 * Client A write locks first 4096 bytes 10263 * Client B write locks second 4096 bytes 10265 * Client A, via a STORE instruction modifies part of its locked 10266 region. 10268 * Simultaneous to client A, client B executes a STORE on part of 10269 its locked region. 10271 Here the challenge is for each client to resynchronize to get a 10272 correct view of the first page. In many operating environments, the 10273 virtual memory management systems on each client only know a page is 10274 modified, not that a subset of the page corresponding to the 10275 respective lock regions has been modified. So it is not possible for 10276 each client to do the right thing, which is to only write to the 10277 server that portion of the page that is locked. For example, if 10278 client A simply writes out the page, and then client B writes out the 10279 page, client A's data is lost. 10281 Moreover, if mandatory locking is enabled on the file, then we have a 10282 different problem. When clients A and B execute the STORE 10283 instructions, the resulting page faults require a byte-range lock on 10284 the entire page. Each client then tries to extend their locked range 10285 to the entire page, which results in a deadlock. Communicating the 10286 NFS4ERR_DEADLOCK error to a STORE instruction is difficult at best. 10288 If a client is locking the entire memory mapped file, there is no 10289 problem with advisory or mandatory byte-range locking, at least until 10290 the client unlocks a region in the middle of the file. 10292 Given the above issues the following are permitted: 10294 o Clients and servers MAY deny memory mapping a file they know there 10295 are byte-range locks for. 10297 o Clients and servers MAY deny a byte-range lock on a file they know 10298 is memory mapped. 10300 o A client MAY deny memory mapping a file that it knows requires 10301 mandatory locking for I/O. If mandatory locking is enabled after 10302 the file is opened and mapped, the client MAY deny the application 10303 further access to its mapped file. 10305 10.8. Name and Directory Caching without Directory Delegations 10307 The NFSv4.1 directory delegation facility (described in Section 10.9 10308 below) is OPTIONAL for servers to implement. Even where it is 10309 implemented, it may not be always be functional because of resource 10310 availability issues or other constraints. Thus, it is important to 10311 understand how name and directory caching are done in the absence of 10312 directory delegations. Those topics are discussed in the next in 10313 Section 10.8.1. 10315 10.8.1. Name Caching 10317 The results of LOOKUP and READDIR operations may be cached to avoid 10318 the cost of subsequent LOOKUP operations. Just as in the case of 10319 attribute caching, inconsistencies may arise among the various client 10320 caches. To mitigate the effects of these inconsistencies and given 10321 the context of typical file system APIs, an upper time boundary is 10322 maintained on how long a client name cache entry can be kept without 10323 verifying that the entry has not been made invalid by a directory 10324 change operation performed by another client. 10326 When a client is not making changes to a directory for which there 10327 exist name cache entries, the client needs to periodically fetch 10328 attributes for that directory to ensure that it is not being 10329 modified. After determining that no modification has occurred, the 10330 expiration time for the associated name cache entries may be updated 10331 to be the current time plus the name cache staleness bound. 10333 When a client is making changes to a given directory, it needs to 10334 determine whether there have been changes made to the directory by 10335 other clients. It does this by using the change attribute as 10336 reported before and after the directory operation in the associated 10337 change_info4 value returned for the operation. The server is able to 10338 communicate to the client whether the change_info4 data is provided 10339 atomically with respect to the directory operation. If the change 10340 values are provided atomically, the client has a basis for 10341 determining, given proper care, whether other clients are modifying 10342 the directory is question. 10344 The simplest way to enable the client to make this determination is 10345 for the client to serialize all changes made to a specific directory. 10346 When this is done, and the server provides before and after values of 10347 the change attribute atomically, the client can simply compare the 10348 after value of the change attribute from one operation on a directory 10349 with the before value on the next subsequent operation modifying that 10350 directory. When these are equal, the client is assured that no other 10351 client is modifying the directory in question. 10353 When such serialization is not used, and there may be multiple 10354 simultaneous outstanding operations modifying a single directory sent 10355 from a single client, making this sort of determination can be more 10356 complicated, since two such operations which are recognized as 10357 complete in a different order than they were actually performed, 10358 might give an appearance consistent with modification being made by 10359 another client. Where this appears to happen, the client needs to 10360 await the completion of all such modifications that were started 10361 previously, to see if the outstanding before and after change numbers 10362 can be sorted into a chain such that the before value of one change 10363 number matches the after value of a previous one, in a chain 10364 consistent with this client being the only one modifying the 10365 directory. 10367 In either of these cases, the client is able to determine whether the 10368 directory is being modified by another client. If the comparison 10369 indicates that the directory was updated by another client, the name 10370 cache associated with the modified directory is purged from the 10371 client. If the comparison indicates no modification, the name cache 10372 can be updated on the client to reflect the directory operation and 10373 the associated timeout extended. The post-operation change value 10374 needs to be saved as the basis for future change_info4 comparisons. 10376 As demonstrated by the scenario above, name caching requires that the 10377 client revalidate name cache data by inspecting the change attribute 10378 of a directory at the point when the name cache item was cached. 10379 This requires that the server update the change attribute for 10380 directories when the contents of the corresponding directory is 10381 modified. For a client to use the change_info4 information 10382 appropriately and correctly, the server must report the pre and post 10383 operation change attribute values atomically. When the server is 10384 unable to report the before and after values atomically with respect 10385 to the directory operation, the server must indicate that fact in the 10386 change_info4 return value. When the information is not atomically 10387 reported, the client should not assume that other clients have not 10388 changed the directory. 10390 10.8.2. Directory Caching 10392 The results of READDIR operations may be used to avoid subsequent 10393 READDIR operations. Just as in the cases of attribute and name 10394 caching, inconsistencies may arise among the various client caches. 10395 To mitigate the effects of these inconsistencies, and given the 10396 context of typical file system APIs, the following rules should be 10397 followed: 10399 o Cached READDIR information for a directory which is not obtained 10400 in a single READDIR operation must always be a consistent snapshot 10401 of directory contents. This is determined by using a GETATTR 10402 before the first READDIR and after the last of READDIR that 10403 contributes to the cache. 10405 o An upper time boundary is maintained to indicate the length of 10406 time a directory cache entry is considered valid before the client 10407 must revalidate the cached information. 10409 The revalidation technique parallels that discussed in the case of 10410 name caching. When the client is not changing the directory in 10411 question, checking the change attribute of the directory with GETATTR 10412 is adequate. The lifetime of the cache entry can be extended at 10413 these checkpoints. When a client is modifying the directory, the 10414 client needs to use the change_info4 data to determine whether there 10415 are other clients modifying the directory. If it is determined that 10416 no other client modifications are occurring, the client may update 10417 its directory cache to reflect its own changes. 10419 As demonstrated previously, directory caching requires that the 10420 client revalidate directory cache data by inspecting the change 10421 attribute of a directory at the point when the directory was cached. 10422 This requires that the server update the change attribute for 10423 directories when the contents of the corresponding directory is 10424 modified. For a client to use the change_info4 information 10425 appropriately and correctly, the server must report the pre and post 10426 operation change attribute values atomically. When the server is 10427 unable to report the before and after values atomically with respect 10428 to the directory operation, the server must indicate that fact in the 10429 change_info4 return value. When the information is not atomically 10430 reported, the client should not assume that other clients have not 10431 changed the directory. 10433 10.9. Directory Delegations 10435 10.9.1. Introduction to Directory Delegations 10437 Directory caching for the NFSv4.1 protocol, as previously described, 10438 is similar to file caching in previous versions. Clients typically 10439 cache directory information for a duration determined by the client. 10440 At the end of a predefined timeout, the client will query the server 10441 to see if the directory has been updated. By caching attributes, 10442 clients reduce the number of GETATTR calls made to the server to 10443 validate attributes. Furthermore, frequently accessed files and 10444 directories, such as the current working directory, have their 10445 attributes cached on the client so that some NFS operations can be 10446 performed without having to make an RPC call. By caching name and 10447 inode information about most recently looked up entries in a 10448 Directory Name Lookup Cache (DNLC), clients do not need to send 10449 LOOKUP calls to the server every time these files are accessed. 10451 This caching approach works reasonably well at reducing network 10452 traffic in many environments. However, it does not address 10453 environments where there are numerous queries for files that do not 10454 exist. In these cases of "misses", the client sends requests to the 10455 server in order to provide reasonable application semantics and 10456 promptly detect the creation of new directory entries. Examples of 10457 high miss activity are compilation in software development 10458 environments. The current behavior of NFS limits its potential 10459 scalability and wide-area sharing effectiveness in these types of 10460 environments. Other distributed stateful file system architectures 10461 such as AFS and DFS have proven that adding state around directory 10462 contents can greatly reduce network traffic in high-miss 10463 environments. 10465 Delegation of directory contents is an OPTIONAL feature of NFSv4.1. 10466 Directory delegations provide similar traffic reduction benefits as 10467 with file delegations. By allowing clients to cache directory 10468 contents (in a read-only fashion) while being notified of changes, 10469 the client can avoid making frequent requests to interrogate the 10470 contents of slowly-changing directories, reducing network traffic and 10471 improving client performance. It can also simplify the task of 10472 determining whether other clients are making changes to the directory 10473 when the client itself is making many changes to the directory and 10474 changes are not serialized. 10476 Directory delegations allow improved namespace cache consistency to 10477 be achieved through delegations and synchronous recalls, in the 10478 absence of notifications. In addition, if time-based consistency is 10479 sufficient, asynchronous notifications can provide performance 10480 benefits for the client, and possibly the server, under some common 10481 operating conditions such as slowly-changing and/or very large 10482 directories. 10484 10.9.2. Directory Delegation Design 10486 NFSv4.1 introduces the GET_DIR_DELEGATION (Section 18.39) operation 10487 to allow the client to ask for a directory delegation. The 10488 delegation covers directory attributes and all entries in the 10489 directory. If either of these change, the delegation will be 10490 recalled synchronously. The operation causing the recall will have 10491 to wait before the recall is complete. Any changes to directory 10492 entry attributes will not cause the delegation to be recalled. 10494 In addition to asking for delegations, a client can also ask for 10495 notifications for certain events. These events include changes to 10496 the directory's attributes and/or its contents. If a client asks for 10497 notification for a certain event, the server will notify the client 10498 when that event occurs. This will not result in the delegation being 10499 recalled for that client. The notifications are asynchronous and 10500 provide a way of avoiding recalls in situations where a directory is 10501 changing enough that the pure recall model may not be effective while 10502 trying to allow the client to get substantial benefit. In the 10503 absence of notifications, once the delegation is recalled the client 10504 has to refresh its directory cache which might not be very efficient 10505 for very large directories. 10507 The delegation is read-only and the client may not make changes to 10508 the directory other than by performing NFSv4.1 operations that modify 10509 the directory or the associated file attributes so that the server 10510 has knowledge of these changes. In order to keep the client 10511 namespace synchronized with the server, the server will, if the 10512 client has requested notifications, notify the client holding the 10513 delegation of the changes made as a result. This is to avoid any 10514 need for subsequent GETATTR or READDIR calls to the server. If a 10515 single client is holding the delegation and that client makes any 10516 changes to the directory (i.e. the changes are made via operations 10517 sent though a session associated with the client ID holding the 10518 delegation), the delegation will not be recalled. Multiple clients 10519 may hold a delegation on the same directory, but if any such client 10520 modifies the directory, the server MUST recall the delegation from 10521 the other clients, unless those clients have made provisions to be 10522 notified of that sort of modification. 10524 Delegations can be recalled by the server at any time. Normally, the 10525 server will recall the delegation when the directory changes in a way 10526 that is not covered by the notification, or when the directory 10527 changes and notifications have not been requested. If another client 10528 removes the directory for which a delegation has been granted, the 10529 server will recall the delegation. 10531 10.9.3. Attributes in Support of Directory Notifications 10533 See Section 5.11 for a description of the attributes associated with 10534 directory notifications. 10536 10.9.4. Directory Delegation Recall 10538 The server will recall the directory delegation by sending a callback 10539 to the client. It will use the same callback procedure as used for 10540 recalling file delegations. The server will recall the delegation 10541 when the directory changes in a way that is not covered by the 10542 notification. However the server need not recall the delegation if 10543 attributes of an entry within the directory change. 10545 If the server notices that handing out a delegation for a directory 10546 is causing too many notifications to be sent out, it may decide not 10547 to hand out delegations for that directory, or recall those already 10548 granted. If a client tries to remove the directory for which a 10549 delegation has been granted, the server will recall all associated 10550 delegations. 10552 The implementation sections for a number of operations describe 10553 situations in which notification or delegation recall would be 10554 required under some common circumstances. In this regard, a similar 10555 set of caveats to those listed in Section 10.2 apply. 10557 o For CREATE, see Section 18.4.4. 10559 o For LINK, see Section 18.9.4. 10561 o For OPEN, see Section 18.16.4. 10563 o For REMOVE, see Section 18.25.4. 10565 o For RENAME, see Section 18.26.4. 10567 o For SETATTR, see Section 18.30.4. 10569 10.9.5. Directory Delegation Recovery 10571 Recovery from client or server restart for state on regular files has 10572 two main goals, avoiding the necessity of breaking application 10573 guarantees with respect to locked files and delivery of updates 10574 cached at the client. Neither of these goals applies to directories 10575 protected by read delegations and notifications. Thus, no provision 10576 is made for reclaiming directory delegations in the event of client 10577 or server restart. The client can simply establish a directory 10578 delegation in the same fashion as was done initially. 10580 11. Multi-Server Namespace 10582 NFSv4.1 supports attributes that allow a namespace to extend beyond 10583 the boundaries of a single server. It is RECOMMENDED that clients 10584 and servers support construction of such multi-server namespaces. 10585 Use of such multi-server namespaces is OPTIONAL however, and for many 10586 purposes, single-server namespace are perfectly acceptable. Use of 10587 multi-server namespaces can provide many advantages, however, by 10588 separating a file system's logical position in a namespace from the 10589 (possibly changing) logistical and administrative considerations that 10590 result in particular file systems being located on particular 10591 servers. 10593 11.1. Location Attributes 10595 NFSv4.1 contains RECOMMENDED attributes that allow file systems on 10596 one server to be associated with one or more instances of that file 10597 system on other servers. These attributes specify such file system 10598 instances by specifying a server address target (either as a DNS name 10599 representing one or more IP addresses or as a literal IP address) 10600 together with the path of that file system within the associated 10601 single-server namespace. 10603 The fs_locations_info RECOMMENDED attribute allows specification of 10604 one or more file system instance locations where the data 10605 corresponding to a given file system may be found. This attribute 10606 provides to the client, in addition to information about file system 10607 instance locations, significant information about the various file 10608 system instance choices (e.g. priority for use, writability, 10609 currency, etc.). It also includes information to help the client 10610 efficiently effect as seamless a transition as possible among 10611 multiple file system instances, when and if that should be necessary. 10613 The fs_locations RECOMMENDED attribute is inherited from NFSv4.0 and 10614 only allows specification of the file system locations where the data 10615 corresponding to a given file system may be found. Servers SHOULD 10616 make this attribute available whenever fs_locations_info is 10617 supported, but client use of fs_locations_info is to be preferred. 10619 11.2. File System Presence or Absence 10621 A given location in an NFSv4.1 namespace (typically but not 10622 necessarily a multi-server namespace) can have a number of file 10623 system instance locations associated with it (via the fs_locations or 10624 fs_locations_info attribute). There may also be an actual current 10625 file system at that location, accessible via normal namespace 10626 operations (e.g. LOOKUP). In this case, the file system is said to 10627 be "present" at that position in the namespace and clients will 10628 typically use it, reserving use of additional locations specified via 10629 the location-related attributes to situations in which the principal 10630 location is no longer available. 10632 When there is no actual file system at the namespace location in 10633 question, the file system is said to be "absent". An absent file 10634 system contains no files or directories other than the root. Any 10635 reference to it, except to access a small set of attributes useful in 10636 determining alternate locations, will result in an error, 10637 NFS4ERR_MOVED. Note that if the server ever returns the error 10638 NFS4ERR_MOVED, it MUST support the fs_locations attribute and SHOULD 10639 support the fs_locations_info and fs_status attributes. 10641 While the error name suggests that we have a case of a file system 10642 which once was present, and has only become absent later, this is 10643 only one possibility. A position in the namespace may be permanently 10644 absent with the set of file system(s) designated by the location 10645 attributes being the only realization. The name NFS4ERR_MOVED 10646 reflects an earlier, more limited conception of its function, but 10647 this error will be returned whenever the referenced file system is 10648 absent, whether it has moved or not. 10650 Except in the case of GETATTR-type operations (to be discussed 10651 later), when the current filehandle at the start of an operation is 10652 within an absent file system, that operation is not performed and the 10653 error NFS4ERR_MOVED returned, to indicate that the file system is 10654 absent on the current server. 10656 Because a GETFH cannot succeed if the current filehandle is within an 10657 absent file system, filehandles within an absent file system cannot 10658 be transferred to the client. When a client does have filehandles 10659 within an absent file system, it is the result of obtaining them when 10660 the file system was present, and having the file system become absent 10661 subsequently. 10663 It should be noted that because the check for the current filehandle 10664 being within an absent file system happens at the start of every 10665 operation, operations that change the current filehandle so that it 10666 is within an absent file system will not result in an error. This 10667 allows such combinations as PUTFH-GETATTR and LOOKUP-GETATTR to be 10668 used to get attribute information, particularly location attribute 10669 information, as discussed below. 10671 The RECOMMENDED file system attribute fs_status can be used to 10672 interrogate the present/absent status of a given file system. 10674 11.3. Getting Attributes for an Absent File System 10676 When a file system is absent, most attributes are not available, but 10677 it is necessary to allow the client access to the small set of 10678 attributes that are available, and most particularly those that give 10679 information about the correct current locations for this file system, 10680 fs_locations and fs_locations_info. 10682 11.3.1. GETATTR Within an Absent File System 10684 As mentioned above, an exception is made for GETATTR in that 10685 attributes may be obtained for a filehandle within an absent file 10686 system. This exception only applies if the attribute mask contains 10687 at least one attribute bit that indicates the client is interested in 10688 a result regarding an absent file system: fs_locations, 10689 fs_locations_info, or fs_status. If none of these attributes is 10690 requested, GETATTR will result in an NFS4ERR_MOVED error. 10692 When a GETATTR is done on an absent file system, the set of supported 10693 attributes is very limited. Many attributes, including those that 10694 are normally REQUIRED, will not be available on an absent file 10695 system. In addition to the attributes mentioned above (fs_locations, 10696 fs_locations_info, fs_status), the following attributes SHOULD be 10697 available on absent file systems, in the case of RECOMMENDED 10698 attributes at least to the same degree that they are available on 10699 present file systems. 10701 change_policy: This attribute is useful for absent file systems and 10702 can be helpful in summarizing to the client when any of the 10703 location-related attributes changes. 10705 fsid: This attribute should be provided so that the client can 10706 determine file system boundaries, including, in particular, the 10707 boundary between present and absent file systems. This value must 10708 be different from any other fsid on the current server and need 10709 have no particular relationship to fsids on any particular 10710 destination to which the client might be directed. 10712 mounted_on_fileid: For objects at the top of an absent file system 10713 this attribute needs to be available. Since the fileid is one 10714 which is within the present parent file system, there should be no 10715 need to reference the absent file system to provide this 10716 information. 10718 Other attributes SHOULD NOT be made available for absent file 10719 systems, even when it is possible to provide them. The server should 10720 not assume that more information is always better and should avoid 10721 gratuitously providing additional information. 10723 When a GETATTR operation includes a bit mask for one of the 10724 attributes fs_locations, fs_locations_info, or fs_status, but where 10725 the bit mask includes attributes which are not supported, GETATTR 10726 will not return an error, but will return the mask of the actual 10727 attributes supported with the results. 10729 Handling of VERIFY/NVERIFY is similar to GETATTR in that if the 10730 attribute mask does not include fs_locations, fs_locations_info, or 10731 fs_status, the error NFS4ERR_MOVED will result. It differs in that 10732 any appearance in the attribute mask of an attribute not supported 10733 for an absent file system (and note that this will include some 10734 normally REQUIRED attributes), will also cause an NFS4ERR_MOVED 10735 result. 10737 11.3.2. READDIR and Absent File Systems 10739 A READDIR performed when the current filehandle is within an absent 10740 file system will result in an NFS4ERR_MOVED error, since, unlike the 10741 case of GETATTR, no such exception is made for READDIR. 10743 Attributes for an absent file system may be fetched via a READDIR for 10744 a directory in a present file system, when that directory contains 10745 the root directories of one or more absent file systems. In this 10746 case, the handling is as follows: 10748 o If the attribute set requested includes one of the attributes 10749 fs_locations, fs_locations_info, or fs_status, then fetching of 10750 attributes proceeds normally and no NFS4ERR_MOVED indication is 10751 returned, even when the rdattr_error attribute is requested. 10753 o If the attribute set requested does not include one of the 10754 attributes fs_locations, fs_locations_info, or fs_status, then if 10755 the rdattr_error attribute is requested, each directory entry for 10756 the root of an absent file system, will report NFS4ERR_MOVED as 10757 the value of the rdattr_error attribute. 10759 o If the attribute set requested does not include any of the 10760 attributes fs_locations, fs_locations_info, fs_status, or 10761 rdattr_error then the occurrence of the root of an absent file 10762 system within the directory will result in the READDIR failing 10763 with an NFS4ERR_MOVED error. 10765 o The unavailability of an attribute because of a file system's 10766 absence, even one that is ordinarily REQUIRED, does not result in 10767 any error indication. The set of attributes returned for the root 10768 directory of the absent file system in that case is simply 10769 restricted to those actually available. 10771 11.4. Uses of Location Information 10773 The location-bearing attributes (fs_locations and fs_locations_info), 10774 provide, together with the possibility of absent file systems, a 10775 number of important facilities in providing reliable, manageable, and 10776 scalable data access. 10778 When a file system is present, these attributes can provide 10779 alternative locations, to be used to access the same data, in the 10780 event of server failures, communications problems, or other 10781 difficulties that make continued access to the current file system 10782 impossible or otherwise impractical. Under some circumstances 10783 multiple alternative locations may be used simultaneously to provide 10784 higher performance access to the file system in question. Provision 10785 of such alternate locations is referred to as "replication" although 10786 there are cases in which replicated sets of data are not in fact 10787 present, and the replicas are instead different paths to the same 10788 data. 10790 When a file system is present and becomes absent, clients can be 10791 given the opportunity to have continued access to their data, at an 10792 alternate location. In this case, a continued attempt to use the 10793 data in the now-absent file system will result in an NFS4ERR_MOVED 10794 error and at that point the successor locations (typically only one 10795 but multiple choices are possible) can be fetched and used to 10796 continue access. Transfer of the file system contents to the new 10797 location is referred to as "migration", but it should be kept in mind 10798 that there are cases in which this term can be used, like 10799 "replication", when there is no actual data migration per se. 10801 Where a file system was not previously present, specification of file 10802 system location provides a means by which file systems located on one 10803 server can be associated with a namespace defined by another server, 10804 thus allowing a general multi-server namespace facility. A 10805 designation of such a location, in place of an absent file system, is 10806 called a "referral". 10808 Because client support for location-related attributes is OPTIONAL, a 10809 server may (but is not required to) take action to hide migration and 10810 referral events from such clients, by acting as a proxy, for example. 10811 The server can determine the presence of client support from the 10812 arguments of the EXCHANGE_ID operation (see Section 18.35.3). 10814 11.4.1. File System Replication 10816 The fs_locations and fs_locations_info attributes provide alternative 10817 locations, to be used to access data in place of or in addition to 10818 the current file system instance. On first access to a file system, 10819 the client should obtain the value of the set of alternate locations 10820 by interrogating the fs_locations or fs_locations_info attribute, 10821 with the latter being preferred. 10823 In the event that server failures, communications problems, or other 10824 difficulties make continued access to the current file system 10825 impossible or otherwise impractical, the client can use the alternate 10826 locations as a way to get continued access to its data. Depending on 10827 specific attributes of these alternate locations, as indicated within 10828 the fs_locations_info attribute, multiple locations may be used 10829 simultaneously, to provide higher performance through the 10830 exploitation of multiple paths between client and target file system. 10832 The alternate locations may be physical replicas of the (typically 10833 read-only) file system data, or they may reflect alternate paths to 10834 the same server or provide for the use of various forms of server 10835 clustering in which multiple servers provide alternate ways of 10836 accessing the same physical file system. How these different modes 10837 of file system transition are represented within the fs_locations and 10838 fs_locations_info attributes and how the client deals with file 10839 system transition issues will be discussed in detail below. 10841 Multiple server addresses, whether they are derived from a single 10842 entry with a DNS name representing a set of IP addresses, or from 10843 multiple entries each with its own server address may correspond to 10844 the same actual server. The fact that two addresses correspond to 10845 the same server is shown by a common so_major_id field within the 10846 eir_server_owner field returned by EXCHANGE_ID (see Section 18.35.3). 10847 For a detailed discussion of how server address targets interact with 10848 the determination of server identity specified by the server owner 10849 field, see Section 11.5. 10851 11.4.2. File System Migration 10853 When a file system is present and becomes absent, clients can be 10854 given the opportunity to have continued access to their data, at an 10855 alternate location, as specified by the fs_locations or 10856 fs_locations_info attribute. Typically, a client will be accessing 10857 the file system in question, get an NFS4ERR_MOVED error, and then use 10858 the fs_locations or fs_locations_info attribute to determine the new 10859 location of the data. When fs_locations_info is used, additional 10860 information will be available which will define the nature of the 10861 client's handling of the transition to a new server. 10863 Such migration can be helpful in providing load balancing or general 10864 resource reallocation. The protocol does not specify how the file 10865 system will be moved between servers. It is anticipated that a 10866 number of different server-to-server transfer mechanisms might be 10867 used with the choice left to the server implementer. The NFSv4.1 10868 protocol specifies the method used to communicate the migration event 10869 between client and server. 10871 The new location may be an alternate communication path to the same 10872 server, or, in the case of various forms of server clustering, 10873 another server providing access to the same physical file system. 10875 The client's responsibilities in dealing with this transition depend 10876 on the specific nature of the new access path and how and whether 10877 data was in fact migrated. These issues will be discussed in detail 10878 below. 10880 When multiple server addresses correspond to the same actual server, 10881 as shown by a common value for the so_major_id field of the 10882 eir_server_owner field returned by EXCHANGE_ID, the location or 10883 locations may designate alternate server addresses in the form of 10884 specific server network addresses. These can be used to access the 10885 file system in question at those addresses and when it is no longer 10886 accessible at the original address. 10888 Although a single successor location is typical, multiple locations 10889 may be provided, together with information that allows priority among 10890 the choices to be indicated, via information in the fs_locations_info 10891 attribute. Where suitable clustering mechanisms make it possible to 10892 provide multiple identical file systems or paths to them, this allows 10893 the client the opportunity to deal with any resource or 10894 communications issues that might limit data availability. 10896 When an alternate location is designated as the target for migration, 10897 it must designate the same data (with metadata being the same to the 10898 degree indicated by the fs_locations_info attribute). Where file 10899 systems are writable, a change made on the original file system must 10900 be visible on all migration targets. Where a file system is not 10901 writable but represents a read-only copy (possibly periodically 10902 updated) of a writable file system, similar requirements apply to the 10903 propagation of updates. Any change visible in the original file 10904 system must already be effected on all migration targets, to avoid 10905 any possibility, that a client in effecting a transition to the 10906 migration target will see any reversion in file system state. 10908 11.4.3. Referrals 10910 Referrals provide a way of placing a file system in a location within 10911 the namespace essentially without respect to its physical location on 10912 a given server. This allows a single server or a set of servers to 10913 present a multi-server namespace that encompasses file systems 10914 located on multiple servers. Some likely uses of this include 10915 establishment of site-wide or organization-wide namespaces, or even 10916 knitting such together into a truly global namespace. 10918 Referrals occur when a client determines, upon first referencing a 10919 position in the current namespace, that it is part of a new file 10920 system and that the file system is absent. When this occurs, 10921 typically by receiving the error NFS4ERR_MOVED, the actual location 10922 or locations of the file system can be determined by fetching the 10923 fs_locations or fs_locations_info attribute. 10925 The locations-related attribute may designate a single file system 10926 location or multiple file system locations, to be selected based on 10927 the needs of the client. The server, in the fs_locations_info 10928 attribute may specify priorities to be associated with various file 10929 system location choices. The server may assign different priorities 10930 to different locations as reported to individual clients, in order to 10931 adapt to client physical location or to effect load balancing. When 10932 both read-only and read-write file systems are present, some of the 10933 read-only locations may not be absolutely up-to-date (as they would 10934 have to be in the case of replication and migration). Servers may 10935 also specify file system locations that include client-substituted 10936 variables so that different clients are referred to different file 10937 systems (with different data contents) based on client attributes 10938 such as CPU architecture. 10940 When the fs_locations_info attribute indicates that there are 10941 multiple possible targets listed, the relationships among them may be 10942 important to the client in selecting the one to use. The same rules 10943 specified in Section 11.4.1 defining the appropriate standards for 10944 the data propagation, apply to these multiple replicas as well. For 10945 example, the client might prefer a writable target on a server that 10946 has additional writable replicas to which it subsequently might 10947 switch. Note that, as distinguished from the case of replication, 10948 there is no need to deal with the case of propagation of updates made 10949 by the current client, since the current client has not accessed the 10950 file system in question. 10952 Use of multi-server namespaces is enabled by NFSv4.1 but is not 10953 required. The use of multi-server namespaces and their scope will 10954 depend on the applications used, and system administration 10955 preferences. 10957 Multi-server namespaces can be established by a single server 10958 providing a large set of referrals to all of the included file 10959 systems. Alternatively, a single multi-server namespace may be 10960 administratively segmented with separate referral file systems (on 10961 separate servers) for each separately-administered portion of the 10962 namespace. Any segment or the top-level referral file system may use 10963 replicated referral file systems for higher availability. 10965 Generally, multi-server namespaces are for the most part uniform, in 10966 that the same data made available to one client at a given location 10967 in the namespace is made available to all clients at that location. 10968 There are however facilities provided which allow different clients 10969 to be directed to different sets of data, so as to adapt to such 10970 client characteristics as CPU architecture. 10972 11.5. Location Entries and Server Identity 10974 As mentioned above, a single location entry may have a server address 10975 target in the form of a DNS name which may represent multiple IP 10976 addresses, while multiple location entries may have their own server 10977 address targets, that reference the same server. Whether two IP 10978 addresses designate the same server is indicated by the existence of 10979 a common so_major_id field within the eir_server_owner field returned 10980 by EXCHANGE_ID (see Section 18.35.3), subject to further 10981 verification, for details of which see Section 2.10.5. 10983 When multiple addresses for the same server exist, the client may 10984 assume that for each file system in the namespace of a given server 10985 network address, there exist file systems at corresponding namespace 10986 locations for each of the other server network addresses. It may do 10987 this even in the absence of explicit listing in fs_locations and 10988 fs_locations_info. Such corresponding file system locations can be 10989 used as alternate locations, just as those explicitly specified via 10990 the fs_locations and fs_locations_info attributes. Where these 10991 specific addresses are explicitly designated in the fs_locations_info 10992 attribute, the conditions of use specified in this attribute (e.g. 10993 priorities, specification of simultaneous use) may limit the client's 10994 use of these alternate locations. 10996 If a single location entry designates multiple server IP addresses, 10997 the client cannot assume that these addresses are multiple paths to 10998 the same server. In most case they will be, but the client MUST 10999 verify that before acting on that assumption. When two server 11000 addresses are designated by a single location entry and they 11001 correspond to different servers, this normally indicates some sort of 11002 misconfiguration, and so the client should avoid use such location 11003 entries when alternatives are available. When they are not, clients 11004 should pick one of IP addresses and use it, without using others that 11005 are not directed to the same server. 11007 11.6. Additional Client-side Considerations 11009 When clients make use of servers that implement referrals, 11010 replication, and migration, care should be taken so that a user who 11011 mounts a given file system that includes a referral or a relocated 11012 file system continues to see a coherent picture of that user-side 11013 file system despite the fact that it contains a number of server-side 11014 file systems which may be on different servers. 11016 One important issue is upward navigation from the root of a server- 11017 side file system to its parent (specified as ".." in UNIX), in the 11018 case in which it transitions to that file system as a result of 11019 referral, migration, or a transition as a result of replication. 11021 When the client is at such a point, and it needs to ascend to the 11022 parent, it must go back to the parent as seen within the multi-server 11023 namespace rather issuing a LOOKUPP call to the server, which would 11024 result in the parent within that server's single-server namespace. 11025 In order to do this, the client needs to remember the filehandles 11026 that represent such file system roots, and use these instead of 11027 issuing a LOOKUPP to the current server. This will allow the client 11028 to present to applications a consistent namespace, where upward 11029 navigation and downward navigation are consistent. 11031 Another issue concerns refresh of referral locations. When referrals 11032 are used extensively, they may change as server configurations 11033 change. It is expected that clients will cache information related 11034 to traversing referrals so that future client side requests are 11035 resolved locally without server communication. This is usually 11036 rooted in client-side name lookup caching. Clients should 11037 periodically purge this data for referral points in order to detect 11038 changes in location information. When the change_policy attribute 11039 changes for directories that hold referral entries or for the 11040 referral entries themselves, clients should consider any associated 11041 cached referral information to be out of date. 11043 11.7. Effecting File System Transitions 11045 Transitions between file system instances, whether due to switching 11046 between replicas upon server unavailability, or in response to 11047 server-initiated migration events are best dealt with together. This 11048 is so even though for the server, pragmatic considerations will 11049 normally force different implementation strategies for planned and 11050 unplanned transitions. Even though the prototypical use cases of 11051 replication and migration contain distinctive sets of features, when 11052 all possibilities for these operations are considered, there is an 11053 underlying unity of these operations, from the client's point of 11054 view, that makes treating them together desirable. 11056 A number of methods are possible for servers to replicate data and to 11057 track client state in order to allow clients to transition between 11058 file system instances with a minimum of disruption. Such methods 11059 vary between those that use inter-server clustering techniques to 11060 limit the changes seen by the client, to those that are less 11061 aggressive, use more standard methods of replicating data, and impose 11062 a greater burden on the client to adapt to the transition. 11064 The NFSv4.1 protocol does not impose choices on clients and servers 11065 with regard to that spectrum of transition methods. In fact, there 11066 are many valid choices, depending on client and application 11067 requirements and their interaction with server implementation 11068 choices. The NFSv4.1 protocol does define the specific choices that 11069 can be made, how these choices are communicated to the client and how 11070 the client is to deal with any discontinuities. 11072 In the sections below, references will be made to various possible 11073 server implementation choices as a way of illustrating the transition 11074 scenarios that clients may deal with. The intent here is not to 11075 define or limit server implementations but rather to illustrate the 11076 range of issues that clients may face. 11078 In the discussion below, references will be made to a file system 11079 having a particular property or of two file systems (typically the 11080 source and destination) belonging to a common class of any of several 11081 types. Two file systems that belong to such a class share some 11082 important aspect of file system behavior that clients may depend upon 11083 when present, to easily effect a seamless transition between file 11084 system instances. Conversely, where the file systems do not belong 11085 to such a common class, the client has to deal with various sorts of 11086 implementation discontinuities which may cause performance or other 11087 issues in effecting a transition. 11089 Where the fs_locations_info attribute is available, such file system 11090 classification data will be made directly available to the client 11091 (see Section 11.10 for details). When only fs_locations is 11092 available, default assumptions with regard to such classifications 11093 have to be inferred (see Section 11.9 for details). 11095 In cases in which one server is expected to accept opaque values from 11096 the client that originated from another server, the servers SHOULD 11097 encode the "opaque" values in big endian byte order. If this is 11098 done, servers acting as replicas or immigrating file systems will be 11099 able to parse values like stateids, directory cookies, filehandles, 11100 etc. even if their native byte order is different from that of other 11101 servers cooperating in the replication and migration of the file 11102 system. 11104 11.7.1. File System Transitions and Simultaneous Access 11106 When a single file system may be accessed at multiple locations, 11107 whether this is because of an indication of file system identity as 11108 reported by the fs_locations or fs_locations_info attributes or 11109 because two file system instances have corresponding locations on 11110 server addresses which connect to the same server (as indicated by a 11111 common so_major_id field in the eir_server_owner field returned by 11112 EXCHANGE_ID), the client will, depending on specific circumstances as 11113 discussed below, either: 11115 o The client accesses multiple instances simultaneously, as 11116 representing alternate paths to the same data and metadata. 11118 o The client accesses one instance (or set of instances) and then 11119 transitions to an alternative instance (or set of instances) as a 11120 result of network issues, server unresponsiveness, or server- 11121 directed migration. The transition may involve changes in 11122 filehandles, fileids, the change attribute, and/or locking state, 11123 depending on the attributes of the source and destination file 11124 system instances, as specified in the fs_locations_info attribute. 11126 Which of these choices is possible, and how a transition is effected, 11127 is governed by equivalence classes of file system instances as 11128 reported by the fs_locations_info attribute, and, for file system 11129 instances in the same location within a multiple single-server 11130 namespace as indicated by the so_major_id field in the 11131 eir_server_owner field returned by EXCHANGE_ID. 11133 11.7.2. Simultaneous Use and Transparent Transitions 11135 When two file system instances have the same location within their 11136 respective single-server namespaces and those two server network 11137 addresses designate the same server (as indicated by the same 11138 so_major_id value in the eir_server_owner value returned in response 11139 to EXCHANGE_ID), those file systems instances can be treated as the 11140 same, and either used together simultaneously or serially with no 11141 transition activity required on the part of the client. In this case 11142 we refer to the transition as "transparent" and the client in 11143 transferring access from to the other is acting as it would in the 11144 event that communication is interrupted, with a new connection and 11145 possibly a new session being established to continue access to the 11146 same file system. 11148 Whether simultaneous use of the two file system instances is valid is 11149 controlled by whether the fs_locations_info attribute shows the two 11150 instances as having the same _simultaneous-use_ class. See 11151 Section 11.10.1 for information about the definition of the various 11152 use classes, including the _simultaneous-use_ class. 11154 Note that for two such file systems, any information within the 11155 fs_locations_info attribute that indicates the need for special 11156 transition activity, i.e. the appearance of the two file system 11157 instances with different _handle_, _fileid_, _write-verifier_, 11158 _change_, _readdir_ classes, indicates a serious problem and the 11159 client, if it allows transition to the file system instance at all, 11160 must not treat this as a transparent transition. The server SHOULD 11161 NOT indicate that these instances belong to different _handle_, 11162 _fileid_, _write-verifier_, _change_, _readdir_ classes, whether the 11163 two instances are shown belonging to the same _simultaneous-use_ 11164 class or not. 11166 Where these conditions do not apply, a non-transparent file system 11167 instance transition is required with the details depending on the 11168 respective _handle_, _fileid_, _write-verifier_, _change_, _readdir_ 11169 classes of the two file system instances and whether the two servers 11170 address in question have the same eir_server_scope value as reported 11171 by EXCHANGE_ID. 11173 11.7.2.1. Simultaneous Use of File System Instances 11175 When the conditions in Section 11.7.2 hold, in either of the 11176 following two cases, the client may use the two file system instances 11177 simultaneously. 11179 o The fs_locations_info attribute does not contain separate per- 11180 network-address entries for file systems instances at the distinct 11181 network addresses. This includes the case in which the 11182 fs_locations_info attribute is unavailable. In this case, the 11183 fact that the two server addresses connect to the same server (as 11184 indicated by the two addresses sharing the same the so_major_id 11185 value and subsequently confirmed as described in Section 2.10.5) 11186 justifies simultaneous use and there is no fs_locations_info 11187 attribute information contradicting that. 11189 o The fs_locations_info attribute indicates that two file system 11190 instances belong to the same _simultaneous-use_ class. 11192 In this case, the client may use both file system instances 11193 simultaneously, as representations of the same file system, whether 11194 that happens because the two network addresses connect to the same 11195 physical server or because different servers connect to clustered 11196 file systems and export their data in common. When simultaneous use 11197 is in effect, any change made to one file system instance must be 11198 immediately reflected in the other file system instance(s). Locks 11199 are treated as part of a common lease, associated with a common 11200 client ID. Depending on the details of the eir_server_owner returned 11201 by EXCHANGE_ID, the two server instances may be accessed by different 11202 sessions or a single session in common. 11204 11.7.2.2. Transparent File System Transitions 11206 When the conditions in Section 11.7.2.1 hold and the 11207 fs_locations_info attribute explicitly shows the file system 11208 instances for these distinct network addresses as belonging to 11209 different _simultaneous-use_ classes, the file system instances 11210 should not be used by the client simultaneously, but rather serially 11211 with one being used unless and until communication difficulties, lack 11212 of responsiveness, or an explicit migration event causes another file 11213 system instance (or set of file system instances sharing a common 11214 _simultaneous-use_ class) to be used. 11216 When a change of file system instance is to be done, the client will 11217 use the same client ID already in effect. If it already has 11218 connections to the new server address, these will be used. Otherwise 11219 new connections to existing sessions or new sessions associated with 11220 the existing client ID are established as indicated by the 11221 eir_server_owner returned by EXCHANGE_ID. 11223 In all such transparent transition cases, the following apply: 11225 o If filehandles are persistent they stay the same. If filehandles 11226 are volatile, they either stay the same, or if they expire, the 11227 reason for expiration is not due to the file system transition. 11229 o Fileid values do not change across the transition. 11231 o The file system will have the same fsid in both the old and new 11232 locations. 11234 o Change attribute values are consistent across the transition and 11235 do not have to be refetched. When change attributes indicate that 11236 a cached object is still valid, it can remain cached. 11238 o Client and state identifiers retain their validity across the 11239 transition, except where their staleness is recognized and 11240 reported by the new server. Except where such staleness requires 11241 it, no lock reclamation is needed. Any such staleness is an 11242 indication that the server should be considered to have restarted 11243 and is reported as discussed in Section 8.4.2. 11245 o Write verifiers are presumed to retain their validity and can be 11246 used to compare with verifiers returned by COMMIT on the new 11247 server, with the expectation that if COMMIT on the new server 11248 returns an identical verifier, then that server has all of the 11249 data unstably written to the original server and has committed it 11250 to stable storage as requested. 11252 o Readdir cookies are presumed to retain their validity and can be 11253 presented to subsequent READDIR requests together with the readdir 11254 verifier with which they are associated. When the verifier is 11255 accepted as valid, the cookie will continue the READDIR operation 11256 so that the entire directory can be obtained by the client. 11258 11.7.3. Filehandles and File System Transitions 11260 There are a number of ways in which filehandles can be handled across 11261 a file system transition. These can be divided into two broad 11262 classes depending upon whether the two file systems across which the 11263 transition happens share sufficient state to effect some sort of 11264 continuity of file system handling. 11266 When there is no such co-operation in filehandle assignment, the two 11267 file systems are reported as being in different _handle_ classes. In 11268 this case, all filehandles are assumed to expire as part of the file 11269 system transition. Note that this behavior does not depend on 11270 fh_expire_type attribute and supersedes the specification of 11271 FH4_VOL_MIGRATION bit, which only affects behavior when 11272 fs_locations_info is not available. 11274 When there is co-operation in filehandle assignment, the two file 11275 systems are reported as being in the same _handle_ classes. In this 11276 case, persistent filehandles remain valid after the file system 11277 transition, while volatile filehandles (excluding those that are only 11278 volatile due to the FH4_VOL_MIGRATION bit) are subject to expiration 11279 on the target server. 11281 11.7.4. Fileids and File System Transitions 11283 In NFSv4.0, the issue of continuity of fileids in the event of a file 11284 system transition was not addressed. The general expectation had 11285 been that in situations in which the two file system instances are 11286 created by a single vendor using some sort of file system image copy, 11287 fileids will be consistent across the transition while in the 11288 analogous multi-vendor transitions they will not. This poses 11289 difficulties, especially for the client without special knowledge of 11290 the transition mechanisms adopted by the server. Note that although 11291 fileid is not a REQUIRED attribute, many servers support fileids and 11292 many clients provide API's that depend on fileids. 11294 It is important to note that while clients themselves may have no 11295 trouble with a fileid changing as a result of a file system 11296 transition event, applications do typically have access to the fileid 11297 (e.g. via stat), and the result of this is that an application may 11298 work perfectly well if there is no file system instance transition or 11299 if any such transition is among instances created by a single vendor, 11300 yet be unable to deal with the situation in which a multi-vendor 11301 transition occurs, at the wrong time. 11303 Providing the same fileids in a multi-vendor (multiple server 11304 vendors) environment has generally been held to be quite difficult. 11305 While there is work to be done, it needs to be pointed out that this 11306 difficulty is partly self-imposed. Servers have typically identified 11307 fileid with inode number, i.e. with a quantity used to find the file 11308 in question. This identification poses special difficulties for 11309 migration of a file system between vendors where assigning the same 11310 index to a given file may not be possible. Note here that a fileid 11311 is not required to be useful to find the file in question, only that 11312 it is unique within the given file system. Servers prepared to 11313 accept a fileid as a single piece of metadata and store it apart from 11314 the value used to index the file information can relatively easily 11315 maintain a fileid value across a migration event, allowing a truly 11316 transparent migration event. 11318 In any case, where servers can provide continuity of fileids, they 11319 should, and the client should be able to find out that such 11320 continuity is available and take appropriate action. Information 11321 about the continuity (or lack thereof) of fileids across a file 11322 system transition is represented by specifying whether the file 11323 systems in question are of the same _fileid_ class. 11325 Note that when consistent fileids do not exist across a transition 11326 (either because there is no continuity of fileids or because fileid 11327 is not a supported attribute on one of instances involved), and there 11328 are no reliable filehandles across a transition event (either because 11329 there is no filehandle continuity or because the filehandles are 11330 volatile), the client is in a position where it cannot verify that 11331 files it was accessing before the transition are the same objects. 11332 It is forced to assume that no object has been renamed, and, unless 11333 there are guarantees that provide this (e.g. the file system is read- 11334 only), problems for applications may occur. Therefore, use of such 11335 configurations should be limited to situations where the problems 11336 that this may cause can be tolerated. 11338 11.7.5. Fsids and File System Transitions 11340 Since fsids are generally only unique within a per-server basis, it 11341 is likely that they will change during a file system transition. One 11342 exception is the case of transparent transitions, but in that case we 11343 have multiple network addresses that are defined as the same server 11344 (as specified by a common value of the so_major_id field of 11345 eir_server_owner). Clients should not make the fsids received from 11346 the server visible to applications since they may not be globally 11347 unique, and because they may change during a file system transition 11348 event. Applications are best served if they are isolated from such 11349 transitions to the extent possible. 11351 Although normally, a single source file system will transition to a 11352 single target file system, there is a provision for splitting a 11353 single source file system into multiple target file systems, by 11354 specifying the FSLI4F_MULTI_FS flag. 11356 11.7.5.1. File System Splitting 11358 When a file system transition is made and the fs_locations_info 11359 indicates that the file system in question may be split into multiple 11360 file systems (via the FSLI4F_MULTI_FS flag), the client SHOULD do 11361 GETATTRs to determine the fsid attribute on all known objects within 11362 the file system undergoing transition to determine the new file 11363 system boundaries. 11365 Clients may maintain the fsids passed to existing applications by 11366 mapping all of the fsids for the descendant file systems to the 11367 common fsid used for the original file system. 11369 Splitting a file system may be done on a transition between file 11370 systems of the same _fileid_ class, since the fact that fileids are 11371 unique within the source file system ensure they will be unique in 11372 each of the target file systems. 11374 11.7.6. The Change Attribute and File System Transitions 11376 Since the change attribute is defined as a server-specific one, 11377 change attributes fetched from one server are normally presumed to be 11378 invalid on another server. Such a presumption is troublesome since 11379 it would invalidate all cached change attributes, requiring 11380 refetching. Even more disruptive, the absence of any assured 11381 continuity for the change attribute means that even if the same value 11382 is retrieved on refetch no conclusions can drawn as to whether the 11383 object in question has changed. The identical change attribute could 11384 be merely an artifact of a modified file with a different change 11385 attribute construction algorithm, with that new algorithm just 11386 happening to result in an identical change value. 11388 When the two file systems have consistent change attribute formats, 11389 and this fact is communicated to the client by reporting as in the 11390 same _change_ class, the client may assume a continuity of change 11391 attribute construction and handle this situation just as it would be 11392 handled without any file system transition. 11394 11.7.7. Lock State and File System Transitions 11396 In a file system transition, the client needs to handle cases in 11397 which the two servers have cooperated in state management and in 11398 which they have not. Cooperation by two servers in state management 11399 requires coordination of client IDs. Before the client attempts to 11400 use a client ID associated with one server in a request to the server 11401 of the other file system, it must eliminate the possibility that two 11402 non-cooperating servers have assigned the same client ID by accident. 11403 The client needs to compare the eir_server_scope values returned by 11404 each server. If the scope values do not match, then the servers have 11405 not cooperated in state management. If the scope values match, then 11406 this indicates the servers have cooperated in assigning client IDs to 11407 the point that they will reject client IDs that refer to state they 11408 do not know about. See Section 2.10.4 for more information about the 11409 use of server scope. 11411 In the case of migration, the servers involved in the migration of a 11412 file system SHOULD transfer all server state from the original to the 11413 new server. When this is done, it must be done in a way that is 11414 transparent to the client. With replication, such a degree of common 11415 state is typically not the case. Clients, however should use the 11416 information provided by the eir_server_scope returned by EXCHANGE_ID 11417 (as modified by the validation procedures described in 11418 Section 2.10.4) to determine whether such sharing may be in effect, 11419 rather than making assumptions based on the reason for the 11420 transition. 11422 This state transfer will reduce disruption to the client when a file 11423 system transition occurs. If the servers are successful in 11424 transferring all state, the client can attempt to establish sessions 11425 associated with the client ID used for the source file system 11426 instance. If the server accepts that as a valid client ID, then the 11427 client may use the existing stateids associated with that client ID 11428 for the old file system instance in connection with that same client 11429 ID in connection with the transitioned file system instance. If the 11430 client in question already had a client ID on the target system, it 11431 may interrogate the stateid values from the source system under that 11432 new client ID, with the assurance that if they are accepted as valid, 11433 then they represent validly transferred lock state for the source 11434 file system, transferred to the target server. 11436 When the two servers belong to the same server scope, it does not 11437 mean that when dealing with the transition, the client will not have 11438 to reclaim state. However it does mean that the client may proceed 11439 using its current client ID when establishing communication with the 11440 new server and the new server will either recognize the client ID as 11441 valid, or reject it, in which case locks must be reclaimed by the 11442 client. 11444 File systems co-operating in state management may actually share 11445 state or simply divide the identifier space so as to recognize (and 11446 reject as stale) each other's stateids and client IDs. Servers which 11447 do share state may not do so under all conditions or at all times. 11448 The requirement for the server is that if it cannot be sure in 11449 accepting a client ID that it reflects the locks the client was 11450 given, it must treat all associated state as stale and report it as 11451 such to the client. 11453 When the two file system instances are on servers that do not share a 11454 server scope value, the client must establish a new client ID on the 11455 destination, if it does not have one already, and reclaim locks if 11456 allowed by the server. In this case, old stateids and client IDs 11457 should not be presented to the new server since there is no assurance 11458 that they will not conflict with IDs valid on that server. Note that 11459 in this case lock reclaim may be attempted even when the servers 11460 involved in the transfer have different server scope values (see 11461 Section 8.4.2.1 for the contrary case of reclaim after server reboot. 11462 Servers with different server scope values may co-operate to allow 11463 reclaim for locks associated with the transfer of a filesystem even 11464 if they do not co-operate sufficiently to share a server scope. 11466 In either case, when actual locks are not known to be maintained, the 11467 destination server may establish a grace period specific to the given 11468 file system, with non-reclaim locks being rejected for that file 11469 system, even though normal locks are being granted for other file 11470 systems. Clients should not infer the absence of a grace period for 11471 file systems being transitioned to a server from responses to 11472 requests for other file systems. 11474 In the case of lock reclamation for a given file system after a file 11475 system transition, edge conditions can arise similar to those for 11476 reclaim after server restart (although in the case of the planned 11477 state transfer associated with migration, these can be avoided by 11478 securely recording lock state as part of state migration). Unless 11479 the destination server can guarantee that locks will not be 11480 incorrectly granted, the destination server should not allow lock 11481 reclaims and avoid establishing a grace period. 11483 Once all locks have been reclaimed, or there were no locks to 11484 reclaim, the client indicates that there are no more reclaims to be 11485 done for the file system in question by issuing a RECLAIM_COMPLETE 11486 operation with the rca_one_fs parameter set to true. Once this has 11487 been done, non-reclaim locking operations may be done, and any 11488 subsequent request to do reclaims will be rejected with the error 11489 NFS4ERR_NO_GRACE. 11491 Information about client identity may be propagated between servers 11492 in the form of client_owner4 and associated verifiers, under the 11493 assumption that the client presents the same values to all the 11494 servers with which it deals. 11496 Servers are encouraged to provide facilities to allow locks to be 11497 reclaimed on the new server after a file system transition. Often, 11498 however, in cases in which the two servers do not share a server 11499 scope value, such facilities may not be available and client should 11500 be prepared to re-obtain locks, even though it is possible that the 11501 client may have its LOCK or OPEN request denied due to a conflicting 11502 lock. 11504 The consequences of having no facilities available to reclaim locks 11505 on the new server will depend on the type of environment. In some 11506 environments, such as the transition between read-only file systems, 11507 such denial of locks should not pose large difficulties in practice. 11508 When an attempt to re-establish a lock on a new server is denied, the 11509 client should treat the situation as if its original lock had been 11510 revoked. Note that when the lock is granted, the client cannot 11511 assume that no conflicting lock could have been granted in the 11512 interim. Where change attribute continuity is present, the client 11513 may check the change attribute to check for unwanted file 11514 modifications. Where even this is not available, and the file system 11515 is not read-only, a client may reasonably treat all pending locks as 11516 having been revoked. 11518 11.7.7.1. Leases and File System Transitions 11520 In the case of lease renewal, the client may not be submitting 11521 requests for a file system that has been transferred to another 11522 server. This can occur because of the lease renewal mechanism. The 11523 client renews the lease associated with all file systems when 11524 submitting a request on an associated session, regardless of the 11525 specific file system being referenced. 11527 In order for the client to schedule renewal of leases where there is 11528 locking state that may have been relocated to the new server, the 11529 client must find out about lease relocation before those leases 11530 expire. To accomplish this, the SEQUENCE operation will return the 11531 status bit SEQ4_STATUS_LEASE_MOVED, if responsibility for any of the 11532 locking state renewed has been transferred to a new server. This 11533 will continue until the client receives an NFS4ERR_MOVED error for 11534 each of the file systems for which there has been locking state 11535 relocation. 11537 When a client receives an SEQ4_STATUS_LEASE_MOVED indication, it 11538 should perform an operation on each file system associated with the 11539 server where there is locking state for the current client associated 11540 with the file system in question. The client may choose to reference 11541 all file systems in the interests of simplicity but what is important 11542 is that it must reference all file systems for which there was 11543 locking state where that state moved. Once the client receives an 11544 NFS4ERR_MOVED error for each file system, the SEQ4_STATUS_LEASE_MOVED 11545 indication is cleared. The client can terminate the process of 11546 checking file systems once this indication is cleared (but only if 11547 the client has received a reply for all outstanding SEQUENCE requests 11548 on all sessions it has with the server), since there are no others 11549 for which locking state has moved. 11551 A client may use GETATTR of the fs_status (or fs_locations_info) 11552 attribute on all of the file systems to get absence indications in a 11553 single (or a few) request(s), since absent file systems will not 11554 cause an error in this context. However, it still must do an 11555 operation which receives NFS4ERR_MOVED on each file system, in order 11556 to clear the SEQ4_STATUS_LEASE_MOVED indication is cleared. 11558 Once the set of file systems with transferred locking state has been 11559 determined, the client can follow the normal process to obtain the 11560 new server information (through the fs_locations and 11561 fs_locations_info attributes) and perform renewal of those leases on 11562 the new server, unless information in fs_locations_info attribute 11563 shows that no state could have been transferred. If the server has 11564 not had state transferred to it transparently, the client will 11565 receive NFS4ERR_STALE_CLIENTID from the new server, as described 11566 above, and the client can then reclaim locks as is done in the event 11567 of server failure. 11569 11.7.7.2. Transitions and the Lease_time Attribute 11571 In order that the client may appropriately manage its leases in the 11572 case of a file system transition, the destination server must 11573 establish proper values for the lease_time attribute. 11575 When state is transferred transparently, that state should include 11576 the correct value of the lease_time attribute. The lease_time 11577 attribute on the destination server must never be less than that on 11578 the source since this would result in premature expiration of leases 11579 granted by the source server. Upon transitions in which state is 11580 transferred transparently, the client is under no obligation to re- 11581 fetch the lease_time attribute and may continue to use the value 11582 previously fetched (on the source server). 11584 If state has not been transferred transparently, either because the 11585 associated servers are shown as having different eir_server_scope 11586 strings or because the client ID is rejected when presented to the 11587 new server, the client should fetch the value of lease_time on the 11588 new (i.e. destination) server, and use it for subsequent locking 11589 requests. However the server must respect a grace period at least as 11590 long as the lease_time on the source server, in order to ensure that 11591 clients have ample time to reclaim their lock before potentially 11592 conflicting non-reclaimed locks are granted. 11594 11.7.8. Write Verifiers and File System Transitions 11596 In a file system transition, the two file systems may be clustered in 11597 the handling of unstably written data. When this is the case, and 11598 the two file systems belong to the same _write-verifier_ class, write 11599 verifiers returned from one system may be compared to those returned 11600 by the other and superfluous writes avoided. 11602 When two file systems belong to different _write-verifier_ classes, 11603 any verifier generated by one must not be compared to one provided by 11604 the other. Instead, it should be treated as not equal even when the 11605 values are identical. 11607 11.7.9. Readdir Cookies and Verifiers and File System Transitions 11609 In a file system transition, the two file systems may be consistent 11610 in their handling of READDIR cookies and verifiers. When this is the 11611 case, and the two file systems belong to the same _readdir_ class, 11612 READDIR cookies and verifiers from one system may be recognized by 11613 the other and READDIR operations started on one server may be validly 11614 continued on the other, simply by presenting the cookie and verifier 11615 returned by a READDIR operation done on the first file system to the 11616 second. 11618 When two file systems belong to different _readdir_ classes, any 11619 READDIR cookie and verifier generated by one is not valid on the 11620 second, and must not be presented to that server by the client. The 11621 client should act as if the verifier was rejected. 11623 11.7.10. File System Data and File System Transitions 11625 When multiple replicas exist and are used simultaneously or in 11626 succession by a client, applications using them will normally expect 11627 that they contain data the same data or data which is consistent with 11628 the normal sorts of changes that are made by other clients updating 11629 the data of the file system. (with metadata being the same to the 11630 degree indicated by the fs_locations_info attribute). However, when 11631 multiple file systems are presented as replicas of one another, the 11632 precise relationship between the data of one and the data of another 11633 is not, as a general matter, specified by the NFSv4.1 protocol. It 11634 is quite possible to present as replicas file systems where the data 11635 of those file systems is sufficiently different that some 11636 applications have problems dealing with the transition between 11637 replicas. The namespace will typically be constructed so that 11638 applications can choose an appropriate level of support, so that in 11639 one position in the namespace a varied set of replicas will be listed 11640 while in another only those that are up-to-date may be considered 11641 replicas. The protocol does define three special cases of the 11642 relationship among replicas to be specified by the server and relied 11643 upon by clients: 11645 o When multiple server addresses correspond to the same actual 11646 server, as indicated by a common so_major_id field within the 11647 eir_server_owner field returned by EXCHANGE_ID, the client may 11648 depend on the fact that changes to data, metadata, or locks made 11649 on one file system are immediately reflected on others. 11651 o When multiple replicas exist and are used simultaneously by a 11652 client (see the FSLIB4_CLSIMUL definition within 11653 fs_locations_info), they must designate the same data. Where file 11654 systems are writable, a change made on one instance must be 11655 visible on all instances, immediately upon the earlier of the 11656 return of the modifying requester or the visibility of that change 11657 on any of the associated replicas. This allows a client to use 11658 these replicas simultaneously without any special adaptation to 11659 the fact that there are multiple replicas. In this case, locks, 11660 whether shared or byte-range, and delegations obtained one replica 11661 are immediately reflected on all replicas, even though these locks 11662 will be managed under a set of client IDs. 11664 o When one replica is designated as the successor instance to 11665 another existing instance after return NFS4ERR_MOVED (i.e. the 11666 case of migration), the client may depend on the fact that all 11667 changes securely made to data (uncommitted writes are dealt with 11668 in Section 11.7.8) on the original instance are made to the 11669 successor image. 11671 o Where a file system is not writable but represents a read-only 11672 copy (possibly periodically updated) of a writable file system, 11673 clients have similar requirements with regard to the propagation 11674 of updates. They may need a guarantee that any change visible on 11675 the original file system instance must be immediately visible on 11676 any replica before the client transitions access to that replica, 11677 in order to avoid any possibility that a client, in effecting a 11678 transition to a replica, will see any reversion in file system 11679 state. The specific means by which this will be prevented varies 11680 based on fs4_status_type reported as part of the fs_status 11681 attribute (see Section 11.11). Since these file systems are 11682 presumed not to be suitable for simultaneous use, there is no 11683 specification of how locking is handled and it generally will be 11684 the case that locks obtained one file system will be separate from 11685 those on others. Since these are going to be read-only file 11686 systems, this is not expected to pose an issue for clients or 11687 applications. 11689 11.8. Effecting File System Referrals 11691 Referrals are effected when an absent file system is encountered, and 11692 one or more alternate locations are made available by the 11693 fs_locations or fs_locations_info attributes. The client will 11694 typically get an NFS4ERR_MOVED error, fetch the appropriate location 11695 information and proceed to access the file system on a different 11696 server, even though it retains its logical position within the 11697 original namespace. Referrals differ from migration events in that 11698 they happen only when the client has not previously referenced the 11699 file system in question (so there is nothing to transition). 11700 Referrals can only come into effect when an absent file system is 11701 encountered at its root. 11703 The examples given in the sections below are somewhat artificial in 11704 that an actual client will not typically do a multi-component lookup, 11705 but will have cached information regarding the upper levels of the 11706 name hierarchy. However, these example are chosen to make the 11707 required behavior clear and easy to put within the scope of a small 11708 number of requests, without getting unduly into details of how 11709 specific clients might choose to cache things. 11711 11.8.1. Referral Example (LOOKUP) 11713 Let us suppose that the following COMPOUND is sent in an environment 11714 in which /this/is/the/path is absent from the target server. This 11715 may be for a number of reasons. It may be the case that the file 11716 system has moved, or, it may be the case that the target server is 11717 functioning mainly, or solely, to refer clients to the servers on 11718 which various file systems are located. 11720 o PUTROOTFH 11722 o LOOKUP "this" 11724 o LOOKUP "is" 11726 o LOOKUP "the" 11728 o LOOKUP "path" 11730 o GETFH 11732 o GETATTR fsid,fileid,size,time_modify 11734 Under the given circumstances, the following will be the result. 11736 o PUTROOTFH --> NFS_OK. The current fh is now the root of the 11737 pseudo-fs. 11739 o LOOKUP "this" --> NFS_OK. The current fh is for /this and is 11740 within the pseudo-fs. 11742 o LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 11743 within the pseudo-fs. 11745 o LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 11746 is within the pseudo-fs. 11748 o LOOKUP "path" --> NFS_OK. The current fh is for /this/is/the/path 11749 and is within a new, absent file system, but ... the client will 11750 never see the value of that fh. 11752 o GETFH --> NFS4ERR_MOVED. Fails because current fh is in an absent 11753 file system at the start of the operation and the spec makes no 11754 exception for GETFH. 11756 o GETATTR fsid,fileid,size,time_modify. Not executed because the 11757 failure of the GETFH stops processing of the COMPOUND. 11759 Given the failure of the GETFH, the client has the job of determining 11760 the root of the absent file system and where to find that file 11761 system, i.e. the server and path relative to that server's root fh. 11762 Note here that in this example, the client did not obtain filehandles 11763 and attribute information (e.g. fsid) for the intermediate 11764 directories, so that it would not be sure where the absent file 11765 system starts. It could be the case, for example, that /this/is/the 11766 is the root of the moved file system and that the reason that the 11767 lookup of "path" succeeded is that the file system was not absent on 11768 that operation but was moved between the last LOOKUP and the GETFH 11769 (since COMPOUND is not atomic). Even if we had the fsids for all of 11770 the intermediate directories, we could have no way of knowing that 11771 /this/is/the/path was the root of a new file system, since we don't 11772 yet have its fsid. 11774 In order to get the necessary information, let us re-send the chain 11775 of LOOKUPs with GETFHs and GETATTRs to at least get the fsids so we 11776 can be sure where the appropriate file system boundaries are. The 11777 client could choose to get fs_locations_info at the same time but in 11778 most cases the client will have a good guess as to where file system 11779 boundaries are (because of where and where not NFS4ERR_MOVED was 11780 received) making fetching of fs_locations_info unnecessary. 11782 OP01: PUTROOTFH --> NFS_OK 11784 - Current fh is root of pseudo-fs. 11786 OP02: GETATTR(fsid) --> NFS_OK 11788 - Just for completeness. Normally, clients will know the fsid of 11789 the pseudo-fs as soon as they establish communication with a 11790 server. 11792 OP03: LOOKUP "this" --> NFS_OK 11794 OP04: GETATTR(fsid) --> NFS_OK 11796 - Get current fsid to see where file system boundaries are. The 11797 fsid will be that for the pseudo-fs in this example, so no 11798 boundary. 11800 OP05: GETFH --> NFS_OK 11802 - Current fh is for /this and is within pseudo-fs. 11804 OP06: LOOKUP "is" --> NFS_OK 11806 - Current fh is for /this/is and is within pseudo-fs. 11808 OP07: GETATTR(fsid) --> NFS_OK 11810 - Get current fsid to see where file system boundaries are. The 11811 fsid will be that for the pseudo-fs in this example, so no 11812 boundary. 11814 OP08: GETFH --> NFS_OK 11816 - Current fh is for /this/is and is within pseudo-fs. 11818 OP09: LOOKUP "the" --> NFS_OK 11820 - Current fh is for /this/is/the and is within pseudo-fs. 11822 OP10: GETATTR(fsid) --> NFS_OK 11824 - Get current fsid to see where file system boundaries are. The 11825 fsid will be that for the pseudo-fs in this example, so no 11826 boundary. 11828 OP11: GETFH --> NFS_OK 11830 - Current fh is for /this/is/the and is within pseudo-fs. 11832 OP12: LOOKUP "path" --> NFS_OK 11834 - Current fh is for /this/is/the/path and is within a new, absent 11835 file system, but ... 11837 - The client will never see the value of that fh 11839 OP13: GETATTR(fsid, fs_locations_info) --> NFS_OK 11841 - We are getting the fsid to know where the file system boundaries 11842 are. In this operation the fsid will be different than that of 11843 the parent directory (which in turn was retrieved in OP10). Note 11844 that the fsid we are given will not necessarily be preserved at 11845 the new location. That fsid might be different and in fact the 11846 fsid we have for this file system might be a valid fsid of a 11847 different file system on that new server. 11849 - In this particular case, we are pretty sure anyway that what has 11850 moved is /this/is/the/path rather than /this/is/the since we have 11851 the fsid of the latter and it is that of the pseudo-fs, which 11852 presumably cannot move. However, in other examples, we might not 11853 have this kind of information to rely on (e.g. /this/is/the might 11854 be a non-pseudo file system separate from /this/is/the/path), so 11855 we need to have another reliable source information on the 11856 boundary of the file system which is moved. If, for example, the 11857 file system "/this/is" had moved we would have a case of migration 11858 rather than referral and once the boundaries of the migrated file 11859 system was clear we could fetch fs_locations_info. 11861 - We are fetching fs_locations_info because the fact that we got an 11862 NFS4ERR_MOVED at this point means that it most likely that this is 11863 a referral and we need the destination. Even if it is the case 11864 that "/this/is/the" is a file system which has migrated, we will 11865 still need the location information for that file system. 11867 OP14: GETFH --> NFS4ERR_MOVED 11869 - Fails because current fh is in an absent file system at the start 11870 of the operation and the spec makes no exception for GETFH. Note 11871 that this means the server will never send the client a filehandle 11872 from within an absent file system. 11874 Given the above, the client knows where the root of the absent file 11875 system is (/this/is/the/path), by noting where the change of fsid 11876 occurred (between "the" and "path"). The fs_locations_info attribute 11877 also gives the client the actual location of the absent file system, 11878 so that the referral can proceed. The server gives the client the 11879 bare minimum of information about the absent file system so that 11880 there will be very little scope for problems of conflict between 11881 information sent by the referring server and information of the file 11882 system's home. No filehandles and very few attributes are present on 11883 the referring server and the client can treat those it receives as 11884 basically transient information with the function of enabling the 11885 referral. 11887 11.8.2. Referral Example (READDIR) 11889 Another context in which a client may encounter referrals is when it 11890 does a READDIR on directory in which some of the sub-directories are 11891 the roots of absent file systems. 11893 Suppose such a directory is read as follows: 11895 o PUTROOTFH 11897 o LOOKUP "this" 11899 o LOOKUP "is" 11901 o LOOKUP "the" 11903 o READDIR (fsid, size, time_modify, mounted_on_fileid) 11905 In this case, because rdattr_error is not requested, 11906 fs_locations_info is not requested, and some of attributes cannot be 11907 provided, the result will be an NFS4ERR_MOVED error on the READDIR, 11908 with the detailed results as follows: 11910 o PUTROOTFH --> NFS_OK. The current fh is at the root of the 11911 pseudo-fs. 11913 o LOOKUP "this" --> NFS_OK. The current fh is for /this and is 11914 within the pseudo-fs. 11916 o LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 11917 within the pseudo-fs. 11919 o LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 11920 is within the pseudo-fs. 11922 o READDIR (fsid, size, time_modify, mounted_on_fileid) --> 11923 NFS4ERR_MOVED. Note that the same error would have been returned 11924 if /this/is/the had migrated, when in fact it is because the 11925 directory contains the root of an absent file system. 11927 So now suppose that we re-send with rdattr_error: 11929 o PUTROOTFH 11931 o LOOKUP "this" 11933 o LOOKUP "is" 11935 o LOOKUP "the" 11937 o READDIR (rdattr_error, fsid, size, time_modify, mounted_on_fileid) 11939 The results will be: 11941 o PUTROOTFH --> NFS_OK. The current fh is at the root of the 11942 pseudo-fs. 11944 o LOOKUP "this" --> NFS_OK. The current fh is for /this and is 11945 within the pseudo-fs. 11947 o LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 11948 within the pseudo-fs. 11950 o LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 11951 is within the pseudo-fs. 11953 o READDIR (rdattr_error, fsid, size, time_modify, mounted_on_fileid) 11954 --> NFS_OK. The attributes for directory entry with the component 11955 named "path" will only contain rdattr_error with the value 11956 NFS4ERR_MOVED, together with an fsid value and a value for 11957 mounted_on_fileid. 11959 So suppose we do another READDIR to get fs_locations_info (although 11960 we could have used a GETATTR directly, as in Section 11.8.1). 11962 o PUTROOTFH 11964 o LOOKUP "this" 11966 o LOOKUP "is" 11968 o LOOKUP "the" 11970 o READDIR (rdattr_error, fs_locations_info, mounted_on_fileid, fsid, 11971 size, time_modify) 11973 The results would be: 11975 o PUTROOTFH --> NFS_OK. The current fh is at the root of the 11976 pseudo-fs. 11978 o LOOKUP "this" --> NFS_OK. The current fh is for /this and is 11979 within the pseudo-fs. 11981 o LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 11982 within the pseudo-fs. 11984 o LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 11985 is within the pseudo-fs. 11987 o READDIR (rdattr_error, fs_locations_info, mounted_on_fileid, fsid, 11988 size, time_modify) --> NFS_OK. The attributes will be as shown 11989 below. 11991 The attributes for the directory entry with the component named 11992 "path" will only contain 11994 o rdattr_error (value: NFS_OK) 11996 o fs_locations_info 11998 o mounted_on_fileid (value: unique fileid within referring file 11999 system) 12001 o fsid (value: unique value within referring server) 12003 The attributes for entry "path" will not contain size or time_modify 12004 because these attributes are not available within an absent file 12005 system. 12007 11.9. The Attribute fs_locations 12009 The fs_locations attribute is structured in the following way: 12011 struct fs_location4 { 12012 utf8str_cis server<>; 12013 pathname4 rootpath; 12014 }; 12016 struct fs_locations4 { 12017 pathname4 fs_root; 12018 fs_location4 locations<>; 12020 }; 12022 The fs_location4 data type is used to represent the location of a 12023 file system by providing a server name and the path to the root of 12024 the file system within that server's namespace. When a set of 12025 servers have corresponding file systems at the same path within their 12026 namespaces, an array of server names may be provided. An entry in 12027 the server array is a UTF-8 string and represents one of a 12028 traditional DNS host name, IPv4 address, or IPv6 address, or a zero- 12029 length string. An IPv4 or IPv6 address is represented as a universal 12030 address (see Section 3.3.9 and [14]), minus the netid, and either 12031 with or without the trailing ".p1.p2" suffix that represents the port 12032 number. If the suffix is omitted, then the default port, 2049, 12033 SHOULD be assumed. A zero-length string SHOULD be used to indicate 12034 the current address being used for the RPC call. It is not a 12035 requirement that all servers that share the same rootpath be listed 12036 in one fs_location4 instance. The array of server names is provided 12037 for convenience. Servers that share the same rootpath may also be 12038 listed in separate fs_location4 entries in the fs_locations 12039 attribute. 12041 The fs_locations4 data type and fs_locations attribute contain an 12042 array of such locations. Since the namespace of each server may be 12043 constructed differently, the "fs_root" field is provided. The path 12044 represented by fs_root represents the location of the file system in 12045 the current server's namespace, i.e. that of the server from which 12046 the fs_locations attribute was obtained. The fs_root path is meant 12047 to aid the client by clearly referencing the root of the file system 12048 whose locations are being reported, no matter what object within the 12049 current file system the current filehandle designates. The fs_root 12050 is simply the pathname the client used to reach the object on the 12051 current server, the object being that the fs_locations attribute 12052 applies to. 12054 When the fs_locations attribute is interrogated and there are no 12055 alternate file system locations, the server SHOULD return a zero- 12056 length array of fs_location4 structures, together with a valid 12057 fs_root. 12059 As an example, suppose there is a replicated file system located at 12060 two servers (servA and servB). At servA, the file system is located 12061 at path "/a/b/c". At, servB the file system is located at path 12062 "/x/y/z". If the client were to obtain the fs_locations value for 12063 the directory at "/a/b/c/d", it might not necessarily know that the 12064 file system's root is located in servA's namespace at "/a/b/c". When 12065 the client switches to servB, it will need to determine that the 12066 directory it first referenced at servA is now represented by the path 12067 "/x/y/z/d" on servB. To facilitate this, the fs_locations attribute 12068 provided by servA would have a fs_root value of "/a/b/c" and two 12069 entries in fs_locations. One entry in fs_locations will be for 12070 itself (servA) and the other will be for servB with a path of 12071 "/x/y/z". With this information, the client is able to substitute 12072 "/x/y/z" for the "/a/b/c" at the beginning of its access path and 12073 construct "/x/y/z/d" to use for the new server. 12075 Note that: there is no requirement that the number of components in 12076 each rootpath be the same; there is no relation between the number of 12077 components in rootpath or fs_root; and the none of the components in 12078 each rootpath and fs_root have to be the same. In the above example, 12079 we could have had a third element in the locations array, with server 12080 equal to "servC", and rootpath equal to "/I/II", and a fourth element 12081 in locations with server equal to "servD", and rootpath equal to 12082 "/aleph/beth/gimel/daleth/he". 12084 The relationship between fs_root to a rootpath is that the client 12085 replaces the pathname indicated in fs_root for the current server for 12086 the substitute indicated in rootpath for the new server. 12088 For an example for a referred or migrated file system, suppose there 12089 is a file system located at serv1. At serv1, the file system is 12090 located at "/az/buky/vedi/glagoli". The client finds that object at 12091 "glagoli" has migrated (or is a referral). The client gets the 12092 fs_locations attribute, which contains an fs_root of "/az/buky/vedi/ 12093 glagoli", and one element in the locations array, with server equal 12094 to "serv2", and rootpath equal to "/izhitsa/fita". The client 12095 replaces "/az/buky/vedi/glagoli" with "/izhitsa/fita", and uses the 12096 latter pathname on "serv2". 12098 Thus, the server MUST return an fs_root that is equal to the path the 12099 client used to reach the object the fs_locations attribute applies 12100 to. Otherwise the client cannot determine the new path to use on the 12101 new server. 12103 Since the fs_locations attribute lacks information defining various 12104 attributes of the various file system choices presented, it SHOULD 12105 only be interrogated and used when fs_locations_info is not 12106 available. When fs_locations is used, information about the specific 12107 locations should be assumed based on the following rules. 12109 The following rules are general and apply irrespective of the 12110 context. 12112 o All listed file system instances should be considered as of the 12113 same _handle_ class, if and only if, the current fh_expire_type 12114 attribute does not include the FH4_VOL_MIGRATION bit. Note that 12115 in the case of referral, filehandle issues do not apply since 12116 there can be no filehandles known within the current file system 12117 nor is there any access to the fh_expire_type attribute on the 12118 referring (absent) file system. 12120 o All listed file system instances should be considered as of the 12121 same _fileid_ class, if and only if, the fh_expire_type attribute 12122 indicates persistent filehandles and does not include the 12123 FH4_VOL_MIGRATION bit. Note that in the case of referral, fileid 12124 issues do not apply since there can be no fileids known within the 12125 referring (absent) file system nor is there any access to the 12126 fh_expire_type attribute. 12128 o All file system instances servers should be considered as of 12129 different _change_ classes. 12131 For other class assignments, handling of file system transitions 12132 depends on the reasons for the transition: 12134 o When the transition is due to migration, that is the client was 12135 directed to new file system after receiving an NFS4ERR_MOVED 12136 error, the target should be treated as being of the same _write- 12137 verifier_ class as the source. 12139 o When the transition is due to failover to another replica, that 12140 is, the client selected another replica without receiving and 12141 NFS4ERR_MOVED error, the target should be treated as being of a 12142 different _write-verifier_ class from the source. 12144 The specific choices reflect typical implementation patterns for 12145 failover and controlled migration respectively. Since other choices 12146 are possible and useful, this information is better obtained by using 12147 fs_locations_info. When a server implementation needs to communicate 12148 other choices, it MUST support the fs_locations_info attribute. 12150 See Section 21 for a discussion on the recommendations for the 12151 security flavor to be used by any GETATTR operation that requests the 12152 "fs_locations" attribute. 12154 11.10. The Attribute fs_locations_info 12156 The fs_locations_info attribute is intended as a more functional 12157 replacement for fs_locations which will continue to exist and be 12158 supported. Clients can use it to get a more complete set of 12159 information about alternative file system locations. When the server 12160 does not support fs_locations_info, fs_locations can be used to get a 12161 subset of the information. A server which supports fs_locations_info 12162 MUST support fs_locations as well. 12164 There is additional information present in fs_locations_info, that is 12165 not available in fs_locations: 12167 o Attribute continuity information to allow a client to select a 12168 location which meets the transparency requirements of the 12169 applications accessing the data and to take advantage of 12170 optimizations that server guarantees as to attribute continuity 12171 may provide (e.g. change attribute). 12173 o File System identity information which indicates when multiple 12174 replicas, from the client's point of view, correspond to the same 12175 target file system, allowing them to be used interchangeably, 12176 without disruption, as multiple paths to the same thing. 12178 o Information which will bear on the suitability of various 12179 replicas, depending on the use that the client intends. For 12180 example, many applications need an absolutely up-to-date copy 12181 (e.g. those that write), while others may only need access to the 12182 most up-to-date copy reasonably available. 12184 o Server-derived preference information for replicas, which can be 12185 used to implement load-balancing while giving the client the 12186 entire file system list to be used in case the primary fails. 12188 The fs_locations_info attribute is structured similarly to the 12189 fs_locations attribute. A top-level structure (fs_locations_info4) 12190 contains the entire attribute including the root pathname of the file 12191 system and an array of lower-level structures that define replicas 12192 that share a common root path on their respective servers. The 12193 lower-level structure in turn (fs_locations_item4) contains a 12194 specific pathname and information on one or more individual server 12195 replicas. For that last lowest-level fs_locations_info has a 12196 fs_locations_server4 structure that contains per-server-replica 12197 information in addition to the server name. This per-server-replica 12198 information includes a nominally opaque array, fls_info, in which 12199 specific pieces of information are located at the specific indices 12200 listed below. 12202 The attribute will always contains at least a single 12203 fs_locations_server entry. Typically, this will be an entry with the 12204 FS4LIGF_CUR_REQ flag set, although in the case of a referral there 12205 will be no entry with that flag set. 12207 It should be noted that fs_locations_info attributes returned by 12208 servers for various replicas may differ for various reasons. One 12209 server may know about a set of replicas that are not know to other 12210 servers. Further, compatibility attributes may differ. Filehandles 12211 might be of the same class going from replica A to replica B but not 12212 going in the reverse direction. This might happen because the 12213 filehandles are the same but replica B's server implementation might 12214 not have provision to note and report that equivalence. 12216 The fs_locations_info attribute consists of a root pathname 12217 (fli_fs_root, just like fs_root in the fs_locations attribute), 12218 together with an array of fs_location_item4 structures. The 12219 fs_location_item4 structures in turn consist of a root pathname 12220 (fli_rootpath) together with an array (fli_entries) of elements of 12221 data type fs_locations_server4, all defined as follows. 12223 /* 12224 * Defines an individual server replica 12225 */ 12226 struct fs_locations_server4 { 12227 int32_t fls_currency; 12228 opaque fls_info<>; 12229 utf8str_cis fls_server; 12230 }; 12232 /* 12233 * Byte indices of items within 12234 * fls_info: flag fields, class numbers, 12235 * bytes indicating ranks and orders. 12236 */ 12237 const FSLI4BX_GFLAGS = 0; 12238 const FSLI4BX_TFLAGS = 1; 12240 const FSLI4BX_CLSIMUL = 2; 12241 const FSLI4BX_CLHANDLE = 3; 12242 const FSLI4BX_CLFILEID = 4; 12243 const FSLI4BX_CLWRITEVER = 5; 12244 const FSLI4BX_CLCHANGE = 6; 12245 const FSLI4BX_CLREADDIR = 7; 12247 const FSLI4BX_READRANK = 8; 12248 const FSLI4BX_WRITERANK = 9; 12249 const FSLI4BX_READORDER = 10; 12250 const FSLI4BX_WRITEORDER = 11; 12252 /* 12253 * Bits defined within the general flag byte. 12254 */ 12255 const FSLI4GF_WRITABLE = 0x01; 12256 const FSLI4GF_CUR_REQ = 0x02; 12257 const FSLI4GF_ABSENT = 0x04; 12258 const FSLI4GF_GOING = 0x08; 12259 const FSLI4GF_SPLIT = 0x10; 12260 /* 12261 * Bits defined within the transport flag byte. 12262 */ 12263 const FSLI4TF_RDMA = 0x01; 12265 /* 12266 * Defines a set of replicas sharing 12267 * a common value of the root path 12268 * with in the corresponding 12269 * single-server namespaces. 12270 */ 12271 struct fs_locations_item4 { 12272 fs_locations_server4 fli_entries<>; 12273 pathname4 fli_rootpath; 12274 }; 12276 /* 12277 * Defines the overall structure of 12278 * the fs_locations_info attribute. 12279 */ 12280 struct fs_locations_info4 { 12281 uint32_t fli_flags; 12282 int32_t fli_valid_for; 12283 pathname4 fli_fs_root; 12284 fs_locations_item4 fli_items<>; 12285 }; 12287 /* 12288 * Flag bits in fli_flags. 12289 */ 12290 const FSLI4IF_VAR_SUB = 0x00000001; 12292 typedef fs_locations_info4 fattr4_fs_locations_info; 12294 As noted above, the fs_locations_info attribute, when supported, may 12295 be requested of absent file systems without causing NFS4ERR_MOVED to 12296 be returned and it is generally expected that it will be available 12297 for both present and absent file systems even if only a single 12298 fs_locations_server4 entry is present, designating the current 12299 (present) file system, or two fs_locations_server4 entries 12300 designating the previous location of an absent file system (the one 12301 just referenced) and its successor location. Servers are strongly 12302 urged to support this attribute on all file systems if they support 12303 it on any file system. 12305 The data presented in the fs_locations_info attribute may be obtained 12306 by the server in any number of ways, including specification by the 12307 administrator or by current protocols for transferring data among 12308 replicas and protocols not yet developed. NFSv4.1 only defines how 12309 this information is presented by the server to the client. 12311 11.10.1. The fs_locations_server4 Structure 12313 The fs_locations_server4 structure consists of the following items: 12315 o An indication of file system up-to-date-ness (fls_currency) in 12316 terms of approximate seconds before the present. This value is 12317 relative to the master copy. A negative value indicates that the 12318 server is unable to give any reasonably useful value here. A zero 12319 indicates that file system is the actual writable data or a 12320 reliably coherent and fully up-to-date copy. Positive values 12321 indicate how out-of-date this copy can normally be before it is 12322 considered for update. Such a value is not a guarantee that such 12323 updates will always be performed on the required schedule but 12324 instead serve as a hint about how far the copy of the data would 12325 be expected to be behind the most up-to-date copy. 12327 o A counted array of one-byte values (fls_info) containing 12328 information about the particular file system instance. This data 12329 includes general flags, transport capability flags, file system 12330 equivalence class information, and selection priority information. 12331 The encoding will be discussed below. 12333 o The server string (fls_server). For the case of the replica 12334 currently being accessed (via GETATTR), a zero-length string MAY 12335 be used to indicate the current address being used for the RPC 12336 call. The fls_server field can also be an IPv4 or IPv6 address, 12337 formatted the same way as an IPv4 or IPv6 address in the "server" 12338 field of the fs_location4 data type (see Section 11.9). 12340 Data within the fls_info array is in the form of 8-bit data items 12341 with constants giving the offsets within the array of various values 12342 describing this particular file system instance. This style of 12343 definition was chosen, in preference to explicit XDR structure 12344 definitions for these values, for a number of reasons. 12346 o The kinds of data in the fls_info array, representing flags, file 12347 system classes and priorities among set of file systems 12348 representing the same data, are such that eight bits provides a 12349 quite acceptable range of values. Even where there might be more 12350 than 256 such file system instances, having more than 256 distinct 12351 classes or priorities is unlikely. 12353 o Explicit definition of the various specific data items within XDR 12354 would limit expandability in that any extension within a 12355 subsequent minor version would require yet another attribute, 12356 leading to specification and implementation clumsiness. 12358 o Such explicit definitions would also make it impossible to propose 12359 standards-track extensions apart from a full minor version. 12361 This encoding scheme can be adapted to the specification of multi- 12362 byte numeric values, even though none are currently defined. If 12363 extensions are made via standards-track RFC's, multi-byte quantities 12364 will be encoded as a range of bytes with a range of indices with the 12365 byte interpreted in big endian byte order. Further any such index 12366 assignments are constrained so that the relevant quantities will not 12367 cross XDR word boundaries. 12369 The set of fls_info data is subject to expansion in a future minor 12370 version, or in a standard-track RFC, within the context of a single 12371 minor version. The server SHOULD NOT send and the client MUST NOT 12372 use indices within the fls_info array that are not defined in 12373 standards-track RFC's. 12375 The fls_info array contains within it: 12377 o Two 8-bit flag fields, one devoted to general file-system 12378 characteristics and a second reserved for transport-related 12379 capabilities. 12381 o Six 8-bit class values which define various file system 12382 equivalence classes as explained below. 12384 o Four 8-bit priority values which govern file system selection as 12385 explained below. 12387 The general file system characteristics flag (at byte index 12388 FSLI4BX_GFLAGS) has the following bits defined within it: 12390 o FSLI4GF_WRITABLE indicates that this file system target is 12391 writable, allowing it to be selected by clients which may need to 12392 write on this file system. When the current file system instance 12393 is writable, and is defined as of the same simultaneous use class 12394 (as specified by the value at index FSLI4BX_CLSIMUL) to which the 12395 client was previously writing, then it must incorporate within its 12396 data any committed write made on the source file system instance. 12397 See Section 11.7.8 which discusses the write-verifier class. 12398 While there is no harm in not setting this flag for a file system 12399 that turns out to be writable, turning the flag on for read-only 12400 file system can cause problems for clients which select a 12401 migration or replication target based on it and then find 12402 themselves unable to write. 12404 o FSLI4GF_CUR_REQ indicates that this replica is the one on which 12405 the request is being made. Only a single server entry may have 12406 this flag set and in the case of a referral, no entry will have 12407 it. 12409 o FSLI4GF_ABSENT indicates that this entry corresponds an absent 12410 file system replica. It can only be set if FSLI4GF_CUR_REQ is 12411 set. When both such bits are set it indicates that a file system 12412 instance is not usable but that the information in the entry can 12413 be used to determine the sorts of continuity available when 12414 switching from this replica to other possible replicas. Since 12415 this bit can only be true if FSLI4GF_CUR_REQ is true, the value 12416 could be determined using the fs_status attribute but the 12417 information is also made available here for the convenience of the 12418 client. An entry with this bit, since it represents a true file 12419 system (albeit absent), does not appear in the event of a 12420 referral, but only where a file system has been accessed at this 12421 location and has subsequently been migrated. 12423 o FSLI4GF_GOING indicates that a replica, while still available, 12424 should not be used further. The client, if using it, should make 12425 an orderly transfer to another file system instance as 12426 expeditiously as possible. It is expected that file systems going 12427 out of service will be announced as FSLI4GF_GOING some time before 12428 the actual loss of service and that the valid_for value will be 12429 sufficiently small to allow clients to detect and act on scheduled 12430 events while large enough that the cost of the requests to fetch 12431 the fs_locations_info values will not be excessive. Values on the 12432 order of ten minutes seem reasonable. 12434 When this flag is seen as part of a transition into a new file 12435 system, a client might choose to transfer immediately to another 12436 replica, or it may reference the current file system and only 12437 transition when a migration event occurs. Similarly, when this 12438 flag appears as a replica in the referral, clients would likely to 12439 avoid being referred to this instance whenever there is another 12440 choice. 12442 o FSLI4GF_SPLIT indicates that when a transition occurs from the 12443 current file system instance to this one, the replacement may 12444 consist of multiple file systems. In this case, the client has to 12445 be prepared for the possibility that objects on the same file 12446 system before migration will be on different ones after. Note 12447 that FSLI4GF_SPLIT is not incompatible with the file systems 12448 belonging to the same _fileid_ class since, if one has a set of 12449 fileids that are unique within a file system, each subset assigned 12450 to a smaller file system after migration would not have any 12451 conflicts internal to that file system. 12453 A client, in the case of a split file system, will interrogate 12454 existing files with which it has continuing connection (it is free 12455 simply forget cached filehandles). If the client remembers the 12456 directory filehandle associated with each open file, it may 12457 proceed upward using LOOKUPP to find the new file system 12458 boundaries. Note that in the event of a referral, there will not 12459 be any such files and so these action will not be performed. 12460 Instead, a reference to a portion of the original file system now 12461 split off into other file systems will encounter an fsid change 12462 and possibly a further referral. 12464 Once the client recognizes that one file system has been split 12465 into two, it can prevent the disruption of running applications by 12466 presenting the two file systems as a single one until a convenient 12467 point to recognize the transition, such as a restart. This would 12468 require a mapping from the server's fsids to fsids as seen by the 12469 client but this is already necessary for other reasons. As noted 12470 above, existing fileids within the two descendant file systems 12471 will not conflict. Providing non-conflicting fileids for newly- 12472 created files on the split file systems is the responsibility of 12473 the server (or servers working in concert). The server can encode 12474 filehandles such that filehandles generated before the split event 12475 can be discerned from those generated after the split, allowing 12476 the server to determine when the need for emulating two file 12477 systems as one is over. 12479 Although it is possible for this flag to be present in the event 12480 of referral, it would generally be of little interest to the 12481 client, since the client is not expected to have information 12482 regarding the current contents of the absent file system. 12484 The transport-flag field (at byte index FSLI4BX_TFLAGS) contains the 12485 following bits related to the transport capabilities of the specific 12486 file system. 12488 o FSLI4TF_RDMA indicates that this file system provides NFSv4.1 file 12489 system access using an RDMA-capable transport. 12491 Attribute continuity and file system identity information are 12492 expressed by defining equivalence relations on the sets of file 12493 systems presented to the client. Each such relation is expressed as 12494 a set of file system equivalence classes. For each relation, a file 12495 system has an 8-bit class number. Two file systems belong to the 12496 same class if both have identical non-zero class numbers. Zero is 12497 treated as non-matching. Most often, the relevant question for the 12498 client will be whether a given replica is identical-to/ 12499 continuous-with the current one in a given respect but the 12500 information should be available also as to whether two other replicas 12501 match in that respect as well. 12503 The following fields specify the file system's class numbers for the 12504 equivalence relations used in determining the nature of file system 12505 transitions. See Section 11.7 for details about how this information 12506 is to be used. Servers may assign these values as they wish, so long 12507 as file system instances that share the same value have the specified 12508 relationship to one another, conversely file systems which have the 12509 specified relationship to one another share a common class value. As 12510 each instance entry is added, the relationships of this instance to 12511 previously entered instances can be consulted and if one is found 12512 that bears the specified relationship, that entry's class value can 12513 be copied to the new entry. When no such previous entry exists, a 12514 new value for that byte index, not previously used can be selected, 12515 most likely by increment the value of the last class value assigned 12516 for that index. 12518 o The field with byte index FSLI4BX_CLSIMUL defines the 12519 simultaneous-use class for the file system. 12521 o The field with byte index FSLI4BX_CLHANDLE defines the handle 12522 class for the file system. 12524 o The field with byte index FSLI4BX_CLFILEID defines the fileid 12525 class for the file system. 12527 o The field with byte index FSLI4BX_CLWRITEVER defines the write- 12528 verifier class for the file system. 12530 o The field with byte index FSLI4BX_CLCHANGE defines the change 12531 class for the file system. 12533 o The field with byte index FSLI4BX_CLREADDIR defines the readdir 12534 class for the file system. 12536 Server-specified preference information is also provided via 8-bit 12537 values within the fls_info array. The values provide a rank and an 12538 order (see below) to be used with separate values specifiable for the 12539 cases of read-only and writable file systems. These values are 12540 compared for different file systems to establish the server-specified 12541 preference, with lower values indicating "more preferred". 12543 Rank is used to express a strict server-imposed ordering on clients, 12544 with lower values indicating "more preferred." Clients should 12545 attempt to use all replicas with a given rank before they use one 12546 with a higher rank. Only if all of those file systems are 12547 unavailable should the client proceed to those of a higher rank. 12548 Because specifying a rank will override client preferences, servers 12549 should be conservative about using this mechanism, particularly when 12550 the environment is one in client communication characteristics are 12551 not tightly controlled and visible to the server. 12553 Within a rank, the order value is used to specify the server's 12554 preference to guide the client's selection when the client's own 12555 preferences are not controlling, with lower values of order 12556 indicating "more preferred." If replicas are approximately equal in 12557 all respects, clients should defer to the order specified by the 12558 server. When clients look at server latency as part of their 12559 selection, they are free to use this criterion but it is suggested 12560 that when latency differences are not significant, the server- 12561 specified order should guide selection. 12563 o The field at byte index FSLI4BX_READRANK gives the rank value to 12564 be used for read-only access. 12566 o The field at byte index FSLI4BX_READORDER gives the order value to 12567 be used for read-only access. 12569 o The field at byte index FSLI4BX_WRITERANK gives the rank value to 12570 be used for writable access. 12572 o The field at byte index FSLI4BX_WRITEORDER gives the order value 12573 to be used for writable access. 12575 Depending on the potential need for write access by a given client, 12576 one of the pairs of rank and order values is used. The read rank and 12577 order should only be used if the client knows that only reading will 12578 ever be done or if it is prepared to switch to a different replica in 12579 the event that any write access capability is required in the future. 12581 11.10.2. The fs_locations_info4 Structure 12583 The fs_locations_info4 structure, encoding the fs_locations_info 12584 attribute, contains the following: 12586 o The fli_flags field which contains general flags that affect the 12587 interpretation of this fs_locations_info4 structure and all 12588 fs_locations_item4 structures within it. The only flag currently 12589 defined is FSLI4IF_VAR_SUB. All bits in the fli_flags field which 12590 are not defined should always be returned as zero. 12592 o The fli_fs_root field which contains the pathname of the root of 12593 the current file system on the current server, just as it does in 12594 the fs_locations4 structure. 12596 o An array called fli_items of fs_locations4_item structures, which 12597 contain information about replicas of the current file system. 12598 Where the current file system is actually present, or has been 12599 present, i.e. this is not a referral situation, one of the 12600 fs_locations_item4 structures will contain an fs_locations_server4 12601 for the current server. This structure will have FSLI4GF_ABSENT 12602 set if the current file system is absent, i.e. normal access to it 12603 will return NFS4ERR_MOVED. 12605 o The fli_valid_for field specifies a time in seconds for which it 12606 is reasonable for a client to use the fs_locations_info attribute 12607 without refetch. The fli_valid_for value does not provide a 12608 guarantee of validity since servers can unexpectedly go out of 12609 service or become inaccessible for any number of reasons. Clients 12610 are well-advised to refetch this information for actively accessed 12611 file system at every fli_valid_for seconds. This is particularly 12612 important when file system replicas may go out of service in a 12613 controlled way using the FSLI4GF_GOING flag to communicate an 12614 ongoing change. The server should set fli_valid_for to a value 12615 which allows well-behaved clients to notice the FSLI4GF_GOING flag 12616 and make an orderly switch before the loss of service becomes 12617 effective. If this value is zero, then no refetch interval is 12618 appropriate and the client need not refetch this data on any 12619 particular schedule. In the event of a transition to a new file 12620 system instance, a new value of the fs_locations_info attribute 12621 will be fetched at the destination and it is to be expected that 12622 this may have a different valid_for value, which the client should 12623 then use, in the same fashion as the previous value. 12625 The FSLI4IF_VAR_SUB flag within fli_flags controls whether variable 12626 substitution is to be enabled. See Section 11.10.3 for an 12627 explanation of variable substitution. 12629 11.10.3. The fs_locations_item4 Structure 12631 The fs_locations_item4 structure contains a pathname (in the field 12632 fli_rootpath) which encodes the path of the target file system 12633 replicas on the set of servers designated by the included 12634 fs_locations_server4 entries. The precise manner in which this 12635 target location is specified depends on the value of the 12636 FSLI4IF_VAR_SUB flag within the associated fs_locations_info4 12637 structure. 12639 If this flag is not set, then fli_rootpath simply designates the 12640 location of the target file system within each server's single-server 12641 namespace just as it does for the rootpath within the fs_location4 12642 structure. When this bit is set, however, component entries of a 12643 certain form are subject to client-specific variable substitution so 12644 as to allow a degree of namespace non-uniformity in order to 12645 accommodate the selection of client-specific file system targets to 12646 adapt to different client architectures or other characteristics. 12648 When such substitution is in effect a variable beginning with the 12649 string "${" and ending with the string "}" and containing a colon is 12650 to be replaced by the client-specific value associated with that 12651 variable. The string "unknown" should be used by the client when it 12652 has no value for such a variable. The pathname resulting from such 12653 substitutions is used to designate the target file system, so that 12654 different clients may have different file systems, corresponding to 12655 that location in the multi-server namespace. 12657 As mentioned above, such substituted pathname variables contain a 12658 colon. The part before the colon is to be a DNS domain name with the 12659 part after being a case-insensitive alphanumeric string. 12661 Where the domain is "ietf.org", only variable names defined in this 12662 document or subsequent standards-track RFC's are subject to such 12663 substitution. Organizations are free to use their domain names to 12664 create their own sets of client-specific variables, to be subject to 12665 such substitution. In case where such variables are intended to be 12666 used more broadly than a single organization, publication of an 12667 informational RFC defining such variables is RECOMMENDED. 12669 The variable ${ietf.org:CPU_ARCH} is used to denote the CPU 12670 architecture object files are compiled. This specification does not 12671 limit the acceptable values (except that they must be valid UTF-8 12672 strings) but such values as "x86", "x86_64" and "sparc" would be 12673 expected to be used in line with industry practice. 12675 The variable ${ietf.org:OS_TYPE} is used to denote the operating 12676 system and thus the kernel and library API's for which code might be 12677 compiled. This specification does not limit the acceptable values 12678 (except that they must be valid UTF-8 strings) but such values as 12679 "linux" and "freebsd" would be expected to be used in line with 12680 industry practice. 12682 The variable ${ietf.org:OS_VERSION} is used to denote the operating 12683 system version and thus the specific details of versioned interfaces 12684 for which code might be compiled. This specification does not limit 12685 the acceptable values (except that they must be valid UTF-8 strings). 12686 However, combinations of numbers and letters with interspersed dots 12687 would be expected to be used in line with industry practice, with the 12688 details of the version format depending on the specific value of the 12689 variable ${ietf.org:OS_TYPE} with which it is used. 12691 Use of these variable could result in direction of different clients 12692 to different file systems on the same server, as appropriate to 12693 particular clients. In cases in which the target file systems are 12694 located on different servers, a single server could serve as a 12695 referral point so that each valid combination of variable values 12696 would designate a referral hosted on a single server, with the 12697 targets of those referrals on a number of different servers. 12699 Because namespace administration is affected by the values selected 12700 to substitute for various variables, clients should provide 12701 convenient means of determining what variable substitutions a client 12702 will implement, as well as, where appropriate, providing means to 12703 control the substitutions to be used. The exact means by which this 12704 will be done is outside the scope of this specification. 12706 Although variable substitution is most suitable for use in the 12707 context of referrals, if may be used in the context of replication 12708 and migration. If it is used in these contexts, the server must 12709 ensure that no matter what values the client presents for the 12710 substituted variables, the result is always a valid successor file 12711 system instance to that from which a transition is occurring, i.e. 12712 that the data is identical or represents a later image of a writable 12713 file system. 12715 Note that when fli_rootpath is a null pathname (that is, one with 12716 zero components), the file system designated is at the root of the 12717 specified server, whether the FSLI4IF_VAR_SUB flag within the 12718 associated fs_locations_info4 structure is set or not. 12720 11.11. The Attribute fs_status 12722 In an environment in which multiple copies of the same basic set of 12723 data are available, information regarding the particular source of 12724 such data and the relationships among different copies can be very 12725 helpful in providing consistent data to applications. 12727 enum fs4_status_type { 12728 STATUS4_FIXED = 1, 12729 STATUS4_UPDATED = 2, 12730 STATUS4_VERSIONED = 3, 12731 STATUS4_WRITABLE = 4, 12732 STATUS4_REFERRAL = 5 12733 }; 12735 struct fs4_status { 12736 bool fss_absent; 12737 fs4_status_type fss_type; 12738 utf8str_cs fss_source; 12739 utf8str_cs fss_current; 12740 int32_t fss_age; 12741 nfstime4 fss_version; 12742 }; 12744 The boolean fss_absent indicates whether the file system is currently 12745 absent. This value will be set if the file system was previously 12746 present and becomes absent, or if the file system has never been 12747 present and the type is STATUS4_REFERRAL. When this boolean is set 12748 and the type is not STATUS4_REFERRAL, the remaining information in 12749 the fs4_status reflects that last valid when the file system was 12750 present. 12752 The fss_type field indicates the kind of file system image 12753 represented. This is of particular importance when using the version 12754 values to determine appropriate succession of file system images. 12755 When fss_absent is set, and the file system was previously present, 12756 the value of fss_type reflected is that when the file was last 12757 present. Five values are distinguished: 12759 o STATUS4_FIXED which indicates a read-only image in the sense that 12760 it will never change. The possibility is allowed that, as a 12761 result of migration or switch to a different image, changed data 12762 can be accessed, but within the confines of this instance, no 12763 change is allowed. The client can use this fact to cache 12764 aggressively. 12766 o STATUS4_VERSIONED which indicates that the image, like the 12767 STATUS4_UPDATED case, is updated externally, but it provides a 12768 guarantee that the server will carefully update an associated 12769 version value so that the client can protect itself from a 12770 situation in which it reads data from one version of the file 12771 system, and then later reads data from an earlier version of the 12772 same file system. See below for a discussion of how this can be 12773 done. 12775 o STATUS4_UPDATED which indicates an image that cannot be updated by 12776 the user writing to it but may be changed externally, typically 12777 because it is a periodically updated copy of another writable file 12778 system somewhere else. In this case, version information is not 12779 provided and the client does not have the responsibility of making 12780 sure that this version only advances upon a file system instance 12781 transition. In this case, it is the responsibility of the server 12782 to make sure that the data presented after a file system instance 12783 transition is a proper successor image and includes all changes 12784 seen by the client and any change made before all such changes. 12786 o STATUS4_WRITABLE which indicates that the file system is an actual 12787 writable one. The client need not, of course, actually write to 12788 the file system, but once it does, it should not accept a 12789 transition to anything other than a writable instance of that same 12790 file system. 12792 o STATUS4_REFERRAL which indicates that the file system is question 12793 is absent and has never been present on this server. 12795 Note that in the STATUS4_UPDATED and STATUS4_VERSIONED cases, the 12796 server is responsible for the appropriate handling of locks that are 12797 inconsistent with external changes to delegations. If a server gives 12798 out delegations, they SHOULD be recalled before an inconsistent 12799 change made to data, and MUST be revoked if this is not possible. 12800 Similarly, if an open is inconsistent with data that is changed (the 12801 open denies WRITE and the data is changed), that lock SHOULD be 12802 considered administratively revoked. 12804 The opaque strings fss_source and fss_current provide a way of 12805 presenting information about the source of the file system image 12806 being present. It is not intended that client do anything with this 12807 information other than make it available to administrative tools. It 12808 is intended that this information be helpful when researching 12809 possible problems with a file system image that might arise when it 12810 is unclear if the correct image is being accessed and if not, how 12811 that image came to be made. This kind of diagnostic information will 12812 be helpful, if, as seems likely, copies of file systems are made in 12813 many different ways (e.g. simple user-level copies, file system-level 12814 point-in-time copies, clones of the underlying storage), under a 12815 variety of administrative arrangements. In such environments, 12816 determining how a given set of data was constructed can be very 12817 helpful in resolving problems. 12819 The opaque string fss_source is used to indicate the source of a 12820 given file system with the expectation that tools capable of creating 12821 a file system image propagate this information, when that is 12822 possible. It is understood that this may not always be possible 12823 since a user-level copy may be thought of as creating a new data set 12824 and the tools used may have no mechanism to propagate this data. 12825 When a file system is initially created, it is desirable to associate 12826 with it data regarding how the file system was created, where it was 12827 created, by whom, etc. Making this information available in this 12828 attribute in a human-readable string form will be helpful for 12829 applications and system administrators and also serves to make it 12830 available when the original file system is used to make subsequent 12831 copies. 12833 The opaque string fss_current should provide whatever information is 12834 available about the source of the current copy. Such information as 12835 the tool creating it, any relevant parameters to that tool, the time 12836 at which the copy was done, the user making the change, the server on 12837 which the change was made, etc. All information should be in a 12838 human-readable string form. 12840 The field fss_age provides an indication of how out-of-date the file 12841 system currently is with respect to its ultimate data source (in case 12842 of cascading data updates). This complements the fls_currency field 12843 of fs_locations_server4 (see Section 11.10) in the following way: the 12844 information in fls_currency gives a bound for how out of date the 12845 data in a file system might typically get, while the value in fss_age 12846 gives a bound on how out of date that data actually is. Negative 12847 values imply that no information is available. A zero means that 12848 this data is known to be current. A positive value means that this 12849 data is known to be no older than that number of seconds with respect 12850 to the ultimate data source. Using this value, the client may be 12851 able to decide that a data copy is too old, so that it may search for 12852 a newer version to use. 12854 The fss_version field provides a version identification, in the form 12855 of a time value, such that successive versions always have later time 12856 values. When the fs_type is anything other than STATUS4_VERSIONED, 12857 the server may provide such a value but there is no guarantee as to 12858 its validity and clients will not use it except to provide additional 12859 information to add to fss_source and fss_current. 12861 When fss_type is STATUS4_VERSIONED, servers SHOULD provide a value of 12862 version which progresses monotonically whenever any new version of 12863 the data is established. This allows the client, if reliable image 12864 progression is important to it, to fetch this attribute as part of 12865 each COMPOUND where data or metadata from the file system is used. 12867 When it is important to the client to make sure that only valid 12868 successor images are accepted, it must make sure that it does not 12869 read data or metadata from the file system without updating its sense 12870 of the current state of the image, to avoid the possibility that the 12871 fs_status which the client holds will be one for an earlier image, 12872 and so accept a new file system instance which is later than that but 12873 still earlier than updated data read by the client. 12875 In order to do this reliably, it must do a GETATTR of the fs_status 12876 attribute that follows any interrogation of data or metadata within 12877 the file system in question. Often this is most conveniently done by 12878 appending such a GETATTR after all other operations that reference a 12879 given file system. When errors occur between reading file system 12880 data and performing such a GETATTR, care must be exercised to make 12881 sure that the data in question is not used before obtaining the 12882 proper fs_status value. In this connection, when an OPEN is done 12883 within such a versioned file system and the associated GETATTR of 12884 fs_status is not successfully completed, the open file in question 12885 must not be accessed until that fs_status is fetched. 12887 The procedure above will ensure that before using any data from the 12888 file system the client has in hand a newly-fetched current version of 12889 the file system image. Multiple values for multiple requests in 12890 flight can be resolved by assembling them into the required partial 12891 order (and the elements should form a total order within it) and 12892 using the last. The client may then, when switching among file 12893 system instances, decline to use an instance which does not have an 12894 fss_type of STATUS4_VERSIONED or whose fss_version field is earlier 12895 than the last one obtained from the predecessor file system instance. 12897 12. Parallel NFS (pNFS) 12899 12.1. Introduction 12901 pNFS is an OPTIONAL feature within NFSv4.1; the pNFS feature set 12902 allows direct client access to the storage devices containing file 12903 data. When file data for a single NFSv4 server is stored on multiple 12904 and/or higher throughput storage devices (by comparison to the 12905 server's throughput capability), the result can be significantly 12906 better file access performance. The relationship among multiple 12907 clients, a single server, and multiple storage devices for pNFS 12908 (server and clients have access to all storage devices) is shown in 12909 Figure 1. 12911 +-----------+ 12912 |+-----------+ +-----------+ 12913 ||+-----------+ | | 12914 ||| | NFSv4.1 + pNFS | | 12915 +|| Clients |<------------------------------>| Server | 12916 +| | | | 12917 +-----------+ | | 12918 ||| +-----------+ 12919 ||| | 12920 ||| | 12921 ||| Storage +-----------+ | 12922 ||| Protocol |+-----------+ | 12923 ||+----------------||+-----------+ Control | 12924 |+-----------------||| | Protocol| 12925 +------------------+|| Storage |------------+ 12926 +| Devices | 12927 +-----------+ 12929 Figure 1 12931 In this model, the clients, server, and storage devices are 12932 responsible for managing file access. This is in contrast to NFSv4 12933 without pNFS where it is primarily the server's responsibility; some 12934 of this responsibility may be delegated to the client under strictly 12935 specified conditions. See Section 12.2.6 for a discussion of the 12936 Control Protocol. See Section 12.2.5 for a discussion of the Storage 12937 Protocol. 12939 pNFS takes the form of OPTIONAL operations that manage protocol 12940 objects called 'layouts' (Section 12.2.7) which contain a byte-range 12941 and storage location information. The layout is managed in a similar 12942 fashion as NFSv4.1 data delegations. For example, the layout is 12943 leased, recallable and revocable. However, layouts are distinct 12944 abstractions and are manipulated with new operations. When a client 12945 holds a layout, it is granted the ability to directly access the 12946 byte-range at the storage location specified in the layout. 12948 There are interactions between layouts and other NFSv4.1 abstractions 12949 such as data delegations and byte-range locking. Delegation issues 12950 are discussed in Section 12.5.5. Byte range locking issues are 12951 discussed in Section 12.2.9 and Section 12.5.1. 12953 12.2. pNFS Definitions 12955 NFSv4.1's pNFS feature provides parallel data access to a file system 12956 that stripes its content across multiple storage servers. The first 12957 instantiation of pNFS, as part of NFSv4.1, separates the file system 12958 protocol processing into two parts: metadata processing and data 12959 processing. Data consist of the contents of regular files which are 12960 striped across storage servers. Data striping occurs in at least two 12961 ways: on a file-by-file basis, and within sufficiently large files, 12962 on a block-by-block basis. In contrast, striped access to metadata 12963 by pNFS clients is not provided in NFSv4.1, even though the file 12964 system back end of a pNFS server might stripe metadata. Metadata 12965 consist of everything else, including the contents of non-regular 12966 files (e.g. directories); see Section 12.2.1. The metadata 12967 functionality is implemented by an NFSv4.1 server that supports pNFS 12968 and the operations described in (Section 18); such a server is called 12969 a metadata server (Section 12.2.2). 12971 The data functionality is implemented by one or more storage devices, 12972 each of which are accessed by the client via a storage protocol. A 12973 subset (defined in Section 13.6) of NFSv4.1 is one such storage 12974 protocol. New terms are introduced to the NFSv4.1 nomenclature and 12975 existing terms are clarified to allow for the description of the pNFS 12976 feature. 12978 12.2.1. Metadata 12980 Information about a file system object, such as its name, location 12981 within the namespace, owner, ACL and other attributes. Metadata may 12982 also include storage location information and this will vary based on 12983 the underlying storage mechanism that is used. 12985 12.2.2. Metadata Server 12987 An NFSv4.1 server which supports the pNFS feature. A variety of 12988 architectural choices exists for the metadata server and its use of 12989 file system information held at the server. Some servers may contain 12990 metadata only for file objects residing at the metadata server while 12991 the file data resides on associated storage devices. Other metadata 12992 servers may hold both metadata and a varying degree of file data. 12994 12.2.3. pNFS Client 12996 An NFSv4.1 client that supports pNFS operations and supports at least 12997 one storage protocol for performing I/O to storage devices. 12999 12.2.4. Storage Device 13001 A storage device stores a regular file's data, but leaves metadata 13002 management to the metadata server. A storage device could be another 13003 NFSv4.1 server, an object storage device (OSD), a block device 13004 accessed over a SAN (e.g., either FiberChannel or iSCSI SAN), or some 13005 other entity. 13007 12.2.5. Storage Protocol 13009 As noted in the Figure 1, the storage protocol is the method used by 13010 the client to store and retrieve data directly from the storage 13011 devices. 13013 The NFSv4.1 pNFS feature has been structured to allow for a variety 13014 of storage protocols to be defined and used. One example storage 13015 protocol is NFSv4.1 itself (as documented in Section 13). Other 13016 options for the storage protocol are described elsewhere and include: 13018 o Block/volume protocols such as iSCSI ([47]), and FCP ([48]). The 13019 block/volume protocol support can be independent of the addressing 13020 structure of the block/volume protocol used, allowing more than 13021 one protocol to access the same file data and enabling 13022 extensibility to other block/volume protocols. See [40] for a 13023 layout specification that allows pNFS to use block/volume storage 13024 protocols. 13026 o Object protocols such as OSD over iSCSI or Fibre Channel [49]. 13027 See [39] for a layout specification that allows pNFS to use object 13028 storage protocols. 13030 It is possible that various storage protocols are available to both 13031 client and server and it may be possible that a client and server do 13032 not have a matching storage protocol available to them. Because of 13033 this, the pNFS server MUST support normal NFSv4.1 access to any file 13034 accessible by the pNFS feature; this will allow for continued 13035 interoperability between an NFSv4.1 client and server. 13037 12.2.6. Control Protocol 13039 As noted in the Figure 1, the control protocol is used by the 13040 exported file system between the metadata server and storage devices. 13041 Specification of such protocols is outside the scope of the NFSv4.1 13042 protocol. Such control protocols would be used to control activities 13043 such as the allocation and deallocation of storage, the management of 13044 state required by the storage devices to perform client access 13045 control, and, depending on the storage protocol, the enforcement of 13046 authentication and authorization so that restrictions that would be 13047 enforced by the metadata server are also enforced by the storage 13048 device. 13050 A particular control protocol is not REQUIRED by NFSv4.1 but 13051 requirements are placed on the control protocol for maintaining 13052 attributes like modify time, the change attribute, and the end-of- 13053 file (EOF) position. Note that if pNFS is layered over a clustered, 13054 parallel file system (e.g. PVFS [50]), the mechanisms that enable 13055 clustering and parallelism in that file system can be considered the 13056 control protocol. 13058 12.2.7. Layout Types 13060 A layout describes the mapping of a file's data to the storage 13061 devices that hold the data. A layout is said to belong to a specific 13062 layout type (data type layouttype4, see Section 3.3.13). The layout 13063 type allows for variants to handle different storage protocols, such 13064 as those associated with block/volume [40], object [39], and file 13065 (Section 13) layout types. A metadata server, along with its control 13066 protocol, MUST support at least one layout type. A private sub-range 13067 of the layout type name space is also defined. Values from the 13068 private layout type range MAY be used for internal testing or 13069 experimentation. 13071 As an example, the organization of the file layout type could be an 13072 array of tuples (e.g., device ID, filehandle), along with a 13073 definition of how the data is stored across the devices (e.g., 13074 striping). A block/volume layout might be an array of tuples that 13075 store along with information 13076 about block size and the associated file offset of the block number. 13077 An object layout might be an array of tuples 13078 and an additional structure (i.e., the aggregation map) that defines 13079 how the logical byte sequence of the file data is serialized into the 13080 different objects. Note that the actual layouts are typically more 13081 complex than these simple expository examples. 13083 Requests for pNFS-related operations will often specify a layout 13084 type. Examples of such operations are GETDEVICEINFO and LAYOUTGET. 13085 The response for these operations will include structures such a 13086 device_addr4 or a layout4, each of which includes a layout type 13087 within it. The layout type sent by the server MUST always be the 13088 same one requested by the client. When a server sends a response 13089 that includes a different layout type, the client SHOULD ignore the 13090 response and behave as if the server had returned an error response. 13092 12.2.8. Layout 13094 A layout defines how a file's data is organized on one or more 13095 storage devices. There are many potential layout types; each of the 13096 layout types are differentiated by the storage protocol used to 13097 access data and in the aggregation scheme that lays out the file data 13098 on the underlying storage devices. A layout is precisely identified 13099 by the following tuple: ; where filehandle refers to the filehandle of the file on the 13101 metadata server. 13103 It is important to define when layouts overlap and/or conflict with 13104 each other. For two layouts with overlapping byte ranges to actually 13105 overlap each other, both layouts must be of the same layout type, 13106 correspond to the same filehandle, and have the same iomode. Layouts 13107 conflict when they overlap and differ in the content of the layout 13108 (i.e., the storage device/file mapping parameters differ). Note that 13109 differing iomodes do not lead to conflicting layouts. It is 13110 permissible for layouts with different iomodes, pertaining to the 13111 same byte range, to be held by the same client. An example of this 13112 would be copy-on-write functionality for a block/volume layout type. 13114 12.2.9. Layout Iomode 13116 The layout iomode (data type layoutiomode4, see Section 3.3.20) 13117 indicates to the metadata server the client's intent to perform 13118 either just read operations or a mixture of I/O possibly containing 13119 read and write operations. For certain layout types, it is useful 13120 for a client to specify this intent at the time it sends LAYOUTGET 13121 (Section 18.43). For example, block/volume based protocols, block 13122 allocation could occur when a READ/WRITE iomode is specified. A 13123 special LAYOUTIOMODE4_ANY iomode is defined and can only be used for 13124 LAYOUTRETURN and CB_LAYOUTRECALL, not for LAYOUTGET. It specifies 13125 that layouts pertaining to both READ and READ/WRITE iomodes are being 13126 returned or recalled, respectively. 13128 A storage device may validate I/O with regard to the iomode; this is 13129 dependent upon storage device implementation and layout type. Thus, 13130 if the client's layout iomode is inconsistent with the I/O being 13131 performed, the storage device may reject the client's I/O with an 13132 error indicating a new layout with the correct iomode should be 13133 obtained via LAYOUTGET. For example, if a client gets a layout with 13134 a READ iomode and performs a WRITE to a storage device, the storage 13135 device is allowed to reject that WRITE. 13137 The use of the layout iomode does not conflict with OPEN share modes 13138 or byte-range lock requests; open mode and lock conflicts are 13139 enforced as they are without the use of pNFS, and are logically 13140 separate from the pNFS layout level. Open modes and locks are the 13141 preferred method for restricting user access to data files. For 13142 example, an OPEN of read, deny-write does not conflict with a 13143 LAYOUTGET containing an iomode of READ/WRITE performed by another 13144 client. Applications that depend on writing into the same file 13145 concurrently may use byte-range locking to serialize their accesses. 13147 12.2.10. Device IDs 13149 The device ID (data type deviceid4, see Section 3.3.14) identifies a 13150 group of storage devices. The scope of a device ID is the pair 13151 . In practice, a significant amount of 13152 information may be required to fully address a storage device. 13153 Rather than embedding all such information in a layout, layouts embed 13154 device IDs. The NFSv4.1 operation GETDEVICEINFO (Section 18.40) is 13155 used to retrieve the complete address information (including all 13156 device addresses for the device ID) regarding the storage device 13157 according to its layout type and device ID. For example, the address 13158 of an NFSv4.1 data server or of an object storage device could be an 13159 IP address and port. The address of a block storage device could be 13160 a volume label. 13162 Clients cannot expect the mapping between a device ID and its storage 13163 device address(es) to persist across metadata server restart. See 13164 Section 12.7.4 for a description of how recovery works in that 13165 situation. 13167 A device ID lives as long as there is a layout referring to the 13168 device ID. If there are no layouts referring to the device ID, the 13169 server is free to delete the device ID any time. Once a device ID is 13170 deleted by the server, the server MUST NOT reuse the device ID for 13171 the same layout type and client ID again. This requirement is 13172 feasible because the device ID is 16 bytes long, leaving sufficient 13173 room to store a generation number if server's implementation requires 13174 most of the rest of the device ID's content to be reused. This 13175 requirement is necessary because otherwise the race conditions 13176 between asynchronous notification of device ID addition and deletion 13177 would be too difficult to sort out. 13179 Device ID to device address mappings are not leased, and can be 13180 changed at any time. (Note that while device ID to device address 13181 mappings are likely to change after the metadata server restarts, the 13182 server is not required to change the mappings.) A server has two 13183 choices for changing mappings. It can recall all layouts referring 13184 to the device ID or it can use a notification mechanism. 13186 The NFSv4.1 protocol has no optimal way to recall all layouts that 13187 referred to a particular device ID (unless the server associates a 13188 single device ID with a single fsid or a single client ID; in which 13189 case, CB_LAYOUTRECALL has options for recalling all layouts 13190 associated with the fsid, client ID pair or just the client ID). 13192 Via a notification mechanism (see Section 20.12), device ID to device 13193 address mappings can change over the duration of server operation 13194 without recalling or revoking the layouts that refer to device ID. 13195 The notification mechanism can also delete a device ID, but only if 13196 the client has no layouts referring to the device ID. A notification 13197 of a change to a device ID to device address mapping will immediately 13198 or eventually invalidate some or all of the device ID's mappings. 13200 The server MUST support notifications and the client must request 13201 them before they can be used. For further information about the 13202 notification types Section 20.12. 13204 12.3. pNFS Operations 13206 NFSv4.1 has several operations that are needed for pNFS servers, 13207 regardless of layout type or storage protocol. These operations are 13208 all sent to a metadata server and summarized here. While pNFS is an 13209 OPTIONAL feature, if pNFS is implemented, some operations are 13210 REQUIRED in order to comply with pNFS. See Section 17. 13212 These are the fore channel pNFS operations: 13214 GETDEVICEINFO. As noted previously (Section 12.2.10), GETDEVICEINFO 13215 (Section 18.40) returns the mapping of device ID to storage device 13216 address. 13218 GETDEVICELIST (Section 18.41), allows clients to fetch all device 13219 IDs for a specific file system. 13221 LAYOUTGET (Section 18.43) is used by a client to get a layout for a 13222 file. 13224 LAYOUTCOMMIT (Section 18.42) is used to inform the metadata server 13225 of the client's intent to commit data which has been written to 13226 the storage device; the storage device as originally indicated in 13227 the return value of LAYOUTGET. 13229 LAYOUTRETURN (Section 18.44) is used to return layouts for a file, 13230 an FSID and for client ID. 13232 These are the backchannel pNFS operations: 13234 CB_LAYOUTRECALL (Section 20.3) recalls a layout or all layouts 13235 belonging to a file system, or all layouts belonging to a client 13236 ID. 13238 CB_RECALL_ANY (Section 20.6), tells a client that it needs to return 13239 some number of recallable objects, including layouts, to the 13240 metadata server. 13242 CB_RECALLABLE_OBJ_AVAIL (Section 20.7) tells a client that a 13243 recallable object that it was denied (in case of pNFS, a layout, 13244 denied by LAYOUTGET) due to resource exhaustion, is now available. 13246 CB_NOTIFY_DEVICEID (Section 20.12) Notifies the client of changes to 13247 device IDs. 13249 12.4. pNFS Attributes 13251 A number of attributes specific to pNFS are listed and described in 13252 Section 5.12 13254 12.5. Layout Semantics 13256 12.5.1. Guarantees Provided by Layouts 13258 Layouts grant to the client the ability to access data located at a 13259 storage device with the appropriate storage protocol. The client is 13260 guaranteed the layout will be recalled when one of two things occur; 13261 either a conflicting layout is requested or the state encapsulated by 13262 the layout becomes invalid and this can happen when an event directly 13263 or indirectly modifies the layout. When a layout is recalled and 13264 returned by the client, the client continues with the ability to 13265 access file data with normal NFSv4.1 operations through the metadata 13266 server. Only the ability to access the storage devices is affected. 13268 The requirement of NFSv4.1, that all user access rights MUST be 13269 obtained through the appropriate open, lock, and access operations, 13270 is not modified with the existence of layouts. Layouts are provided 13271 to NFSv4.1 clients and user access still follows the rules of the 13272 protocol as if they did not exist. It is a requirement that for a 13273 client to access a storage device, a layout must be held by the 13274 client. If a storage device receives an I/O for a byte range for 13275 which the client does not hold a layout, the storage device SHOULD 13276 reject that I/O request. Note that the act of modifying a file for 13277 which a layout is held, does not necessarily conflict with the 13278 holding of the layout that describes the file being modified. 13279 Therefore, it is the requirement of the storage protocol or layout 13280 type that determines the necessary behavior. For example, block/ 13281 volume layout types require that the layout's iomode agree with the 13282 type of I/O being performed. 13284 Depending upon the layout type and storage protocol in use, storage 13285 device access permissions may be granted by LAYOUTGET and may be 13286 encoded within the type-specific layout. For an example of storage 13287 device access permissions, see an object based protocol such as [49]. 13288 If access permissions are encoded within the layout, the metadata 13289 server SHOULD recall the layout when those permissions become invalid 13290 for any reason; for example when a file becomes unwritable or 13291 inaccessible to a client. Note, clients are still required to 13292 perform the appropriate access operations with open, lock and access 13293 as described above. The degree to which it is possible for the 13294 client to circumvent these access operations and the consequences of 13295 doing so must be clearly specified by the individual layout type 13296 specifications. In addition, these specifications must be clear 13297 about the requirements and non-requirements for the checking 13298 performed by the server. 13300 In the presence of pNFS functionality, mandatory file locks MUST 13301 behave as they would without pNFS. Therefore, if mandatory file 13302 locks and layouts are provided simultaneously, the storage device 13303 MUST be able to enforce the mandatory file locks. For example, if 13304 one client obtains a mandatory lock and a second client accesses the 13305 storage device, the storage device MUST appropriately restrict I/O 13306 for the byte range of the mandatory file lock. If the storage device 13307 is incapable of providing this check in the presence of mandatory 13308 file locks, the metadata server then MUST NOT grant layouts and 13309 mandatory file locks simultaneously. 13311 12.5.2. Getting a Layout 13313 A client obtains a layout with the LAYOUTGET operation. The metadata 13314 server will grant layouts of a particular type (e.g., block/volume, 13315 object, or file). The client selects an appropriate layout type that 13316 the server supports and the client is prepared to use. The layout 13317 returned to the client might not exactly match the requested byte 13318 range as described in Section 18.43.3. As needed a client may make 13319 multiple LAYOUTGET requests; these might result in multiple 13320 overlapping, non-conflicting layouts (see Section 12.2.8). 13322 In order to get a layout, the client must first have opened the file 13323 via the OPEN operation. When a client has no layout on a file, it 13324 MUST present a stateid as returned by OPEN, a delegation stateid, or 13325 a byte-range lock stateid in the loga_stateid argument. A successful 13326 LAYOUTGET result includes a layout stateid. The first successful 13327 LAYOUTGET processed by the server using a non-layout stateid as an 13328 argument MUST have the "seqid" field of the layout stateid in the 13329 response set to one. Thereafter, the client MUST use a layout 13330 stateid (see Section 12.5.3) on future invocations of LAYOUTGET on 13331 the file, and the "seqid" MUST NOT be set to zero. Once the layout 13332 has been retrieved, it can be held across multiple OPEN and CLOSE 13333 sequences. Therefore, a client may hold a layout for a file that is 13334 not currently open by any user on the client. This allows for the 13335 caching of layouts beyond CLOSE. 13337 The storage protocol used by the client to access the data on the 13338 storage device is determined by the layout's type. The client is 13339 responsible for matching the layout type with an available method to 13340 interpret and use the layout. The method for this layout type 13341 selection is outside the scope of the pNFS functionality. 13343 Although the metadata server is in control of the layout for a file, 13344 the pNFS client can provide hints to the server when a file is opened 13345 or created about the preferred layout type and aggregation schemes. 13346 pNFS introduces a layout_hint (Section 5.12.4) attribute that the 13347 client can set at file creation time to provide a hint to the server 13348 for new files. Setting this attribute separately, after the file has 13349 been created might make it difficult, or impossible, for the server 13350 implementation to comply. 13352 Because the EXCLUSIVE4 createmode4 does not allow the setting of 13353 attributes at file creation time, NFSv4.1 introduces the EXCLUSIVE4_1 13354 createmode4, which does allow attributes to be set at file creation 13355 time. In addition, if the session is created with persistent reply 13356 caches, EXCLUSIVE4_1 is neither necessary nor allowed. Instead, 13357 GUARDED4 both works better and is prescribed. Table 10 in 13358 Section 18.16.3, summarizes how a client is allowed to send an 13359 exclusive create. 13361 12.5.3. Layout Stateid 13363 As with all other stateids, the layout stateid consists of a "seqid" 13364 and "other" field. Once a layout stateid is changed, the "other" 13365 field will stay constant unless the stateid is revoked, or the client 13366 returns all layouts on the file and the server disposes of the 13367 stateid. The "seqid" field is initially set to one, and is never 13368 zero on any NFSv4.1 operation that uses layout stateids, whether it 13369 is a fore channel or backchannel operation. After the layout stateid 13370 is established, the server increments by one the value of the "seqid" 13371 in each subsequent LAYOUTGET and LAYOUTRETURN response, and in each 13372 CB_LAYOUTRECALL request. 13374 Given the design goal of pNFS to provide parallelism, the layout 13375 stateid differs from other stateid types in that the client is 13376 expected to send LAYOUTGET and LAYOUTRETURN operations in parallel. 13377 The "seqid" value is used by the client to properly sort responses to 13378 LAYOUTGET and LAYOUTRETURN. The "seqid" is also used to prevent race 13379 conditions between LAYOUTGET and CB_LAYOUTRECALL. Given the 13380 processing rules differ from layout stateids and other stateid types, 13381 only the pNFS sections of this document should be considered to 13382 determine proper layout stateid handling. 13384 Once the client receives a layout stateid, it MUST use the correct 13385 "seqid" for subsequent LAYOUTGET or LAYOUTRETURN operations. The 13386 correct "seqid" is defined as the highest "seqid" value from 13387 responses of fully processed LAYOUTGET or LAYOUTRETURN operations or 13388 arguments of a fully processed CB_LAYOUTRECALL operation. Since the 13389 server is incrementing the "seqid" value on each layout operation, 13390 the client may determine the order of operation processing by 13391 inspecting the "seqid" value. In the case of overlapping layout 13392 ranges, the ordering information will provide the client the 13393 knowledge of which layout ranges are held. Note that overlapping 13394 layout ranges may occur because of the client's specific requests or 13395 because the server is allowed to expand the range of a requested 13396 layout and notify the client in the LAYOUTRETURN results. Additional 13397 layout stateid sequencing requirements are provided in 13398 Section 12.5.5.2. 13400 The client's receipt of a "seqid" is not sufficient for subsequent 13401 use. The client must fully process the operations before the "seqid" 13402 can be used. For LAYOUTGET results, if the client is not using the 13403 forgetful model (Section 12.5.5.1), it MUST first update its record 13404 of what ranges of the file's layout it has before using the seqid. 13405 For LAYOUTRETURN results, the client MUST delete the range from its 13406 record of what ranges of the file's layout it had before using the 13407 seqid. For CB_LAYOUTRECALL arguments, the client MUST send a 13408 response to the recall before using the seqid. The fundamental 13409 requirement in client processing is that the "seqid" is used to 13410 provide the order of processing. LAYOUTGET results may be processed 13411 in parallel. LAYOUTRETURN results may be processed in parallel. 13412 LAYOUTGET and LAYOUTRETURN responses may be processed in parallel as 13413 long as the ranges do not overlap. CB_LAYOUTRECALL request 13414 processing MUST be processed in "seqid" order at all times. 13416 Once a client has no more layouts on a file, the layout stateid is no 13417 longer valid, and MUST NOT be used. Any attempt to use such a layout 13418 stateid will result in NFS4ERR_BAD_STATEID. 13420 12.5.4. Committing a Layout 13422 Allowing for varying storage protocols capabilities, the pNFS 13423 protocol does not require the metadata server and storage devices to 13424 have a consistent view of file attributes and data location mappings. 13425 Data location mapping refers to aspects such as which offsets store 13426 data as opposed to storing holes (see Section 13.4.4 for a 13427 discussion). Related issues arise for storage protocols where a 13428 layout may hold provisionally allocated blocks where the allocation 13429 of those blocks does not survive a complete restart of both the 13430 client and server. Because of this inconsistency, it is necessary to 13431 re-synchronize the client with the metadata server and its storage 13432 devices and make any potential changes available to other clients. 13433 This is accomplished by use of the LAYOUTCOMMIT operation. 13435 The LAYOUTCOMMIT operation is responsible for committing a modified 13436 layout to the metadata server. The data should be written and 13437 committed to the appropriate storage devices before the LAYOUTCOMMIT 13438 occurs. The scope of the LAYOUTCOMMIT operation depends on the 13439 storage protocol in use. It is important to note that the level of 13440 synchronization is from the point of view of the client which sent 13441 the LAYOUTCOMMIT. The updated state on the metadata server need only 13442 reflect the state as of the client's last operation previous to the 13443 LAYOUTCOMMIT. It is not REQUIRED to maintain a global view that 13444 accounts for other clients' I/O that may have occurred within the 13445 same time frame. 13447 For block/volume-based layouts, LAYOUTCOMMIT may require updating the 13448 block list that comprises the file and committing this layout to 13449 stable storage. For file-layouts synchronization of attributes 13450 between the metadata and storage devices primarily the size attribute 13451 is required. 13453 The control protocol is free to synchronize the attributes before it 13454 receives a LAYOUTCOMMIT, however upon successful completion of a 13455 LAYOUTCOMMIT, state that exists on the metadata server that describes 13456 the file MUST be in sync with the state existing on the storage 13457 devices that comprise that file as of the issuing client's last 13458 operation. Thus, a client that queries the size of a file between a 13459 WRITE to a storage device and the LAYOUTCOMMIT may observe a size 13460 that does not reflect the actual data written. 13462 The client MUST have a layout in order to issue LAYOUTCOMMIT. 13464 12.5.4.1. LAYOUTCOMMIT and change/time_modify 13466 The change and time_modify attributes may be updated by the server 13467 when the LAYOUTCOMMIT operation is processed. The reason for this is 13468 that some layout types do not support the update of these attributes 13469 when the storage devices process I/O operations. If client has a 13470 layout with the LAYOUTIOMODE4_RW iomode on the file, the client MAY 13471 provide a suggested value to the server for time_modify within the 13472 arguments to LAYOUTCOMMIT. Based on the layout type, the provided 13473 value may or may not be used. The server should sanity check the 13474 client provided values before they are used. For example, the server 13475 should ensure that time does not flow backwards. The client always 13476 has the option to set time_modify through an explicit SETATTR 13477 operation. 13479 For some layout protocols, the storage device is able to notify the 13480 metadata server of the occurrence of an I/O and as a result the 13481 change and time_modify attributes may be updated at the metadata 13482 server. For a metadata server that is capable of monitoring updates 13483 to the change and time_modify attributes, LAYOUTCOMMIT processing is 13484 not required to update the change attribute; in this case the 13485 metadata server must ensure that no further update to the data has 13486 occurred since the last update of the attributes; file-based 13487 protocols may have enough information to make this determination or 13488 may update the change attribute upon each file modification. This 13489 also applies for the time_modify attribute. If the server 13490 implementation is able to determine that the file has not been 13491 modified since the last time_modify update, the server need not 13492 update time_modify at LAYOUTCOMMIT. At LAYOUTCOMMIT completion, the 13493 updated attributes should be visible if that file was modified since 13494 the latest previous LAYOUTCOMMIT or LAYOUTGET. 13496 12.5.4.2. LAYOUTCOMMIT and size 13498 The size of a file may be updated when the LAYOUTCOMMIT operation is 13499 used by the client. One of the fields in the argument to 13500 LAYOUTCOMMIT is loca_last_write_offset; this field indicates the 13501 highest byte offset written but not yet committed with the 13502 LAYOUTCOMMIT operation. The data type of loca_last_write_offset is 13503 newoffset4 and is switched on a boolean value, no_newoffset, that 13504 indicates if a previous write occurred or not. If no_newoffset is 13505 FALSE, an offset is not given. If the client has a layout with 13506 LAYOUTIOMODE4_RW iomode on the file, with an lo_offset and lo_length 13507 that overlaps loca_last_write_offset, then the client MAY set 13508 no_newoffset to TRUE and provide an offset that will update the file 13509 size. Keep in mind that offset is not the same as length, though 13510 they are related. For example, a loca_last_write_offset value of 13511 zero means that one byte was written at offset zero, and so the 13512 length of the file is at least one byte. 13514 The metadata server may do one of the following: 13516 1. Update the file's size using the last write offset provided by 13517 the client as either the true file size or as a hint of the file 13518 size. If the metadata server has a method available, any new 13519 value for file size should be sanity checked. For example, the 13520 file must not be truncated if the client presents a last write 13521 offset less than the file's current size. 13523 2. Ignore the client provided last write offset; the metadata server 13524 must have sufficient knowledge from other sources to determine 13525 the file's size. For example, the metadata server queries the 13526 storage devices with the control protocol. 13528 The method chosen to update the file's size will depend on the 13529 storage device's and/or the control protocol's capabilities. For 13530 example, if the storage devices are block devices with no knowledge 13531 of file size, the metadata server must rely on the client to set the 13532 last write offset appropriately. 13534 The results of LAYOUTCOMMIT contain a new size value in the form of a 13535 newsize4 union data type. If the file's size is set as a result of 13536 LAYOUTCOMMIT, the metadata server must reply with the new size; 13537 otherwise the new size is not provided. If the file size is updated, 13538 the metadata server SHOULD update the storage devices such that the 13539 new file size is reflected when LAYOUTCOMMIT processing is complete. 13540 For example, the client should be able to READ up to the new file 13541 size. 13543 The client can extend the length of a file or truncate a file by 13544 sending a SETATTR operation to the metadata server with the size 13545 attribute specified. If the size specified is larger than the 13546 current size of the file, the file is "zero extended", i.e., zeroes 13547 are implicitly added between the file's previous EOF and the new EOF. 13548 (In many implementations the zero extended region of the file 13549 consists of unallocated holes in the file.) When the client writes 13550 past EOF via WRITE, the SETATTR operation does not need to be used. 13552 12.5.4.3. LAYOUTCOMMIT and layoutupdate 13554 The LAYOUTCOMMIT argument contains a loca_layoutupdate field 13555 (Section 18.42.1) of data type layoutupdate4 (Section 3.3.18). This 13556 argument is a layout type-specific structure. The structure can be 13557 used to pass arbitrary layout type-specific information from the 13558 client to the metadata server at LAYOUTCOMMIT time. For example, if 13559 using a block/volume layout, the client can indicate to the metadata 13560 server which reserved or allocated blocks the client used or did not 13561 use. The content of loca_layoutupdate (field lou_body) need not be 13562 the same layout type-specific content returned by LAYOUTGET 13563 (Section 18.43.2) in the loc_body field of the lo_content field, of 13564 the logr_layout field. The content of loca_layoutupdate is defined 13565 by the layout type specification and is opaque to LAYOUTCOMMIT. 13567 12.5.5. Recalling a Layout 13569 Since a layout protects a client's access to a file via a direct 13570 client-storage-device path, a layout need only be recalled when it is 13571 semantically unable to serve this function. Typically, this occurs 13572 when the layout no longer encapsulates the true location of the file 13573 over the byte range it represents. Any operation or action, such as 13574 server driven restriping or load balancing, that changes the layout 13575 will result in a recall of the layout. A layout is recalled by the 13576 CB_LAYOUTRECALL callback operation (see Section 20.3) and returned 13577 with LAYOUTRETURN Section 18.44. The CB_LAYOUTRECALL operation may 13578 recall a layout identified by a byte range, all the layouts 13579 associated with a file system (FSID), or all layouts associated with 13580 a client ID. Section 12.5.5.2 discusses sequencing issues 13581 surrounding the getting, returning, and recalling of layouts. 13583 An iomode is also specified when recalling a layout. Generally, the 13584 iomode in the recall request must match the layout being returned; 13585 for example, a recall with an iomode of LAYOUTIOMODE4_RW should cause 13586 the client to only return LAYOUTIOMODE4_RW layouts and not 13587 LAYOUTIOMODE4_READ layouts. However, a special LAYOUTIOMODE4_ANY 13588 enumeration is defined to enable recalling a layout of any iomode; in 13589 other words, the client must return both read-only and read/write 13590 layouts. 13592 A REMOVE operation SHOULD cause the metadata server to recall the 13593 layout to prevent the client from accessing a non-existent file and 13594 to reclaim state stored on the client. Since a REMOVE may be delayed 13595 until the last close of the file has occurred, the recall may also be 13596 delayed until this time. After the last reference on the file has 13597 been released and the file has been removed, the client should no 13598 longer be able to perform I/O using the layout. In the case of a 13599 files based layout, the data server SHOULD return NFS4ERR_STALE in 13600 response to any operation on the removed file. 13602 Once a layout has been returned, the client MUST NOT send I/Os to the 13603 storage devices for the file, byte range, and iomode represented by 13604 the returned layout. If a client does send an I/O to a storage 13605 device for which it does not hold a layout, the storage device SHOULD 13606 reject the I/O. 13608 Although pNFS does not alter the file data caching capabilities of 13609 clients, or their semantics, it recognizes that some clients may 13610 perform more aggressive write-behind caching to optimize the benefits 13611 provided by pNFS. However, write-behind caching may negatively 13612 affect the latency in returning a layout in response to a 13613 CB_LAYOUTRECALL; this is similar to file delegations and the impact 13614 that file data caching has on DELEGRETURN. Client implementations 13615 SHOULD limit the amount of unwritten data they have outstanding at 13616 any one time in order to prevent excessively long responses to 13617 CB_LAYOUTRECALL. Once a layout is recalled, a server MUST wait one 13618 lease period before taking further action. As soon as a lease period 13619 has past, the server may choose to fence the client's access to the 13620 storage devices if the server perceives the client has taken too long 13621 to return a layout. However, just as in the case of data delegation 13622 and DELEGRETURN, the server may choose to wait given that the client 13623 is showing forward progress on its way to returning the layout. This 13624 forward progress can take the form of successful interaction with the 13625 storage devices or sub-portions of the layout being returned by the 13626 client. The server can also limit exposure to these problems by 13627 limiting the byte ranges initially provided in the layouts and thus 13628 the amount of outstanding modified data. 13630 12.5.5.1. Layout Recall Callback Robustness 13632 It has been assumed thus far that pNFS client state for a file 13633 exactly matches the pNFS server state for that file and client 13634 regarding layout ranges and iomode. This assumption leads to the 13635 implication that any callback results in a LAYOUTRETURN or set of 13636 LAYOUTRETURNs that exactly match the range in the callback, since 13637 both client and server agree about the state being maintained. 13638 However, it can be useful if this assumption does not always hold. 13639 For example: 13641 o If conflicts that require callbacks are very rare, and a server 13642 can use a multi-file callback to recover per-client resources 13643 (e.g., via a FSID recall, or a multi-file recall within a single 13644 compound), the result may be significantly less client-server pNFS 13645 traffic. 13647 o It may be useful for servers to maintain information about what 13648 ranges are held by a client on a coarse-grained basis, leading to 13649 the server's layout ranges being beyond those actually held by the 13650 client. In the extreme, a server could manage conflicts on a per- 13651 file basis, only issuing whole-file callbacks even though clients 13652 may request and be granted sub-file ranges. 13654 o It may be useful for clients to "forget" details about what 13655 layouts and ranges the client actually has, leading to the 13656 server's layout ranges being beyond those what the client "thinks" 13657 it has. As long as the client does not assume it has layouts that 13658 are beyond what the server has granted, this is a safe practice. 13659 When a client forgets what ranges and layouts it has, and it 13660 receives a CB_LAYOUTRECALL operation, the client MUST follow up 13661 with a LAYOUTRETURN for what the server recalled, or alternatively 13662 return the NFS4ERR_NOMATCHING_LAYOUT error if it has no layout to 13663 return in the recalled range. 13665 o In order to avoid errors, it is vital that a client not assign 13666 itself layout permissions beyond what the server has granted and 13667 that the server not forget layout permissions that have been 13668 granted. On the other hand, if a server believes that a client 13669 holds a layout that the client does not know about, it is useful 13670 for the client to cleanly indicate completion of the requested 13671 recall either by issuing a LAYOUTRETURN for the entire requested 13672 range or by returning an NFS4ERR_NOMATCHING_LAYOUT error to the 13673 CB_LAYOUTRECALL. 13675 Thus, in light of the above, it is useful for a server to be able to 13676 send callbacks for layout ranges it has not granted to a client, and 13677 for a client to return ranges it does not hold. A pNFS client MUST 13678 always return layouts that comprise the full range specified by the 13679 recall. Note, the full recalled layout range need not be returned as 13680 part of a single operation, but may be returned in portions. This 13681 allows the client to stage the flushing of dirty data, layout 13682 commits, and returns. Also, it indicates to the metadata server that 13683 the client is making progress. 13685 When a layout is returned, the client MUST NOT have any outstanding 13686 I/O requests to the storage devices involved in the layout. 13687 Rephrasing, the client MUST NOT return the layout while it has 13688 outstanding I/O requests to the storage device. 13690 Even with this requirement for the client, it is possible that I/O 13691 requests may be presented to a storage device no longer allowed to 13692 perform them. Since the server has no strict control as to when the 13693 client will return the layout, the server may later decide to 13694 unilaterally revoke the client's access to the storage devices as 13695 provided by the layout. In choosing to revoke access, the server 13696 must deal with the possibility of lingering I/O request; those 13697 outstanding I/O requests are still in flight to storage devices 13698 identified by the revoked layout. All layout type specifications 13699 MUST define whether unilateral layout revocation by the metadata 13700 server is supported; if it is, the specification must also describe 13701 how lingering writes are processed. For example, storage devices 13702 identified by the revoked layout could be fenced off from the client 13703 that held the layout. 13705 In order to ensure client/server convergence with regard to layout 13706 state, the final LAYOUTRETURN operation in a sequence of LAYOUTRETURN 13707 operations for a particular recall, MUST specify the entire range 13708 being recalled, echoing the recalled layout type, iomode, recall/ 13709 return type (FILE, FSID, or ALL), and byte range; even if layouts 13710 pertaining to partial ranges were previously returned. In addition, 13711 if the client holds no layouts that overlaps the range being 13712 recalled, the client should return the NFS4ERR_NOMATCHING_LAYOUT 13713 error code to CB_LAYOUTRECALL. This allows the server to update its 13714 view of the client's layout state. 13716 12.5.5.2. Sequencing of Layout Operations 13718 As with other stateful operations, pNFS requires the correct 13719 sequencing of layout operations. pNFS uses the "seqid" in the layout 13720 stateid to provide the correct sequencing between regular operations 13721 and callbacks. It is the server's responsibility to avoid 13722 inconsistencies regarding the layouts provided and the client's 13723 responsibility to properly serialize its layout requests and layout 13724 returns. 13726 12.5.5.2.1. Layout Recall and Return Sequencing 13728 One critical issue with regard to layout operations sequencing 13729 concerns callbacks. The protocol must defend against races between 13730 the reply to a LAYOUTGET or LAYOUTRETURN operation and a subsequent 13731 CB_LAYOUTRECALL. A client MUST NOT process a CB_LAYOUTRECALL that 13732 implies one or more outstanding LAYOUTGET or LAYOUTRETURN operations 13733 to which the client has not yet received a reply. The client detects 13734 such a CB_LAYOUTRECALL by examining the "seqid" field of the recall's 13735 layout stateid. If the "seqid" is not one higher than what the 13736 client currently has recorded, and the client has at least one 13737 LAYOUTGET and/or LAYOUTRETURN operation outstanding, the client knows 13738 the server sent the CB_LAYOUTRECALL after sending a response to an 13739 outstanding LAYOUTGET or LAYOUTRETURN. The client MUST wait before 13740 processing such a CB_LAYOUTRECALL until it processes all replies for 13741 outstanding LAYOUTGET and LAYOUTRETURN operations for the 13742 corresponding file with seqid less than the seqid given by 13743 CB_LAYOUTRECALL (lor_stateid, see Section 20.3.) 13745 In addition to the seqid-based mechanism, Section 2.10.6.3 describes 13746 the sessions mechanism for allowing the client to detect callback 13747 race conditions and delay processing such a CB_LAYOUTRECALL. The 13748 server MAY reference conflicting operations in the CB_SEQUENCE that 13749 precedes the CB_LAYOUTRECALL. Because the server has already sent 13750 replies for these operations before issuing the callback, the replies 13751 may race with the CB_LAYOUTRECALL. The client MUST wait for all the 13752 referenced calls to complete and update its view of the layout state 13753 before processing the CB_LAYOUTRECALL. 13755 12.5.5.2.1.1. Get/Return Sequencing 13757 The protocol allows the client to send concurrent LAYOUTGET and 13758 LAYOUTRETURN operations to the server. The protocol does not provide 13759 any means for the server to process the requests in the same order in 13760 which they were created. However, through the use of the "seqid" 13761 field in the layout stateid, the client can determine the order in 13762 which parallel outstanding operations were processed by the server. 13763 Thus, when a layout retrieved by an outstanding LAYOUTGET operation 13764 intersects with a layout returned by an outstanding LAYOUTRETURN on 13765 the same file, the order in which the two conflicting operations are 13766 processed determines the final state of the overlapping layout. The 13767 order is determined by the "seqid" returned in each operation: the 13768 operation with the higher seqid was executed later. 13770 It is permissible for the client to send in parallel multiple 13771 LAYOUTGET operations for the same file or multiple LAYOUTRETURN 13772 operations for the same file, and a mix of both. 13774 It is permissible for the client to use the current stateid (see 13775 Section 16.2.3.1.2) for LAYOUTGET operations for example when 13776 compounding LAYOUTGETs or compounding OPEN and LAYOUTGETs. It is 13777 also permissible to use the current stateid when compounding 13778 LAYOUTRETURNs. 13780 It is permissible for the client to use the current stateid when 13781 combining LAYOUTRETURN and LAYOUTGET operations for the same file in 13782 the same COMPOUND request since the server MUST process these in 13783 order. However, if a client does send such COMPOUND requests, it 13784 MUST NOT have more than one outstanding for the same file at the same 13785 time and MUST NOT have other LAYOUTGET or LAYOUTRETURN operations 13786 outstanding at the same time for that same file. 13788 12.5.5.2.1.2. Client Considerations 13790 Consider a pNFS client that has sent a LAYOUTGET and before it 13791 receives the reply to LAYOUTGET, it receives a CB_LAYOUTRECALL for 13792 the same file with an overlapping range. There are two 13793 possibilities, which the client can distinguish via the layout 13794 stateid in the recall. 13796 1. The server processed the LAYOUTGET before issuing the recall, so 13797 the LAYOUTGET must be waited for because it may be carrying 13798 layout information that will need to be returned to deal with the 13799 CB_LAYOUTRECALL. 13801 2. The server sent the callback before receiving the LAYOUTGET. The 13802 server will not respond to the LAYOUTGET until the 13803 CB_LAYOUTRECALL is processed. 13805 If these possibilities cannot be distinguished, a deadlock could 13806 result, as the client must wait for the LAYOUTGET response before 13807 processing the recall in the first case, but that response will not 13808 arrive until after the recall is processed in the second case. Note 13809 that in the first case, the "seqid" in the layout stateid of the 13810 recall is two greater than what the client has recorded and in the 13811 second case, the "seqid" is one greater than what the client has 13812 recorded. This allows the client to disambiguate between the two 13813 cases. The client thus knows precisely which possibility applies. 13815 In case 1 the client knows it needs to wait for the LAYOUTGET 13816 response before processing the recall (or the client can return 13817 NFS4ERR_DELAY). 13819 In case 2 the client will not wait for the LAYOUTGET response before 13820 processing the recall, because waiting would cause deadlock. 13821 Therefore, the action at the client will only require waiting in the 13822 case that the client has not yet seen the server's earlier responses 13823 to the LAYOUTGET operation(s). 13825 The recall process can be considered completed when the final 13826 LAYOUTRETURN operation for the recalled range is completed. The 13827 LAYOUTRETURN uses the layout stateid (with seqid) specified in 13828 CB_LAYOUTRECALL. If the client uses multiple LAYOUTRETURNs in 13829 processing the recall, the first LAYOUTRETURN will use the layout 13830 stateid as specified in CB_LAYOUTRECALL. Subsequent LAYOUTRETURNs 13831 will use the highest seqid as is the usual case. 13833 12.5.5.2.1.3. Server Considerations 13835 Consider a race from the metadata server's point of view. The 13836 metadata server has sent a CB_LAYOUTRECALL and receives an 13837 overlapping LAYOUTGET for the same file before the LAYOUTRETURN(s) 13838 that respond to the CB_LAYOUTRECALL. There are three cases: 13840 1. The client sent the LAYOUTGET before processing the 13841 CB_LAYOUTRECALL. The "seqid" in the layout stateid of LAYOUTGET 13842 is two less than the "seqid" in CB_LAYOUTRECALL. The server 13843 returns NFS4ERR_RECALLCONFLICT to the client, which indicates to 13844 the client that there is a pending recall. 13846 2. The client sent the LAYOUTGET after processing the 13847 CB_LAYOUTRECALL, but the LAYOUTGET arrived before the 13848 LAYOUTRETURN and the response to CB_LAYOUTRECALL that completed 13849 that processing. The "seqid" in the layout stateid of LAYOUTGET 13850 is equal to or greater than that of the "seqid" in 13851 CB_LAYOUTRECALL. The server has not received a response to the 13852 CB_LAYOUTRECALL, so it returns NFS4ERR_RECALLCONFLICT. 13854 3. The client sent the LAYOUTGET after processing the 13855 CB_LAYOUTRECALL, the server received the CB_LAYOUTRECALL 13856 response, but the LAYOUTGET arrived before the LAYOUTRETURN that 13857 completed that processing. The "seqid" in the layout stateid of 13858 LAYOUTGET is equal to that of the "seqid" in CB_LAYOUTRECALL. 13859 The server has received a response to the CB_LAYOUTRECALL, so it 13860 returns NFS4ERR_RETURNCONFLICT. 13862 12.5.5.2.1.4. Wraparound and Validation of Seqid 13864 The rules for layout stateid processing differ from other stateids in 13865 the protocol because the "seqid" value cannot be zero and the 13866 stateid's "seqid" value changes in a CB_LAYOUTRECALL operation. The 13867 non-zero requirement combined with the inherent parallelism of layout 13868 operations means that a set of LAYOUTGET and LAYOUTRETURN operations 13869 may contain the same value for "seqid". The server uses a slightly 13870 modified version of the modulo arithmetic as described in 13871 Section 2.10.6.1 when incrementing the layout stateid's "seqid". The 13872 modification to that modulo arithmetic description is to not use 13873 zero. The modulo arithmetic is also used for the comparisons of 13874 "seqid" values in the processing of CB_LAYOUTRECALL events as 13875 described above in Section 12.5.5.2.1.3. 13877 Just as the server validates the "seqid" in the event of 13878 CB_LAYOUTRECALL usage, as described in Section 12.5.5.2.1.3, the 13879 server also validates the "seqid" value to ensure that it is within 13880 an appropriate range. This range represents the degree of 13881 parallelism the server supports for layout stateids. If the client 13882 is sending multiple layout operations to the server in parallel, by 13883 definition, the "seqid" value in the supplied stateid will not be the 13884 current "seqid" as held by the server. The range of parallelism 13885 spans from the highest or current "seqid" to a "seqid" value in the 13886 past. To assist in the discussion, the server's current "seqid" 13887 value for a layout stateid is defined as: SERVER_CURRENT_SEQID. The 13888 lowest "seqid" value that is acceptable to the server is represented 13889 by PAST_SEQID. And the value for the range of valid "seqid"s or 13890 range of parallelism is VALID_SEQID_RANGE. Therefore, the following 13891 holds: VALID_SEQID_RANGE = SERVER_CURRENT_SEQID - PAST_SEQID. In the 13892 following, all arithmetic is the modulo arithmetic as described 13893 above. 13895 The server MUST support a minimum VALID_SEQID_RANGE. The minimum is 13896 defined as: VALID_SEQID_RANGE = summation of 1..N of 13897 (ca_maxoperations(i) - 1) where N is the number of session fore 13898 channels and ca_maxoperations(i) is the value of the ca_maxoperations 13899 returned from CREATE_SESSION of the i'th session. The reason for 13900 minus 1 is to allow for the required SEQUENCE operation. The server 13901 MAY support a VALID_SEQID_RANGE value larger than the minimum. The 13902 maximum VALID_SEQID_RANGE is (2 ^ 32 - 2) (accounts for 0 not being a 13903 valid "seqid" value). 13905 If the server finds the "seqid" is zero, the NFS4ERR_BAD_STATEID 13906 error is returned to the client. The server further validates the 13907 "seqid" to ensure it is within the range of parallelism, 13908 VALID_SEQID_RANGE. If the "seqid" value is outside of that range, 13909 the error NFS4ERR_OLD_STATEID is returned to the client. Upon 13910 receipt of NFS4ERR_OLD_STATEID, the client updates the stateid in the 13911 layout request based on processing of other layout requests and re- 13912 sends the operation to the server. 13914 12.5.5.2.1.5. Bulk Recall and Return 13916 pNFS supports recalling and returning all layouts that are for files 13917 belonging to a particular fsid (LAYOUTRECALL4_FSID, 13918 LAYOUTRETURN4_FSID) or client ID (LAYOUTRECALL4_ALL, 13919 LAYOUTRETURN4_ALL). There are no "bulk" stateids, so detection of 13920 races via the seqid is not possible. The server MUST NOT initiate 13921 bulk recall while another recall is in progress, or the corresponding 13922 LAYOUTRETURN is in progress or pending. In the event the server 13923 sends a bulk recall while the client has pending or in progress 13924 LAYOUTRETURN, CB_LAYOUTRECALL, or LAYOUTGET, the client returns 13925 NFS4ERR_DELAY. In the event the client sends a LAYOUTGET or 13926 LAYOUTRETURN while a bulk recall is in progress, the server returns 13927 NFS4ERR_RECALLCONFLICT. If the client sends a LAYOUTGET or 13928 LAYOUTRETURN after the server receives NFS4ERR_DELAY from a bulk 13929 recall, then to ensure forward progress, the server MAY return 13930 NFS4ERR_RECALLCONFLICT. 13932 Once a CB_LAYOUTRECALL of LAYOUTRECALL4_ALL is sent, the server MUST 13933 NOT allow the client to use any layout stateid except for 13934 LAYOUTCOMMIT operations. Once the client receives a CB_LAYOUTRECALL 13935 of LAYOUTRECALL4_ALL, it MUST NOT use any layout stateid except for 13936 LAYOUTCOMMIT operations. Once a LAYOUTRETURN of LAYOUTRETURN4_ALL is 13937 sent, all layout stateids granted to the client ID are freed. The 13938 client MUST NOT use the layout stateids again. It MUST use LAYOUTGET 13939 to obtain new layout stateids. 13941 Once a CB_LAYOUTRECALL of LAYOUTRECALL4_FSID is sent, the server MUST 13942 NOT allow the client to use any layout stateid that refers to a file 13943 with the specified fsid except for LAYOUTCOMMIT operations. Once the 13944 client receives a CB_LAYOUTRECALL of LAYOUTRECALL4_ALL, it MUST NOT 13945 use any layout stateid that refers to a file with the specified fsid 13946 except for LAYOUTCOMMIT operations. Once a LAYOUTRETURN of 13947 LAYOUTRETURN4_FSID is sent, all layout stateids granted to the 13948 referenced fsid are freed. The client MUST NOT use the layout 13949 stateids for files with the referenced fsid again. It MUST use 13950 LAYOUTGET to obtain new layout stateids files with the referenced 13951 fsid. 13953 If the server has sent a bulk CB_LAYOUTRECALL, and receives a 13954 LAYOUTGET, or a LAYOUTRETURN with a stateid, the server MUST return 13955 NFS4ERR_RECALLCONFLICT. If the server has sent a bulk 13956 CB_LAYOUTRECALL, and receives a LAYOUTRETURN with an lr_returntype 13957 that is not equal to the lor_recalltype of the CB_LAYOUTRECALL, the 13958 server MUST return NFS4ERR_RECALLCONFLICT. 13960 12.5.6. Revoking Layouts 13962 Parallel NFS permits servers to revoke layouts from clients that fail 13963 to response to recalls and/or fail to renew their lease in time. 13964 Whether the server revokes the layout or not depends on the layout 13965 type, and what actions are taken with respect to the client's I/O to 13966 data servers is also layout type specific. 13968 12.5.7. Metadata Server Write Propagation 13970 Asynchronous writes written through the metadata server may be 13971 propagated lazily to the storage devices. For data written 13972 asynchronously through the metadata server, a client performing a 13973 read at the appropriate storage device is not guaranteed to see the 13974 newly written data until a COMMIT occurs at the metadata server. 13975 While the write is pending, reads to the storage device may give out 13976 either the old data, the new data, or a mixture of new and old. Upon 13977 completion of a synchronous WRITE or COMMIT (for asynchronously 13978 written data), the metadata server MUST ensure that storage devices 13979 give out the new data and that the data has been written to stable 13980 storage. If the server implements its storage in any way such that 13981 it cannot obey these constraints, then it MUST recall the layouts to 13982 prevent reads being done that cannot be handled correctly. Note that 13983 the layouts MUST be recalled prior to the server responding to the 13984 associated WRITE operations. 13986 12.6. pNFS Mechanics 13988 This section describes the operations flow taken by a pNFS client to 13989 a metadata server and storage device. 13991 When a pNFS client encounters a new FSID, it sends a GETATTR to the 13992 NFSv4.1 server for the fs_layout_type (Section 5.12.1) attribute. If 13993 the attribute returns at least one layout type, and the layout types 13994 returned are among the set supported by the client, the client knows 13995 that pNFS is a possibility for the file system. If, from the server 13996 that returned the new FSID, the client does not have a client ID that 13997 came from an EXCHANGE_ID result that returned 13998 EXCHGID4_FLAG_USE_PNFS_MDS, it MUST send an EXCHANGE_ID to the server 13999 with the EXCHGID4_FLAG_USE_PNFS_MDS bit set. If the server's 14000 response does not have EXCHGID4_FLAG_USE_PNFS_MDS, then contrary to 14001 what the fs_layout_type attribute said, the server does not support 14002 pNFS, and the client will not be able use pNFS to that server; in 14003 this case, the server MUST return NFS4ERR_NOTSUPP in response to any 14004 pNFS operation. 14006 The client then creates a session, requesting a persistent session, 14007 so that exclusive creates can be done with single round trip via the 14008 createmode4 of GUARDED4. If the session ends up not being 14009 persistent, the client will use EXCLUSIVE4_1 for exclusive creates. 14011 If a file is to be created on a pNFS enabled file system, the client 14012 uses the OPEN operation. With the normal set of attributes that may 14013 be provided upon OPEN used for creation, there is an OPTIONAL 14014 layout_hint attribute. The client's use of layout_hint allows the 14015 client to express its preference for a layout type and its associated 14016 layout details. The use of a createmode4 of UNCHECKED4, GUARDED4, or 14017 EXCLUSIVE4_1 will allow the client to provide the layout_hint 14018 attribute at create time. The client MUST NOT use EXCLUSIVE4 (see 14019 Table 10). The client is RECOMMENDED to combine a GETATTR operation 14020 after the OPEN within the same COMPOUND. The GETATTR may then 14021 retrieve the layout_type attribute for the newly created file. The 14022 client will then know what layout type the server has chosen for the 14023 file and therefore what storage protocol the client must use. 14025 If the client wants to open an existing file, then it also includes a 14026 GETATTR to determine what layout type the file supports. 14028 The GETATTR in either the file creation or plain file open case can 14029 also include the layout_blksize and layout_alignment attributes so 14030 that the client can determine optimal offsets and lengths for I/O on 14031 the file. 14033 Assuming the client supports the layout type returned by GETATTR and 14034 it chooses to use pNFS for data access, it then sends LAYOUTGET using 14035 the filehandle and stateid returned by OPEN, specifying the range it 14036 wants to do I/O on. The response is a layout, which may be a subset 14037 of the range for which the client asked. It also includes device IDs 14038 and a description of how data is organized (or in the case of 14039 writing, how data is to be organized) across the devices. The device 14040 IDs and data description are encoded in a format that is specific to 14041 the layout type, but the client is expected to understand. 14043 When the client wants to send an I/O, it determines which device ID 14044 it needs to send the I/O command to by examining the data description 14045 in the layout. It then sends a GETDEVICEINFO to find the device 14046 address(es) of the device ID. The client then sends the I/O request 14047 one of device ID's device addresses, using the storage protocol 14048 defined for the layout type. Note that if a client has multiple I/Os 14049 to send, these I/O requests may be done in parallel. 14051 If the I/O was a WRITE, then at some point the client may want to use 14052 LAYOUTCOMMIT to commit the modification time and the new size of the 14053 file (if it believes it extended the file size) to the metadata 14054 server and the modified data to the file system. 14056 12.7. Recovery 14058 Recovery is complicated by the distributed nature of the pNFS 14059 protocol. In general, crash recovery for layouts is similar to crash 14060 recovery for delegations in the base NFSv4.1 protocol. However, the 14061 client's ability to perform I/O without contacting the metadata 14062 server introduces subtleties that must be handled correctly if the 14063 possibility of file system corruption is to be avoided. 14065 12.7.1. Recovery from Client Restart 14067 Client recovery for layouts is similar to client recovery for other 14068 lock and delegation state. When an pNFS client restarts, it will 14069 lose all information about the layouts that it previously owned. 14070 There are two methods by which the server can reclaim these resources 14071 and allow otherwise conflicting layouts to be provided to other 14072 clients. 14074 The first is through the expiry of the client's lease. If the client 14075 recovery time is longer than the lease period, the client's lease 14076 will expire and the server will know that state may be released. For 14077 layouts the server may release the state immediately upon lease 14078 expiry or it may allow the layout to persist awaiting possible lease 14079 revival, as long as no other layout conflicts. 14081 The second is through the client restarting in less time than it 14082 takes for the lease period to expire. In such a case, the client 14083 will contact the server through the standard EXCHANGE_ID protocol. 14084 The server will find that the client's co_ownerid matches the 14085 co_ownerid of the previous client invocation, but that the verifier 14086 is different. The server uses this as a signal to release all layout 14087 state associated with the client's previous invocation. In this 14088 scenario, the data written by the client but not covered by a 14089 successful LAYOUTCOMMIT is in an undefined state; it may have been 14090 written or it may now be lost. This is acceptable behavior and it is 14091 the client's responsibility to use LAYOUTCOMMIT to achieve the 14092 desired level of stability. 14094 12.7.2. Dealing with Lease Expiration on the Client 14096 If a client believes its lease has expired, it MUST NOT send I/O to 14097 the storage device until it has validated its lease. The client can 14098 send a SEQUENCE operation to the metadata server. If the SEQUENCE 14099 operation is successful, but sr_status_flag has 14100 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, 14101 SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, or 14102 SEQ4_STATUS_ADMIN_STATE_REVOKED set, the client MUST NOT use 14103 currently held layouts. The client has two choices to recover from 14104 the lease expiration. First, for all modified but uncommitted data, 14105 write it to the metadata server using the FILE_SYNC4 flag for the 14106 WRITEs or WRITE and COMMIT. Second, the client reestablishes a 14107 client ID and session with the server and obtain new layouts and 14108 device ID to device address mappings for the modified data ranges and 14109 then write the data to the storage devices with the newly obtained 14110 layouts. 14112 If sr_status_flags from the metadata server has 14113 SEQ4_STATUS_RESTART_RECLAIM_NEEDED set (or SEQUENCE returns 14114 NFS4ERR_BAD_SESSION and CREATE_SESSION returns 14115 NFS4ERR_STALE_CLIENTID) then the metadata server has restarted, and 14116 the client SHOULD recover using the methods described in 14117 Section 12.7.4. 14119 If sr_status_flags from the metadata server has 14120 SEQ4_STATUS_LEASE_MOVED set, then the client recovers by following 14121 the procedure described in Section 11.7.7.1. After that, the client 14122 may get an indication that the layout state was not moved with the 14123 file system. The client recovers as in the other applicable 14124 situations discussed in Paragraph 1 or Paragraph 2 of this section. 14126 If sr_status_flags reports no loss of state, then the lease for the 14127 layouts the client has are valid and renewed, and the client can once 14128 again send I/O requests to the storage devices. 14130 While clients SHOULD NOT send I/Os to storage devices that may extend 14131 past the lease expiration time period, this is not always possible; 14132 for example, an extended network partition that starts after the I/O 14133 is sent and does not heal until the I/O request is received by the 14134 storage device. Thus the metadata server and/or storage devices are 14135 responsible for protecting themselves from I/Os that are sent before 14136 the lease expires, but arrive after the lease expires. See 14137 Section 12.7.3. 14139 12.7.3. Dealing with Loss of Layout State on the Metadata Server 14141 This is a description of the case where all of the following are 14142 true: 14144 o the metadata server has not restarted 14146 o a pNFS client's layouts have been discarded (usually because the 14147 client's lease expired) and are invalid 14149 o an I/O from the pNFS client arrives at the storage device 14151 The metadata server and its storage devices MUST solve this by 14152 fencing the client. In other words, prevent the execution of I/O 14153 operations from the client to the storage devices after layout state 14154 loss. The details of how fencing is done are specific to the layout 14155 type. The solution for NFSv4.1 file-based layouts is described in 14156 (Section 13.11), and for other layout types in their respective 14157 external specification documents. 14159 12.7.4. Recovery from Metadata Server Restart 14161 The pNFS client will discover that the metadata server has restarted 14162 via the methods described in Section 8.4.2 and discussed in a pNFS- 14163 specific context in Paragraph 2, of Section 12.7.2. The client MUST 14164 stop using layouts and delete the device ID to device address 14165 mappings it previously received from the metadata server. Having 14166 done that, if the client wrote data to the storage device without 14167 committing the layouts via LAYOUTCOMMIT, then the client has 14168 additional work to do in order to have the client, metadata server 14169 and storage device(s) all synchronized on the state of the data. 14171 o If the client has data still modified and unwritten in the 14172 client's memory, the client has only two choices. 14174 1. The client can obtain a layout via LAYOUTGET after the 14175 server's grace period and write the data to the storage 14176 devices. 14178 2. The client can write that data through the metadata server 14179 using the WRITE (Section 18.32) operation, and then obtain 14180 layouts as desired. 14182 o If the client asynchronously wrote data to the storage device, but 14183 still has a copy of the data in its memory, then it has available 14184 to it the recovery options listed above in the previous bullet 14185 point. If the metadata server is also in its grace period, the 14186 client has available to it the options below in the next bullet 14187 item. 14189 o The client does not have a copy of the data in its memory and the 14190 metadata server is still in its grace period. The client cannot 14191 use LAYOUTGET (within or outside the grace period) to reclaim a 14192 layout because the contents of the response from LAYOUTGET may not 14193 match what it had previously. The range might be different or it 14194 might get the same range but the content of the layout might be 14195 different. Even if the content of the layout appears to be the 14196 same, the device IDs may map to different device addresses, and 14197 even if the device addresses are the same, the device addresses 14198 could have been assigned to a different storage device. The 14199 option of retrieving the data from the storage device and writing 14200 it to the metadata server per the recovery scenario described 14201 above is not available because, again, the mappings of range to 14202 device ID, device ID to device address, device address to physical 14203 device are stale and new mappings via new LAYOUTGET do not solve 14204 the problem. 14206 The only recovery option for this scenario is to send a 14207 LAYOUTCOMMIT in reclaim mode, which the metadata server will 14208 accept as long as it is in its grace period. The use of 14209 LAYOUTCOMMIT in reclaim mode informs the metadata server that the 14210 layout has changed. It is critical the metadata server receive 14211 this information before its grace period ends, and thus before it 14212 starts allowing updates to the file system. 14214 To send LAYOUTCOMMIT in reclaim mode, the client sets the 14215 loca_reclaim field of the operation's arguments (Section 18.42.1) 14216 to TRUE. During the metadata server's recovery grace period (and 14217 only during the recovery grace period) the metadata server is 14218 prepared to accept LAYOUTCOMMIT requests with the loca_reclaim 14219 field set to TRUE. 14221 When loca_reclaim is TRUE, the client is attempting to commit 14222 changes to the layout that occurred prior to the restart of the 14223 metadata server. The metadata server applies some consistency 14224 checks on the loca_layoutupdate field of the arguments to 14225 determine whether the client can commit the data written to the 14226 storage device to the file system. The loca_layoutupdate field is 14227 of data type layoutupdate4, and contains layout type-specific 14228 content (in the lou_body field of loca_layoutupdate). The layout 14229 type-specific information that loca_layoutupdate might have is 14230 discussed in Section 12.5.4.3. If the metadata server's 14231 consistency checks on loca_layoutupdate succeed, then the metadata 14232 server MUST commit the data (as described by the loca_offset, 14233 loca_length, and loca_layoutupdate fields of the arguments) that 14234 was written to storage device. If the metadata server's 14235 consistency checks on loca_layoutupdate fail, the metadata server 14236 rejects the LAYOUTCOMMIT operation, and makes no changes to the 14237 file system. However, any time LAYOUTCOMMIT with loca_reclaim 14238 TRUE fails, the pNFS client has lost all the data in the range 14239 defined by . A client can defend 14240 against this risk by caching all data, whether written 14241 synchronously or asynchronously in its memory and not release the 14242 cached data until a successful LAYOUTCOMMIT. This condition does 14243 not hold true for all layout types; for example, files-based 14244 storage devices need not suffer from this limitation. 14246 o The client does not have a copy of the data in its memory and the 14247 metadata server is no longer in its grace period; i.e. the 14248 metadata server returns NFS4ERR_NO_GRACE. As with the scenario in 14249 the above bullet item, the failure of LAYOUTCOMMIT means the data 14250 in the range lost. The defense against 14251 the risk is the same; cache all written data on the client until a 14252 successful LAYOUTCOMMIT. 14254 12.7.5. Operations During Metadata Server Grace Period 14256 Some of the recovery scenarios thus far noted that some operations, 14257 namely WRITE and LAYOUTGET might be permitted during the metadata 14258 server's grace period. The metadata server may allow these 14259 operations during its grace period. For LAYOUTGET, the metadata 14260 server must reliably determine that servicing such a request will not 14261 conflict with an impending LAYOUTCOMMIT reclaim request. For WRITE, 14262 it must reliably determine that it will not conflict with an 14263 impending OPEN; or a LOCK where the file has mandatory file locking 14264 enabled. 14266 As mentioned previously, some operations, namely WRITE and LAYOUTGET 14267 may be rejected during the metadata server's grace period, because to 14268 provide simple, valid handling during the grace period, the easiest 14269 method is to simply reject all non-reclaim pNFS requests and WRITE 14270 operations by returning the NFS4ERR_GRACE error. However, depending 14271 on the storage protocol (which is specific to the layout type) and 14272 metadata server implementation, the metadata server may be able to 14273 determine that a particular request is safe. For example, a metadata 14274 server may save provisional allocation mappings for each file to 14275 stable storage, as well as information about potentially conflicting 14276 OPEN share modes and mandatory byte-range locks that might have been 14277 in effect at the time of restart, and use this information during the 14278 recovery grace period to determine that a WRITE request is safe. 14280 12.7.6. Storage Device Recovery 14282 Recovery from storage device restart is mostly dependent upon the 14283 layout type in use. However, there are a few general techniques a 14284 client can use if it discovers a storage device has crashed while 14285 holding modified, uncommitted data that was asynchronously written. 14286 First and foremost, it is important to realize that the client is the 14287 only one which has the information necessary to recover non-committed 14288 data; since, it holds the modified data and probably nothing else 14289 does. Second, the best solution is for the client to err on the side 14290 of caution and attempt to re-write the modified data through another 14291 path. 14293 The client SHOULD immediately write the data to the metadata server, 14294 with the stable field in the WRITE4args set to FILE_SYNC4. Once it 14295 does this, there is no need to wait for the original storage device. 14297 12.8. Metadata and Storage Device Roles 14299 If the same physical hardware is used to implement both a metadata 14300 server and storage device, then the same hardware entity is to be 14301 understood to be implementing two distinct roles and it is important 14302 that it be clearly understood on behalf of which role the hardware is 14303 executing at any given time. 14305 Two sub-cases can be distinguished. 14307 1. The storage device uses NFSv4.1 as the storage protocol, i.e. 14308 same physical hardware is used to implement both a metadata and 14309 data server. See Section 13.1 for a description how multiple 14310 roles are handled. 14312 2. The storage device does not use NFSv4.1 as the storage protocol, 14313 and the same physical hardware is used to implement both a 14314 metadata and storage device. Whether distinct network addresses 14315 are used to access metadata server and storage device is 14316 immaterial, because, it is always clear to the pNFS client and 14317 server, from upper layer protocol being used (NFSv4.1 or non- 14318 NFSv4.1) what role the request to the common server network 14319 address is directed to. 14321 12.9. Security Considerations for pNFS 14323 pNFS separates file system metadata and data and provides access to 14324 both. There are pNFS-specific operations (listed in Section 12.3) 14325 that provide access to the metadata; all existing NFSv4.1 14326 conventional (non-pNFS) security mechanisms and features apply to 14327 accessing the metadata. The combination of components in a pNFS 14328 system (see Figure 1) is required to preserve the security properties 14329 of NFSv4.1 with respect to an entity accessing storage device from a 14330 client, including security countermeasures to defend against threats 14331 that NFSv4.1 provides defenses for in environments where these 14332 threats are considered significant. 14334 In some cases, the security countermeasures for connections to 14335 storage devices may take the form of physical isolation or a 14336 recommendation not to use pNFS in an environment. For example, it 14337 may be impractical to provide confidentiality protection for some 14338 storage protocols to protect against eavesdropping; in environments 14339 where eavesdropping on such protocols is of sufficient concern to 14340 require countermeasures, physical isolation of the communication 14341 channel (e.g., via direct connection from client(s) to storage 14342 device(s)) and/or a decision to forgo use of pNFS (e.g., and fall 14343 back to conventional NFSv4.1) may be appropriate courses of action. 14345 Where communication with storage devices is subject to the same 14346 threats as client to metadata server communication, the protocols 14347 used for that communication need to provide security mechanisms as 14348 strong as or no weaker than those available via RPCSEC_GSS for 14349 NFSv4.1. Except for the storage protocol used for the 14350 LAYOUT4_NFSV4_1_FILES layout (see Section 13), i.e. except for 14351 NFSv4.1, it is beyond the scope of this document to specify the 14352 security mechanisms for storage access protocols. 14354 pNFS implementations MUST NOT remove NFSv4.1's access controls. The 14355 combination of clients, storage devices, and the metadata server are 14356 responsible for ensuring that all client to storage device file data 14357 access respects NFSv4.1's ACLs and file open modes. This entails 14358 performing both of these checks on every access in the client, the 14359 storage device, or both (as applicable; when the storage device is an 14360 NFSv4.1 server, the storage device is ultimately responsible for 14361 controlling access as described in Section 13.9.2). If a pNFS 14362 configuration performs these checks only in the client, the risk of a 14363 misbehaving client obtaining unauthorized access is an important 14364 consideration in determining when it is appropriate to use such a 14365 pNFS configuration. Such layout types SHOULD NOT be used when 14366 client-only access checks do not provide sufficient assurance that 14367 NFSv4.1 access control is being applied correctly. (This is not a 14368 problem for the file layout type described in Section 13 because the 14369 storage access protocol for LAYOUT4_NFSV4_1_FILES is NFSv4.1, and 14370 thus the security model for storage device access via 14371 LAYOUT4_NFSv4_1_FILES is the sames as that of the metadata server.) 14372 For handling of access control specific to a layout, the reader 14373 should examine the layout specification, such as the NFSv4.1/ 14374 files-based layout (Section 13) of this document, the blocks layout 14375 [40], and objects layout [39]. 14377 13. NFSv4.1 as a Storage Protocol in pNFS: the File Layout Type 14379 This section describes the semantics and format of NFSv4.1 file-based 14380 layouts for pNFS. NFSv4.1 file-based layouts uses the 14381 LAYOUT4_NFSV4_1_FILES layout type. The LAYOUT4_NFSV4_1_FILES type 14382 defines striping data across multiple NFSv4.1 data servers. 14384 13.1. Client ID and Session Considerations 14386 Sessions are a REQUIRED feature of NFSv4.1, and this extends to both 14387 the metadata server and file-based (NFSv4.1-based) data servers. 14389 The role a server plays in pNFS is determined by the result it 14390 returns from EXCHANGE_ID. The roles are: 14392 o metadata server (EXCHGID4_FLAG_USE_PNFS_MDS is set in the result 14393 eir_flags), 14395 o data server (EXCHGID4_FLAG_USE_PNFS_DS) 14396 o non-metadata server (EXCHGID4_FLAG_USE_NON_PNFS). This is an 14397 NFSv4.1 server that does not support operations (e.g. LAYOUTGET) 14398 or attributes that pertain to pNFS. 14400 The client MAY request zero or more of EXCHGID4_FLAG_USE_NON_PNFS, 14401 EXCHGID4_FLAG_USE_PNFS_DS, or EXCHGID4_FLAG_USE_PNFS_MDS, even though 14402 some combinations (e.g. EXCHGID4_FLAG_USE_NON_PNFS | 14403 EXCHGID4_FLAG_USE_PNFS_MDS) are contradictory. The server however 14404 MUST only return the following acceptable combinations: 14406 +--------------------------------------------------------+ 14407 | Acceptable Results from EXCHANGE_ID | 14408 +--------------------------------------------------------+ 14409 | EXCHGID4_FLAG_USE_PNFS_MDS | 14410 | EXCHGID4_FLAG_USE_PNFS_MDS | EXCHGID4_FLAG_USE_PNFS_DS | 14411 | EXCHGID4_FLAG_USE_PNFS_DS | 14412 | EXCHGID4_FLAG_USE_NON_PNFS | 14413 | EXCHGID4_FLAG_USE_PNFS_DS | EXCHGID4_FLAG_USE_NON_PNFS | 14414 +--------------------------------------------------------+ 14416 As the above table implies, a server can have one or two roles. A 14417 server can be both a metadata server and a data server or it can be 14418 both a data server and non-metadata server. In addition to returning 14419 two roles in EXCHANGE_ID's results, and thus serving both roles via a 14420 common client ID, a server can serve two roles by returning a unique 14421 client ID and server owner for each role in each of two EXCHANGE_ID 14422 results, with each result indicating each role. 14424 In the case of a server with concurrent pNFS roles that are served by 14425 a common client ID, if the EXCHANGE_ID request from the client has 14426 zero or a combination of the bits set in eia_flags, the server result 14427 should set bits which represent the higher of the acceptable 14428 combination of the server roles, with a preference to match the roles 14429 requested by the client. Thus if a client request has 14430 (EXCHGID4_FLAG_USE_NON_PNFS | EXCHGID4_FLAG_USE_PNFS_MDS | 14431 EXCHGID4_FLAG_USE_PNFS_DS) flags set, and the server is both a 14432 metadata server and a data server, serving both the roles by a common 14433 client ID, the server SHOULD return with (EXCHGID4_FLAG_USE_PNFS_MDS 14434 | EXCHGID4_FLAG_USE_PNFS_DS) set. 14436 In the case of a server that has multiple concurrent pNFS roles, each 14437 role served by a unique client ID, if the client specifies zero or a 14438 combination of roles in the request, the server results SHOULD return 14439 only one of the roles from the combination specified by the client 14440 request. If the role specified by the server result does not match 14441 the intended use by the client, the client should send the 14442 EXCHANGE_ID specifying just the interested pNFS role. 14444 If a pNFS metadata client gets a layout that refers it to an NFSv4.1 14445 data server, it needs a client ID on that data server. If it does 14446 not yet have a client ID from the server that had the 14447 EXCHGID4_FLAG_USE_PNFS_DS flag set in the EXCHANGE_ID results, then 14448 the client needs to send an EXCHANGE_ID to the data server, using the 14449 same co_ownerid as it sent to the metadata server, with the 14450 EXCHGID4_FLAG_USE_PNFS_DS flag set in the arguments. If the server's 14451 EXCHANGE_ID results have EXCHGID4_FLAG_USE_PNFS_DS set, then the 14452 client may use the client ID to create sessions that will exchange 14453 pNFS data operations. The client ID returned by the data server has 14454 no relationship with the client ID returned by a metadata server 14455 unless the client IDs are equal and the server owners and server 14456 scopes of the data server and metadata server are equal. 14458 In NFSv4.1, the session ID in the SEQUENCE operation implies the 14459 client ID, which in turn might be used by the server to map the 14460 stateid to the right client/server pair. However, when a data server 14461 is presented with a READ or WRITE operation with a stateid, because 14462 the stateid is associated with client ID on a metadata server, and 14463 because the session ID in the preceding SEQUENCE operation is tied to 14464 the client ID of the data server, the data server has no obvious way 14465 to determine the metadata server from the COMPOUND procedure, and 14466 thus has no way to validate the stateid. One RECOMMENDED approach is 14467 for pNFS servers to encode metadata server routing and/or identity 14468 information in the data server filehandles as returned in the layout. 14470 If metadata server routing and/or identity information is encoded in 14471 data server filehandles, when the metadata server identity or 14472 location changes, the data server filehandles it gave out will become 14473 invalid (stale), and so the metadata server MUST first recall the 14474 layouts. Invalidating a data server filehandle does not render the 14475 NFS client's data cache invalid. The client's cache should map a 14476 data server filehandle to a metadata server filehandle, and a 14477 metadata server filehandle to cached data. 14479 If a server is both a metadata server and a data server, the server 14480 might need to distinguish operations on files that are directed to 14481 the metadata server from those that are directed to the data server. 14482 It is RECOMMENDED that the values of the filehandles returned by the 14483 LAYOUTGET operation to be different than the value of the filehandle 14484 returned by the OPEN of the same file. 14486 Another scenario is for the metadata server and the storage device to 14487 be distinct from one client's point of view, and the roles reversed 14488 from another client's point of view. For example, in the cluster 14489 file system model, a metadata server to one client might be a data 14490 server to another client. If NFSv4.1 is being used as the storage 14491 protocol, then pNFS servers need to encode the values of filehandles 14492 according to their specific roles. 14494 13.1.1. Sessions Considerations for Data Servers 14496 Section 2.10.10.2 states that a client has to keep its lease renewed 14497 in order to prevent a session from being deleted by the server. If 14498 the reply to EXCHANGE_ID has just the EXCHGID4_FLAG_USE_PNFS_DS role 14499 set, then as noted in Section 13.6 the client will not be able to 14500 determine the data server's lease_time attribute, because GETATTR 14501 will not be permitted. Instead, the rule is that any time a client 14502 receives a layout referring it to a data server that returns just the 14503 EXCHGID4_FLAG_USE_PNFS_DS role, the client MAY assume that the 14504 lease_time attribute from the metadata server that returned the 14505 layout applies to the data server. Thus the data server MUST be 14506 aware of the values of all lease_time attributes of all metadata 14507 servers it is providing I/O for, and MUST use the maximum of all such 14508 lease_time values as the lease interval for all client IDs and 14509 sessions established on it. 14511 For example, if one metadata server has a lease_time attribute of 20 14512 seconds, and a second metadata server has a lease_time attribute of 14513 10 seconds, then if both servers return layouts that refer to an 14514 EXCHGID4_FLAG_USE_PNFS_DS-only data server, the data server MUST 14515 renew a client's lease if the interval between two SEQUENCE 14516 operations on different COMPOUND requests is less than 20 seconds. 14518 13.2. File Layout Definitions 14520 The following definitions apply to the LAYOUT4_NFSV4_1_FILES layout 14521 type, and may be applicable to other layout types. 14523 Unit. A unit is a fixed size quantity of data written to a data 14524 server. 14526 Pattern. A pattern is a method of distributing one or more equal 14527 sized units across a set of data servers. A pattern is iterated 14528 one or more times. 14530 Stripe. An stripe is a set of data distributed across a set of data 14531 servers in a pattern before that pattern repeats. 14533 Stripe Count. A stripe count is the number of units in a pattern. 14535 Stripe Width. A stripe width is the size of stripe in bytes. The 14536 stripe width = the stripe count * the size of the stripe unit. 14538 Hereafter, this document will refer to a unit that is a written in a 14539 pattern as a "stripe unit". 14541 A pattern may have more stripe units than data servers. If so, some 14542 data servers will have more than one stripe unit per stripe. A data 14543 server that has multiple stripe units per stripe MAY store each unit 14544 in a different data file (and depending on the implementation, will 14545 possibly assign a unique data filehandle to each data file). 14547 13.3. File Layout Data Types 14549 The high level NFSv4.1 layout types are nfsv4_1_file_layouthint4, 14550 nfsv4_1_file_layout_ds_addr4, and nfsv4_1_file_layout4. 14552 The SETATTR operation supports a layout hint attribute 14553 (Section 5.12.4). When the client sets a layout hint (data type 14554 layouthint4) with a layout type of LAYOUT4_NFSV4_1_FILES (the 14555 loh_type field), the loh_body field contains a value of data type 14556 nfsv4_1_file_layouthint4. 14558 const NFL4_UFLG_MASK = 0x0000003F; 14559 const NFL4_UFLG_DENSE = 0x00000001; 14560 const NFL4_UFLG_COMMIT_THRU_MDS = 0x00000002; 14561 const NFL4_UFLG_STRIPE_UNIT_SIZE_MASK 14562 = 0xFFFFFFC0; 14564 typedef uint32_t nfl_util4; 14566 enum filelayout_hint_care4 { 14567 NFLH4_CARE_DENSE = NFL4_UFLG_DENSE, 14569 NFLH4_CARE_COMMIT_THRU_MDS 14570 = NFL4_UFLG_COMMIT_THRU_MDS, 14572 NFLH4_CARE_STRIPE_UNIT_SIZE 14573 = 0x00000040, 14575 NFLH4_CARE_STRIPE_COUNT = 0x00000080 14576 }; 14578 /* Encoded in the loh_body field of type layouthint4: */ 14580 struct nfsv4_1_file_layouthint4 { 14581 uint32_t nflh_care; 14582 nfl_util4 nflh_util; 14583 count4 nflh_stripe_count; 14584 }; 14586 The generic layout hint structure is described in Section 3.3.19. 14587 The client uses the layout hint in the layout_hint (Section 5.12.4) 14588 attribute to indicate the preferred type of layout to be used for a 14589 newly created file. The LAYOUT4_NFSV4_1_FILES layout type-specific 14590 content for the layout hint is composed of three fields. The first 14591 field, nflh_care, is a set of flags indicating which values of the 14592 hint the client cares about. If the NFLH4_CARE_DENSE flag is set, 14593 then the client indicates in the second field, nflh_util, a 14594 preference for how the data file is packed (Section 13.4.4), which is 14595 controlled by the value of nflh_util & NFL4_UFLG_DENSE. If the 14596 NFLH4_CARE_COMMIT_THRU_MDS flag is set, then the client indicates a 14597 preference for whether the client should send COMMIT operations to 14598 the metadata server or data server (Section 13.7), which is 14599 controlled by the value of nflh_util & NFL4_UFLG_COMMIT_THRU_MDS. If 14600 the NFLH4_CARE_STRIPE_UNIT_SIZE flag is set, the client indicates its 14601 preferred stripe unit size, which is indicated in nflh_util & 14602 NFL4_UFLG_STRIPE_UNIT_SIZE_MASK (thus the stripe unit size MUST be a 14603 multiple of 64 bytes). The minimum stripe unit size is 64 bytes. If 14604 the NFLH4_CARE_STRIPE_COUNT flag is set, the client indicates in the 14605 third field, nflh_stripe_count, the stripe count. The stripe count 14606 multiplied by the stripe unit size is the stripe width. 14608 When LAYOUTGET returns a LAYOUT4_NFSV4_1_FILES layout (indicated in 14609 the loc_type field of the lo_content field), the loc_body field of 14610 the lo_content field contains a value of data type 14611 nfsv4_1_file_layout4. Among other content, nfsv4_1_file_layout4 has 14612 a storage device ID (field nfl_deviceid) of data type deviceid4. The 14613 GETDEVICEINFO operation maps a device ID to a storage device address 14614 (type device_addr4). When GETDEVICEINFO returns a device address 14615 with a layout type of LAYOUT4_NFSV4_1_FILES (the da_layout_type 14616 field), the da_addr_body field contains a value of data type 14617 nfsv4_1_file_layout_ds_addr4. 14619 typedef netaddr4 multipath_list4<>; 14621 /* Encoded in the da_addr_body field of type device_addr4: */ 14622 struct nfsv4_1_file_layout_ds_addr4 { 14623 uint32_t nflda_stripe_indices<>; 14624 multipath_list4 nflda_multipath_ds_list<>; 14625 }; 14627 The nfsv4_1_file_layout_ds_addr4 data type represents the device 14628 address. It is composed of two fields: 14630 1. nflda_multipath_ds_list: An array of lists of data servers, where 14631 each list can be one or more elements, and each element 14632 represents a (see Section 13.5) data server address which may 14633 serve equally as the target of IO operations. The length of this 14634 array might be different than the stripe count. 14636 2. nflda_stripe_indices: An array of indices used to index into 14637 nflda_multipath_ds_list. The value of each element of 14638 nflda_stripe_indices MUST be less than the number of elements in 14639 nflda_multipath_ds_list. Each element of nflda_multipath_ds_list 14640 SHOULD be referred to by one or more elements of 14641 nflda_stripe_indices. The number of elements in 14642 nflda_stripe_indices is always equal to the stripe count. 14644 /* Encoded in the loc_body field of type layout_content4: */ 14645 struct nfsv4_1_file_layout4 { 14646 deviceid4 nfl_deviceid; 14647 nfl_util4 nfl_util; 14648 uint32_t nfl_first_stripe_index; 14649 offset4 nfl_pattern_offset; 14650 nfs_fh4 nfl_fh_list<>; 14651 }; 14653 The nfsv4_1_file_layout4 data type represents the layout. It is 14654 composed of the following fields: 14656 1. nfl_deviceid: The device ID which maps to a value of type 14657 nfsv4_1_file_layout_ds_addr4. 14659 2. nfl_util: Like the nflh_util field of data type 14660 nfsv4_1_file_layouthint4, a compact representation of how the 14661 data on a file on each data server is packed, whether the client 14662 should send COMMIT operations to the metadata server or data 14663 server, and the stripe unit size. If a server returns two or 14664 more overlapping layouts, each stripe unit size in each 14665 overlapping layout MUST be the same. 14667 3. nfl_first_stripe_index: The index into the first element of the 14668 nflda_stripe_indices array to use. 14670 4. nfl_pattern_offset: This field is the logical offset into the 14671 file where the striping pattern starts. It is required for 14672 converting the client's logical I/O offset (e.g. the current 14673 offset in a POSIX file descriptor before the read() or write() 14674 system call is sent) into the stripe unit number (see 14675 Section 13.4.1). 14677 If dense packing is used, then nfl_pattern_offset is also needed 14678 to convert the client's logical I/O offset to an offset on the 14679 file on the data server corresponding to the stripe unit number 14680 (see Section 13.4.4). 14682 Note that nfl_pattern_offset is not always the same as lo_offset. 14683 For example, via the LAYOUTGET operation, a client might request 14684 a layout starting at offset 1000 of a file that has its striping 14685 pattern start at offset 0. 14687 5. nfl_fh_list: An array of data server filehandles for each list of 14688 data servers in each element of the nflda_multipath_ds_list 14689 array. The number of elements in nfl_fh_list depends on whether 14690 sparse or dense packing is being used. 14692 * If sparse packing is being used, the number of elements in 14693 nfl_fh_list MUST be one of three values: 14695 + Zero. This means that filehandles used for each data 14696 server are the same as the filehandle returned by the OPEN 14697 operation from the metadata server. 14699 + One. This means that every data server uses the same 14700 filehandle: what is specified in nfl_fh_list[0]. 14702 + The same number of elements in nflda_multipath_ds_list. 14703 Thus, in this case, when issuing an I/O to any data server 14704 in nflda_multipath_ds_list[X], the filehandle in 14705 nfl_fh_list[X] MUST be used. 14707 See the discussion on sparse packing in Section 13.4.4. 14709 * If dense packing is being used, number of elements in 14710 nfl_fh_list MUST be the same as the number of elements in 14711 nflda_stripe_indices. Thus when issuing I/O to any data 14712 server in nflda_multipath_ds_list[nflda_stripe_indices[Y]], 14713 the filehandle in nfl_fh_list[Y] MUST be used. In addition, 14714 any time there exists i, and j, (i != j) such that the 14715 intersection of 14716 nflda_multipath_ds_list[nflda_stripe_indices[i]] and 14717 nflda_multipath_ds_list[nflda_stripe_indices[j]] is not empty, 14718 then nfl_fh_list[i] MUST NOT equal nfl_fh_list[j]. In other 14719 words, when dense packing is being used, if a data server 14720 appears in two or more units of a striping pattern, each 14721 reference to the data server MUST use a different filehandle. 14723 Indeed, if there are multiple striping patterns, as indicated 14724 by the presence of multiple objects of data type layout4 14725 (either returned in one or multiple LAYOUTGET operations), and 14726 a data server is the target of a unit of one pattern and 14727 another unit of another pattern, then each reference to each 14728 data server MUST use a different filehandle. 14730 See the discussion on dense packing in Section 13.4.4. 14732 The details on the interpretation of the layout are in Section 13.4. 14734 13.4. Interpreting the File Layout 14736 13.4.1. Determining the Stripe Unit Number 14738 To find the stripe unit number that corresponds to the client's 14739 logical file offset, the pattern offset will also be used. The i'th 14740 stripe unit (SUi) is: 14742 relative_offset = file_offset - nfl_pattern_offset; 14743 SUi = floor(relative_offset / stripe_unit_size); 14745 13.4.2. Interpreting the File Layout Using Sparse Packing 14747 When sparse packing is used, the algorithm for determining the 14748 filehandle and set of data server network addresses to write stripe 14749 unit i (SUi) to is: 14751 stripe_count = number of elements in nflda_stripe_indices; 14753 j = (SUi + nfl_first_stripe_index) % stripe_count; 14755 idx = nflda_stripe_indices[j]; 14757 fh_count = number of elements in nfl_fh_list; 14758 ds_count = number of elements in nflda_multipath_ds_list; 14760 switch (fh_count) { 14761 case ds_count: 14762 fh = nfl_fh_list[idx]; 14763 break; 14765 case 1: 14766 fh = nfl_fh_list[0]; 14767 break; 14769 case 0: 14770 fh = filehandle returned by OPEN; 14771 break; 14773 default: 14774 throw a fatal exception; 14775 break; 14776 } 14778 address_list = nflda_multipath_ds_list[idx]; 14780 The client would then select a data server from address_list, and 14781 send a READ or WRITE operation using the filehandle specified in fh. 14783 Consider the following example: 14785 Suppose we have a device address consisting of seven data servers, 14786 arranged in three equivalence (Section 13.5) classes: 14788 { A, B, C, D }, { E }, { F, G } 14790 Where A through G are network addresses. 14792 Then 14794 nflda_multipath_ds_list<> = { A, B, C, D }, { E }, { F, G } 14796 i.e. 14798 nflda_multipath_ds_list[0] = { A, B, C, D } 14800 nflda_multipath_ds_list[1] = { E } 14802 nflda_multipath_ds_list[2] = { F, G } 14804 Suppose the striping index array is: 14806 nflda_stripe_indices<> = { 2, 0, 1, 0 } 14808 Now suppose the client gets a layout which has a device ID that maps 14809 to the above device address. The initial index, 14811 nfl_first_stripe_index = 2, 14813 and 14815 nfl_fh_list = { 0x36, 0x87, 0x67 }. 14817 If the client wants to write to SU0, the set of valid { network 14818 address, filehandle } combinations for SUi are determined by: 14820 nfl_first_stripe_index = 2 14822 So 14824 idx = nflda_stripe_indices[(0 + 2) % 4] 14826 = nflda_stripe_indices[2] 14828 = 1 14830 So 14832 nflda_multipath_ds_list[1] = { E } 14834 and 14836 nfl_fh_list[1] = { 0x87 } 14838 The client can thus write SU0 to { 0x87, { E }, }. 14840 The destinations of the first thirteen storage units are: 14842 +-----+------------+--------------+ 14843 | SUi | filehandle | data servers | 14844 +-----+------------+--------------+ 14845 | 0 | 87 | E | 14846 | 1 | 36 | A,B,C,D | 14847 | 2 | 67 | F,G | 14848 | 3 | 36 | A,B,C,D | 14849 | 4 | 87 | E | 14850 | 5 | 36 | A,B,C,D | 14851 | 6 | 67 | F,G | 14852 | 7 | 36 | A,B,C,D | 14853 | 8 | 87 | E | 14854 | 9 | 36 | A,B,C,D | 14855 | 10 | 67 | F,G | 14856 | 11 | 36 | A,B,C,D | 14857 | 12 | 87 | E | 14858 +-----+------------+--------------+ 14860 13.4.3. Interpreting the File Layout Using Dense Packing 14862 When dense packing is used, the algorithm for determining the 14863 filehandle and set of data server network addresses to write stripe 14864 unit i (SUi) to is: 14866 stripe_count = number of elements in nflda_stripe_indices; 14868 j = (SUi + nfl_first_stripe_index) % stripe_count; 14870 idx = nflda_stripe_indices[j]; 14872 fh_count = number of elements in nfl_fh_list; 14873 ds_count = number of elements in nflda_multipath_ds_list; 14875 switch (fh_count) { 14876 case stripe_count: 14877 fh = nfl_fh_list[j]; 14878 break; 14880 default: 14881 throw a fatal exception; 14882 break; 14883 } 14885 address_list = nflda_multipath_ds_list[idx]; 14887 The client would then select a data server from address_list, and 14888 send a READ or WRITE operation using the filehandle specified in fh. 14890 Consider the following example (which is the same as the sparse 14891 packing example, except for the filehandle list): 14893 Suppose we have a device address consisting of seven data servers, 14894 arranged in three equivalence (Section 13.5) classes: 14896 { A, B, C, D }, { E }, { F, G } 14898 Where A through G are network addresses. 14900 Then 14902 nflda_multipath_ds_list<> = { A, B, C, D }, { E }, { F, G } 14904 i.e. 14906 nflda_multipath_ds_list[0] = { A, B, C, D } 14908 nflda_multipath_ds_list[1] = { E } 14910 nflda_multipath_ds_list[2] = { F, G } 14912 Suppose the striping index array is: 14914 nflda_stripe_indices<> = { 2, 0, 1, 0 } 14916 Now suppose the client gets a layout which has a device ID that maps 14917 to the above device address. The initial index, 14919 nfl_first_stripe_index = 2, 14921 and 14923 nfl_fh_list = { 0x67, 0x37, 0x87, 0x36 }. 14925 The interesting examples for dense packing are SU1 and SU3, because 14926 each stripe unit refers to the same data server list, yet MUST use a 14927 different filehandle. If the client wants to write to SU1, the set 14928 of valid { network address, filehandle } combinations for SUi are 14929 determined by: 14931 nfl_first_stripe_index = 2 14933 So 14934 j = (1 + 2) % 4 = 3 14936 idx = nflda_stripe_indices[j] 14938 = nflda_stripe_indices[3] 14940 = 0 14942 So 14944 nflda_multipath_ds_list[0] = { A, B, C, D } 14946 and 14948 nfl_fh_list[3] = { 0x36 } 14950 The client can thus write SU1 to { 0x36, { A, B, C, D }, }. 14952 For SU3, j = (3 + 2) % 4 = 1, and nflda_stripe_indices[1] = 0. Then 14953 nflda_multipath_ds_list[0] = { A, B, C, D }, and nfl_fh_list[1] = 14954 0x37. The client can thus write SU3 to { 0x37, { A, B, C, D } }. 14956 The destinations of the first thirteen storage units are: 14958 +-----+------------+--------------+ 14959 | SUi | filehandle | data servers | 14960 +-----+------------+--------------+ 14961 | 0 | 87 | E | 14962 | 1 | 36 | A,B,C,D | 14963 | 2 | 67 | F,G | 14964 | 3 | 37 | A,B,C,D | 14965 | 4 | 87 | E | 14966 | 5 | 36 | A,B,C,D | 14967 | 6 | 67 | F,G | 14968 | 7 | 37 | A,B,C,D | 14969 | 8 | 87 | E | 14970 | 9 | 36 | A,B,C,D | 14971 | 10 | 67 | F,G | 14972 | 11 | 37 | A,B,C,D | 14973 | 12 | 87 | E | 14974 +-----+------------+--------------+ 14976 13.4.4. Sparse and Dense Stripe Unit Packing 14978 The flag NFL4_UFLG_DENSE of the nfl_util4 data type (field nflh_util 14979 of the data type nfsv4_1_file_layouthint4 and field nfl_util of data 14980 type nfsv4_1_file_layout_ds_addr4) specifies how the data is packed 14981 within the data file on a data server. It allows for two different 14982 data packings: sparse and dense. The packing type determines the 14983 calculation that will be made to map the client visible file offset 14984 to the offset within the data file located on the data server. 14986 If nfl_util & NFL4_UFLG_DENSE is zero, this means that sparse packing 14987 is being used. Hence the logical offsets of the file as viewed by a 14988 client issuing READs and WRITEs directly to the metadata server are 14989 the same offsets each data server uses when storing a stripe unit. 14990 The effect then, for striping patterns consisting of at least two 14991 stripe units, is for each data server file to be sparse or holey. So 14992 for example, suppose there is a pattern with three stripe units, the 14993 stripe unit size is a 4096 bytes, and there are three data servers in 14994 the pattern, then the file in data server 1 will have stripe units 0, 14995 3, 6, 9, ... filled, data server 2's file will have stripe units 1, 14996 4, 7, 10, ... filled, and data server 3's file will have stripe units 14997 2, 5, 8, 11, ... filled. The unfilled stripe units of each file will 14998 be holes, hence the files in each data server are sparse. 15000 If sparse packing is being used and a client attempts I/O to one of 15001 the holes, then an error MUST be returned by the data server. Using 15002 the above example, if data server 3 received a READ or WRITE request 15003 for block 4, the data server would return NFS4ERR_PNFS_IO_HOLE. Thus 15004 data servers need to understand the striping pattern in order to 15005 support sparse packing. 15007 If nfl_util & NFL4_UFLG_DENSE is one, this means that dense packing 15008 is being used and the data server files have no holes. Dense packing 15009 might be selected because the data server does not (efficiently) 15010 support holey files, or because the data server cannot recognize 15011 read-ahead unless there are no holes. If dense packing is indicated 15012 in the layout, the data files will be packed. Using the example 15013 striping pattern and stripe unit size that was used for the sparse 15014 packing example, the corresponding dense packing would have all 15015 stripe units of all data files filled. Logical stripe units 0, 3, 6, 15016 ... of the file would live on stripe units 0, 1, 2, ... of the file 15017 of data server 1, logical stripe units 1, 4, 7, ... of the file would 15018 live on stripe units 0, 1, 2, ... of the file of data server 2, and 15019 logical stripe units 2, 5, 8, ... of the file would live on stripe 15020 units 0, 1, 2, ... of the file of data server 3. 15022 Because dense packing does not leave holes on the data servers, the 15023 pNFS client is allowed to write to any offset of any data file of any 15024 data server in the stripe. Thus the data servers need not know the 15025 file's striping pattern. 15027 The calculation to determine the byte offset within the data file for 15028 dense data server layouts is: 15030 stripe_width = stripe_unit_size * N; 15031 where N = number of elements in nflda_stripe_indices. 15033 relative_offset = file_offset - nfl_pattern_offset; 15035 data_file_offset = floor(relative_offset / stripe_width) 15036 * stripe_unit_size 15037 + relative_offset % stripe_unit_size 15039 If dense packing is being used, and a data server appears more than 15040 once in a striping pattern, then to distinguish one stripe unit from 15041 another, the data server MUST use a different filehandle. Let's 15042 suppose there are two data servers. Logical stripe units 0, 3, 6 are 15043 served by data server 1, logical stripe units 1, 4, 7 are served by 15044 data server 2, and logical stripe units 2, 5, 8 are also served by 15045 data server 2. Unless data server 2 has two filehandles (each 15046 referring to a different data file), then, for example, a write to 15047 logical stripe unit 1 overwrites the write to logical stripe unit 2, 15048 because both logical stripe units are located in the same stripe unit 15049 (0) of data server 2. 15051 13.5. Data Server Multipathing 15053 The NFSv4.1 file layout supports multipathing to multiple data server 15054 addresses. Data server-level multipathing is used for bandwidth 15055 scaling via trunking (Section 2.10.5) and for higher availability of 15056 use in the case of a data server failure. Multipathing allows the 15057 client to switch to another data server address which may that of 15058 another data server that is exporting the same data stripe unit, 15059 without having to contact the metadata server for a new layout. 15061 To support data server multipathing, each element of the 15062 nflda_multipath_ds_list contains an array of one more data server 15063 network addresses. This array (data type multipath_list4) represents 15064 a list of data servers (each identified by a network address), with 15065 it being possible that some data servers will appear in the list 15066 multiple times. 15068 The client is free to use any of the network addresses as a 15069 destination to send data server requests. If some network addresses 15070 are less optimal paths to the data than others, then the MDS SHOULD 15071 NOT include those network addresses in an element of 15072 nflda_multipath_ds_list. If less optimal network addresses exist to 15073 provide fail over, the RECOMMENDED method to offer the addresses is 15074 to provide them in a replacement device ID to device address mapping, 15075 or a replacement device ID. When a client finds that no data server 15076 in an element of nflda_multipath_ds_list responds, it SHOULD send a 15077 GETDEVICEINFO to attempt to replace the existing device ID to device 15078 address mappings. If the MDS detects that all data servers 15079 represented by an element of nflda_multipath_ds_list are unavailable, 15080 the MDS SHOULD send a CB_NOTIFY_DEVICEID (if the client has indicated 15081 it wants device ID notifications for changed device IDs) to change 15082 the device ID to device address mappings to the available data 15083 servers. If the device ID itself will be replaced, the MDS SHOULD 15084 recall all layouts with the device ID, and thus force the client to 15085 get new layouts and device ID mappings via LAYOUTGET and 15086 GETDEVICEINFO. 15088 Generally if two network addresses appear in an element of 15089 nflda_multipath_ds_list they will designate the same data server and 15090 the two data server addresses will support the implementation client 15091 ID or session trunking (the latter is RECOMMENDED) as defined in 15092 Section 2.10.5, and the two data server addresses will share the same 15093 server owner, or major ID of the server owner. It is not always 15094 necessary for the two data server addresses to designate the same 15095 server with trunking being used. For example the data could be read- 15096 only, and the data consist of exact replicas. 15098 13.6. Operations Sent to NFSv4.1 Data Servers 15100 Clients accessing data on an NFSv4.1 data server MUST send only the 15101 NULL procedure and COMPOUND procedures whose operations are taken 15102 only from two restricted subsets of the operations defined as valid 15103 NFSv4.1 operations. Clients MUST use the filehandle specified by the 15104 layout when accessing data on NFSv4.1 data servers. 15106 The first of these operation subsets consist of management 15107 operations. This subset consists of the BACKCHANNEL_CTL, 15108 BIND_CONN_TO_SESSION, CREATE_SESSION, DESTROY_CLIENTID, 15109 DESTROY_SESSION, EXCHANGE_ID, SECINFO_NO_NAME, SET_SSV, and SEQUENCE 15110 operations. The client may use these operations in order to set up 15111 and maintain the appropriate client IDs, sessions, and security 15112 contexts involved in communication with the data server. Henceforth 15113 these will be referred to as data-server housekeeping operations. 15115 The second subset consists of COMMIT, READ, WRITE, and PUTFH, These 15116 operations MUST be used with a current filehandle specified by the 15117 layout. In the case of PUTFH, the new current filehandle MUST be one 15118 taken from the layout. Henceforth, these will be referred to as 15119 data-server I/O operations. As described in Section 12.5.1, a client 15120 MUST NOT send an I/O to a data server for which it does not hold a 15121 valid layout; the data server MUST reject such an I/O. 15123 Unless the server has a concurrent non-data-server personality, i.e. 15124 EXCHANGE_ID results returned (EXCHGID4_FLAG_USE_PNFS_DS | 15125 EXCHGID4_FLAG_USE_PNFS_MDS) or (EXCHGID4_FLAG_USE_PNFS_DS | 15126 EXCHGID4_FLAG_USE_NON_PNFS), see Section 13.1, any attempted use of 15127 operations against a data server other than those specified in the 15128 two subsets above MUST return NFS4ERR_NOTSUPP to the client. 15130 When the server has concurrent data server and non-data-server 15131 personalities, each COMPOUND sent by the client MUST be constructed 15132 so that it is appropriate to one of the two personalities, and MUST 15133 NOT contain operations directed to a mix of those personalities. The 15134 server MUST enforce this. To understand the constraints, operations 15135 within a COMPOUND are divided into the following three classes: 15137 1. An operation which is ambiguous regarding its personality 15138 assignment. These include all of the data-server housekeeping 15139 operations. Additionally, if the server has assigned filehandles 15140 so that the ones defined by the layout are the same as those used 15141 by the metadata server, all operations using such filehandles are 15142 within this class, with the following exception. The exception 15143 is that if the operation uses a stateid that is incompatible with 15144 a data-server personality (e.g. a special stateid or the stateid 15145 has a non-zero seqid field, see Section 13.9.1); if so, the 15146 operation is in class 3, as described below. A COMPOUND 15147 containing multiple class 1 operations (and operations of no 15148 other class) MAY be sent to a server with multiple concurrent 15149 data server and non-data-server personalities. 15151 2. An operation which is unambiguously referable to the data server 15152 personality. These are data-server I/O operations where the 15153 filehandle is one that can only be validly directed to the data- 15154 server personality. 15156 3. An operation which is unambiguously referable to the non-data- 15157 server personality. These include all COMPOUND operations that 15158 are neither data-server housekeeping nor data-server I/O 15159 operations plus data-server I/O operations where the current fh 15160 (or the one to be made the current fh in the case of PUTFH) is 15161 one that is only valid on the metadata server or where a stateid 15162 is used that is incompatible with the data server, i.e. is a 15163 special stateid or has a non-zero seqid value. 15165 When a COMPOUND first executes an operation from class 3 above, it 15166 acts as a normal COMPOUND on any other server and the data server 15167 personality ceases to be relevant. There are no special restrictions 15168 on the operations in the COMPOUND to limit them to those for a data 15169 server. When a PUTFH is done, filehandles derived from the layout 15170 are not valid. If their format is not normally acceptable, then 15171 NFS4ERR_BADHANDLE MUST result. Similarly, current filehandles for 15172 other operations do not accept filehandles derived from layouts and 15173 are not normally usable on the metadata server. Using these will 15174 result in NFS4ERR_STALE. 15176 When a COMPOUND first executes an operation from class 2, which would 15177 be PUTFH where the filehandle is one from a layout, the COMPOUND 15178 henceforth is interpreted with respect to the data server 15179 personality. Operations outside the two classes discussed above MUST 15180 result in NFS4ERR_NOTSUPP. Filehandles are validated using the rules 15181 of the data server, resulting in NFS4ERR_BADHANDLE and/or 15182 NFS4ERR_STALE even when they would not normally do so when addressed 15183 to the non-data-server personality. Stateids must obey the rules of 15184 the data server in that any use of special stateids or stateids with 15185 non-zero seqid values must result in NFS4ERR_BAD_STATEID. 15187 Until the server first executes an operation from class 2 or class 3, 15188 the client MUST NOT depend on the operation being executed by either 15189 the data-server or the non-data-server personality. The server MUST 15190 pick one personality consistently for a given COMPOUND, with the only 15191 possible transition being a single one when the first operation from 15192 class 2 or class 3 is executed. 15194 Because of the complexity induced by assigning filehandles so they 15195 can be used on both a data server and a metadata server, it is 15196 RECOMMENDED that where the same server can have both personalities, 15197 the server assign separate unique filehandles to both personalities. 15198 This makes it unambiguous for which server a given request is 15199 intended. 15201 GETATTR and SETATTR MUST be directed to the metadata server. In the 15202 case of a SETATTR of the size attribute, the control protocol is 15203 responsible for propagating size updates/truncations to the data 15204 servers. In the case of extending WRITEs to the data servers, the 15205 new size must be visible on the metadata server once a LAYOUTCOMMIT 15206 has completed (see Section 12.5.4.2). Section 13.10, describes the 15207 mechanism by which the client is to handle data server files that do 15208 not reflect the metadata server's size. 15210 13.7. COMMIT Through Metadata Server 15212 The file layout provides two alternate means of providing for the 15213 commit of data written through data servers. The flag 15214 NFL4_UFLG_COMMIT_THRU_MDS in the field nfl_util of the file layout 15215 (data type nfsv4_1_file_layout4) is an indication from the metadata 15216 server to the client of the REQUIRED way of performing COMMIT, either 15217 by sending the COMMIT to the data server or the metadata server. 15218 These two methods of dealing with the issue correspond to broad 15219 styles of implementation for a pNFS server supporting the files 15220 layout type. 15222 o When the flag is FALSE, COMMIT operations MUST to be sent to the 15223 data server to which the corresponding WRITE operations were sent. 15224 This approach is most useful when striping of files is implemented 15225 as part of pNFS server, with the individual data servers each 15226 implementing their own file systems. 15228 o When the flag is TRUE, COMMIT operations MUST be sent to the 15229 metadata server, rather than to the individual data servers. This 15230 approach is most useful when the pNFS server is implemented on top 15231 of a clustered file system. In such an implementation, sending 15232 COMMIT's to multiple data servers may result in repeated writes of 15233 metadata blocks as each individual COMMIT is executed, to the 15234 detriment of write performance. Sending a single COMMIT to the 15235 metadata server can provide more efficiency when there exists a 15236 clustered file system capable of implementing such a co-ordinated 15237 COMMIT. 15239 If nfl_util & NFL4_UFLG_COMMIT_THRU_MDS is TRUE, then in order to 15240 maintain the current NFSv4.1 commit and recovery model, the data 15241 servers MUST return a common writeverf verifier in all WRITE 15242 responses for a given file layout, and the metadata server's 15243 COMMIT implementation must return the same writeverf. The value 15244 of the writeverf verifier MUST be changed at the metadata server 15245 or any data server that is referenced in the layout, whenever 15246 there is a server event that can possibly lead to loss of 15247 uncommitted data. The scope of the verifier can be for a file or 15248 for the entire pNFS server. It might be more difficult for the 15249 server to maintain the verifier at the file level but the benefit 15250 is that only events that impact a given file will require recovery 15251 action. 15253 Note that if the layout specified dense packing, then the offset used 15254 to a COMMIT to the MDS may differ than that of an offset used to a 15255 COMMIT to the data server. 15257 The single COMMIT to the metadata server will return a verifier and 15258 the client should compare it to all the verifiers from the WRITEs and 15259 fail the COMMIT if there is any mismatched verifiers. If COMMIT to 15260 the metadata server fails, the client should re-send WRITEs for all 15261 the modified data in the file. The client should treat modified data 15262 with a mismatched verifier as a WRITE failure and try to recover by 15263 reissuing the WRITEs to the original data server or using another 15264 path to that data if the layout has not been recalled. Another 15265 option the client has is getting a new layout or just rewrite the 15266 data through the metadata server. If nfl_util & 15267 NFL4_UFLG_COMMIT_THRU_MDS is FALSE, sending a COMMIT to the metadata 15268 server might have no effect. If nfl_util & NFL4_UFLG_COMMIT_THRU_MDS 15269 is FALSE, a COMMIT sent to the metadata server should be used only to 15270 commit data that was written to the metadata server. See 15271 Section 12.7.6 for recovery options. 15273 13.8. The Layout Iomode 15275 The layout iomode need not be used by the metadata server when 15276 servicing NFSv4.1 file-based layouts, although in some circumstances 15277 it may be useful. For example, if the server implementation supports 15278 reading from read-only replicas or mirrors, it would be useful for 15279 the server to return a layout enabling the client to do so. As such, 15280 the client SHOULD set the iomode based on its intent to read or write 15281 the data. The client may default to an iomode of LAYOUTIOMODE4_RW. 15282 The iomode need not be checked by the data servers when clients 15283 perform I/O. However, the data servers SHOULD still validate that the 15284 client holds a valid layout and return an error if the client does 15285 not. 15287 13.9. Metadata and Data Server State Coordination 15289 13.9.1. Global Stateid Requirements 15291 When the client sends I/O to a data server, the stateid used MUST NOT 15292 be a layout stateid as returned by LAYOUTGET or sent by 15293 CB_LAYOUTRECALL. Permitted stateids are based on one of the 15294 following: an open stateid (the stateid field of data type OPEN4resok 15295 as returned by OPEN), a delegation stateid (the stateid field of data 15296 types open_read_delegation4 and open_write_delegation4 as returned by 15297 OPEN or WANT_DELEGATION, or as sent by CB_PUSH_DELEG), or a stateid 15298 returned by the LOCK or LOCKU operations. The stateid sent to the 15299 data server MUST be sent with the seqid set to zero, indicating the 15300 most current version of that stateid, rather than indicating a 15301 specific non-zero seqid value. In no case is the use of special 15302 stateid values allowed. 15304 The stateid used for I/O MUST have the same effect and be subject to 15305 the same validation on a data server as it would if the I/O was being 15306 performed on the metadata server itself in the absence of pNFS. This 15307 has the implication that stateids are globally valid on both the 15308 metadata and data servers. This requires the metadata server to 15309 propagate changes in lock and open state to the data servers, so that 15310 the data servers can validate I/O accesses. This is discussed 15311 further in Section 13.9.2. Depending on when stateids are 15312 propagated, the existence of a valid stateid on the data server may 15313 act as proof of a valid layout. 15315 Clients performing I/O operations need to select an appropriate 15316 stateid based on the locks (including opens and delegations) held by 15317 the client and the various types of state-owners issuing the I/O 15318 requests. The rules for doing so when referencing data servers are 15319 somewhat different from those discussed in Section 8.2.5 which apply 15320 when accessing metadata servers. 15322 The following rules, applied in order of decreasing priority, govern 15323 the selection of the appropriate stateid: 15325 o If the client holds a delegation for the file in question, the 15326 delegation stateid should be used. 15328 o Otherwise, there must be an open stateid for the current open- 15329 owner, and that open stateid for the open file in question is 15330 used, unless mandatory locking, prevents that. See below. 15332 o If the data server had previously responded with NFS4ERR_LOCKED to 15333 use of the open stateid, then the client should use the lock 15334 stateid whenever one exists for that open file with the current 15335 lock-owner. 15337 o Special stateids should never be used and if used the data server 15338 MUST reject the I/O with an NFS4ERR_BAD_STATEID error. 15340 13.9.2. Data Server State Propagation 15342 Since the metadata server, which handles lock and open-mode state 15343 changes, as well as ACLs, might not be co-located with the data 15344 servers where I/O access are validated, the server implementation 15345 MUST take care of propagating changes of this state to the data 15346 servers. Once the propagation to the data servers is complete, the 15347 full effect of those changes MUST be in effect at the data servers. 15348 However, some state changes need not be propagated immediately, 15349 although all changes SHOULD be propagated promptly. These state 15350 propagations have an impact on the design of the control protocol, 15351 even though the control protocol is outside of the scope of this 15352 specification. Immediate propagation refers to the synchronous 15353 propagation of state from the metadata server to the data server(s); 15354 the propagation must be complete before returning to the client. 15356 13.9.2.1. Lock State Propagation 15358 If the pNFS server supports mandatory locking, any mandatory locks on 15359 a file MUST be made effective at the data servers before the request 15360 that establishes them returns to the caller. The effect MUST be the 15361 same as if the mandatory lock state were synchronously propagated to 15362 the data servers, even though the details of the control protocol may 15363 avoid actual transfer of the state under certain circumstances. 15365 On the other hand, since advisory lock state is not used for checking 15366 I/O accesses at the data servers, there is no semantic reason for 15367 propagating advisory lock state to the data servers. Since updates 15368 to advisory locks neither confer nor remove privileges, these changes 15369 need not be propagated immediately, and may not need to be propagated 15370 promptly. The updates to advisory locks need only be propagated when 15371 the data server needs to resolve a question about a stateid. In 15372 fact, if byte-range locking is not mandatory (i.e., is advisory) the 15373 clients are advised not to use the lock-based stateids for I/O at 15374 all. The stateids returned by open are sufficient and eliminate 15375 overhead for this kind of state propagation. 15377 If a client gets back an NFS4ERR_LOCKED error from a data server, 15378 this is an indication that mandatory byte-range locking is in force. 15379 The client recovers from this by getting a byte-range lock that 15380 covers the affected range and re-sends the I/O with the stateid of 15381 the byte-range lock. 15383 13.9.2.2. Open and Deny Mode Validation 15385 Open and deny mode validation MUST be performed against the open and 15386 deny mode(s) held by the data servers. When access is reduced or a 15387 deny mode made more restrictive (because of CLOSE or DOWNGRADE) the 15388 data server MUST prevent any I/Os that would be denied if performed 15389 on the metadata server. When access is expanded, the data server 15390 MUST make sure that no requests are subsequently rejected because of 15391 open or deny issues that no longer apply, given the previous 15392 relaxation. 15394 13.9.2.3. File Attributes 15396 Since the SETATTR operation has the ability to modify state that is 15397 visible on both the metadata and data servers (e.g., the size), care 15398 must be taken to ensure that the resultant state across the set of 15399 data servers is consistent; especially when truncating or growing the 15400 file. 15402 As described earlier, the LAYOUTCOMMIT operation is used to ensure 15403 that the metadata is synchronized with changes made to the data 15404 servers. For the NFSv4.1-based data storage protocol, it is 15405 necessary to re-synchronize state such as the size attribute, and the 15406 setting of mtime/change/atime. See Section 12.5.4 for a full 15407 description of the semantics regarding LAYOUTCOMMIT and attribute 15408 synchronization. It should be noted, that by using an NFSv4.1-based 15409 layout type, it is possible to synchronize this state before 15410 LAYOUTCOMMIT occurs. For example, the control protocol can be used 15411 to query the attributes present on the data servers. 15413 Any changes to file attributes that control authorization or access 15414 as reflected by ACCESS calls or READs and WRITEs on the metadata 15415 server, MUST be propagated to the data servers for enforcement on 15416 READ and WRITE I/O calls. If the changes made on the metadata server 15417 result in more restrictive access permissions for any user, those 15418 changes MUST be propagated to the data servers synchronously. 15420 The OPEN operation (Section 18.16.4) does not impose any requirement 15421 that I/O operations on an open file have the same credentials as the 15422 OPEN itself (unless EXCHGID4_FLAG_BIND_PRINC_STATEID is set when 15423 EXCHANGE_ID creates the client ID) and so requires the server's READ 15424 and WRITE operations to perform appropriate access checking. Changes 15425 to ACLs also require new access checking by READ and WRITE on the 15426 server. The propagation of access right changes due to changes in 15427 ACLs may be asynchronous only if the server implementation is able to 15428 determine that the updated ACL is not more restrictive for any user 15429 specified in the old ACL. Due to the relative infrequency of ACL 15430 updates, it is suggested that all changes be propagated 15431 synchronously. 15433 13.10. Data Server Component File Size 15435 A potential problem exists when a component data file on a particular 15436 data server is grown past EOF; the problem exists for both dense and 15437 sparse layouts. Imagine the following scenario: a client creates a 15438 new file (size == 0) and writes to byte 131072; the client then seeks 15439 to the beginning of the file and reads byte 100. The client should 15440 receive 0s back as a result of the READ. However, if the READ falls 15441 on a data server other than the one that received client's original 15442 WRITE, the data server servicing the READ may still believe that the 15443 file's size is at 0 and return no data with the EOF flag set. The 15444 data server can only return 0s if it knows that the file's size has 15445 been extended. This would require the immediate propagation of the 15446 file's size to all data servers, which is potentially very costly. 15447 Therefore, the client that has initiated the extension of the file's 15448 size MUST be prepared to deal with these EOF conditions; the EOF'ed 15449 or short READs will be treated as a hole in the file and the NFS 15450 client will substitute 0s for the data when the offset is less than 15451 the client's view of the file size. 15453 The NFSv4.1 protocol only provides close to open file data cache 15454 semantics; meaning that when the file is closed all modified data is 15455 written to the server. When a subsequent OPEN of the file is done, 15456 the change attribute is inspected for a difference from a cached 15457 value for the change attribute. For the case above, this means that 15458 a LAYOUTCOMMIT will be done at close (along with the data WRITEs) and 15459 will update the file's size and change attribute. Access from 15460 another client after that point will result in the appropriate size 15461 being returned. 15463 13.11. Layout Revocation and Fencing 15465 As described in Section 12.7, the layout type-specific storage 15466 protocol is responsible for handling the effects of I/Os started 15467 before lease expiration, extending through lease expiration. The 15468 LAYOUT4_NFSV4_1_FILES layout type can prevent all I/Os to data 15469 servers from being executed after lease expiration, without relying 15470 on a precise client lease timer and without requiring data servers to 15471 maintain lease timers. However, while LAYOUT4_NFSV4_1_FILES pNFS 15472 server is free to deny the client all access to the data servers, 15473 because it supports revocation of layouts, it is also free to perform 15474 a denial on a per file basis only when revoking a layout. 15476 In addition to lease expiration, the reasons a layout can be revoked 15477 include: client fails to respond to a CB_LAYOUTRECALL, the metadata 15478 server restarts, or administrative intervention. Regardless of the 15479 reason, once a client's layout has been revoked, the pNFS server MUST 15480 prevent the client from issuing I/O for the affected file from and to 15481 all data servers, in other words, it MUST fence the client from the 15482 affected file on the data servers. 15484 Fencing works as follows. As described in Section 13.1, in COMPOUND 15485 procedure requests to the data server, the data filehandle provided 15486 by the PUTFH operation and the stateid in the READ or WRITE operation 15487 are used to validate that the client has a valid layout for the I/O 15488 being performed, if it does not, the I/O is rejected with 15489 NFS4ERR_PNFS_NO_LAYOUT. The server can simply check the stateid, and 15490 additionally, make the data filehandle stale if the layout specified 15491 a data filehandle that is different from the metadata server's 15492 filehandle for the file (see the nfl_fh_list description in 15493 Section 13.3). 15495 Before the metadata server takes any action to invalidate layout 15496 state given out by a previous instance, it must make sure that all 15497 layout state from that previous instance are invalidated at the data 15498 servers. This means that a metadata server may not restripe a file 15499 until it has contacted all of the data servers to invalidate the 15500 layouts from the previous instance nor may it give out mandatory 15501 locks that conflict with layouts from the previous instance without 15502 either doing a specific invalidation (as it would have to do anyway) 15503 or doing a global data server invalidation. 15505 13.12. Security Considerations for the File Layout Type 15507 The NFSv4.1 file layout type MUST adhere to the security 15508 considerations outlined in Section 12.9. NFSv4.1 data servers MUST 15509 make all of the required access checks on each READ or WRITE I/O as 15510 determined by the NFSv4.1 protocol. If the metadata server would 15511 deny READ or WRITE operation on a given file due its ACL, mode 15512 attribute, open mode, open deny mode, mandatory lock state, or any 15513 other attributes and state, the data server MUST also deny the READ 15514 or WRITE operation. This impacts the control protocol and the 15515 propagation of state from the metadata server to the data servers; 15516 see Section 13.9.2 for more details. 15518 The methods for authentication, integrity, and privacy for file 15519 layout-based data servers are the same as those used by metadata 15520 servers. Metadata and data servers use ONC RPC security flavors to 15521 authenticate, and SECINFO and SECINFO_NO_NAME to negotiate the 15522 security mechanism and services to be used. Thus when using the 15523 LAYOUT4_NFSV4_1_FILES layout type, the impact on the RPC-based 15524 security model due to pNFS (as alluded to in Section 1.6.1 and 15525 Section 1.6.2.2) is zero. 15527 For a given file object, a metadata server MAY require different 15528 security parameters (secinfo4 value) than the data server. For a 15529 given file object with multiple data servers, the secinfo4 value 15530 SHOULD be the same across all data servers. If the secinfo4 values 15531 across a metadata server and its data servers differ for a specific 15532 file, the mapping of the principal to the server's internal user 15533 identifier MUST be the same in order for the access control checks 15534 based on ACL, mode, open and deny mode, and mandatory locking to be 15535 consistent across on the pNFS server. 15537 If an NFSv4.1 implementation supports pNFS and supports NFSv4.1 file 15538 layouts, then the implementation MUST support the SECINFO_NO_NAME 15539 operation, on both the metadata and data servers. 15541 14. Internationalization 15543 The primary issue in which NFSv4.1 needs to deal with 15544 internationalization, or I18N, is with respect to file names and 15545 other strings as used within the protocol. The choice of string 15546 representation must allow reasonable name/string access to clients 15547 which use various languages. The UTF-8 encoding of the UCS as 15548 defined by ISO10646 [20] allows for this type of access and follows 15549 the policy described in "IETF Policy on Character Sets and 15550 Languages", RFC2277 [21]. 15552 RFC3454 [18], otherwise know as "stringprep", documents a framework 15553 for using Unicode/UTF-8 in networking protocols, so as "to increase 15554 the likelihood that string input and string comparison work in ways 15555 that make sense for typical users throughout the world." A protocol 15556 must define a profile of stringprep "in order to fully specify the 15557 processing options." The remainder of this Internationalization 15558 section defines the NFSv4.1 stringprep profiles. Much of terminology 15559 used for the remainder of this section comes from stringprep. 15561 There are three UTF-8 string types defined for NFSv4.1: utf8str_cs, 15562 utf8str_cis, and utf8str_mixed. Separate profiles are defined for 15563 each. Each profile defines the following, as required by stringprep: 15565 o The intended applicability of the profile 15567 o The character repertoire that is the input and output to 15568 stringprep (which is Unicode 3.2 for referenced version of 15569 stringprep). However, NFSv4.1 implementations are not limited to 15570 3.2. 15572 o The mapping tables from stringprep used (as described in section 3 15573 of stringprep) 15575 o Any additional mapping tables specific to the profile 15577 o The Unicode normalization used, if any (as described in section 4 15578 of stringprep) 15580 o The tables from stringprep listing of characters that are 15581 prohibited as output (as described in section 5 of stringprep) 15583 o The bidirectional string testing used, if any (as described in 15584 section 6 of stringprep) 15586 o Any additional characters that are prohibited as output specific 15587 to the profile 15589 Stringprep discusses Unicode characters, whereas NFSv4.1 renders 15590 UTF-8 characters. Since there is a one-to-one mapping from UTF-8 to 15591 Unicode, when the remainder of this document refers to Unicode, the 15592 reader should assume UTF-8. 15594 Much of the text for the profiles comes from RFC3491 [22]. 15596 14.1. Stringprep profile for the utf8str_cs type 15598 Every use of the utf8str_cs type definition in the NFSv4 protocol 15599 specification follows the profile named nfs4_cs_prep. 15601 14.1.1. Intended applicability of the nfs4_cs_prep profile 15603 The utf8str_cs type is a case sensitive string of UTF-8 characters. 15604 Its primary use in NFSv4.1 is for naming components and pathnames. 15605 Components and pathnames are stored on the server's file system. Two 15606 valid distinct UTF-8 strings might be the same after processing via 15607 the utf8str_cs profile. If the strings are two names inside a 15608 directory, the NFSv4.1 server will need to either: 15610 o disallow the creation of a second name if its post processed form 15611 collides with that of an existing name, or 15613 o allow the creation of the second name, but arrange so that after 15614 post processing, the second name is different than the post 15615 processed form of the first name. 15617 14.1.2. Character repertoire of nfs4_cs_prep 15619 The nfs4_cs_prep profile uses Unicode 3.2, as defined in stringprep's 15620 Appendix A.1. However, NFSv4.1 implementations are not limited to 15621 3.2. 15623 14.1.3. Mapping used by nfs4_cs_prep 15625 The nfs4_cs_prep profile specifies mapping using the following tables 15626 from stringprep: 15628 Table B.1 15630 Table B.2 is normally not part of the nfs4_cs_prep profile as it is 15631 primarily for dealing with case-insensitive comparisons. However, if 15632 the NFSv4.1 file server supports the case_insensitive file system 15633 attribute, and if case_insensitive is TRUE, the NFSv4.1 server MUST 15634 use Table B.2 (in addition to Table B1) when processing utf8str_cs 15635 strings, and the NFSv4.1 client MUST assume Table B.2 (in addition to 15636 Table B.1) are being used. 15638 If the case_preserving attribute is present and set to FALSE, then 15639 the NFSv4.1 server MUST use table B.2 to map case when processing 15640 utf8str_cs strings. Whether the server maps from lower to upper case 15641 or the upper to lower case is an implementation dependency. 15643 14.1.4. Normalization used by nfs4_cs_prep 15645 The nfs4_cs_prep profile does not specify a normalization form. A 15646 later revision of this specification may specify a particular 15647 normalization form. Therefore, the server and client can expect that 15648 they may receive unnormalized characters within protocol requests and 15649 responses. If the operating environment requires normalization, then 15650 the implementation must normalize utf8str_cs strings within the 15651 protocol before presenting the information to an application (at the 15652 client) or local file system (at the server). 15654 14.1.5. Prohibited output for nfs4_cs_prep 15656 The nfs4_cs_prep profile RECOMMENDS prohibiting the use of the 15657 following tables from stringprep: 15659 Table C.5 15661 Table C.6 15663 14.1.6. Bidirectional output for nfs4_cs_prep 15665 The nfs4_cs_prep profile does not specify any checking of 15666 bidirectional strings. 15668 14.2. Stringprep profile for the utf8str_cis type 15670 Every use of the utf8str_cis type definition in the NFSv4.1 protocol 15671 specification follows the profile named nfs4_cis_prep. 15673 14.2.1. Intended applicability of the nfs4_cis_prep profile 15675 The utf8str_cis type is a case insensitive string of UTF-8 15676 characters. Its primary use in NFSv4.1 is for naming NFS servers. 15678 14.2.2. Character repertoire of nfs4_cis_prep 15680 The nfs4_cis_prep profile uses Unicode 3.2, as defined in 15681 stringprep's Appendix A.1. However, NFSv4.1 implementations are not 15682 limited to 3.2. 15684 14.2.3. Mapping used by nfs4_cis_prep 15686 The nfs4_cis_prep profile specifies mapping using the following 15687 tables from stringprep: 15689 Table B.1 15691 Table B.2 15693 14.2.4. Normalization used by nfs4_cis_prep 15695 The nfs4_cis_prep profile specifies using Unicode normalization form 15696 KC, as described in stringprep. 15698 14.2.5. Prohibited output for nfs4_cis_prep 15700 The nfs4_cis_prep profile specifies prohibiting using the following 15701 tables from stringprep: 15703 Table C.1.2 15705 Table C.2.2 15707 Table C.3 15709 Table C.4 15711 Table C.5 15713 Table C.6 15715 Table C.7 15717 Table C.8 15719 Table C.9 15721 14.2.6. Bidirectional output for nfs4_cis_prep 15723 The nfs4_cis_prep profile specifies checking bidirectional strings as 15724 described in stringprep's section 6. 15726 14.3. Stringprep profile for the utf8str_mixed type 15728 Every use of the utf8str_mixed type definition in the NFSv4.1 15729 protocol specification follows the profile named nfs4_mixed_prep. 15731 14.3.1. Intended applicability of the nfs4_mixed_prep profile 15733 The utf8str_mixed type is a string of UTF-8 characters, with a prefix 15734 that is case sensitive, a separator equal to '@', and a suffix that 15735 is fully qualified domain name. Its primary use in NFSv4.1 is for 15736 naming principals identified in an Access Control Entry. 15738 14.3.2. Character repertoire of nfs4_mixed_prep 15740 The nfs4_mixed_prep profile uses Unicode 3.2, as defined in 15741 stringprep's Appendix A.1. However, NFSv4.1 implementations are not 15742 limited to 3.2. 15744 14.3.3. Mapping used by nfs4_cis_prep 15746 For the prefix and the separator of a utf8str_mixed string, the 15747 nfs4_mixed_prep profile specifies mapping using the following table 15748 from stringprep: 15750 Table B.1 15752 For the suffix of a utf8str_mixed string, the nfs4_mixed_prep profile 15753 specifies mapping using the following tables from stringprep: 15755 Table B.1 15757 Table B.2 15759 14.3.4. Normalization used by nfs4_mixed_prep 15761 The nfs4_mixed_prep profile specifies using Unicode normalization 15762 form KC, as described in stringprep. 15764 14.3.5. Prohibited output for nfs4_mixed_prep 15766 The nfs4_mixed_prep profile specifies prohibiting using the following 15767 tables from stringprep: 15769 Table C.1.2 15771 Table C.2.2 15773 Table C.3 15775 Table C.4 15777 Table C.5 15779 Table C.6 15781 Table C.7 15783 Table C.8 15785 Table C.9 15787 14.3.6. Bidirectional output for nfs4_mixed_prep 15789 The nfs4_mixed_prep profile specifies checking bidirectional strings 15790 as described in stringprep's section 6. 15792 14.4. UTF-8 Capabilities 15794 const FSCHARSET_CAP4_CONTAINS_NON_UTF8 = 0x1; 15795 const FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 = 0x2; 15797 typedef uint32_t fs_charset_cap4; 15798 Because some operating environments and file systems do not enforce 15799 character set encodings, NFSv4.1 supports the fs_charset_cap 15800 attribute (Section 5.8.2.11) that indicates to the client a file 15801 system's UTF-8 capabilities. The attribute is an integer containing 15802 a pair of flags. The first flag is FSCHARSET_CAP4_CONTAINS_NON_UTF8, 15803 which, if set to one tells the client the file system contains non- 15804 UTF-8 characters, and the server will not convert non-UTF characters 15805 to UTF-8 if the client reads a symlink or directory, nor will 15806 operations with component names or pathnames in the arguments convert 15807 the strings to UTF-8. The second flag is 15808 FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 which if set to one, indicates that 15809 the server will accept (and generate) only UTF-8 characters on the 15810 file system. If FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 is set to one, 15811 FSCHARSET_CAP4_CONTAINS_NON_UTF8 MUST be set to zero. 15812 FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 SHOULD always be set to one. 15814 14.5. UTF-8 Related Errors 15816 Where the client sends an invalid UTF-8 string, the server should 15817 return NFS4ERR_INVAL (see Table 5). This includes cases in which 15818 inappropriate prefixes are detected and where the count includes 15819 trailing bytes that do not constitute a full UCS character. 15821 Where the client supplied string is valid UTF-8 but contains 15822 characters that are not supported by the server as a value for that 15823 string (e.g. names containing characters outside of Unicode plane 0 15824 on filesystems that fail to support such characters despite their 15825 presence in the Unicode standard), the server should return 15826 NFS4ERR_BADCHAR. 15828 Where a UTF-8 string is used as a file name, and the file system, 15829 while supporting all of the characters within the name, does not 15830 allow that particular name to be used, the server should return the 15831 error NFS4ERR_BADNAME (Table 5). This includes situations in which 15832 the server file system imposes a normalization constraint on name 15833 strings, but will also include such situations as file system 15834 prohibitions of "." and ".." as file names for certain operations, 15835 and other such constraints. 15837 15. Error Values 15839 NFS error numbers are assigned to failed operations within a Compound 15840 (COMPOUND or CB_COMPOUND) request. A Compound request contains a 15841 number of NFS operations that have their results encoded in sequence 15842 in a Compound reply. The results of successful operations will 15843 consist of an NFS4_OK status followed by the encoded results of the 15844 operation. If an NFS operation fails, an error status will be 15845 entered in the reply and the Compound request will be terminated. 15847 15.1. Error Definitions 15849 Protocol Error Definitions 15851 +-----------------------------------+--------+-------------------+ 15852 | Error | Number | Description | 15853 +-----------------------------------+--------+-------------------+ 15854 | NFS4_OK | 0 | Section 15.1.3.1 | 15855 | NFS4ERR_ACCESS | 13 | Section 15.1.6.1 | 15856 | NFS4ERR_ATTRNOTSUPP | 10032 | Section 15.1.15.1 | 15857 | NFS4ERR_ADMIN_REVOKED | 10047 | Section 15.1.5.1 | 15858 | NFS4ERR_BACK_CHAN_BUSY | 10057 | Section 15.1.12.1 | 15859 | NFS4ERR_BADCHAR | 10040 | Section 15.1.7.1 | 15860 | NFS4ERR_BADHANDLE | 10001 | Section 15.1.2.1 | 15861 | NFS4ERR_BADIOMODE | 10049 | Section 15.1.10.1 | 15862 | NFS4ERR_BADLAYOUT | 10050 | Section 15.1.10.2 | 15863 | NFS4ERR_BADNAME | 10041 | Section 15.1.7.2 | 15864 | NFS4ERR_BADOWNER | 10039 | Section 15.1.15.2 | 15865 | NFS4ERR_BADSESSION | 10052 | Section 15.1.11.1 | 15866 | NFS4ERR_BADSLOT | 10053 | Section 15.1.11.2 | 15867 | NFS4ERR_BADTYPE | 10007 | Section 15.1.4.1 | 15868 | NFS4ERR_BADXDR | 10036 | Section 15.1.1.1 | 15869 | NFS4ERR_BAD_COOKIE | 10003 | Section 15.1.1.2 | 15870 | NFS4ERR_BAD_HIGH_SLOT | 10077 | Section 15.1.11.3 | 15871 | NFS4ERR_BAD_RANGE | 10042 | Section 15.1.8.1 | 15872 | NFS4ERR_BAD_SEQID | 10026 | Section 15.1.16.1 | 15873 | NFS4ERR_BAD_SESSION_DIGEST | 10051 | Section 15.1.12.2 | 15874 | NFS4ERR_BAD_STATEID | 10025 | Section 15.1.5.2 | 15875 | NFS4ERR_CB_PATH_DOWN | 10048 | Section 15.1.11.4 | 15876 | NFS4ERR_CLID_INUSE | 10017 | Section 15.1.13.2 | 15877 | NFS4ERR_CLIENTID_BUSY | 10074 | Section 15.1.13.1 | 15878 | NFS4ERR_COMPLETE_ALREADY | 10054 | Section 15.1.9.1 | 15879 | NFS4ERR_CONN_NOT_BOUND_TO_SESSION | 10055 | Section 15.1.11.6 | 15880 | NFS4ERR_DEADLOCK | 10045 | Section 15.1.8.2 | 15881 | NFS4ERR_DEADSESSION | 10078 | Section 15.1.11.5 | 15882 | NFS4ERR_DELAY | 10008 | Section 15.1.1.3 | 15883 | NFS4ERR_DELEG_ALREADY_WANTED | 10056 | Section 15.1.14.1 | 15884 | NFS4ERR_DELEG_REVOKED | 10087 | Section 15.1.5.3 | 15885 | NFS4ERR_DENIED | 10010 | Section 15.1.8.3 | 15886 | NFS4ERR_DIRDELEG_UNAVAIL | 10084 | Section 15.1.14.2 | 15887 | NFS4ERR_DQUOT | 69 | Section 15.1.4.2 | 15888 | NFS4ERR_ENCR_ALG_UNSUPP | 10079 | Section 15.1.13.3 | 15889 | NFS4ERR_EXIST | 17 | Section 15.1.4.3 | 15890 | NFS4ERR_EXPIRED | 10011 | Section 15.1.5.4 | 15891 | NFS4ERR_FBIG | 27 | Section 15.1.4.4 | 15892 | NFS4ERR_FHEXPIRED | 10014 | Section 15.1.2.2 | 15893 | NFS4ERR_FILE_OPEN | 10046 | Section 15.1.4.5 | 15894 | NFS4ERR_GRACE | 10013 | Section 15.1.9.2 | 15895 | NFS4ERR_HASH_ALG_UNSUPP | 10072 | Section 15.1.13.4 | 15896 | NFS4ERR_INVAL | 22 | Section 15.1.1.4 | 15897 | NFS4ERR_IO | 5 | Section 15.1.4.6 | 15898 | NFS4ERR_ISDIR | 21 | Section 15.1.2.3 | 15899 | NFS4ERR_LAYOUTTRYLATER | 10058 | Section 15.1.10.3 | 15900 | NFS4ERR_LAYOUTUNAVAILABLE | 10059 | Section 15.1.10.4 | 15901 | NFS4ERR_LEASE_MOVED | 10031 | Section 15.1.16.2 | 15902 | NFS4ERR_LOCKED | 10012 | Section 15.1.8.4 | 15903 | NFS4ERR_LOCKS_HELD | 10037 | Section 15.1.8.5 | 15904 | NFS4ERR_LOCK_NOTSUPP | 10043 | Section 15.1.8.6 | 15905 | NFS4ERR_LOCK_RANGE | 10028 | Section 15.1.8.7 | 15906 | NFS4ERR_MINOR_VERS_MISMATCH | 10021 | Section 15.1.3.2 | 15907 | NFS4ERR_MLINK | 31 | Section 15.1.4.7 | 15908 | NFS4ERR_MOVED | 10019 | Section 15.1.2.4 | 15909 | NFS4ERR_NAMETOOLONG | 63 | Section 15.1.7.3 | 15910 | NFS4ERR_NOENT | 2 | Section 15.1.4.8 | 15911 | NFS4ERR_NOFILEHANDLE | 10020 | Section 15.1.2.5 | 15912 | NFS4ERR_NOMATCHING_LAYOUT | 10060 | Section 15.1.10.5 | 15913 | NFS4ERR_NOSPC | 28 | Section 15.1.4.9 | 15914 | NFS4ERR_NOTDIR | 20 | Section 15.1.2.6 | 15915 | NFS4ERR_NOTEMPTY | 66 | Section 15.1.4.10 | 15916 | NFS4ERR_NOTSUPP | 10004 | Section 15.1.1.5 | 15917 | NFS4ERR_NOT_ONLY_OP | 10081 | Section 15.1.3.3 | 15918 | NFS4ERR_NOT_SAME | 10027 | Section 15.1.15.3 | 15919 | NFS4ERR_NO_GRACE | 10033 | Section 15.1.9.3 | 15920 | NFS4ERR_NXIO | 6 | Section 15.1.16.3 | 15921 | NFS4ERR_OLD_STATEID | 10024 | Section 15.1.5.5 | 15922 | NFS4ERR_OPENMODE | 10038 | Section 15.1.8.8 | 15923 | NFS4ERR_OP_ILLEGAL | 10044 | Section 15.1.3.4 | 15924 | NFS4ERR_OP_NOT_IN_SESSION | 10071 | Section 15.1.3.5 | 15925 | NFS4ERR_PERM | 1 | Section 15.1.6.2 | 15926 | NFS4ERR_PNFS_IO_HOLE | 10075 | Section 15.1.10.6 | 15927 | NFS4ERR_PNFS_NO_LAYOUT | 10080 | Section 15.1.10.7 | 15928 | NFS4ERR_RECALLCONFLICT | 10061 | Section 15.1.14.3 | 15929 | NFS4ERR_RECLAIM_BAD | 10034 | Section 15.1.9.4 | 15930 | NFS4ERR_RECLAIM_CONFLICT | 10035 | Section 15.1.9.5 | 15931 | NFS4ERR_REJECT_DELEG | 10085 | Section 15.1.14.4 | 15932 | NFS4ERR_REP_TOO_BIG | 10066 | Section 15.1.3.6 | 15933 | NFS4ERR_REP_TOO_BIG_TO_CACHE | 10067 | Section 15.1.3.7 | 15934 | NFS4ERR_REQ_TOO_BIG | 10065 | Section 15.1.3.8 | 15935 | NFS4ERR_RESTOREFH | 10030 | Section 15.1.16.4 | 15936 | NFS4ERR_RETRY_UNCACHED_REP | 10068 | Section 15.1.3.9 | 15937 | NFS4ERR_RETURNCONFLICT | 10086 | Section 15.1.10.8 | 15938 | NFS4ERR_ROFS | 30 | Section 15.1.4.11 | 15939 | NFS4ERR_SAME | 10009 | Section 15.1.15.4 | 15940 | NFS4ERR_SHARE_DENIED | 10015 | Section 15.1.8.9 | 15941 | NFS4ERR_SEQUENCE_POS | 10064 | Section 15.1.3.10 | 15942 | NFS4ERR_SEQ_FALSE_RETRY | 10076 | Section 15.1.11.7 | 15943 | NFS4ERR_SEQ_MISORDERED | 10063 | Section 15.1.11.8 | 15944 | NFS4ERR_SERVERFAULT | 10006 | Section 15.1.1.6 | 15945 | NFS4ERR_STALE | 70 | Section 15.1.2.7 | 15946 | NFS4ERR_STALE_CLIENTID | 10022 | Section 15.1.13.5 | 15947 | NFS4ERR_STALE_STATEID | 10023 | Section 15.1.16.5 | 15948 | NFS4ERR_SYMLINK | 10029 | Section 15.1.2.8 | 15949 | NFS4ERR_TOOSMALL | 10005 | Section 15.1.1.7 | 15950 | NFS4ERR_TOO_MANY_OPS | 10070 | Section 15.1.3.11 | 15951 | NFS4ERR_UNKNOWN_LAYOUTTYPE | 10062 | Section 15.1.10.9 | 15952 | NFS4ERR_UNSAFE_COMPOUND | 10069 | Section 15.1.3.12 | 15953 | NFS4ERR_WRONGSEC | 10016 | Section 15.1.6.3 | 15954 | NFS4ERR_WRONG_CRED | 10082 | Section 15.1.6.4 | 15955 | NFS4ERR_WRONG_TYPE | 10083 | Section 15.1.2.9 | 15956 | NFS4ERR_XDEV | 18 | Section 15.1.4.12 | 15957 +-----------------------------------+--------+-------------------+ 15959 Table 5 15961 15.1.1. General Errors 15963 This section deals with errors that are applicable to a broad set of 15964 different purposes. 15966 15.1.1.1. NFS4ERR_BADXDR (Error Code 10036) 15968 The arguments for this operation do not match those specified in the 15969 XDR definition. This includes situations in which the request ends 15970 before all the arguments have been seen. Note that this error 15971 applies when fixed enumerations (these include booleans) have a value 15972 within the input stream which is not valid for the enum. A replier 15973 may pre-parse all operations for a Compound procedure before doing 15974 any operation execution and return RPC-level XDR errors in that case. 15976 15.1.1.2. NFS4ERR_BAD_COOKIE (Error Code 10003) 15978 Used for operations that provide a set of information indexed by some 15979 quantity provided by the client or cookie sent by the server for an 15980 earlier invocation. Where the value cannot be used for its intended 15981 purpose, this error results. 15983 15.1.1.3. NFS4ERR_DELAY (Error Code 10008) 15985 For any of a number of reasons, the replier could not process this 15986 operation in what was deemed a reasonable time. The client should 15987 wait and then try the request with a new slot and sequence value. 15989 Some example of situations that might lead to this situation: 15991 o A server that supports hierarchical storage receives a request to 15992 process a file that had been migrated. 15994 o An operation requires a delegation recall to proceed and waiting 15995 for this delegation recall makes processing this request in a 15996 timely fashion impossible. 15998 In such cases, the error NFS4ERR_DELAY allows these preparatory 15999 operations to proceed without holding up client resources such as a 16000 session slot. After delaying for period of time, the client can then 16001 re-send the operation in question (but not with the same slot ID and 16002 sequence ID; one or both MUST be different on the re-send). 16004 Note that without the ability to return NFS4ERR_DELAY and the 16005 client's willingness to re-send when receiving it, deadlock might 16006 well result. E.g., if a recall is done, and if the delegation return 16007 or operations preparatory to delegation return are held up by other 16008 operations that need the delegation to be returned, session slots 16009 might not be available. The result could be deadlock. 16011 15.1.1.4. NFS4ERR_INVAL (Error Code 22) 16013 The arguments for this operation are not valid for some reason, even 16014 though they do match those specified in the XDR definition for the 16015 request. 16017 15.1.1.5. NFS4ERR_NOTSUPP (Error Code 10004) 16019 Operation not supported, either because the operation is an OPTIONAL 16020 one and is not supported by this server or because the operation MUST 16021 NOT be implemented in the current minor version. 16023 15.1.1.6. NFS4ERR_SERVERFAULT (Error Code 10006) 16025 An error occurred on the server which does not map to any of the 16026 specific legal NFSv4.1 protocol error values. The client should 16027 translate this into an appropriate error. UNIX clients may choose to 16028 translate this to EIO. 16030 15.1.1.7. NFS4ERR_TOOSMALL (Error Code 10005) 16032 Used where an operation returns a variable amount of data, with a 16033 limit specified by the client. Where the data returned cannot be fit 16034 within the limit specified by the client, this error results. 16036 15.1.2. Filehandle Errors 16038 These errors deal with the situation in which the current or saved 16039 filehandle, or the filehandle passed to PUTFH intended to become the 16040 current filehandle, is invalid in some way. This includes situations 16041 in which the filehandle is a valid filehandle in general but is not 16042 of the appropriate object type for the current operation. 16044 Where the error description indicates a problem with the current or 16045 saved filehandle, it is to be understood that filehandles are only 16046 checked for the condition if they are implicit arguments of the 16047 operation in question. 16049 15.1.2.1. NFS4ERR_BADHANDLE (Error Code 10001) 16051 Illegal NFS filehandle for the current server. The current file 16052 handle failed internal consistency checks. Once accepted as valid 16053 (by PUTFH), no subsequent status change can cause the filehandle to 16054 generate this error. 16056 15.1.2.2. NFS4ERR_FHEXPIRED (Error Code 10014) 16058 A current or saved filehandle which is an argument to the current 16059 operation is volatile and has expired at the server. 16061 15.1.2.3. NFS4ERR_ISDIR (Error Code 21) 16063 The current or saved filehandle designates a directory when the 16064 current operation does not allow a directory to be accepted as the 16065 target of this operation. 16067 15.1.2.4. NFS4ERR_MOVED (Error Code 10019) 16069 The file system which contains the current filehandle object is not 16070 present at the server. It may have been relocated, migrated to 16071 another server or may have never been present. The client may obtain 16072 the new file system location by obtaining the "fs_locations" or 16073 "fs_locations_info" attribute for the current filehandle. For 16074 further discussion, refer to Section 11.2 16076 15.1.2.5. NFS4ERR_NOFILEHANDLE (Error Code 10020) 16078 The logical current or saved filehandle value is required by the 16079 current operation and is not set. This may be a result of a 16080 malformed COMPOUND operation (i.e. no PUTFH or PUTROOTFH before an 16081 operation that requires the current filehandle be set). 16083 15.1.2.6. NFS4ERR_NOTDIR (Error Code 20) 16085 The current (or saved) filehandle designates an object which is not a 16086 directory for an operation in which a directory is required. 16088 15.1.2.7. NFS4ERR_STALE (Error Code 70) 16090 The current or saved filehandle value designating an argument to the 16091 current operation is invalid The file referred to by that filehandle 16092 no longer exists or access to it has been revoked. 16094 15.1.2.8. NFS4ERR_SYMLINK (Error Code 10029) 16096 The current filehandle designates a symbolic link when the current 16097 operation does not allow a symbolic link as the target. 16099 15.1.2.9. NFS4ERR_WRONG_TYPE (Error Code 10083) 16101 The current (or saved) filehandle designates an object which is of an 16102 invalid type for the current operation and there is no more specific 16103 error (such as NFS4ERR_ISDIR or NFS4ERR_SYMLINK) that applies. Note 16104 that in NFSv4.0, such situations generally resulted in the less 16105 specific error NFS4ERR_INVAL. 16107 15.1.3. Compound Structure Errors 16109 This section deals with errors that relate to overall structure of a 16110 Compound request (by which we mean to include both COMPOUND and 16111 CB_COMPOUND), rather than to particular operations. 16113 There are a number of basic constraints on the operations that may 16114 appear in a Compound request. Sessions adds to these basic 16115 constraints by requiring a Sequence operation (either SEQUENCE or 16116 CB_SEQUENCE) at the start of the Compound. 16118 15.1.3.1. NFS_OK (Error code 0) 16120 Indicates the operation completed successfully, in that all of the 16121 constituent operations completed without error. 16123 15.1.3.2. NFS4ERR_MINOR_VERS_MISMATCH (Error code 10021) 16125 The minor version specified is not one that the current listener 16126 supports. This value is returned in the overall status for the 16127 Compound but is not associated with a specific operation since the 16128 results will specify a result count of zero. 16130 15.1.3.3. NFS4ERR_NOT_ONLY_OP (Error Code 10081) 16132 Certain operations, which are allowed to be executed outside of a 16133 session, MUST be the only operation within a COMPOUND. This error 16134 results when that constraint is not met. 16136 15.1.3.4. NFS4ERR_OP_ILLEGAL (Error Code 10044) 16138 The operation code is not a valid one for the current Compound 16139 procedure. The opcode in the result stream matched with this error 16140 is the ILLEGAL value, although the value that appears in the request 16141 stream may be different. Where an illegal value appears and the 16142 replier pre-parses all operations for a Compound procedure before 16143 doing any operation execution, an RPC-level XDR error may be returned 16144 in this case. 16146 15.1.3.5. NFS4ERR_OP_NOT_IN_SESSION (Error Code 10071) 16148 Most forward operations and all callback operations are only valid 16149 within the context of a session, so that the Compound request in 16150 question MUST begin with a Sequence operation. If an attempt is made 16151 to execute these operations outside the context of session, this 16152 error results. 16154 15.1.3.6. NFS4ERR_REP_TOO_BIG (Error Code 10066) 16156 The reply to a Compound would exceed the channel's negotiated maximum 16157 response size. 16159 15.1.3.7. NFS4ERR_REP_TOO_BIG_TO_CACHE (Error Code 10067) 16161 The reply to a Compound would exceed the channel's negotiated maximum 16162 size for replies cached in the reply cache when the Sequence for the 16163 current request specifies that this request is to be cached. 16165 15.1.3.8. NFS4ERR_REQ_TOO_BIG (Error Code 10065) 16167 The Compound request exceeds the channel's negotiated maximum size 16168 for requests. 16170 15.1.3.9. NFS4ERR_RETRY_UNCACHED_REP (Error Code 10068) 16172 The requester has attempted a retry of a Compound which it previously 16173 requested not be placed in the reply cache. 16175 15.1.3.10. NFS4ERR_SEQUENCE_POS (Error Code 10064) 16177 A Sequence operation appeared in a position other than the first 16178 operation of a Compound request. 16180 15.1.3.11. NFS4ERR_TOO_MANY_OPS (Error Code 10070) 16182 The Compound request has too many operations, exceeding the count 16183 negotiated when the session was created. 16185 15.1.3.12. NFS4ERR_UNSAFE_COMPOUND (Error Code 10068) 16187 The client has sent a COMPOUND request with an unsafe mix of 16188 operations, specifically with a non-idempotent operation changing the 16189 current filehandle which is not followed by a GETFH. 16191 15.1.4. File System Errors 16193 These errors describe situations which occurred in the underlying 16194 file system implementation rather than in the protocol or any NFSv4.x 16195 feature. 16197 15.1.4.1. NFS4ERR_BADTYPE (Error Code 10007) 16199 An attempt was made to create an object with an inappropriate type 16200 specified to CREATE. This may be because the type is undefined, 16201 because it is a type not supported by the server, or because it is a 16202 type for which create is not intended such as a regular file or named 16203 attribute, for which OPEN is used to do the file creation. 16205 15.1.4.2. NFS4ERR_DQUOT (Error Code 19) 16207 Resource (quota) hard limit exceeded. The user's resource limit on 16208 the server has been exceeded. 16210 15.1.4.3. NFS4ERR_EXIST (Error Code 17) 16212 A file of the specified target name (when creating, renaming or 16213 linking) already exists. 16215 15.1.4.4. NFS4ERR_FBIG (Error Code 27) 16217 File too large. The operation would have caused a file to grow 16218 beyond the server's limit. 16220 15.1.4.5. NFS4ERR_FILE_OPEN (Error Code 10046) 16222 The operation is not allowed because a file involved in the operation 16223 is currently open. Servers may, but are not required to disallow 16224 linking-to, removing, or renaming open files. 16226 15.1.4.6. NFS4ERR_IO (Error Code 5) 16228 Indicates that an I/O error occurred for which the file system was 16229 unable to provide recovery. 16231 15.1.4.7. NFS4ERR_MLINK (Error Code 31) 16233 The request would have caused the server's limit for the number of 16234 hard links a file may have to be exceeded. 16236 15.1.4.8. NFS4ERR_NOENT (Error Code 2) 16238 Indicates no such file or directory. The file or directory name 16239 specified does not exist. 16241 15.1.4.9. NFS4ERR_NOSPC (Error Code 28) 16243 Indicates no space left on device. The operation would have caused 16244 the server's file system to exceed its limit. 16246 15.1.4.10. NFS4ERR_NOTEMPTY (Error Code 66) 16248 An attempt was made to remove a directory that was not empty. 16250 15.1.4.11. NFS4ERR_ROFS (Error Code 30) 16252 Indicates a read-only file system. A modifying operation was 16253 attempted on a read-only file system. 16255 15.1.4.12. NFS4ERR_XDEV (Error Code 18) 16257 Indicates an attempt to do an operation, such as linking, that 16258 inappropriately crosses a boundary. This may be due to such 16259 boundaries as: 16261 o That between file systems (where the fsids are different). 16263 o That between different named attribute directories or between a 16264 named attribute directory and an ordinary directory. 16266 o That between regions of a file system that the file system 16267 implementation treats as separate (for example for space 16268 accounting purposes), and where cross-connection between the 16269 regions are not allowed. 16271 15.1.5. State Management Errors 16273 These errors indicate problems with the stateid (or one of the 16274 stateids) passed to a given operation. This includes situations in 16275 which the stateid is invalid as well as situations in which the 16276 stateid is valid but designates revoked locking state. Depending on 16277 the operation, the stateid when valid may designate opens, byte-range 16278 locks, file or directory delegations, layouts, or device maps. 16280 15.1.5.1. NFS4ERR_ADMIN_REVOKED (Error Code 10047) 16282 A stateid designates locking state of any type that has been revoked 16283 due to administrative interaction, possibly while the lease is valid. 16285 15.1.5.2. NFS4ERR_BAD_STATEID (Error Code 10026) 16287 A stateid does not properly designate any valid state. See 16288 Section 8.2.4 and Section 8.2.3 for a discussion of how stateids are 16289 validated. 16291 15.1.5.3. NFS4ERR_DELEG_REVOKED (Error Code 10087) 16293 A stateid designates recallable locking state of any type (delegation 16294 or layout) that has been revoked due to the failure of the client to 16295 return the lock when it was recalled. 16297 15.1.5.4. NFS4ERR_EXPIRED (Error Code 10011) 16299 A stateid designates locking state of any type that has been revoked 16300 due to expiration of the client's lease, either immediately upon 16301 lease expiration, or following a later request for a conflicting 16302 lock. 16304 15.1.5.5. NFS4ERR_OLD_STATEID (Error Code 10024) 16306 A stateid with a non-zero seqid value does match the current seqid 16307 for the state designated by the user. 16309 15.1.6. Security Errors 16311 These are the various permission-related errors in NFSv4.1. 16313 15.1.6.1. NFS4ERR_ACCESS (Error Code 13) 16315 Indicates permission denied. The caller does not have the correct 16316 permission to perform the requested operation. Contrast this with 16317 NFS4ERR_PERM (Section 15.1.6.2), which restricts itself to owner or 16318 privileged user permission failures, and NFS4ERR_WRONG_CRED 16319 (Section 15.1.6.4) which deals with appropriate permission to delete 16320 or modify transient objects, based on the credentials of the user 16321 that created them. 16323 15.1.6.2. NFS4ERR_PERM (Error Code 1) 16325 Indicates requester is not the owner. The operation was not allowed 16326 because the caller is neither a privileged user (root) nor the owner 16327 of the target of the operation. 16329 15.1.6.3. NFS4ERR_WRONGSEC (Error Code 10016) 16331 Indicates that the security mechanism being used by the client for 16332 the operation does not match the server's security policy. The 16333 client should change the security mechanism being used and re-send 16334 the operation (but not with the same slot ID and sequence ID; one or 16335 both MUST be different on the re-send). SECINFO and SECINFO_NO_NAME 16336 can be used to determine the appropriate mechanism. 16338 15.1.6.4. NFS4ERR_WRONG_CRED (Error Code 10082) 16340 An operation manipulating state was attempted by a principal that was 16341 not allowed to modify that piece of state. 16343 15.1.7. Name Errors 16345 Names in NFSv4 are UTF-8 strings. When the strings are not valid 16346 UTF-8 or are of length zero, the error NFS4ERR_INVAL results. 16347 Besides this, there are a number of other errors to indicate specific 16348 problems with names. 16350 15.1.7.1. NFS4ERR_BADCHAR (Error Code 10040) 16352 A UTF-8 string contains a character which is not supported by the 16353 server in the context in which it being used. 16355 15.1.7.2. NFS4ERR_BADNAME (Error Code 10041) 16357 A name string in a request consisted of valid UTF-8 characters 16358 supported by the server but the name is not supported by the server 16359 as a valid name for current operation. An example might be creating 16360 a file or directory named ".." on a server whose file system uses 16361 that name for links to parent directories. 16363 15.1.7.3. NFS4ERR_NAMETOOLONG (Error Code 63) 16365 Returned when the filename in an operation exceeds the server's 16366 implementation limit. 16368 15.1.8. Locking Errors 16370 This section deal with errors related to locking, both as to share 16371 reservations and byte-range locking. It does not deal with errors 16372 specific to the process of reclaiming locks. Those are dealt with in 16373 the next section. 16375 15.1.8.1. NFS4ERR_BAD_RANGE (Error Code 10042) 16377 The range for a LOCK, LOCKT, or LOCKU operation is not appropriate to 16378 the allowable range of offsets for the server. E.g., this error 16379 results when a server which only supports 32-bit ranges receives a 16380 range that cannot be handled by that server. (See Section 18.10.3). 16382 15.1.8.2. NFS4ERR_DEADLOCK (Error Code 10045) 16384 The server has been able to determine a file locking deadlock 16385 condition for a blocking lock request. 16387 15.1.8.3. NFS4ERR_DENIED (Error Code 10010) 16389 An attempt to lock a file is denied. Since this may be a temporary 16390 condition, the client is encouraged to re-send the lock request (but 16391 not with the same slot ID and sequence ID; one or both MUST be 16392 different on the re-send) until the lock is accepted. See 16393 Section 9.6 for a discussion of the re-send. 16395 15.1.8.4. NFS4ERR_LOCKED (Error Code 10012) 16397 A read or write operation was attempted on a file where there was a 16398 conflict between the I/O and an existing lock: 16400 o There is a share reservation inconsistent with the I/O being done. 16402 o The range to be read or written intersects an existing mandatory 16403 byte range lock. 16405 15.1.8.5. NFS4ERR_LOCKS_HELD (Error Code 10037) 16407 An operation was prevented by the unexpected presence of locks. 16409 15.1.8.6. NFS4ERR_LOCK_NOTSUPP (Error Code 10043) 16411 A locking request was attempted which would require the upgrade or 16412 downgrade of a lock range already held by the owner when the server 16413 does not support atomic upgrade or downgrade of locks. 16415 15.1.8.7. NFS4ERR_LOCK_RANGE (Error Code 10028) 16417 A lock request is operating on a range that overlaps in part a 16418 currently held lock for the current lock-owner and does not precisely 16419 match a single such lock where the server does not support this type 16420 of request, and thus does not implement POSIX locking semantics [23]. 16421 See Section 18.10.4, Section 18.11.4, and Section 18.12.4 for a 16422 discussion of how this applies to LOCK, LOCKT, and LOCKU 16423 respectively. 16425 15.1.8.8. NFS4ERR_OPENMODE (Error Code 10038) 16427 The client attempted a READ, WRITE, LOCK or other operation not 16428 sanctioned by the stateid passed (e.g. writing to a file opened only 16429 for read). 16431 15.1.8.9. NFS4ERR_SHARE_DENIED (Error Code 10015) 16433 An attempt to OPEN a file with a share reservation has failed because 16434 of a share conflict. 16436 15.1.9. Reclaim Errors 16438 These errors relate to the process of reclaiming locks after a server 16439 restart. 16441 15.1.9.1. NFS4ERR_COMPLETE_ALREADY (Error Code 10054) 16443 The client previously sent a successful RECLAIM_COMPLETE operation. 16444 An additional RECLAIM_COMPLETE operation is not necessary and results 16445 in this error. 16447 15.1.9.2. NFS4ERR_GRACE (Error Code 10013) 16449 The server is in its recovery or grace period which should at least 16450 match the lease period of the server. A locking request other than a 16451 reclaim could not be granted during that period. 16453 15.1.9.3. NFS4ERR_NO_GRACE (Error Code 10033) 16455 A reclaim of client state was attempted in circumstances in which the 16456 server cannot guarantee that conflicting state has not been provided 16457 to another client. This can occur because the reclaim has been done 16458 outside of the grace period of the server, after the client has done 16459 a RECLAIM_COMPLETE operation, or because previous operations have 16460 created a situation in which the server is not able to determine that 16461 a reclaim-interfering edge condition does not exist. 16463 15.1.9.4. NFS4ERR_RECLAIM_BAD (Error Code 10034) 16465 A reclaim attempted by the client does not match the server's state 16466 consistency checks and has been rejected therefore as invalid. 16468 15.1.9.5. NFS4ERR_RECLAIM_CONFLICT (Error Code 10035) 16470 The reclaim attempted by the client has encountered a conflict and 16471 cannot be satisfied. Potentially indicates a misbehaving client, 16472 although not necessarily the one receiving the error. The 16473 misbehavior might be on the part of the client that established the 16474 lock with which this client conflicted. 16476 15.1.10. pNFS Errors 16478 This section deals with pNFS-related errors including those that are 16479 associated with using NFSv4.1 to communicate with a data server. 16481 15.1.10.1. NFS4ERR_BADIOMODE (Error Code 10049) 16483 An invalid or inappropriate layout iomode was specified. 16485 15.1.10.2. NFS4ERR_BADLAYOUT (Error Code 10050) 16487 The layout specified is invalid in some way. For LAYOUTCOMMIT, this 16488 indicates that the specified layout is not held by the client or is 16489 not of mode LAYOUTIOMODE4_RW. For LAYOUTGET, it indicates that a 16490 layout matching the client's specification as to minimum length 16491 cannot be granted. 16493 15.1.10.3. NFS4ERR_LAYOUTTRYLATER (Error Code 10058) 16495 Layouts are temporarily unavailable for the file. The client should 16496 re-send later (but not with the same slot ID and sequence ID; one or 16497 both MUST be different on the re-send). 16499 15.1.10.4. NFS4ERR_LAYOUTUNAVAILABLE (Error Code 10059) 16501 Returned when layouts are not available for the current file system 16502 or the particular specified file. 16504 15.1.10.5. NFS4ERR_NOMATCHING_LAYOUT (Error Code 10060) 16506 Returned when layouts are recalled and the client has no layouts 16507 matching the specification of the layouts being recalled. 16509 15.1.10.6. NFS4ERR_PNFS_IO_HOLE (Error Code 10075) 16511 The pNFS client has attempted to read from or write to an illegal 16512 hole of a file of a data server that is using sparse packing. See 16513 Section 13.4.4. 16515 15.1.10.7. NFS4ERR_PNFS_NO_LAYOUT (Error Code 10080) 16517 The pNFS client has attempted to read from or write to a file (using 16518 a request to a data server) without holding a valid layout. This 16519 includes the case where the client had a layout, but the iomode does 16520 not allow a WRITE. 16522 15.1.10.8. NFS4ERR_RETURNCONFLICT (Error Code 10086) 16524 A layout is unavailable due to an attempt to perform the LAYOUTGET 16525 before a pending LAYOUTRETURN on the file has been received. See 16526 Section 12.5.5.2.1.3. 16528 15.1.10.9. NFS4ERR_UNKNOWN_LAYOUTTYPE (Error Code 10062) 16530 The client has specified a layout type which is not supported by the 16531 server. 16533 15.1.11. Session Use Errors 16535 This section deals with errors encountered in using sessions, that 16536 is, in issuing requests over them using the Sequence (i.e. either 16537 SEQUENCE or CB_SEQUENCE) operations. 16539 15.1.11.1. NFS4ERR_BADSESSION (Error Code 10052) 16541 The specified session ID is unknown to the server to which the 16542 operation is addressed. 16544 15.1.11.2. NFS4ERR_BADSLOT (Error Code 10053) 16546 The requester sent a Sequence operation that attempted to use a slot 16547 the replier does not have in its slot table. It is possible the slot 16548 may have been retired. 16550 15.1.11.3. NFS4ERR_BAD_HIGH_SLOT (Error Code 10077) 16552 The highest_slot argument in a Sequence operation exceeds the 16553 replier's enforced highest_slotid. 16555 15.1.11.4. NFS4ERR_CB_PATH_DOWN (Error Code 10048) 16557 There is a problem contacting the client via the callback path. The 16558 function of this error has been mostly superseded by the use of 16559 status flags in the reply to the SEQUENCE operation (see 16560 Section 18.46). 16562 15.1.11.5. NFS4ERR_DEADSESSION (Error Code 10078) 16564 The specified session is a persistent session which is dead and does 16565 not accept new requests or perform new operations on existing 16566 requests (in the case in which a request was partially executed 16567 before server restart). 16569 15.1.11.6. NFS4ERR_CONN_NOT_BOUND_TO_SESSION (Error Code 10055) 16571 A Sequence operation was sent on a connection that has not been 16572 associated with the specified session, where the client specified 16573 that connection association was to be enforced with SP4_MACH_CRED or 16574 SP4_SSV state protection. 16576 15.1.11.7. NFS4ERR_SEQ_FALSE_RETRY (Error Code 10076) 16578 The requester sent a Sequence operation with a slot ID and sequence 16579 ID that are in the reply cache, but the replier has detected that the 16580 retried request is not the same as the original request. 16582 15.1.11.8. NFS4ERR_SEQ_MISORDERED (Error Code 10063) 16584 The requester sent a Sequence operation with an invalid sequence ID. 16586 15.1.12. Session Management Errors 16588 This section deals with errors associated with requests used in 16589 session management. 16591 15.1.12.1. NFS4ERR_BACK_CHAN_BUSY (Error Code 10057) 16593 An attempt was made to destroy a session when the session cannot be 16594 destroyed because the server has callback requests outstanding. 16596 15.1.12.2. NFS4ERR_BAD_SESSION_DIGEST (Error Code 10051) 16598 The digest used in a SET_SSV request is not valid. 16600 15.1.13. Client Management Errors 16602 This sections deals with errors associated with requests used to 16603 create and manage client IDs. 16605 15.1.13.1. NFS4ERR_CLIENTID_BUSY (Error Code 10074) 16607 The DESTROY_CLIENTID operation has found there are sessions and/or 16608 unexpired state associated with the client ID to be destroyed. 16610 15.1.13.2. NFS4ERR_CLID_INUSE (Error Code 10017) 16612 While processing an EXCHANGE_ID operation, the server was presented 16613 with a co_ownerid field matches an existing client with valid leased 16614 state but the principal issuing the EXCHANGE_ID is different than 16615 that establishing the existing client. This indicates a (most likely 16616 due to chance) collision between clients. The client should recover 16617 by changing the co_ownerid and re-sending EXCHANGE_ID (but not with 16618 the same slot ID and sequence ID; one or both MUST be different on 16619 the re-send). 16621 15.1.13.3. NFS4ERR_ENCR_ALG_UNSUPP (Error Code 10079) 16623 An EXCHANGE_ID was sent which specified state protection via SSV, and 16624 where the set of encryption algorithms presented by the client did 16625 not include any supported by the server. 16627 15.1.13.4. NFS4ERR_HASH_ALG_UNSUPP (Error Code 10072) 16629 An EXCHANGE_ID was sent which specified state protection via SSV, and 16630 where the set of hashing algorithms presented by the client did not 16631 include any supported by the server. 16633 15.1.13.5. NFS4ERR_STALE_CLIENTID (Error Code 10022) 16635 A client ID not recognized by the server was passed to an operation. 16636 Note that unlike the case of NFSv4.0, client IDs are not passed 16637 explicitly to the server in ordinary locking operations and cannot 16638 result in this error. Instead, when there is a server restart, it is 16639 first manifested through an error on the associated session and the 16640 staleness of the client ID is detected when trying to associate a 16641 client ID with a new session. 16643 15.1.14. Delegation Errors 16645 This section deals with errors associated with requesting and 16646 returning delegations. 16648 15.1.14.1. NFS4ERR_DELEG_ALREADY_WANTED (Error Code 10056) 16650 The client has requested a delegation when it had already registered 16651 that it wants that same delegation. 16653 15.1.14.2. NFS4ERR_DIRDELEG_UNAVAIL (Error Code 10084) 16655 This error is returned when the server is unable or unwilling to 16656 provide a requested directory delegation. 16658 15.1.14.3. NFS4ERR_RECALLCONFLICT (Error Code 10061) 16660 A recallable object (i.e. a layout or delegation) is unavailable due 16661 to a conflicting recall operation for that object that is currently 16662 in progress. 16664 15.1.14.4. NFS4ERR_REJECT_DELEG (Error Code 10085) 16666 The callback operation invoked to deal with a new delegation has 16667 rejected it. 16669 15.1.15. Attribute Handling Errors 16671 This section deals with errors specific to attribute handling within 16672 NFSv4. 16674 15.1.15.1. NFS4ERR_ATTRNOTSUPP (Error Code 10032) 16676 An attribute specified is not supported by the server. This error 16677 MUST NOT be returned by the GETATTR operation. 16679 15.1.15.2. NFS4ERR_BADOWNER (Error Code 10039) 16681 Returned when an owner or owner_group attribute value or the who 16682 field of an ace within an ACL attribute value cannot be translated to 16683 a local representation. 16685 15.1.15.3. NFS4ERR_NOT_SAME (Error Code 10027) 16687 This error is returned by the VERIFY operation to signify that the 16688 attributes compared were not the same as those provided in the 16689 client's request. 16691 15.1.15.4. NFS4ERR_SAME (Error Code 10009) 16693 This error is returned by the NVERIFY operation to signify that the 16694 attributes compared were the same as those provided in the client's 16695 request. 16697 15.1.16. Obsoleted Errors 16699 These errors MUST NOT be generated by any NFSv4.1 operation. This 16700 can be for a number of reasons. 16702 o The function provided by the error has been superseded by one of 16703 the status bits returned by the SEQUENCE operation. 16705 o The new session structure and associated change in locking have 16706 made the error unnecessary. 16708 o There has been a restructuring of some errors for NFSv4.1 which 16709 resulted in the elimination of certain of the errors. 16711 15.1.16.1. NFS4ERR_BAD_SEQID (Error Code 10026) 16713 The sequence number (seqid) in a locking request is neither the next 16714 expected number or the last number processed. These seqids are 16715 ignored in NFSv4.1. 16717 15.1.16.2. NFS4ERR_LEASE_MOVED (Error Code 10031) 16719 A lease being renewed is associated with a file system that has been 16720 migrated to a new server. The error has been superseded by the 16721 SEQ4_STATUS_LEASE_MOVED status bit (see Section 18.46). 16723 15.1.16.3. NFS4ERR_NXIO (Error Code 5) 16725 I/O error. No such device or address. This error is for errors 16726 involving block and character device access, but NFSv4.1 is not a 16727 device access protocol. 16729 15.1.16.4. NFS4ERR_RESTOREFH (Error Code 10030) 16731 The RESTOREFH operation does not have a saved filehandle (identified 16732 by SAVEFH) to operate upon. In NFSv4.1, this error has been 16733 superseded by NFS4ERR_NOFILEHANDLE. 16735 15.1.16.5. NFS4ERR_STALE_STATEID (Error Code 10023) 16737 A stateid generated by an earlier server instance was used. This 16738 error is moot in NFSv4.1 because all operations that take a stateid 16739 MUST be preceded by the SEQUENCE operation, and the earlier server 16740 instance is detected by the session infrastructure that supports 16741 SEQUENCE. 16743 15.2. Operations and their valid errors 16745 This section contains a table which gives the valid error returns for 16746 each protocol operation. The error code NFS4_OK (indicating no 16747 error) is not listed but should be understood to be returnable by all 16748 operations with two important exceptions: 16750 o The operations which MUST NOT be implemented: OPEN_CONFIRM, 16751 RELEASE_LOCKOWNER, RENEW, SETCLIENTID, and SETCLIENTID_CONFIRM. 16753 o The invalid operation: ILLEGAL. 16755 Valid error returns for each protocol operation 16757 +----------------------+--------------------------------------------+ 16758 | Operation | Errors | 16759 +----------------------+--------------------------------------------+ 16760 | ACCESS | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 16761 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 16762 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 16763 | | NFS4ERR_IO, NFS4ERR_MOVED, | 16764 | | NFS4ERR_NOFILEHANDLE, | 16765 | | NFS4ERR_OP_NOT_IN_SESSION, | 16766 | | NFS4ERR_REP_TOO_BIG, | 16767 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16768 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 16769 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS | 16770 | BACKCHANNEL_CTL | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 16771 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 16772 | | NFS4ERR_NOENT, NFS4ERR_OP_NOT_IN_SESSION, | 16773 | | NFS4ERR_REP_TOO_BIG, | 16774 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16775 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS | 16776 | BIND_CONN_TO_SESSION | NFS4ERR_BADSESSION, NFS4ERR_BADXDR, | 16777 | | NFS4ERR_BAD_SESSION_DIGEST, | 16778 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 16779 | | NFS4ERR_INVAL, NFS4ERR_NOT_ONLY_OP, | 16780 | | NFS4ERR_REP_TOO_BIG, | 16781 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16782 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 16783 | | NFS4ERR_TOO_MANY_OPS | 16784 | CLOSE | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 16785 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | 16786 | | NFS4ERR_DELAY, NFS4ERR_EXPIRED, | 16787 | | NFS4ERR_FHEXPIRED, NFS4ERR_LOCKS_HELD, | 16788 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 16789 | | NFS4ERR_OLD_STATEID, | 16790 | | NFS4ERR_OP_NOT_IN_SESSION, | 16791 | | NFS4ERR_REP_TOO_BIG, | 16792 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16793 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 16794 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | 16795 | | NFS4ERR_WRONG_CRED | 16796 | COMMIT | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 16797 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 16798 | | NFS4ERR_FHEXPIRED, NFS4ERR_IO, | 16799 | | NFS4ERR_ISDIR, NFS4ERR_MOVED, | 16800 | | NFS4ERR_NOFILEHANDLE, | 16801 | | NFS4ERR_OP_NOT_IN_SESSION, | 16802 | | NFS4ERR_REP_TOO_BIG, | 16803 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16804 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 16805 | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | 16806 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_TYPE | 16807 | CREATE | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | 16808 | | NFS4ERR_BADCHAR, NFS4ERR_BADNAME, | 16809 | | NFS4ERR_BADOWNER, NFS4ERR_BADTYPE, | 16810 | | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 16811 | | NFS4ERR_DELAY, NFS4ERR_DQUOT, | 16812 | | NFS4ERR_EXIST, NFS4ERR_FHEXPIRED, | 16813 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MLINK, | 16814 | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | 16815 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 16816 | | NFS4ERR_NOTDIR, NFS4ERR_OP_NOT_IN_SESSION, | 16817 | | NFS4ERR_PERM, NFS4ERR_REP_TOO_BIG, | 16818 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16819 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_ROFS, | 16820 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 16821 | | NFS4ERR_TOO_MANY_OPS, | 16822 | | NFS4ERR_UNSAFE_COMPOUND | 16823 | CREATE_SESSION | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 16824 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 16825 | | NFS4ERR_NOENT, NFS4ERR_NOT_ONLY_OP, | 16826 | | NFS4ERR_NOSPC, NFS4ERR_REP_TOO_BIG, | 16827 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16828 | | NFS4ERR_REQ_TOO_BIG, | 16829 | | NFS4ERR_SEQ_MISORDERED, | 16830 | | NFS4ERR_SERVERFAULT, | 16831 | | NFS4ERR_STALE_CLIENTID, NFS4ERR_TOOSMALL, | 16832 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 16833 | DELEGPURGE | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 16834 | | NFS4ERR_DELAY, NFS4ERR_NOTSUPP, | 16835 | | NFS4ERR_OP_NOT_IN_SESSION, | 16836 | | NFS4ERR_REP_TOO_BIG, | 16837 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16838 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 16839 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 16840 | DELEGRETURN | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 16841 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | 16842 | | NFS4ERR_DELAY, NFS4ERR_DELEG_REVOKED, | 16843 | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | 16844 | | NFS4ERR_INVAL, NFS4ERR_MOVED, | 16845 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | 16846 | | NFS4ERR_OLD_STATEID, | 16847 | | NFS4ERR_OP_NOT_IN_SESSION, | 16848 | | NFS4ERR_REP_TOO_BIG, | 16849 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16850 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 16851 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | 16852 | | NFS4ERR_WRONG_CRED | 16853 | DESTROY_CLIENTID | NFS4ERR_BADXDR, NFS4ERR_CLIENTID_BUSY, | 16854 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 16855 | | NFS4ERR_NOT_ONLY_OP, NFS4ERR_REP_TOO_BIG, | 16856 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16857 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 16858 | | NFS4ERR_STALE_CLIENTID, | 16859 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 16860 | DESTROY_SESSION | NFS4ERR_BACK_CHAN_BUSY, | 16861 | | NFS4ERR_BADSESSION, NFS4ERR_BADXDR, | 16862 | | NFS4ERR_CB_PATH_DOWN, | 16863 | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | 16864 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 16865 | | NFS4ERR_NOT_ONLY_OP, NFS4ERR_REP_TOO_BIG, | 16866 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16867 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 16868 | | NFS4ERR_STALE_CLIENTID, | 16869 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 16870 | EXCHANGE_ID | NFS4ERR_BADCHAR, NFS4ERR_BADXDR, | 16871 | | NFS4ERR_CLID_INUSE, NFS4ERR_DEADSESSION, | 16872 | | NFS4ERR_DELAY, NFS4ERR_ENCR_ALG_UNSUPP, | 16873 | | NFS4ERR_HASH_ALG_UNSUPP, NFS4ERR_INVAL, | 16874 | | NFS4ERR_NOENT, NFS4ERR_NOT_ONLY_OP, | 16875 | | NFS4ERR_NOT_SAME, NFS4ERR_REP_TOO_BIG, | 16876 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16877 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 16878 | | NFS4ERR_TOO_MANY_OPS | 16879 | FREE_STATEID | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 16880 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 16881 | | NFS4ERR_LOCKS_HELD, NFS4ERR_OLD_STATEID, | 16882 | | NFS4ERR_OP_NOT_IN_SESSION, | 16883 | | NFS4ERR_REP_TOO_BIG, | 16884 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16885 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 16886 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 16887 | GET_DIR_DELEGATION | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 16888 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 16889 | | NFS4ERR_DIRDELEG_UNAVAIL, | 16890 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 16891 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 16892 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 16893 | | NFS4ERR_NOTSUPP, | 16894 | | NFS4ERR_OP_NOT_IN_SESSION, | 16895 | | NFS4ERR_REP_TOO_BIG, | 16896 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16897 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 16898 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS | 16899 | GETATTR | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 16900 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 16901 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 16902 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 16903 | | NFS4ERR_NOFILEHANDLE, | 16904 | | NFS4ERR_OP_NOT_IN_SESSION, | 16905 | | NFS4ERR_REP_TOO_BIG, | 16906 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16907 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 16908 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | 16909 | | NFS4ERR_WRONG_TYPE | 16910 | GETDEVICEINFO | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 16911 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 16912 | | NFS4ERR_NOENT, NFS4ERR_NOTSUPP, | 16913 | | NFS4ERR_OP_NOT_IN_SESSION, | 16914 | | NFS4ERR_REP_TOO_BIG, | 16915 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16916 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 16917 | | NFS4ERR_TOOSMALL, NFS4ERR_TOO_MANY_OPS, | 16918 | | NFS4ERR_UNKNOWN_LAYOUTTYPE | 16919 | GETDEVICELIST | NFS4ERR_BADXDR, NFS4ERR_BAD_COOKIE, | 16920 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 16921 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 16922 | | NFS4ERR_IO, NFS4ERR_NOFILEHANDLE, | 16923 | | NFS4ERR_NOTSUPP, NFS4ERR_NOT_SAME, | 16924 | | NFS4ERR_OP_NOT_IN_SESSION, | 16925 | | NFS4ERR_REP_TOO_BIG, | 16926 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16927 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 16928 | | NFS4ERR_TOO_MANY_OPS, | 16929 | | NFS4ERR_UNKNOWN_LAYOUTTYPE | 16930 | GETFH | NFS4ERR_FHEXPIRED, NFS4ERR_MOVED, | 16931 | | NFS4ERR_NOFILEHANDLE, | 16932 | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_STALE | 16933 | ILLEGAL | NFS4ERR_BADXDR NFS4ERR_OP_ILLEGAL | 16934 | LAYOUTCOMMIT | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 16935 | | NFS4ERR_ATTRNOTSUPP, NFS4ERR_BADIOMODE, | 16936 | | NFS4ERR_BADLAYOUT, NFS4ERR_BADXDR, | 16937 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 16938 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_EXPIRED, | 16939 | | NFS4ERR_FBIG, NFS4ERR_FHEXPIRED, | 16940 | | NFS4ERR_GRACE, NFS4ERR_INVAL, NFS4ERR_IO, | 16941 | | NFS4ERR_ISDIR NFS4ERR_MOVED, | 16942 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | 16943 | | NFS4ERR_NO_GRACE, | 16944 | | NFS4ERR_OP_NOT_IN_SESSION, | 16945 | | NFS4ERR_RECLAIM_BAD, | 16946 | | NFS4ERR_RECLAIM_CONFLICT, | 16947 | | NFS4ERR_REP_TOO_BIG, | 16948 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16949 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 16950 | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | 16951 | | NFS4ERR_TOO_MANY_OPS, | 16952 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 16953 | | NFS4ERR_WRONG_CRED | 16954 | LAYOUTGET | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 16955 | | NFS4ERR_BADIOMODE, NFS4ERR_BADLAYOUT, | 16956 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 16957 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 16958 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | 16959 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 16960 | | NFS4ERR_INVAL, NFS4ERR_IO, | 16961 | | NFS4ERR_LAYOUTTRYLATER, | 16962 | | NFS4ERR_LAYOUTUNAVAILABLE, NFS4ERR_LOCKED, | 16963 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 16964 | | NFS4ERR_NOSPC, NFS4ERR_NOTSUPP, | 16965 | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | 16966 | | NFS4ERR_OP_NOT_IN_SESSION, | 16967 | | NFS4ERR_RECALLCONFLICT, | 16968 | | NFS4ERR_REP_TOO_BIG, | 16969 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16970 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 16971 | | NFS4ERR_STALE, NFS4ERR_TOOSMALL, | 16972 | | NFS4ERR_TOO_MANY_OPS, | 16973 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 16974 | | NFS4ERR_WRONG_TYPE | 16975 | LAYOUTRETURN | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 16976 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | 16977 | | NFS4ERR_DELAY, NFS4ERR_DELEG_REVOKED, | 16978 | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | 16979 | | NFS4ERR_GRACE, NFS4ERR_INVAL, | 16980 | | NFS4ERR_ISDIR, NFS4ERR_MOVED, | 16981 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | 16982 | | NFS4ERR_NO_GRACE, NFS4ERR_OLD_STATEID, | 16983 | | NFS4ERR_OP_NOT_IN_SESSION, | 16984 | | NFS4ERR_REP_TOO_BIG, | 16985 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 16986 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 16987 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | 16988 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 16989 | | NFS4ERR_WRONG_CRED, NFS4ERR_WRONG_TYPE | 16990 | LINK | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 16991 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 16992 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 16993 | | NFS4ERR_DQUOT, NFS4ERR_EXIST, | 16994 | | NFS4ERR_FHEXPIRED, NFS4ERR_FILE_OPEN, | 16995 | | NFS4ERR_GRACE, NFS4ERR_INVAL, | 16996 | | NFS4ERR_ISDIR, NFS4ERR_IO, NFS4ERR_MLINK, | 16997 | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | 16998 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 16999 | | NFS4ERR_NOTDIR, NFS4ERR_NOTSUPP, | 17000 | | NFS4ERR_OP_NOT_IN_SESSION, | 17001 | | NFS4ERR_REP_TOO_BIG, | 17002 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17003 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_ROFS, | 17004 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 17005 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 17006 | | NFS4ERR_WRONGSEC, NFS4ERR_WRONG_TYPE, | 17007 | | NFS4ERR_XDEV | 17008 | LOCK | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 17009 | | NFS4ERR_BADXDR, NFS4ERR_BAD_RANGE, | 17010 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADLOCK, | 17011 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17012 | | NFS4ERR_DENIED, NFS4ERR_EXPIRED, | 17013 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 17014 | | NFS4ERR_INVAL, NFS4ERR_ISDIR, | 17015 | | NFS4ERR_LOCK_NOTSUPP, NFS4ERR_LOCK_RANGE, | 17016 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 17017 | | NFS4ERR_NO_GRACE, NFS4ERR_OLD_STATEID, | 17018 | | NFS4ERR_OPENMODE, | 17019 | | NFS4ERR_OP_NOT_IN_SESSION, | 17020 | | NFS4ERR_RECLAIM_BAD, | 17021 | | NFS4ERR_RECLAIM_CONFLICT, | 17022 | | NFS4ERR_REP_TOO_BIG, | 17023 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17024 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_ROFS, | 17025 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 17026 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 17027 | | NFS4ERR_WRONG_CRED, NFS4ERR_WRONG_TYPE | 17028 | LOCKT | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 17029 | | NFS4ERR_BAD_RANGE, NFS4ERR_DEADSESSION, | 17030 | | NFS4ERR_DELAY, NFS4ERR_DENIED, | 17031 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 17032 | | NFS4ERR_INVAL, NFS4ERR_ISDIR, | 17033 | | NFS4ERR_LOCK_RANGE, NFS4ERR_MOVED, | 17034 | | NFS4ERR_NOFILEHANDLE, | 17035 | | NFS4ERR_OP_NOT_IN_SESSION, | 17036 | | NFS4ERR_REP_TOO_BIG, | 17037 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17038 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_ROFS, | 17039 | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | 17040 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED, | 17041 | | NFS4ERR_WRONG_TYPE | 17042 | LOCKU | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 17043 | | NFS4ERR_BADXDR, NFS4ERR_BAD_RANGE, | 17044 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | 17045 | | NFS4ERR_DELAY, NFS4ERR_EXPIRED, | 17046 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 17047 | | NFS4ERR_LOCK_RANGE, NFS4ERR_MOVED, | 17048 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_OLD_STATEID, | 17049 | | NFS4ERR_OP_NOT_IN_SESSION, | 17050 | | NFS4ERR_REP_TOO_BIG, | 17051 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17052 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 17053 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | 17054 | | NFS4ERR_WRONG_CRED | 17055 | LOOKUP | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 17056 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 17057 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17058 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 17059 | | NFS4ERR_IO, NFS4ERR_MOVED, | 17060 | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | 17061 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 17062 | | NFS4ERR_OP_NOT_IN_SESSION, | 17063 | | NFS4ERR_REP_TOO_BIG, | 17064 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17065 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 17066 | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | 17067 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC | 17068 | LOOKUPP | NFS4ERR_ACCESS, NFS4ERR_DEADSESSION, | 17069 | | NFS4ERR_DELAY, NFS4ERR_FHEXPIRED, | 17070 | | NFS4ERR_IO, NFS4ERR_MOVED, NFS4ERR_NOENT, | 17071 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 17072 | | NFS4ERR_OP_NOT_IN_SESSION, | 17073 | | NFS4ERR_REP_TOO_BIG, | 17074 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17075 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 17076 | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | 17077 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC | 17078 | NVERIFY | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | 17079 | | NFS4ERR_BADCHAR, NFS4ERR_BADXDR, | 17080 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17081 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 17082 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 17083 | | NFS4ERR_NOFILEHANDLE, | 17084 | | NFS4ERR_OP_NOT_IN_SESSION, | 17085 | | NFS4ERR_REP_TOO_BIG, | 17086 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17087 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SAME, | 17088 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 17089 | | NFS4ERR_TOO_MANY_OPS, | 17090 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 17091 | | NFS4ERR_WRONG_TYPE | 17092 | OPEN | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 17093 | | NFS4ERR_ATTRNOTSUPP, NFS4ERR_BADCHAR, | 17094 | | NFS4ERR_BADNAME, NFS4ERR_BADOWNER, | 17095 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 17096 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17097 | | NFS4ERR_DELEG_ALREADY_WANTED, | 17098 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | 17099 | | NFS4ERR_EXIST, NFS4ERR_EXPIRED, | 17100 | | NFS4ERR_FBIG, NFS4ERR_FHEXPIRED, | 17101 | | NFS4ERR_GRACE, NFS4ERR_INVAL, | 17102 | | NFS4ERR_ISDIR, NFS4ERR_IO, NFS4ERR_MOVED, | 17103 | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | 17104 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 17105 | | NFS4ERR_NOTDIR, NFS4ERR_NO_GRACE, | 17106 | | NFS4ERR_OLD_STATEID, | 17107 | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_PERM, | 17108 | | NFS4ERR_RECLAIM_BAD, | 17109 | | NFS4ERR_RECLAIM_CONFLICT, | 17110 | | NFS4ERR_REP_TOO_BIG, | 17111 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17112 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_ROFS, | 17113 | | NFS4ERR_SERVERFAULT, NFS4ERR_SHARE_DENIED, | 17114 | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | 17115 | | NFS4ERR_TOO_MANY_OPS, | 17116 | | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_WRONGSEC, | 17117 | | NFS4ERR_WRONG_TYPE | 17118 | OPEN_CONFIRM | NFS4ERR_NOTSUPP | 17119 | OPEN_DOWNGRADE | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 17120 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | 17121 | | NFS4ERR_DELAY, NFS4ERR_EXPIRED, | 17122 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 17123 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 17124 | | NFS4ERR_OLD_STATEID, | 17125 | | NFS4ERR_OP_NOT_IN_SESSION, | 17126 | | NFS4ERR_REP_TOO_BIG, | 17127 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17128 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_ROFS, | 17129 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 17130 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 17131 | OPENATTR | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 17132 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17133 | | NFS4ERR_DQUOT, NFS4ERR_FHEXPIRED, | 17134 | | NFS4ERR_IO, NFS4ERR_MOVED, NFS4ERR_NOENT, | 17135 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 17136 | | NFS4ERR_NOTSUPP, | 17137 | | NFS4ERR_OP_NOT_IN_SESSION, | 17138 | | NFS4ERR_REP_TOO_BIG, | 17139 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17140 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_ROFS, | 17141 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 17142 | | NFS4ERR_TOO_MANY_OPS, | 17143 | | NFS4ERR_UNSAFE_COMPOUND, | 17144 | | NFS4ERR_WRONG_TYPE | 17145 | PUTFH | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 17146 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17147 | | NFS4ERR_MOVED, NFS4ERR_OP_NOT_IN_SESSION, | 17148 | | NFS4ERR_REP_TOO_BIG, | 17149 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17150 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 17151 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | 17152 | | NFS4ERR_WRONGSEC | 17153 | PUTPUBFH | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17154 | | NFS4ERR_OP_NOT_IN_SESSION, | 17155 | | NFS4ERR_REP_TOO_BIG, | 17156 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17157 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 17158 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC | 17159 | PUTROOTFH | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17160 | | NFS4ERR_OP_NOT_IN_SESSION, | 17161 | | NFS4ERR_REP_TOO_BIG, | 17162 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17163 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 17164 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC | 17165 | READ | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 17166 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 17167 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17168 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_EXPIRED, | 17169 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 17170 | | NFS4ERR_INVAL, NFS4ERR_ISDIR, NFS4ERR_IO, | 17171 | | NFS4ERR_LOCKED, NFS4ERR_MOVED, | 17172 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_OLD_STATEID, | 17173 | | NFS4ERR_OPENMODE, | 17174 | | NFS4ERR_OP_NOT_IN_SESSION, | 17175 | | NFS4ERR_PNFS_IO_HOLE, | 17176 | | NFS4ERR_PNFS_NO_LAYOUT, | 17177 | | NFS4ERR_REP_TOO_BIG, | 17178 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17179 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 17180 | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | 17181 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_TYPE | 17182 | READDIR | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 17183 | | NFS4ERR_BAD_COOKIE, NFS4ERR_DEADSESSION, | 17184 | | NFS4ERR_DELAY, NFS4ERR_FHEXPIRED, | 17185 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 17186 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 17187 | | NFS4ERR_NOT_SAME, | 17188 | | NFS4ERR_OP_NOT_IN_SESSION, | 17189 | | NFS4ERR_REP_TOO_BIG, | 17190 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17191 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 17192 | | NFS4ERR_STALE, NFS4ERR_TOOSMALL, | 17193 | | NFS4ERR_TOO_MANY_OPS | 17194 | READLINK | NFS4ERR_ACCESS, NFS4ERR_DEADSESSION, | 17195 | | NFS4ERR_DELAY, NFS4ERR_FHEXPIRED, | 17196 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 17197 | | NFS4ERR_NOFILEHANDLE, | 17198 | | NFS4ERR_OP_NOT_IN_SESSION, | 17199 | | NFS4ERR_REP_TOO_BIG, | 17200 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17201 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 17202 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | 17203 | | NFS4ERR_WRONG_TYPE | 17204 | RECLAIM_COMPLETE | NFS4ERR_BADXDR, NFS4ERR_COMPLETE_ALREADY, | 17205 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17206 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 17207 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 17208 | | NFS4ERR_OP_NOT_IN_SESSION, | 17209 | | NFS4ERR_REP_TOO_BIG, | 17210 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17211 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 17212 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | 17213 | | NFS4ERR_WRONG_CRED, NFS4ERR_WRONG_TYPE | 17214 | RELEASE_LOCKOWNER | NFS4ERR_NOTSUPP | 17215 | REMOVE | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 17216 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 17217 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17218 | | NFS4ERR_FHEXPIRED, NFS4ERR_FILE_OPEN, | 17219 | | NFS4ERR_GRACE, NFS4ERR_INVAL, NFS4ERR_IO, | 17220 | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | 17221 | | NFS4ERR_NOENT, NFS4ERR_NOFILEHANDLE, | 17222 | | NFS4ERR_NOTDIR, NFS4ERR_NOTEMPTY, | 17223 | | NFS4ERR_OP_NOT_IN_SESSION, | 17224 | | NFS4ERR_REP_TOO_BIG, | 17225 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17226 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_ROFS, | 17227 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 17228 | | NFS4ERR_TOO_MANY_OPS | 17229 | RENAME | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 17230 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 17231 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17232 | | NFS4ERR_DQUOT, NFS4ERR_EXIST, | 17233 | | NFS4ERR_FHEXPIRED, NFS4ERR_FILE_OPEN, | 17234 | | NFS4ERR_GRACE, NFS4ERR_INVAL, NFS4ERR_IO, | 17235 | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | 17236 | | NFS4ERR_NOENT, NFS4ERR_NOFILEHANDLE, | 17237 | | NFS4ERR_NOSPC, NFS4ERR_NOTDIR, | 17238 | | NFS4ERR_NOTEMPTY, | 17239 | | NFS4ERR_OP_NOT_IN_SESSION, | 17240 | | NFS4ERR_REP_TOO_BIG, | 17241 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17242 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_ROFS, | 17243 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 17244 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC, | 17245 | | NFS4ERR_XDEV | 17246 | RENEW | NFS4ERR_NOTSUPP | 17247 | RESTOREFH | NFS4ERR_DEADSESSION, NFS4ERR_FHEXPIRED, | 17248 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 17249 | | NFS4ERR_OP_NOT_IN_SESSION, | 17250 | | NFS4ERR_REP_TOO_BIG, | 17251 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17252 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 17253 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | 17254 | | NFS4ERR_WRONGSEC | 17255 | SAVEFH | NFS4ERR_DEADSESSION, NFS4ERR_FHEXPIRED, | 17256 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 17257 | | NFS4ERR_OP_NOT_IN_SESSION, | 17258 | | NFS4ERR_REP_TOO_BIG, | 17259 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17260 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 17261 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS | 17262 | SECINFO | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 17263 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 17264 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17265 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 17266 | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | 17267 | | NFS4ERR_NOENT, NFS4ERR_NOFILEHANDLE, | 17268 | | NFS4ERR_NOTDIR, NFS4ERR_OP_NOT_IN_SESSION, | 17269 | | NFS4ERR_REP_TOO_BIG, | 17270 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17271 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 17272 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS | 17273 | SECINFO_NO_NAME | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 17274 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17275 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 17276 | | NFS4ERR_MOVED, NFS4ERR_NOENT, | 17277 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 17278 | | NFS4ERR_NOTSUPP, | 17279 | | NFS4ERR_OP_NOT_IN_SESSION, | 17280 | | NFS4ERR_REP_TOO_BIG, | 17281 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17282 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 17283 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS | 17284 | SEQUENCE | NFS4ERR_BADSESSION, NFS4ERR_BADSLOT, | 17285 | | NFS4ERR_BADXDR, NFS4ERR_BAD_HIGH_SLOT, | 17286 | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | 17287 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17288 | | NFS4ERR_REP_TOO_BIG, | 17289 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17290 | | NFS4ERR_REQ_TOO_BIG, | 17291 | | NFS4ERR_RETRY_UNCACHED_REP, | 17292 | | NFS4ERR_SEQUENCE_POS, | 17293 | | NFS4ERR_SEQ_FALSE_RETRY, | 17294 | | NFS4ERR_SEQ_MISORDERED, | 17295 | | NFS4ERR_TOO_MANY_OPS | 17296 | SET_SSV | NFS4ERR_BADXDR, | 17297 | | NFS4ERR_BAD_SESSION_DIGEST, | 17298 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17299 | | NFS4ERR_INVAL, NFS4ERR_OP_NOT_IN_SESSION, | 17300 | | NFS4ERR_REP_TOO_BIG, | 17301 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17302 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS | 17303 | SETATTR | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 17304 | | NFS4ERR_ATTRNOTSUPP, NFS4ERR_BADCHAR, | 17305 | | NFS4ERR_BADOWNER, NFS4ERR_BADXDR, | 17306 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | 17307 | | NFS4ERR_DELAY, NFS4ERR_DELEG_REVOKED, | 17308 | | NFS4ERR_DQUOT, NFS4ERR_EXPIRED, | 17309 | | NFS4ERR_FBIG, NFS4ERR_FHEXPIRED, | 17310 | | NFS4ERR_GRACE, NFS4ERR_INVAL, NFS4ERR_IO, | 17311 | | NFS4ERR_LOCKED, NFS4ERR_MOVED, | 17312 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 17313 | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | 17314 | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_PERM, | 17315 | | NFS4ERR_REP_TOO_BIG, | 17316 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17317 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_ROFS, | 17318 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 17319 | | NFS4ERR_TOO_MANY_OPS, | 17320 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 17321 | | NFS4ERR_WRONG_TYPE | 17322 | SETCLIENTID | NFS4ERR_NOTSUPP | 17323 | SETCLIENTID_CONFIRM | NFS4ERR_NOTSUPP | 17324 | TEST_STATEID | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 17325 | | NFS4ERR_DELAY, NFS4ERR_OP_NOT_IN_SESSION, | 17326 | | NFS4ERR_REP_TOO_BIG, | 17327 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17328 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 17329 | | NFS4ERR_TOO_MANY_OPS | 17330 | VERIFY | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | 17331 | | NFS4ERR_BADCHAR, NFS4ERR_BADXDR, | 17332 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17333 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 17334 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 17335 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOT_SAME, | 17336 | | NFS4ERR_OP_NOT_IN_SESSION, | 17337 | | NFS4ERR_REP_TOO_BIG, | 17338 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17339 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 17340 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | 17341 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 17342 | | NFS4ERR_WRONG_TYPE | 17343 | WANT_DELEGATION | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 17344 | | NFS4ERR_DELAY, | 17345 | | NFS4ERR_DELEG_ALREADY_WANTED, | 17346 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 17347 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 17348 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | 17349 | | NFS4ERR_NO_GRACE, | 17350 | | NFS4ERR_OP_NOT_IN_SESSION, | 17351 | | NFS4ERR_RECALLCONFLICT, | 17352 | | NFS4ERR_RECLAIM_BAD, | 17353 | | NFS4ERR_RECLAIM_CONFLICT, | 17354 | | NFS4ERR_REP_TOO_BIG, | 17355 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17356 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | 17357 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | 17358 | | NFS4ERR_WRONG_TYPE | 17359 | WRITE | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 17360 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 17361 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17362 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | 17363 | | NFS4ERR_EXPIRED, NFS4ERR_FBIG, | 17364 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 17365 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_ISDIR, | 17366 | | NFS4ERR_LOCKED, NFS4ERR_MOVED, | 17367 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 17368 | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | 17369 | | NFS4ERR_OP_NOT_IN_SESSION, | 17370 | | NFS4ERR_PNFS_IO_HOLE, | 17371 | | NFS4ERR_PNFS_NO_LAYOUT, | 17372 | | NFS4ERR_REP_TOO_BIG, | 17373 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17374 | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_ROFS, | 17375 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 17376 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 17377 | | NFS4ERR_WRONG_TYPE | 17378 +----------------------+--------------------------------------------+ 17380 Table 6 17382 15.3. Callback operations and their valid errors 17384 This section contains a table which gives the valid error returns for 17385 each callback operation. The error code NFS4_OK (indicating no 17386 error) is not listed but should be understood to be returnable by all 17387 callback operations with the exception of CB_ILLEGAL. 17389 Valid error returns for each protocol callback operation 17391 +-------------------------+-----------------------------------------+ 17392 | Callback Operation | Errors | 17393 +-------------------------+-----------------------------------------+ 17394 | CB_GETATTR | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 17395 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 17396 | | NFS4ERR_OP_NOT_IN_SESSION, | 17397 | | NFS4ERR_REP_TOO_BIG, | 17398 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17399 | | NFS4ERR_REQ_TOO_BIG, | 17400 | | NFS4ERR_SERVERFAULT, | 17401 | | NFS4ERR_TOO_MANY_OPS, | 17402 | CB_ILLEGAL | NFS4ERR_BADXDR, NFS4ERR_OP_ILLEGAL | 17403 | CB_LAYOUTRECALL | NFS4ERR_BADHANDLE, NFS4ERR_BADIOMODE, | 17404 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 17405 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 17406 | | NFS4ERR_NOMATCHING_LAYOUT, | 17407 | | NFS4ERR_NOTSUPP, | 17408 | | NFS4ERR_OP_NOT_IN_SESSION, | 17409 | | NFS4ERR_REP_TOO_BIG, | 17410 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17411 | | NFS4ERR_REQ_TOO_BIG, | 17412 | | NFS4ERR_TOO_MANY_OPS, | 17413 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 17414 | | NFS4ERR_WRONG_TYPE | 17415 | CB_NOTIFY | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 17416 | | NFS4ERR_BAD_STATEID, NFS4ERR_DELAY, | 17417 | | NFS4ERR_INVAL, NFS4ERR_NOTSUPP, | 17418 | | NFS4ERR_OP_NOT_IN_SESSION, | 17419 | | NFS4ERR_REP_TOO_BIG, | 17420 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17421 | | NFS4ERR_REQ_TOO_BIG, | 17422 | | NFS4ERR_SERVERFAULT, | 17423 | | NFS4ERR_TOO_MANY_OPS | 17424 | CB_NOTIFY_DEVICEID | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 17425 | | NFS4ERR_INVAL, NFS4ERR_NOTSUPP, | 17426 | | NFS4ERR_OP_NOT_IN_SESSION, | 17427 | | NFS4ERR_REP_TOO_BIG, | 17428 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17429 | | NFS4ERR_REQ_TOO_BIG, | 17430 | | NFS4ERR_SERVERFAULT, | 17431 | | NFS4ERR_TOO_MANY_OPS | 17432 | CB_NOTIFY_LOCK | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 17433 | | NFS4ERR_BAD_STATEID, NFS4ERR_DELAY, | 17434 | | NFS4ERR_NOTSUPP, | 17435 | | NFS4ERR_OP_NOT_IN_SESSION, | 17436 | | NFS4ERR_REP_TOO_BIG, | 17437 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17438 | | NFS4ERR_REQ_TOO_BIG, | 17439 | | NFS4ERR_SERVERFAULT, | 17440 | | NFS4ERR_TOO_MANY_OPS | 17441 | CB_PUSH_DELEG | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 17442 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 17443 | | NFS4ERR_NOTSUPP, | 17444 | | NFS4ERR_OP_NOT_IN_SESSION, | 17445 | | NFS4ERR_REJECT_DELEG, | 17446 | | NFS4ERR_REP_TOO_BIG, | 17447 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17448 | | NFS4ERR_REQ_TOO_BIG, | 17449 | | NFS4ERR_SERVERFAULT, | 17450 | | NFS4ERR_TOO_MANY_OPS, | 17451 | | NFS4ERR_WRONG_TYPE | 17452 | CB_RECALL | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 17453 | | NFS4ERR_BAD_STATEID, NFS4ERR_DELAY, | 17454 | | NFS4ERR_OP_NOT_IN_SESSION, | 17455 | | NFS4ERR_REP_TOO_BIG, | 17456 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17457 | | NFS4ERR_REQ_TOO_BIG, | 17458 | | NFS4ERR_SERVERFAULT, | 17459 | | NFS4ERR_TOO_MANY_OPS | 17460 | CB_RECALL_ANY | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 17461 | | NFS4ERR_INVAL, | 17462 | | NFS4ERR_OP_NOT_IN_SESSION, | 17463 | | NFS4ERR_REP_TOO_BIG, | 17464 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17465 | | NFS4ERR_REQ_TOO_BIG, | 17466 | | NFS4ERR_TOO_MANY_OPS | 17467 | CB_RECALLABLE_OBJ_AVAIL | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 17468 | | NFS4ERR_INVAL, NFS4ERR_NOTSUPP, | 17469 | | NFS4ERR_OP_NOT_IN_SESSION, | 17470 | | NFS4ERR_REP_TOO_BIG, | 17471 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17472 | | NFS4ERR_REQ_TOO_BIG, | 17473 | | NFS4ERR_SERVERFAULT, | 17474 | | NFS4ERR_TOO_MANY_OPS | 17475 | CB_RECALL_SLOT | NFS4ERR_BADXDR, NFS4ERR_BAD_HIGH_SLOT, | 17476 | | NFS4ERR_DELAY, | 17477 | | NFS4ERR_OP_NOT_IN_SESSION, | 17478 | | NFS4ERR_REP_TOO_BIG, | 17479 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17480 | | NFS4ERR_REQ_TOO_BIG, | 17481 | | NFS4ERR_TOO_MANY_OPS | 17482 | CB_SEQUENCE | NFS4ERR_BADSESSION, NFS4ERR_BADSLOT, | 17483 | | NFS4ERR_BADXDR, NFS4ERR_BAD_HIGH_SLOT, | 17484 | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | 17485 | | NFS4ERR_DELAY, NFS4ERR_REP_TOO_BIG, | 17486 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17487 | | NFS4ERR_REQ_TOO_BIG, | 17488 | | NFS4ERR_RETRY_UNCACHED_REP, | 17489 | | NFS4ERR_SEQUENCE_POS, | 17490 | | NFS4ERR_SEQ_FALSE_RETRY, | 17491 | | NFS4ERR_SEQ_MISORDERED, | 17492 | | NFS4ERR_TOO_MANY_OPS | 17493 | CB_WANTS_CANCELLED | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 17494 | | NFS4ERR_NOTSUPP, | 17495 | | NFS4ERR_OP_NOT_IN_SESSION, | 17496 | | NFS4ERR_REP_TOO_BIG, | 17497 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17498 | | NFS4ERR_REQ_TOO_BIG, | 17499 | | NFS4ERR_SERVERFAULT, | 17500 | | NFS4ERR_TOO_MANY_OPS | 17501 +-------------------------+-----------------------------------------+ 17503 Table 7 17505 15.4. Errors and the operations that use them 17506 +-----------------------------------+-------------------------------+ 17507 | Error | Operations | 17508 +-----------------------------------+-------------------------------+ 17509 | NFS4ERR_ACCESS | ACCESS, COMMIT, CREATE, | 17510 | | GETATTR, GET_DIR_DELEGATION, | 17511 | | LAYOUTCOMMIT, LAYOUTGET, | 17512 | | LINK, LOCK, LOCKT, LOCKU, | 17513 | | LOOKUP, LOOKUPP, NVERIFY, | 17514 | | OPEN, OPENATTR, READ, | 17515 | | READDIR, READLINK, REMOVE, | 17516 | | RENAME, SECINFO, | 17517 | | SECINFO_NO_NAME, SETATTR, | 17518 | | VERIFY, WRITE | 17519 | NFS4ERR_ADMIN_REVOKED | CLOSE, DELEGRETURN, | 17520 | | LAYOUTCOMMIT, LAYOUTGET, | 17521 | | LAYOUTRETURN, LOCK, LOCKU, | 17522 | | OPEN, OPEN_DOWNGRADE, READ, | 17523 | | SETATTR, WRITE | 17524 | NFS4ERR_ATTRNOTSUPP | CREATE, LAYOUTCOMMIT, | 17525 | | NVERIFY, OPEN, SETATTR, | 17526 | | VERIFY | 17527 | NFS4ERR_BACK_CHAN_BUSY | DESTROY_SESSION | 17528 | NFS4ERR_BADCHAR | CREATE, EXCHANGE_ID, LINK, | 17529 | | LOOKUP, NVERIFY, OPEN, | 17530 | | REMOVE, RENAME, SECINFO, | 17531 | | SETATTR, VERIFY | 17532 | NFS4ERR_BADHANDLE | CB_GETATTR, CB_LAYOUTRECALL, | 17533 | | CB_NOTIFY, CB_NOTIFY_LOCK, | 17534 | | CB_PUSH_DELEG, CB_RECALL, | 17535 | | PUTFH | 17536 | NFS4ERR_BADIOMODE | CB_LAYOUTRECALL, | 17537 | | LAYOUTCOMMIT, LAYOUTGET | 17538 | NFS4ERR_BADLAYOUT | LAYOUTCOMMIT, LAYOUTGET | 17539 | NFS4ERR_BADNAME | CREATE, LINK, LOOKUP, OPEN, | 17540 | | REMOVE, RENAME, SECINFO | 17541 | NFS4ERR_BADOWNER | CREATE, OPEN, SETATTR | 17542 | NFS4ERR_BADSESSION | BIND_CONN_TO_SESSION, | 17543 | | CB_SEQUENCE, DESTROY_SESSION, | 17544 | | SEQUENCE | 17545 | NFS4ERR_BADSLOT | CB_SEQUENCE, SEQUENCE | 17546 | NFS4ERR_BADTYPE | CREATE | 17547 | NFS4ERR_BADXDR | ACCESS, BACKCHANNEL_CTL, | 17548 | | BIND_CONN_TO_SESSION, | 17549 | | CB_GETATTR, CB_ILLEGAL, | 17550 | | CB_LAYOUTRECALL, CB_NOTIFY, | 17551 | | CB_NOTIFY_DEVICEID, | 17552 | | CB_NOTIFY_LOCK, | 17553 | | CB_PUSH_DELEG, CB_RECALL, | 17554 | | CB_RECALLABLE_OBJ_AVAIL, | 17555 | | CB_RECALL_ANY, | 17556 | | CB_RECALL_SLOT, CB_SEQUENCE, | 17557 | | CB_WANTS_CANCELLED, CLOSE, | 17558 | | COMMIT, CREATE, | 17559 | | CREATE_SESSION, DELEGPURGE, | 17560 | | DELEGRETURN, | 17561 | | DESTROY_CLIENTID, | 17562 | | DESTROY_SESSION, EXCHANGE_ID, | 17563 | | FREE_STATEID, GETATTR, | 17564 | | GETDEVICEINFO, GETDEVICELIST, | 17565 | | GET_DIR_DELEGATION, ILLEGAL, | 17566 | | LAYOUTCOMMIT, LAYOUTGET, | 17567 | | LAYOUTRETURN, LINK, LOCK, | 17568 | | LOCKT, LOCKU, LOOKUP, | 17569 | | NVERIFY, OPEN, OPENATTR, | 17570 | | OPEN_DOWNGRADE, PUTFH, READ, | 17571 | | READDIR, RECLAIM_COMPLETE, | 17572 | | REMOVE, RENAME, SECINFO, | 17573 | | SECINFO_NO_NAME, SEQUENCE, | 17574 | | SETATTR, SET_SSV, | 17575 | | TEST_STATEID, VERIFY, | 17576 | | WANT_DELEGATION, WRITE | 17577 | NFS4ERR_BAD_COOKIE | GETDEVICELIST, READDIR | 17578 | NFS4ERR_BAD_HIGH_SLOT | CB_RECALL_SLOT, CB_SEQUENCE, | 17579 | | SEQUENCE | 17580 | NFS4ERR_BAD_RANGE | LOCK, LOCKT, LOCKU | 17581 | NFS4ERR_BAD_SESSION_DIGEST | BIND_CONN_TO_SESSION, SET_SSV | 17582 | NFS4ERR_BAD_STATEID | CB_LAYOUTRECALL, CB_NOTIFY, | 17583 | | CB_NOTIFY_LOCK, CB_RECALL, | 17584 | | CLOSE, DELEGRETURN, | 17585 | | FREE_STATEID, LAYOUTGET, | 17586 | | LAYOUTRETURN, LOCK, LOCKU, | 17587 | | OPEN, OPEN_DOWNGRADE, READ, | 17588 | | SETATTR, WRITE | 17589 | NFS4ERR_CB_PATH_DOWN | DESTROY_SESSION | 17590 | NFS4ERR_CLID_INUSE | EXCHANGE_ID | 17591 | NFS4ERR_CLIENTID_BUSY | DESTROY_CLIENTID | 17592 | NFS4ERR_COMPLETE_ALREADY | RECLAIM_COMPLETE | 17593 | NFS4ERR_CONN_NOT_BOUND_TO_SESSION | CB_SEQUENCE, DESTROY_SESSION, | 17594 | | SEQUENCE | 17595 | NFS4ERR_DEADLOCK | LOCK | 17596 | NFS4ERR_DEADSESSION | ACCESS, BACKCHANNEL_CTL, | 17597 | | BIND_CONN_TO_SESSION, CLOSE, | 17598 | | COMMIT, CREATE, | 17599 | | CREATE_SESSION, DELEGPURGE, | 17600 | | DELEGRETURN, | 17601 | | DESTROY_CLIENTID, | 17602 | | DESTROY_SESSION, EXCHANGE_ID, | 17603 | | FREE_STATEID, GETATTR, | 17604 | | GETDEVICEINFO, GETDEVICELIST, | 17605 | | GET_DIR_DELEGATION, | 17606 | | LAYOUTCOMMIT, LAYOUTGET, | 17607 | | LAYOUTRETURN, LINK, LOCK, | 17608 | | LOCKT, LOCKU, LOOKUP, | 17609 | | LOOKUPP, NVERIFY, OPEN, | 17610 | | OPENATTR, OPEN_DOWNGRADE, | 17611 | | PUTFH, PUTPUBFH, PUTROOTFH, | 17612 | | READ, READDIR, READLINK, | 17613 | | RECLAIM_COMPLETE, REMOVE, | 17614 | | RENAME, RESTOREFH, SAVEFH, | 17615 | | SECINFO, SECINFO_NO_NAME, | 17616 | | SEQUENCE, SETATTR, SET_SSV, | 17617 | | TEST_STATEID, VERIFY, | 17618 | | WANT_DELEGATION, WRITE | 17619 | NFS4ERR_DELAY | ACCESS, BACKCHANNEL_CTL, | 17620 | | BIND_CONN_TO_SESSION, | 17621 | | CB_GETATTR, CB_LAYOUTRECALL, | 17622 | | CB_NOTIFY, | 17623 | | CB_NOTIFY_DEVICEID, | 17624 | | CB_NOTIFY_LOCK, | 17625 | | CB_PUSH_DELEG, CB_RECALL, | 17626 | | CB_RECALLABLE_OBJ_AVAIL, | 17627 | | CB_RECALL_ANY, | 17628 | | CB_RECALL_SLOT, CB_SEQUENCE, | 17629 | | CB_WANTS_CANCELLED, CLOSE, | 17630 | | COMMIT, CREATE, | 17631 | | CREATE_SESSION, DELEGPURGE, | 17632 | | DELEGRETURN, | 17633 | | DESTROY_CLIENTID, | 17634 | | DESTROY_SESSION, EXCHANGE_ID, | 17635 | | FREE_STATEID, GETATTR, | 17636 | | GETDEVICEINFO, GETDEVICELIST, | 17637 | | GET_DIR_DELEGATION, | 17638 | | LAYOUTCOMMIT, LAYOUTGET, | 17639 | | LAYOUTRETURN, LINK, LOCK, | 17640 | | LOCKT, LOCKU, LOOKUP, | 17641 | | LOOKUPP, NVERIFY, OPEN, | 17642 | | OPENATTR, OPEN_DOWNGRADE, | 17643 | | PUTFH, PUTPUBFH, PUTROOTFH, | 17644 | | READ, READDIR, READLINK, | 17645 | | RECLAIM_COMPLETE, REMOVE, | 17646 | | RENAME, SECINFO, | 17647 | | SECINFO_NO_NAME, SEQUENCE, | 17648 | | SETATTR, SET_SSV, | 17649 | | TEST_STATEID, VERIFY, | 17650 | | WANT_DELEGATION, WRITE | 17651 | NFS4ERR_DELEG_ALREADY_WANTED | OPEN, WANT_DELEGATION | 17652 | NFS4ERR_DELEG_REVOKED | DELEGRETURN, LAYOUTCOMMIT, | 17653 | | LAYOUTGET, LAYOUTRETURN, | 17654 | | OPEN, READ, SETATTR, WRITE | 17655 | NFS4ERR_DENIED | LOCK, LOCKT | 17656 | NFS4ERR_DIRDELEG_UNAVAIL | GET_DIR_DELEGATION | 17657 | NFS4ERR_DQUOT | CREATE, LAYOUTGET, LINK, | 17658 | | OPEN, OPENATTR, RENAME, | 17659 | | SETATTR, WRITE | 17660 | NFS4ERR_ENCR_ALG_UNSUPP | EXCHANGE_ID | 17661 | NFS4ERR_EXIST | CREATE, LINK, OPEN, RENAME | 17662 | NFS4ERR_EXPIRED | CLOSE, DELEGRETURN, | 17663 | | LAYOUTCOMMIT, LAYOUTRETURN, | 17664 | | LOCK, LOCKU, OPEN, | 17665 | | OPEN_DOWNGRADE, READ, | 17666 | | SETATTR, WRITE | 17667 | NFS4ERR_FBIG | LAYOUTCOMMIT, OPEN, SETATTR, | 17668 | | WRITE | 17669 | NFS4ERR_FHEXPIRED | ACCESS, CLOSE, COMMIT, | 17670 | | CREATE, DELEGRETURN, GETATTR, | 17671 | | GETDEVICELIST, GETFH, | 17672 | | GET_DIR_DELEGATION, | 17673 | | LAYOUTCOMMIT, LAYOUTGET, | 17674 | | LAYOUTRETURN, LINK, LOCK, | 17675 | | LOCKT, LOCKU, LOOKUP, | 17676 | | LOOKUPP, NVERIFY, OPEN, | 17677 | | OPENATTR, OPEN_DOWNGRADE, | 17678 | | READ, READDIR, READLINK, | 17679 | | RECLAIM_COMPLETE, REMOVE, | 17680 | | RENAME, RESTOREFH, SAVEFH, | 17681 | | SECINFO, SECINFO_NO_NAME, | 17682 | | SETATTR, VERIFY, | 17683 | | WANT_DELEGATION, WRITE | 17684 | NFS4ERR_FILE_OPEN | LINK, REMOVE, RENAME | 17685 | NFS4ERR_GRACE | GETATTR, GET_DIR_DELEGATION, | 17686 | | LAYOUTCOMMIT, LAYOUTGET, | 17687 | | LAYOUTRETURN, LINK, LOCK, | 17688 | | LOCKT, NVERIFY, OPEN, READ, | 17689 | | REMOVE, RENAME, SETATTR, | 17690 | | VERIFY, WANT_DELEGATION, | 17691 | | WRITE | 17692 | NFS4ERR_HASH_ALG_UNSUPP | EXCHANGE_ID | 17693 | NFS4ERR_INVAL | ACCESS, BACKCHANNEL_CTL, | 17694 | | BIND_CONN_TO_SESSION, | 17695 | | CB_GETATTR, CB_LAYOUTRECALL, | 17696 | | CB_NOTIFY, | 17697 | | CB_NOTIFY_DEVICEID, | 17698 | | CB_PUSH_DELEG, | 17699 | | CB_RECALLABLE_OBJ_AVAIL, | 17700 | | CB_RECALL_ANY, CREATE, | 17701 | | CREATE_SESSION, DELEGRETURN, | 17702 | | EXCHANGE_ID, GETATTR, | 17703 | | GETDEVICEINFO, GETDEVICELIST, | 17704 | | GET_DIR_DELEGATION, | 17705 | | LAYOUTCOMMIT, LAYOUTGET, | 17706 | | LAYOUTRETURN, LINK, LOCK, | 17707 | | LOCKT, LOCKU, LOOKUP, | 17708 | | NVERIFY, OPEN, | 17709 | | OPEN_DOWNGRADE, READ, | 17710 | | READDIR, READLINK, | 17711 | | RECLAIM_COMPLETE, REMOVE, | 17712 | | RENAME, SECINFO, | 17713 | | SECINFO_NO_NAME, SETATTR, | 17714 | | SET_SSV, VERIFY, | 17715 | | WANT_DELEGATION, WRITE | 17716 | NFS4ERR_IO | ACCESS, COMMIT, CREATE, | 17717 | | GETATTR, GETDEVICELIST, | 17718 | | GET_DIR_DELEGATION, | 17719 | | LAYOUTCOMMIT, LAYOUTGET, | 17720 | | LINK, LOOKUP, LOOKUPP, | 17721 | | NVERIFY, OPEN, OPENATTR, | 17722 | | READ, READDIR, READLINK, | 17723 | | REMOVE, RENAME, SETATTR, | 17724 | | VERIFY, WANT_DELEGATION, | 17725 | | WRITE | 17726 | NFS4ERR_ISDIR | COMMIT, LAYOUTCOMMIT, | 17727 | | LAYOUTRETURN, LINK, LOCK, | 17728 | | LOCKT, OPEN, READ, WRITE | 17729 | NFS4ERR_LAYOUTTRYLATER | LAYOUTGET | 17730 | NFS4ERR_LAYOUTUNAVAILABLE | LAYOUTGET | 17731 | NFS4ERR_LOCKED | LAYOUTGET, READ, SETATTR, | 17732 | | WRITE | 17733 | NFS4ERR_LOCKS_HELD | CLOSE, FREE_STATEID | 17734 | NFS4ERR_LOCK_NOTSUPP | LOCK | 17735 | NFS4ERR_LOCK_RANGE | LOCK, LOCKT, LOCKU | 17736 | NFS4ERR_MLINK | CREATE, LINK | 17737 | NFS4ERR_MOVED | ACCESS, CLOSE, COMMIT, | 17738 | | CREATE, DELEGRETURN, GETATTR, | 17739 | | GETFH, GET_DIR_DELEGATION, | 17740 | | LAYOUTCOMMIT, LAYOUTGET, | 17741 | | LAYOUTRETURN, LINK, LOCK, | 17742 | | LOCKT, LOCKU, LOOKUP, | 17743 | | LOOKUPP, NVERIFY, OPEN, | 17744 | | OPENATTR, OPEN_DOWNGRADE, | 17745 | | PUTFH, READ, READDIR, | 17746 | | READLINK, RECLAIM_COMPLETE, | 17747 | | REMOVE, RENAME, RESTOREFH, | 17748 | | SAVEFH, SECINFO, | 17749 | | SECINFO_NO_NAME, SETATTR, | 17750 | | VERIFY, WANT_DELEGATION, | 17751 | | WRITE | 17752 | NFS4ERR_NAMETOOLONG | CREATE, LINK, LOOKUP, OPEN, | 17753 | | REMOVE, RENAME, SECINFO | 17754 | NFS4ERR_NOENT | BACKCHANNEL_CTL, | 17755 | | CREATE_SESSION, EXCHANGE_ID, | 17756 | | GETDEVICEINFO, LOOKUP, | 17757 | | LOOKUPP, OPEN, OPENATTR, | 17758 | | REMOVE, RENAME, SECINFO, | 17759 | | SECINFO_NO_NAME | 17760 | NFS4ERR_NOFILEHANDLE | ACCESS, CLOSE, COMMIT, | 17761 | | CREATE, DELEGRETURN, GETATTR, | 17762 | | GETDEVICELIST, GETFH, | 17763 | | GET_DIR_DELEGATION, | 17764 | | LAYOUTCOMMIT, LAYOUTGET, | 17765 | | LAYOUTRETURN, LINK, LOCK, | 17766 | | LOCKT, LOCKU, LOOKUP, | 17767 | | LOOKUPP, NVERIFY, OPEN, | 17768 | | OPENATTR, OPEN_DOWNGRADE, | 17769 | | READ, READDIR, READLINK, | 17770 | | RECLAIM_COMPLETE, REMOVE, | 17771 | | RENAME, RESTOREFH, SAVEFH, | 17772 | | SECINFO, SECINFO_NO_NAME, | 17773 | | SETATTR, VERIFY, | 17774 | | WANT_DELEGATION, WRITE | 17775 | NFS4ERR_NOMATCHING_LAYOUT | CB_LAYOUTRECALL | 17776 | NFS4ERR_NOSPC | CREATE, CREATE_SESSION, | 17777 | | LAYOUTGET, LINK, OPEN, | 17778 | | OPENATTR, RENAME, SETATTR, | 17779 | | WRITE | 17780 | NFS4ERR_NOTDIR | CREATE, GET_DIR_DELEGATION, | 17781 | | LINK, LOOKUP, LOOKUPP, OPEN, | 17782 | | READDIR, REMOVE, RENAME, | 17783 | | SECINFO, SECINFO_NO_NAME | 17784 | NFS4ERR_NOTEMPTY | REMOVE, RENAME | 17785 | NFS4ERR_NOTSUPP | CB_LAYOUTRECALL, CB_NOTIFY, | 17786 | | CB_NOTIFY_DEVICEID, | 17787 | | CB_NOTIFY_LOCK, | 17788 | | CB_PUSH_DELEG, | 17789 | | CB_RECALLABLE_OBJ_AVAIL, | 17790 | | CB_WANTS_CANCELLED, | 17791 | | DELEGPURGE, DELEGRETURN, | 17792 | | GETDEVICEINFO, GETDEVICELIST, | 17793 | | GET_DIR_DELEGATION, | 17794 | | LAYOUTCOMMIT, LAYOUTGET, | 17795 | | LAYOUTRETURN, LINK, OPENATTR, | 17796 | | OPEN_CONFIRM, | 17797 | | RELEASE_LOCKOWNER, RENEW, | 17798 | | SECINFO_NO_NAME, SETCLIENTID, | 17799 | | SETCLIENTID_CONFIRM, | 17800 | | WANT_DELEGATION | 17801 | NFS4ERR_NOT_ONLY_OP | BIND_CONN_TO_SESSION, | 17802 | | CREATE_SESSION, | 17803 | | DESTROY_CLIENTID, | 17804 | | DESTROY_SESSION, EXCHANGE_ID | 17805 | NFS4ERR_NOT_SAME | EXCHANGE_ID, GETDEVICELIST, | 17806 | | READDIR, VERIFY | 17807 | NFS4ERR_NO_GRACE | LAYOUTCOMMIT, LAYOUTRETURN, | 17808 | | LOCK, OPEN, WANT_DELEGATION | 17809 | NFS4ERR_OLD_STATEID | CLOSE, DELEGRETURN, | 17810 | | FREE_STATEID, LAYOUTGET, | 17811 | | LAYOUTRETURN, LOCK, LOCKU, | 17812 | | OPEN, OPEN_DOWNGRADE, READ, | 17813 | | SETATTR, WRITE | 17814 | NFS4ERR_OPENMODE | LAYOUTGET, LOCK, READ, | 17815 | | SETATTR, WRITE | 17816 | NFS4ERR_OP_ILLEGAL | CB_ILLEGAL, ILLEGAL | 17817 | NFS4ERR_OP_NOT_IN_SESSION | ACCESS, BACKCHANNEL_CTL, | 17818 | | CB_GETATTR, CB_LAYOUTRECALL, | 17819 | | CB_NOTIFY, | 17820 | | CB_NOTIFY_DEVICEID, | 17821 | | CB_NOTIFY_LOCK, | 17822 | | CB_PUSH_DELEG, CB_RECALL, | 17823 | | CB_RECALLABLE_OBJ_AVAIL, | 17824 | | CB_RECALL_ANY, | 17825 | | CB_RECALL_SLOT, | 17826 | | CB_WANTS_CANCELLED, CLOSE, | 17827 | | COMMIT, CREATE, DELEGPURGE, | 17828 | | DELEGRETURN, FREE_STATEID, | 17829 | | GETATTR, GETDEVICEINFO, | 17830 | | GETDEVICELIST, GETFH, | 17831 | | GET_DIR_DELEGATION, | 17832 | | LAYOUTCOMMIT, LAYOUTGET, | 17833 | | LAYOUTRETURN, LINK, LOCK, | 17834 | | LOCKT, LOCKU, LOOKUP, | 17835 | | LOOKUPP, NVERIFY, OPEN, | 17836 | | OPENATTR, OPEN_DOWNGRADE, | 17837 | | PUTFH, PUTPUBFH, PUTROOTFH, | 17838 | | READ, READDIR, READLINK, | 17839 | | RECLAIM_COMPLETE, REMOVE, | 17840 | | RENAME, RESTOREFH, SAVEFH, | 17841 | | SECINFO, SECINFO_NO_NAME, | 17842 | | SETATTR, SET_SSV, | 17843 | | TEST_STATEID, VERIFY, | 17844 | | WANT_DELEGATION, WRITE | 17845 | NFS4ERR_PERM | CREATE, OPEN, SETATTR | 17846 | NFS4ERR_PNFS_IO_HOLE | READ, WRITE | 17847 | NFS4ERR_PNFS_NO_LAYOUT | READ, WRITE | 17848 | NFS4ERR_RECALLCONFLICT | LAYOUTGET, WANT_DELEGATION | 17849 | NFS4ERR_RECLAIM_BAD | LAYOUTCOMMIT, LOCK, OPEN, | 17850 | | WANT_DELEGATION | 17851 | NFS4ERR_RECLAIM_CONFLICT | LAYOUTCOMMIT, LOCK, OPEN, | 17852 | | WANT_DELEGATION | 17853 | NFS4ERR_REJECT_DELEG | CB_PUSH_DELEG | 17854 | NFS4ERR_REP_TOO_BIG | ACCESS, BACKCHANNEL_CTL, | 17855 | | BIND_CONN_TO_SESSION, | 17856 | | CB_GETATTR, CB_LAYOUTRECALL, | 17857 | | CB_NOTIFY, | 17858 | | CB_NOTIFY_DEVICEID, | 17859 | | CB_NOTIFY_LOCK, | 17860 | | CB_PUSH_DELEG, CB_RECALL, | 17861 | | CB_RECALLABLE_OBJ_AVAIL, | 17862 | | CB_RECALL_ANY, | 17863 | | CB_RECALL_SLOT, CB_SEQUENCE, | 17864 | | CB_WANTS_CANCELLED, CLOSE, | 17865 | | COMMIT, CREATE, | 17866 | | CREATE_SESSION, DELEGPURGE, | 17867 | | DELEGRETURN, | 17868 | | DESTROY_CLIENTID, | 17869 | | DESTROY_SESSION, EXCHANGE_ID, | 17870 | | FREE_STATEID, GETATTR, | 17871 | | GETDEVICEINFO, GETDEVICELIST, | 17872 | | GET_DIR_DELEGATION, | 17873 | | LAYOUTCOMMIT, LAYOUTGET, | 17874 | | LAYOUTRETURN, LINK, LOCK, | 17875 | | LOCKT, LOCKU, LOOKUP, | 17876 | | LOOKUPP, NVERIFY, OPEN, | 17877 | | OPENATTR, OPEN_DOWNGRADE, | 17878 | | PUTFH, PUTPUBFH, PUTROOTFH, | 17879 | | READ, READDIR, READLINK, | 17880 | | RECLAIM_COMPLETE, REMOVE, | 17881 | | RENAME, RESTOREFH, SAVEFH, | 17882 | | SECINFO, SECINFO_NO_NAME, | 17883 | | SEQUENCE, SETATTR, SET_SSV, | 17884 | | TEST_STATEID, VERIFY, | 17885 | | WANT_DELEGATION, WRITE | 17886 | NFS4ERR_REP_TOO_BIG_TO_CACHE | ACCESS, BACKCHANNEL_CTL, | 17887 | | BIND_CONN_TO_SESSION, | 17888 | | CB_GETATTR, CB_LAYOUTRECALL, | 17889 | | CB_NOTIFY, | 17890 | | CB_NOTIFY_DEVICEID, | 17891 | | CB_NOTIFY_LOCK, | 17892 | | CB_PUSH_DELEG, CB_RECALL, | 17893 | | CB_RECALLABLE_OBJ_AVAIL, | 17894 | | CB_RECALL_ANY, | 17895 | | CB_RECALL_SLOT, CB_SEQUENCE, | 17896 | | CB_WANTS_CANCELLED, CLOSE, | 17897 | | COMMIT, CREATE, | 17898 | | CREATE_SESSION, DELEGPURGE, | 17899 | | DELEGRETURN, | 17900 | | DESTROY_CLIENTID, | 17901 | | DESTROY_SESSION, EXCHANGE_ID, | 17902 | | FREE_STATEID, GETATTR, | 17903 | | GETDEVICEINFO, GETDEVICELIST, | 17904 | | GET_DIR_DELEGATION, | 17905 | | LAYOUTCOMMIT, LAYOUTGET, | 17906 | | LAYOUTRETURN, LINK, LOCK, | 17907 | | LOCKT, LOCKU, LOOKUP, | 17908 | | LOOKUPP, NVERIFY, OPEN, | 17909 | | OPENATTR, OPEN_DOWNGRADE, | 17910 | | PUTFH, PUTPUBFH, PUTROOTFH, | 17911 | | READ, READDIR, READLINK, | 17912 | | RECLAIM_COMPLETE, REMOVE, | 17913 | | RENAME, RESTOREFH, SAVEFH, | 17914 | | SECINFO, SECINFO_NO_NAME, | 17915 | | SEQUENCE, SETATTR, SET_SSV, | 17916 | | TEST_STATEID, VERIFY, | 17917 | | WANT_DELEGATION, WRITE | 17918 | NFS4ERR_REQ_TOO_BIG | ACCESS, BACKCHANNEL_CTL, | 17919 | | BIND_CONN_TO_SESSION, | 17920 | | CB_GETATTR, CB_LAYOUTRECALL, | 17921 | | CB_NOTIFY, | 17922 | | CB_NOTIFY_DEVICEID, | 17923 | | CB_NOTIFY_LOCK, | 17924 | | CB_PUSH_DELEG, CB_RECALL, | 17925 | | CB_RECALLABLE_OBJ_AVAIL, | 17926 | | CB_RECALL_ANY, | 17927 | | CB_RECALL_SLOT, CB_SEQUENCE, | 17928 | | CB_WANTS_CANCELLED, CLOSE, | 17929 | | COMMIT, CREATE, | 17930 | | CREATE_SESSION, DELEGPURGE, | 17931 | | DELEGRETURN, | 17932 | | DESTROY_CLIENTID, | 17933 | | DESTROY_SESSION, EXCHANGE_ID, | 17934 | | FREE_STATEID, GETATTR, | 17935 | | GETDEVICEINFO, GETDEVICELIST, | 17936 | | GET_DIR_DELEGATION, | 17937 | | LAYOUTCOMMIT, LAYOUTGET, | 17938 | | LAYOUTRETURN, LINK, LOCK, | 17939 | | LOCKT, LOCKU, LOOKUP, | 17940 | | LOOKUPP, NVERIFY, OPEN, | 17941 | | OPENATTR, OPEN_DOWNGRADE, | 17942 | | PUTFH, PUTPUBFH, PUTROOTFH, | 17943 | | READ, READDIR, READLINK, | 17944 | | RECLAIM_COMPLETE, REMOVE, | 17945 | | RENAME, RESTOREFH, SAVEFH, | 17946 | | SECINFO, SECINFO_NO_NAME, | 17947 | | SEQUENCE, SETATTR, SET_SSV, | 17948 | | TEST_STATEID, VERIFY, | 17949 | | WANT_DELEGATION, WRITE | 17950 | NFS4ERR_RETRY_UNCACHED_REP | CB_SEQUENCE, SEQUENCE | 17951 | NFS4ERR_ROFS | CREATE, LINK, LOCK, LOCKT, | 17952 | | OPEN, OPENATTR, | 17953 | | OPEN_DOWNGRADE, REMOVE, | 17954 | | RENAME, SETATTR, WRITE | 17955 | NFS4ERR_SAME | NVERIFY | 17956 | NFS4ERR_SEQUENCE_POS | CB_SEQUENCE, SEQUENCE | 17957 | NFS4ERR_SEQ_FALSE_RETRY | CB_SEQUENCE, SEQUENCE | 17958 | NFS4ERR_SEQ_MISORDERED | CB_SEQUENCE, CREATE_SESSION, | 17959 | | SEQUENCE | 17960 | NFS4ERR_SERVERFAULT | ACCESS, BIND_CONN_TO_SESSION, | 17961 | | CB_GETATTR, CB_NOTIFY, | 17962 | | CB_NOTIFY_DEVICEID, | 17963 | | CB_NOTIFY_LOCK, | 17964 | | CB_PUSH_DELEG, CB_RECALL, | 17965 | | CB_RECALLABLE_OBJ_AVAIL, | 17966 | | CB_WANTS_CANCELLED, CLOSE, | 17967 | | COMMIT, CREATE, | 17968 | | CREATE_SESSION, DELEGPURGE, | 17969 | | DELEGRETURN, | 17970 | | DESTROY_CLIENTID, | 17971 | | DESTROY_SESSION, EXCHANGE_ID, | 17972 | | FREE_STATEID, GETATTR, | 17973 | | GETDEVICEINFO, GETDEVICELIST, | 17974 | | GET_DIR_DELEGATION, | 17975 | | LAYOUTCOMMIT, LAYOUTGET, | 17976 | | LAYOUTRETURN, LINK, LOCK, | 17977 | | LOCKU, LOOKUP, LOOKUPP, | 17978 | | NVERIFY, OPEN, OPENATTR, | 17979 | | OPEN_DOWNGRADE, PUTFH, | 17980 | | PUTPUBFH, PUTROOTFH, READ, | 17981 | | READDIR, READLINK, | 17982 | | RECLAIM_COMPLETE, REMOVE, | 17983 | | RENAME, RESTOREFH, SAVEFH, | 17984 | | SECINFO, SECINFO_NO_NAME, | 17985 | | SETATTR, TEST_STATEID, | 17986 | | VERIFY, WANT_DELEGATION, | 17987 | | WRITE | 17988 | NFS4ERR_SHARE_DENIED | OPEN | 17989 | NFS4ERR_STALE | ACCESS, CLOSE, COMMIT, | 17990 | | CREATE, DELEGRETURN, GETATTR, | 17991 | | GETFH, GET_DIR_DELEGATION, | 17992 | | LAYOUTCOMMIT, LAYOUTGET, | 17993 | | LAYOUTRETURN, LINK, LOCK, | 17994 | | LOCKT, LOCKU, LOOKUP, | 17995 | | LOOKUPP, NVERIFY, OPEN, | 17996 | | OPENATTR, OPEN_DOWNGRADE, | 17997 | | PUTFH, READ, READDIR, | 17998 | | READLINK, RECLAIM_COMPLETE, | 17999 | | REMOVE, RENAME, RESTOREFH, | 18000 | | SAVEFH, SECINFO, | 18001 | | SECINFO_NO_NAME, SETATTR, | 18002 | | VERIFY, WANT_DELEGATION, | 18003 | | WRITE | 18004 | NFS4ERR_STALE_CLIENTID | CREATE_SESSION, | 18005 | | DESTROY_CLIENTID, | 18006 | | DESTROY_SESSION | 18007 | NFS4ERR_SYMLINK | COMMIT, LAYOUTCOMMIT, LINK, | 18008 | | LOCK, LOCKT, LOOKUP, LOOKUPP, | 18009 | | OPEN, READ, WRITE | 18010 | NFS4ERR_TOOSMALL | CREATE_SESSION, | 18011 | | GETDEVICEINFO, LAYOUTGET, | 18012 | | READDIR | 18013 | NFS4ERR_TOO_MANY_OPS | ACCESS, BACKCHANNEL_CTL, | 18014 | | BIND_CONN_TO_SESSION, | 18015 | | CB_GETATTR, CB_LAYOUTRECALL, | 18016 | | CB_NOTIFY, | 18017 | | CB_NOTIFY_DEVICEID, | 18018 | | CB_NOTIFY_LOCK, | 18019 | | CB_PUSH_DELEG, CB_RECALL, | 18020 | | CB_RECALLABLE_OBJ_AVAIL, | 18021 | | CB_RECALL_ANY, | 18022 | | CB_RECALL_SLOT, CB_SEQUENCE, | 18023 | | CB_WANTS_CANCELLED, CLOSE, | 18024 | | COMMIT, CREATE, | 18025 | | CREATE_SESSION, DELEGPURGE, | 18026 | | DELEGRETURN, | 18027 | | DESTROY_CLIENTID, | 18028 | | DESTROY_SESSION, EXCHANGE_ID, | 18029 | | FREE_STATEID, GETATTR, | 18030 | | GETDEVICEINFO, GETDEVICELIST, | 18031 | | GET_DIR_DELEGATION, | 18032 | | LAYOUTCOMMIT, LAYOUTGET, | 18033 | | LAYOUTRETURN, LINK, LOCK, | 18034 | | LOCKT, LOCKU, LOOKUP, | 18035 | | LOOKUPP, NVERIFY, OPEN, | 18036 | | OPENATTR, OPEN_DOWNGRADE, | 18037 | | PUTFH, PUTPUBFH, PUTROOTFH, | 18038 | | READ, READDIR, READLINK, | 18039 | | RECLAIM_COMPLETE, REMOVE, | 18040 | | RENAME, RESTOREFH, SAVEFH, | 18041 | | SECINFO, SECINFO_NO_NAME, | 18042 | | SEQUENCE, SETATTR, SET_SSV, | 18043 | | TEST_STATEID, VERIFY, | 18044 | | WANT_DELEGATION, WRITE | 18045 | NFS4ERR_UNKNOWN_LAYOUTTYPE | CB_LAYOUTRECALL, | 18046 | | GETDEVICEINFO, GETDEVICELIST, | 18047 | | LAYOUTCOMMIT, LAYOUTGET, | 18048 | | LAYOUTRETURN, NVERIFY, | 18049 | | SETATTR, VERIFY | 18050 | NFS4ERR_UNSAFE_COMPOUND | CREATE, OPEN, OPENATTR | 18051 | NFS4ERR_WRONGSEC | LINK, LOOKUP, LOOKUPP, OPEN, | 18052 | | PUTFH, PUTPUBFH, PUTROOTFH, | 18053 | | RENAME, RESTOREFH | 18054 | NFS4ERR_WRONG_CRED | CLOSE, CREATE_SESSION, | 18055 | | DELEGPURGE, DELEGRETURN, | 18056 | | DESTROY_CLIENTID, | 18057 | | DESTROY_SESSION, | 18058 | | FREE_STATEID, LAYOUTCOMMIT, | 18059 | | LAYOUTRETURN, LOCK, LOCKT, | 18060 | | LOCKU, OPEN_DOWNGRADE, | 18061 | | RECLAIM_COMPLETE | 18062 | NFS4ERR_WRONG_TYPE | CB_LAYOUTRECALL, | 18063 | | CB_PUSH_DELEG, COMMIT, | 18064 | | GETATTR, LAYOUTGET, | 18065 | | LAYOUTRETURN, LINK, LOCK, | 18066 | | LOCKT, NVERIFY, OPEN, | 18067 | | OPENATTR, READ, READLINK, | 18068 | | RECLAIM_COMPLETE, SETATTR, | 18069 | | VERIFY, WANT_DELEGATION, | 18070 | | WRITE | 18071 | NFS4ERR_XDEV | LINK, RENAME | 18072 +-----------------------------------+-------------------------------+ 18074 Table 8 18076 16. NFSv4.1 Procedures 18078 Both procedures, NULL and COMPOUND, MUST be implemented. 18080 16.1. Procedure 0: NULL - No Operation 18082 16.1.1. ARGUMENTS 18084 void; 18086 16.1.2. RESULTS 18088 void; 18090 16.1.3. DESCRIPTION 18092 This is the standard NULL procedure with the standard void argument 18093 and void response. This procedure has no functionality associated 18094 with it. Because of this it is sometimes used to measure the 18095 overhead of processing a service request. Therefore, the server 18096 SHOULD ensure that no unnecessary work is done in servicing this 18097 procedure. 18099 16.1.4. ERRORS 18101 None. 18103 16.2. Procedure 1: COMPOUND - Compound Operations 18105 16.2.1. ARGUMENTS 18107 enum nfs_opnum4 { 18108 OP_ACCESS = 3, 18109 OP_CLOSE = 4, 18110 OP_COMMIT = 5, 18111 OP_CREATE = 6, 18112 OP_DELEGPURGE = 7, 18113 OP_DELEGRETURN = 8, 18114 OP_GETATTR = 9, 18115 OP_GETFH = 10, 18116 OP_LINK = 11, 18117 OP_LOCK = 12, 18118 OP_LOCKT = 13, 18119 OP_LOCKU = 14, 18120 OP_LOOKUP = 15, 18121 OP_LOOKUPP = 16, 18122 OP_NVERIFY = 17, 18123 OP_OPEN = 18, 18124 OP_OPENATTR = 19, 18125 OP_OPEN_CONFIRM = 20, /* Mandatory not-to-implement */ 18126 OP_OPEN_DOWNGRADE = 21, 18127 OP_PUTFH = 22, 18128 OP_PUTPUBFH = 23, 18129 OP_PUTROOTFH = 24, 18130 OP_READ = 25, 18131 OP_READDIR = 26, 18132 OP_READLINK = 27, 18133 OP_REMOVE = 28, 18134 OP_RENAME = 29, 18135 OP_RENEW = 30, /* Mandatory not-to-implement */ 18136 OP_RESTOREFH = 31, 18137 OP_SAVEFH = 32, 18138 OP_SECINFO = 33, 18139 OP_SETATTR = 34, 18140 OP_SETCLIENTID = 35, /* Mandatory not-to-implement */ 18141 OP_SETCLIENTID_CONFIRM = 36, /* Mandatory not-to-implement */ 18142 OP_VERIFY = 37, 18143 OP_WRITE = 38, 18144 OP_RELEASE_LOCKOWNER = 39, /* Mandatory not-to-implement */ 18146 /* new operations for NFSv4.1 */ 18147 OP_BACKCHANNEL_CTL = 40, 18148 OP_BIND_CONN_TO_SESSION = 41, 18149 OP_EXCHANGE_ID = 42, 18150 OP_CREATE_SESSION = 43, 18151 OP_DESTROY_SESSION = 44, 18152 OP_FREE_STATEID = 45, 18153 OP_GET_DIR_DELEGATION = 46, 18154 OP_GETDEVICEINFO = 47, 18155 OP_GETDEVICELIST = 48, 18156 OP_LAYOUTCOMMIT = 49, 18157 OP_LAYOUTGET = 50, 18158 OP_LAYOUTRETURN = 51, 18159 OP_SECINFO_NO_NAME = 52, 18160 OP_SEQUENCE = 53, 18161 OP_SET_SSV = 54, 18162 OP_TEST_STATEID = 55, 18163 OP_WANT_DELEGATION = 56, 18164 OP_DESTROY_CLIENTID = 57, 18165 OP_RECLAIM_COMPLETE = 58, 18166 OP_ILLEGAL = 10044 18167 }; 18169 union nfs_argop4 switch (nfs_opnum4 argop) { 18170 case OP_ACCESS: ACCESS4args opaccess; 18171 case OP_CLOSE: CLOSE4args opclose; 18172 case OP_COMMIT: COMMIT4args opcommit; 18173 case OP_CREATE: CREATE4args opcreate; 18174 case OP_DELEGPURGE: DELEGPURGE4args opdelegpurge; 18175 case OP_DELEGRETURN: DELEGRETURN4args opdelegreturn; 18176 case OP_GETATTR: GETATTR4args opgetattr; 18177 case OP_GETFH: void; 18178 case OP_LINK: LINK4args oplink; 18179 case OP_LOCK: LOCK4args oplock; 18180 case OP_LOCKT: LOCKT4args oplockt; 18181 case OP_LOCKU: LOCKU4args oplocku; 18182 case OP_LOOKUP: LOOKUP4args oplookup; 18183 case OP_LOOKUPP: void; 18184 case OP_NVERIFY: NVERIFY4args opnverify; 18185 case OP_OPEN: OPEN4args opopen; 18186 case OP_OPENATTR: OPENATTR4args opopenattr; 18188 /* Not for NFSv4.1 */ 18189 case OP_OPEN_CONFIRM: OPEN_CONFIRM4args opopen_confirm; 18191 case OP_OPEN_DOWNGRADE: 18192 OPEN_DOWNGRADE4args opopen_downgrade; 18194 case OP_PUTFH: PUTFH4args opputfh; 18195 case OP_PUTPUBFH: void; 18196 case OP_PUTROOTFH: void; 18197 case OP_READ: READ4args opread; 18198 case OP_READDIR: READDIR4args opreaddir; 18199 case OP_READLINK: void; 18200 case OP_REMOVE: REMOVE4args opremove; 18201 case OP_RENAME: RENAME4args oprename; 18203 /* Not for NFSv4.1 */ 18204 case OP_RENEW: RENEW4args oprenew; 18206 case OP_RESTOREFH: void; 18207 case OP_SAVEFH: void; 18208 case OP_SECINFO: SECINFO4args opsecinfo; 18209 case OP_SETATTR: SETATTR4args opsetattr; 18211 /* Not for NFSv4.1 */ 18212 case OP_SETCLIENTID: SETCLIENTID4args opsetclientid; 18214 /* Not for NFSv4.1 */ 18215 case OP_SETCLIENTID_CONFIRM: SETCLIENTID_CONFIRM4args 18216 opsetclientid_confirm; 18217 case OP_VERIFY: VERIFY4args opverify; 18218 case OP_WRITE: WRITE4args opwrite; 18220 /* Not for NFSv4.1 */ 18221 case OP_RELEASE_LOCKOWNER: 18222 RELEASE_LOCKOWNER4args 18223 oprelease_lockowner; 18225 /* Operations new to NFSv4.1 */ 18226 case OP_BACKCHANNEL_CTL: 18227 BACKCHANNEL_CTL4args opbackchannel_ctl; 18229 case OP_BIND_CONN_TO_SESSION: 18230 BIND_CONN_TO_SESSION4args 18231 opbind_conn_to_session; 18233 case OP_EXCHANGE_ID: EXCHANGE_ID4args opexchange_id; 18235 case OP_CREATE_SESSION: 18236 CREATE_SESSION4args opcreate_session; 18238 case OP_DESTROY_SESSION: 18239 DESTROY_SESSION4args opdestroy_session; 18241 case OP_FREE_STATEID: FREE_STATEID4args opfree_stateid; 18242 case OP_GET_DIR_DELEGATION: 18243 GET_DIR_DELEGATION4args 18244 opget_dir_delegation; 18246 case OP_GETDEVICEINFO: GETDEVICEINFO4args opgetdeviceinfo; 18247 case OP_GETDEVICELIST: GETDEVICELIST4args opgetdevicelist; 18248 case OP_LAYOUTCOMMIT: LAYOUTCOMMIT4args oplayoutcommit; 18249 case OP_LAYOUTGET: LAYOUTGET4args oplayoutget; 18250 case OP_LAYOUTRETURN: LAYOUTRETURN4args oplayoutreturn; 18252 case OP_SECINFO_NO_NAME: 18253 SECINFO_NO_NAME4args opsecinfo_no_name; 18255 case OP_SEQUENCE: SEQUENCE4args opsequence; 18256 case OP_SET_SSV: SET_SSV4args opset_ssv; 18257 case OP_TEST_STATEID: TEST_STATEID4args optest_stateid; 18259 case OP_WANT_DELEGATION: 18260 WANT_DELEGATION4args opwant_delegation; 18262 case OP_DESTROY_CLIENTID: 18263 DESTROY_CLIENTID4args 18264 opdestroy_clientid; 18266 case OP_RECLAIM_COMPLETE: 18267 RECLAIM_COMPLETE4args 18268 opreclaim_complete; 18270 /* Operations not new to NFSv4.1 */ 18271 case OP_ILLEGAL: void; 18272 }; 18274 struct COMPOUND4args { 18275 utf8str_cs tag; 18276 uint32_t minorversion; 18277 nfs_argop4 argarray<>; 18278 }; 18280 16.2.2. RESULTS 18282 union nfs_resop4 switch (nfs_opnum4 resop) { 18283 case OP_ACCESS: ACCESS4res opaccess; 18284 case OP_CLOSE: CLOSE4res opclose; 18285 case OP_COMMIT: COMMIT4res opcommit; 18286 case OP_CREATE: CREATE4res opcreate; 18287 case OP_DELEGPURGE: DELEGPURGE4res opdelegpurge; 18288 case OP_DELEGRETURN: DELEGRETURN4res opdelegreturn; 18289 case OP_GETATTR: GETATTR4res opgetattr; 18290 case OP_GETFH: GETFH4res opgetfh; 18291 case OP_LINK: LINK4res oplink; 18292 case OP_LOCK: LOCK4res oplock; 18293 case OP_LOCKT: LOCKT4res oplockt; 18294 case OP_LOCKU: LOCKU4res oplocku; 18295 case OP_LOOKUP: LOOKUP4res oplookup; 18296 case OP_LOOKUPP: LOOKUPP4res oplookupp; 18297 case OP_NVERIFY: NVERIFY4res opnverify; 18298 case OP_OPEN: OPEN4res opopen; 18299 case OP_OPENATTR: OPENATTR4res opopenattr; 18300 /* Not for NFSv4.1 */ 18301 case OP_OPEN_CONFIRM: OPEN_CONFIRM4res opopen_confirm; 18303 case OP_OPEN_DOWNGRADE: 18304 OPEN_DOWNGRADE4res 18305 opopen_downgrade; 18307 case OP_PUTFH: PUTFH4res opputfh; 18308 case OP_PUTPUBFH: PUTPUBFH4res opputpubfh; 18309 case OP_PUTROOTFH: PUTROOTFH4res opputrootfh; 18310 case OP_READ: READ4res opread; 18311 case OP_READDIR: READDIR4res opreaddir; 18312 case OP_READLINK: READLINK4res opreadlink; 18313 case OP_REMOVE: REMOVE4res opremove; 18314 case OP_RENAME: RENAME4res oprename; 18315 /* Not for NFSv4.1 */ 18316 case OP_RENEW: RENEW4res oprenew; 18317 case OP_RESTOREFH: RESTOREFH4res oprestorefh; 18318 case OP_SAVEFH: SAVEFH4res opsavefh; 18319 case OP_SECINFO: SECINFO4res opsecinfo; 18320 case OP_SETATTR: SETATTR4res opsetattr; 18321 /* Not for NFSv4.1 */ 18322 case OP_SETCLIENTID: SETCLIENTID4res opsetclientid; 18324 /* Not for NFSv4.1 */ 18325 case OP_SETCLIENTID_CONFIRM: 18326 SETCLIENTID_CONFIRM4res 18327 opsetclientid_confirm; 18328 case OP_VERIFY: VERIFY4res opverify; 18329 case OP_WRITE: WRITE4res opwrite; 18331 /* Not for NFSv4.1 */ 18332 case OP_RELEASE_LOCKOWNER: 18333 RELEASE_LOCKOWNER4res 18334 oprelease_lockowner; 18336 /* Operations new to NFSv4.1 */ 18337 case OP_BACKCHANNEL_CTL: 18338 BACKCHANNEL_CTL4res 18339 opbackchannel_ctl; 18341 case OP_BIND_CONN_TO_SESSION: 18342 BIND_CONN_TO_SESSION4res 18343 opbind_conn_to_session; 18345 case OP_EXCHANGE_ID: EXCHANGE_ID4res opexchange_id; 18347 case OP_CREATE_SESSION: 18348 CREATE_SESSION4res 18349 opcreate_session; 18351 case OP_DESTROY_SESSION: 18352 DESTROY_SESSION4res 18353 opdestroy_session; 18355 case OP_FREE_STATEID: FREE_STATEID4res 18356 opfree_stateid; 18358 case OP_GET_DIR_DELEGATION: 18359 GET_DIR_DELEGATION4res 18360 opget_dir_delegation; 18362 case OP_GETDEVICEINFO: GETDEVICEINFO4res 18363 opgetdeviceinfo; 18365 case OP_GETDEVICELIST: GETDEVICELIST4res 18366 opgetdevicelist; 18368 case OP_LAYOUTCOMMIT: LAYOUTCOMMIT4res oplayoutcommit; 18369 case OP_LAYOUTGET: LAYOUTGET4res oplayoutget; 18370 case OP_LAYOUTRETURN: LAYOUTRETURN4res oplayoutreturn; 18372 case OP_SECINFO_NO_NAME: 18373 SECINFO_NO_NAME4res 18374 opsecinfo_no_name; 18376 case OP_SEQUENCE: SEQUENCE4res opsequence; 18377 case OP_SET_SSV: SET_SSV4res opset_ssv; 18378 case OP_TEST_STATEID: TEST_STATEID4res optest_stateid; 18380 case OP_WANT_DELEGATION: 18381 WANT_DELEGATION4res 18382 opwant_delegation; 18384 case OP_DESTROY_CLIENTID: 18386 DESTROY_CLIENTID4res 18387 opdestroy_clientid; 18389 case OP_RECLAIM_COMPLETE: 18390 RECLAIM_COMPLETE4res 18391 opreclaim_complete; 18393 /* Operations not new to NFSv4.1 */ 18394 case OP_ILLEGAL: ILLEGAL4res opillegal; 18395 }; 18397 struct COMPOUND4res { 18398 nfsstat4 status; 18399 utf8str_cs tag; 18400 nfs_resop4 resarray<>; 18401 }; 18403 16.2.3. DESCRIPTION 18405 The COMPOUND procedure is used to combine one or more of the NFS 18406 operations into a single RPC request. The NFS RPC program has two 18407 main procedures: NULL and COMPOUND. All other operations use the 18408 COMPOUND procedure as a wrapper. 18410 The COMPOUND procedure is used to combine individual operations into 18411 a single RPC request. The server interprets each of the operations 18412 in turn. If an operation is executed by the server and the status of 18413 that operation is NFS4_OK, then the next operation in the COMPOUND 18414 procedure is executed. The server continues this process until there 18415 are no more operations to be executed or one of the operations has a 18416 status value other than NFS4_OK. 18418 In the processing of the COMPOUND procedure, the server may find that 18419 it does not have the available resources to execute any or all of the 18420 operations within the COMPOUND sequence. See Section 2.10.6.4 for a 18421 more detailed discussion. 18423 The server will generally choose between two methods of decoding the 18424 client's request. The first would be the traditional one pass XDR 18425 decode. If there is an XDR decoding error in this case, the RPC XDR 18426 decode error would be returned. The second method would be to make 18427 an initial pass to decode the basic COMPOUND request and then to XDR 18428 decode the individual operations; the most interesting is the decode 18429 of attributes. In this case, the server may encounter an XDR decode 18430 error during the second pass. In this case, the server would return 18431 the error NFS4ERR_BADXDR to signify the decode error. 18433 The COMPOUND arguments contain a "minorversion" field. For NFSv4.1, 18434 the value for this field is 1. If the server receives a COMPOUND 18435 procedure with a minorversion field value that it does not support, 18436 the server MUST return an error of NFS4ERR_MINOR_VERS_MISMATCH and a 18437 zero length resultdata array. 18439 Contained within the COMPOUND results is a "status" field. If the 18440 results array length is non-zero, this status must be equivalent to 18441 the status of the last operation that was executed within the 18442 COMPOUND procedure. Therefore, if an operation incurred an error 18443 then the "status" value will be the same error value as is being 18444 returned for the operation that failed. 18446 Note that operations, 0 (zero) and 1 (one) are not defined for the 18447 COMPOUND procedure. Operation 2 is not defined and is reserved for 18448 future definition and use with minor versioning. If the server 18449 receives a operation array that contains operation 2 and the 18450 minorversion field has a value of 0 (zero), an error of 18451 NFS4ERR_OP_ILLEGAL, as described in the next paragraph, is returned 18452 to the client. If an operation array contains an operation 2 and the 18453 minorversion field is non-zero and the server does not support the 18454 minor version, the server returns an error of 18455 NFS4ERR_MINOR_VERS_MISMATCH. Therefore, the 18456 NFS4ERR_MINOR_VERS_MISMATCH error takes precedence over all other 18457 errors. 18459 It is possible that the server receives a request that contains an 18460 operation that is less than the first legal operation (OP_ACCESS) or 18461 greater than the last legal operation (OP_RELEASE_LOCKOWNER). In 18462 this case, the server's response will encode the opcode OP_ILLEGAL 18463 rather than the illegal opcode of the request. The status field in 18464 the ILLEGAL return results will set to NFS4ERR_OP_ILLEGAL. The 18465 COMPOUND procedure's return results will also be NFS4ERR_OP_ILLEGAL. 18467 The definition of the "tag" in the request is left to the 18468 implementor. It may be used to summarize the content of the compound 18469 request for the benefit of packet sniffers and engineers debugging 18470 implementations. However, the value of "tag" in the response SHOULD 18471 be the same value as provided in the request. This applies to the 18472 tag field of the CB_COMPOUND procedure as well. 18474 16.2.3.1. Current Filehandle and Stateid 18476 The COMPOUND procedure offers a simple environment for the execution 18477 of the operations specified by the client. The first two relate to 18478 the filehandle while the second two relate to the current stateid. 18480 16.2.3.1.1. Current Filehandle 18482 The current and saved filehandle are used throughout the protocol. 18483 Most operations implicitly use the current filehandle as a argument 18484 and many set the current filehandle as part of the results. The 18485 combination of client specified sequences of operations and current 18486 and saved filehandle arguments and results allows for greater 18487 protocol flexibility. The best or easiest example of current 18488 filehandle usage is a sequence like the following: 18490 PUTFH fh1 {fh1} 18491 LOOKUP "compA" {fh2} 18492 GETATTR {fh2} 18493 LOOKUP "compB" {fh3} 18494 GETATTR {fh3} 18495 LOOKUP "compC" {fh4} 18496 GETATTR {fh4} 18497 GETFH 18499 Figure 2 18501 In this example, the PUTFH (Section 18.19) operation explicitly sets 18502 the current filehandle value while the result of each LOOKUP 18503 operation sets the current filehandle value to the resultant file 18504 system object. Also, the client is able to insert GETATTR operations 18505 using the current filehandle as an argument. 18507 The PUTROOTFH (Section 18.21) and PUTPUBFH (Section 18.20) operations 18508 also set the current filehandle. The above example would replace 18509 "PUTFH fh1" with PUTROOTFH or PUTPUBFH with no filehandle argument in 18510 order to achieve the same effect (on the assumption that "compA" is 18511 directly below the root of the namespace). 18513 Along with the current filehandle, there is a saved filehandle. 18514 While the current filehandle is set as the result of operations like 18515 LOOKUP, the saved filehandle must be set directly with the use of the 18516 SAVEFH operation. The SAVEFH operations copies the current 18517 filehandle value to the saved value. The saved filehandle value is 18518 used in combination with the current filehandle value for the LINK 18519 and RENAME operations. The RESTOREFH operation will copy the saved 18520 filehandle value to the current filehandle value; as a result, the 18521 saved filehandle value may be used a sort of "scratch" area for the 18522 client's series of operations. 18524 16.2.3.1.2. Current Stateid 18526 With NFSv4.1, additions of a current stateid and a saved stateid have 18527 been made to the COMPOUND processing environment; this allows for the 18528 passing of stateids between operations. There are no changes to the 18529 syntax of the protocol, only changes to the semantics of a few 18530 operations. 18532 A "current stateid" is the stateid that is associated with the 18533 current filehandle. The current stateid may only be changed by an 18534 operation that modifies the current filehandle or returns a stateid. 18535 If an operation returns a stateid it MUST set the current stateid to 18536 the returned value. If an operation sets the current filehandle but 18537 does not return a stateid, the current stateid MUST be set to the 18538 all-zeros special stateid, i.e. (seqid, other) = (0, 0). If an 18539 operation uses a stateid as an argument but does not return a 18540 stateid, the current stateid MUST NOT be changed. E.g., PUTFH, 18541 PUTROOTFH, and PUTPUBFH will change the current server state from 18542 {ocfh, (osid)} to {cfh, (0, 0)} while LOCK will change the current 18543 state from {cfh, (osid} to {cfh, (nsid)}. Operations like LOOKUP 18544 that transform a current filehandle and component name into a new 18545 current filehandle will also change the current stateid to {0, 0}. 18546 The SAVEFH and RESTOREFH operations will save and restore both the 18547 current filehandle and the current stateid as a set. 18549 The following example is the common case of a simple READ operation 18550 with a supplied stateid showing that the PUTFH initializes the 18551 current stateid to (0, 0). The subsequent READ with stateid (sid1) 18552 leaves the current stateid unchanged, but does evaluate the 18553 operation. 18555 PUTFH fh1 - -> {fh1, (0, 0)} 18556 READ (sid1), 0, 1024 {fh1, (0, 0)} -> {fh1, (0, 0)} 18558 Figure 3 18560 This next example performs an OPEN with the root filehandle and as a 18561 result generates stateid (sid1). The next operation specifies the 18562 READ with the argument stateid set such that (seqid, other) are equal 18563 to (1, 0), but the current stateid set by the previous operation is 18564 actually used when the operation is evaluated. This allows correct 18565 interaction with any existing, potentially conflicting, locks. 18567 PUTROOTFH - -> {fh1, (0, 0)} 18568 OPEN "compA" {fh1, (0, 0)} -> {fh2, (sid1)} 18569 READ (1, 0), 0, 1024 {fh2, (sid1)} -> {fh2, (sid1)} 18570 CLOSE (1, 0) {fh2, (sid1)} -> {fh2, (sid2)} 18571 Figure 4 18573 This next example is similar to the second in how it passes the 18574 stateid sid2 generated by the LOCK operation to the next READ 18575 operation. This allows the client to explicitly surround a single 18576 I/O operation with a lock and its appropriate stateid to guarantee 18577 correctness with other client locks. The example also shows how 18578 SAVEFH and RESTOREFH can save and later re-use a filehandle and 18579 stateid, passing them as the current filehandle and stateid to a READ 18580 operation. 18582 PUTFH fh1 - -> {fh1, (0, 0)} 18583 LOCK 0, 1024, (sid1) {fh1, (sid1)} -> {fh1, (sid2)} 18584 READ (1, 0), 0, 1024 {fh1, (sid2)} -> {fh1, (sid2)} 18585 LOCKU 0, 1024, (1, 0) {fh1, (sid2)} -> {fh1, (sid3)} 18586 SAVEFH {fh1, (sid3)} -> {fh1, (sid3)} 18588 PUTFH fh2 {fh1, (sid3)} -> {fh2, (0, 0)} 18589 WRITE (1, 0), 0, 1024 {fh2, (0, 0)} -> {fh2, (0, 0)} 18591 RESTOREFH {fh2, (0, 0)} -> {fh1, (sid3)} 18592 READ (1, 0), 1024, 1024 {fh1, (sid3)} -> {fh1, (sid3)} 18594 Figure 5 18596 The final example shows a disallowed use of the current stateid. The 18597 client is attempting to implicitly pass anonymous special stateid, 18598 (0,0) to the READ operation. The server MUST return 18599 NFS4ERR_BAD_STATEID in the reply to the READ operation. 18601 PUTFH fh1 - -> {fh1, (0, 0)} 18602 READ (1, 0), 0, 1024 {fh1, (0, 0)} -> NFS4ERR_BAD_STATEID 18604 Figure 6 18606 16.2.4. ERRORS 18608 COMPOUND will of course return every error that each operation on the 18609 fore channel can return (see Table 6). However if COMPOUND returns 18610 zero operations, obviously the error returned by COMPOUND has nothing 18611 to do with an error returned by an operation. The list of errors 18612 COMPOUND will return if it processes zero operations include: 18614 COMPOUND error returns 18616 +------------------------------+------------------------------------+ 18617 | Error | Notes | 18618 +------------------------------+------------------------------------+ 18619 | NFS4ERR_BADCHAR | The tag argument has a character | 18620 | | the replier does not support. | 18621 | NFS4ERR_BADXDR | | 18622 | NFS4ERR_DELAY | | 18623 | NFS4ERR_INVAL | The tag argument is not in UTF-8 | 18624 | | encoding. | 18625 | NFS4ERR_MINOR_VERS_MISMATCH | | 18626 | NFS4ERR_SERVERFAULT | | 18627 | NFS4ERR_TOO_MANY_OPS | | 18628 | NFS4ERR_REP_TOO_BIG | | 18629 | NFS4ERR_REP_TOO_BIG_TO_CACHE | | 18630 | NFS4ERR_REQ_TOO_BIG | | 18631 +------------------------------+------------------------------------+ 18633 Table 9 18635 17. Operations: REQUIRED, RECOMMENDED, or OPTIONAL 18637 The following tables summarize the operations of the NFSv4.1 protocol 18638 and the corresponding designation of REQUIRED, RECOMMENDED, OPTIONAL 18639 to implement or MUST NOT implement. The designation of MUST NOT 18640 implement is reserved for those operations that were defined in 18641 NFSv4.0 and MUST NOT be implemented in NFSv4.1. 18643 For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation 18644 for operations sent by the client is for the server implementation. 18645 The client is generally required to implement the operations needed 18646 for the operating environment for which it serves. For example, a 18647 read-only NFSv4.1 client would have no need to implement the WRITE 18648 operation and is not required to do so. 18650 The REQUIRED or OPTIONAL designation for callback operations sent by 18651 the server is for both the client and server. Generally, the client 18652 has the option of creating the backchannel and sending the operations 18653 on the fore channel that will be a catalyst for the server sending 18654 callback operations. A partial exception is CB_RECALL_SLOT; the only 18655 way the client can avoid supporting this operation is by not creating 18656 a backchannel. 18658 Since this is a summary of the operations and their designation, 18659 there are subtleties that are not presented here. Therefore, if 18660 there is a question of the requirements of implementation, the 18661 operation descriptions themselves must be consulted along with other 18662 relevant explanatory text within this specification. 18664 The abbreviations used in the second and third columns of the table 18665 are defined as follows. 18667 REQ REQUIRED to implement 18669 REC RECOMMEND to implement 18671 OPT OPTIONAL to implement 18673 MNI MUST NOT implement 18675 For the NFSv4.1 features that are OPTIONAL, the operations that 18676 support those features are OPTIONAL and the server would return 18677 NFS4ERR_NOTSUPP in response to the client's use of those operations. 18678 If an OPTIONAL feature is supported, it is possible that a set of 18679 operations related to the feature become REQUIRED to implement. The 18680 third column of the table designates the feature(s) and if the 18681 operation is REQUIRED or OPTIONAL in the presence of support for the 18682 feature. 18684 The OPTIONAL features identified and their abbreviations are as 18685 follows: 18687 pNFS Parallel NFS 18689 FDELG File Delegations 18691 DDELG Directory Delegations 18693 Operations 18695 +----------------------+------------+--------------+----------------+ 18696 | Operation | REQ, REC, | Feature | Definition | 18697 | | OPT, or | (REQ, REC, | | 18698 | | MNI | or OPT) | | 18699 +----------------------+------------+--------------+----------------+ 18700 | ACCESS | REQ | | Section 18.1 | 18701 | BACKCHANNEL_CTL | REQ | | Section 18.33 | 18702 | BIND_CONN_TO_SESSION | REQ | | Section 18.34 | 18703 | CLOSE | REQ | | Section 18.2 | 18704 | COMMIT | REQ | | Section 18.3 | 18705 | CREATE | REQ | | Section 18.4 | 18706 | CREATE_SESSION | REQ | | Section 18.36 | 18707 | DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 | 18708 | DELEGRETURN | OPT | FDELG, | Section 18.6 | 18709 | | | DDELG, pNFS | | 18710 | | | (REQ) | | 18711 | DESTROY_CLIENTID | REQ | | Section 18.50 | 18712 | DESTROY_SESSION | REQ | | Section 18.37 | 18713 | EXCHANGE_ID | REQ | | Section 18.35 | 18714 | FREE_STATEID | REQ | | Section 18.38 | 18715 | GETATTR | REQ | | Section 18.7 | 18716 | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 | 18717 | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 | 18718 | GETFH | REQ | | Section 18.8 | 18719 | GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 | 18720 | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 | 18721 | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 | 18722 | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 | 18723 | LINK | OPT | | Section 18.9 | 18724 | LOCK | REQ | | Section 18.10 | 18725 | LOCKT | REQ | | Section 18.11 | 18726 | LOCKU | REQ | | Section 18.12 | 18727 | LOOKUP | REQ | | Section 18.13 | 18728 | LOOKUPP | REQ | | Section 18.14 | 18729 | NVERIFY | REQ | | Section 18.15 | 18730 | OPEN | REQ | | Section 18.16 | 18731 | OPENATTR | OPT | | Section 18.17 | 18732 | OPEN_CONFIRM | MNI | | N/A | 18733 | OPEN_DOWNGRADE | REQ | | Section 18.18 | 18734 | PUTFH | REQ | | Section 18.19 | 18735 | PUTPUBFH | REQ | | Section 18.20 | 18736 | PUTROOTFH | REQ | | Section 18.21 | 18737 | READ | REQ | | Section 18.22 | 18738 | READDIR | REQ | | Section 18.23 | 18739 | READLINK | OPT | | Section 18.24 | 18740 | RECLAIM_COMPLETE | REQ | | Section 18.51 | 18741 | RELEASE_LOCKOWNER | MNI | | N/A | 18742 | REMOVE | REQ | | Section 18.25 | 18743 | RENAME | REQ | | Section 18.26 | 18744 | RENEW | MNI | | N/A | 18745 | RESTOREFH | REQ | | Section 18.27 | 18746 | SAVEFH | REQ | | Section 18.28 | 18747 | SECINFO | REQ | | Section 18.29 | 18748 | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, | 18749 | | | layout (REQ) | Section 13.12 | 18750 | SEQUENCE | REQ | | Section 18.46 | 18751 | SETATTR | REQ | | Section 18.30 | 18752 | SETCLIENTID | MNI | | N/A | 18753 | SETCLIENTID_CONFIRM | MNI | | N/A | 18754 | SET_SSV | REQ | | Section 18.47 | 18755 | TEST_STATEID | REQ | | Section 18.48 | 18756 | VERIFY | REQ | | Section 18.31 | 18757 | WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 | 18758 | WRITE | REQ | | Section 18.32 | 18759 +----------------------+------------+--------------+----------------+ 18761 Callback Operations: 18763 Callback Operations 18765 +-------------------------+-----------+-------------+---------------+ 18766 | Operation | REQ, REC, | Feature | Definition | 18767 | | OPT, or | (REQ, REC, | | 18768 | | MNI | or OPT) | | 18769 +-------------------------+-----------+-------------+---------------+ 18770 | CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 | 18771 | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 | 18772 | CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 | 18773 | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 | 18774 | CB_NOTIFY_LOCK | OPT | | Section 20.11 | 18775 | CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 | 18776 | CB_RECALL | OPT | FDELG, | Section 20.2 | 18777 | | | DDELG, pNFS | | 18778 | | | (REQ) | | 18779 | CB_RECALL_ANY | OPT | FDELG, | Section 20.6 | 18780 | | | DDELG, pNFS | | 18781 | | | (REQ) | | 18782 | CB_RECALL_SLOT | REQ | | Section 20.8 | 18783 | CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 | 18784 | | | (REQ) | | 18785 | CB_SEQUENCE | OPT | FDELG, | Section 20.9 | 18786 | | | DDELG, pNFS | | 18787 | | | (REQ) | | 18788 | CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 | 18789 | | | DDELG, pNFS | | 18790 | | | (REQ) | | 18791 +-------------------------+-----------+-------------+---------------+ 18793 18. NFSv4.1 Operations 18795 18.1. Operation 3: ACCESS - Check Access Rights 18796 18.1.1. ARGUMENTS 18798 const ACCESS4_READ = 0x00000001; 18799 const ACCESS4_LOOKUP = 0x00000002; 18800 const ACCESS4_MODIFY = 0x00000004; 18801 const ACCESS4_EXTEND = 0x00000008; 18802 const ACCESS4_DELETE = 0x00000010; 18803 const ACCESS4_EXECUTE = 0x00000020; 18805 struct ACCESS4args { 18806 /* CURRENT_FH: object */ 18807 uint32_t access; 18808 }; 18810 18.1.2. RESULTS 18812 struct ACCESS4resok { 18813 uint32_t supported; 18814 uint32_t access; 18815 }; 18817 union ACCESS4res switch (nfsstat4 status) { 18818 case NFS4_OK: 18819 ACCESS4resok resok4; 18820 default: 18821 void; 18822 }; 18824 18.1.3. DESCRIPTION 18826 ACCESS determines the access rights that a user, as identified by the 18827 credentials in the RPC request, has with respect to the file system 18828 object specified by the current filehandle. The client encodes the 18829 set of access rights that are to be checked in the bit mask "access". 18830 The server checks the permissions encoded in the bit mask. If a 18831 status of NFS4_OK is returned, two bit masks are included in the 18832 response. The first, "supported", represents the access rights for 18833 which the server can verify reliably. The second, "access", 18834 represents the access rights available to the user for the filehandle 18835 provided. On success, the current filehandle retains its value. 18837 Note that the reply's supported and access fields MUST NOT contain 18838 more values than originally set in the request's access field. For 18839 example, if the client sends an ACCESS operation with just the 18840 ACCESS4_READ value set and the server supports this value, the server 18841 MUST NOT set more than ACCESS4_READ in the supported field even if it 18842 could have reliably checked other values. 18844 The reply's access field MUST NOT contain more values than the 18845 supported field. 18847 The results of this operation are necessarily advisory in nature. A 18848 return status of NFS4_OK and the appropriate bit set in the bit mask 18849 does not imply that such access will be allowed to the file system 18850 object in the future. This is because access rights can be revoked 18851 by the server at any time. 18853 The following access permissions may be requested: 18855 ACCESS4_READ Read data from file or read a directory. 18857 ACCESS4_LOOKUP Look up a name in a directory (no meaning for non- 18858 directory objects). 18860 ACCESS4_MODIFY Rewrite existing file data or modify existing 18861 directory entries. 18863 ACCESS4_EXTEND Write new data or add directory entries. 18865 ACCESS4_DELETE Delete an existing directory entry. 18867 ACCESS4_EXECUTE Execute a regular file (no meaning for a directory). 18869 On success, the current filehandle retains its value. 18871 ACCESS4_EXECUTE is a challenging semantic to implement because NFS 18872 provides remote file access, not remote execution. This leads to the 18873 following: 18875 o Whether a regular file is executable or not ought to be the 18876 responsibility of the NFS client and not the server. And yet the 18877 ACCESS operation is specified to seemingly require a server to own 18878 that responsibility. 18880 o When a client executes a regular file, it has to read the file 18881 from the server. Strictly speaking, the server should not allow 18882 the client to read a file being executed unless the user has read 18883 permissions on the file. Requiring users and administers to set 18884 read permissions on executable files in order to access them over 18885 NFS is not going to be acceptable to some people. Historically, 18886 NFS servers have allowed a user to READ a file if the user has 18887 execute access to the file. 18889 As a practical example, the UNIX specification [51] states that an 18890 implementation claiming conformance to UNIX may indicate in the 18891 access() programming interface's result that a privileged user has 18892 execute rights, even if no execute permission bits are set on the 18893 regular file's attributes. It is possible to claim conformance to 18894 the UNIX specification and instead not indicate execute rights in 18895 that situation, which is true for some operating environments. 18896 Suppose the operating environments of the client and server are 18897 implementing the access() semantics for privileged users differently, 18898 and the ACCESS operation implementations of the client and server 18899 follow their respective access() semantics. This can cause undesired 18900 behavior: 18902 o Suppose the client's access() interface returns X_OK if the user 18903 is privileged and no execute permission bits are set on the 18904 regular file's attribute, and the server's access() interface does 18905 not return X_OK in that situation. Then the client will be unable 18906 to execute files stored on the NFS server that could be executed 18907 if stored on a non-NFS file system. 18909 o Suppose the client's access() interface does not return X_OK if 18910 the user is privileged, and no execute permission bits are set on 18911 the regular file's attribute, and the server's access() interface 18912 does return X_OK in that situation. Then: 18914 * The client will be able to execute files stored on the NFS 18915 server that could be executed if stored on a non-NFS file 18916 system, unless the client's execution subsystem also checks for 18917 execute permission bits. 18919 * Even if the execution subsystem is checking for execute 18920 permission bits, there are more potential issues. E.g. suppose 18921 the client is invoking access() to build a "path search table" 18922 of all executable files in the user's "search path", where the 18923 path is a list of directories each containing executable files. 18924 Suppose there are two files each in separate directories of the 18925 search path, such that files have the same component name. In 18926 the first directory the file has no execute permission bits 18927 set, and in the second directory the file has execute bits set. 18928 The path search table will indicate that the first directory 18929 has the executable file, but the execute subsystem will fail to 18930 execute it. The command shell might fail to try the second 18931 file in the second directory. And even if it did, this is a 18932 potential performance issue. Clearly the desired outcome for 18933 the client is for the path search table to not contain the 18934 first file. 18936 To deal the problems described above, the smart client, stupid server 18937 principle is used. The client owns overall responsibility for 18938 determining execute access and relies on the server to parse the 18939 execution permissions within the file's mode, acl, and dacl 18940 attributes. The rules for the client and server follow: 18942 o If the client is sending ACCESS in order to determine if the user 18943 can read the file, the client SHOULD set ACCESS4_READ in the 18944 request's access field. 18946 o If the client's operating environment only grants execution to the 18947 user if the user has execute access according to the execute 18948 permissions in the mode, acl, and dacl attributes, then if the 18949 client wants to determine execute access, the client SHOULD send 18950 an ACCESS request with ACCESS4_EXECUTE bit set in the request's 18951 access field. 18953 o If the client's operating environment grants execution to the user 18954 even if the user does not have execute access according to the 18955 execute permissions in the mode, acl, and dacl attributes, then if 18956 the client wants to determine execute access, it SHOULD send an 18957 ACCESS request with both the ACCESS4_EXECUTE and ACCESS4_READ bits 18958 set in the request's access field. This way, if any read or 18959 execute permission grants the user read or execute access (or if 18960 the server interprets the user as privileged), as indicated by the 18961 presence of ACCESS4_EXECUTE and/or ACCESS4_READ in the reply's 18962 access field, the client will be able to grant the user execute 18963 access to the file. 18965 o If the server supports execute permission bits, or some other 18966 method for denoting executability (e.g. the suffix of the name of 18967 the file might indicate execute), it MUST check only execute 18968 permissions, not read permissions, when determining whether the 18969 reply will have ACCESS4_EXECUTE set in the access field or not. 18970 The server MUST NOT also examine read permission bits when 18971 determining whether the reply will have ACCESS4_EXECUTE set in the 18972 access field or not. Even if the server's operating environment 18973 would grant execute access to the user (e.g., the user is 18974 privileged), the server MUST NOT reply with ACCESS4_EXECUTE set in 18975 reply's access field, unless there is at least one execute 18976 permission bit set in the mode, acl, or dacl attributes. In the 18977 case of acl and dacl, the "one execute permission bit" MUST be an 18978 ACE4_EXECUTE bit set in an ALLOW ACE. 18980 o If the server does not support execute permission bits or some 18981 other method for denoting executability, it MUST NOT set 18982 ACCESS4_EXECUTE in the reply's supported and access fields. If 18983 the client set ACCESS4_EXECUTE in the ACCESS request's access 18984 field, and ACCESS4_EXECUTE is not set in the reply's supported 18985 field, then the client will have to send an ACCESS request with 18986 the ACCESS4_READ bit set in the request's access field. 18988 o If the server supports read permission bits, it MUST only check 18989 for read permissions in the mode, acl, and dacl attributes when it 18990 receives an ACCESS request with ACCESS4_READ set the access field. 18991 The server MUST NOT also examine execute permission bits when 18992 determining whether the reply will have ACCESS4_READ set in the 18993 access field or not. 18995 Note that if the ACCESS reply has ACCESS4_READ or ACCESS_EXECUTE set, 18996 then the user also has permissions to OPEN (Section 18.16) or READ 18997 (Section 18.22) the file. I.e., if client sends an ACCESS request 18998 with the ACCESS4_READ and ACCESS_EXECUTE set in the access field (or 18999 two separate requests, one with ACCESS4_READ set, and the other with 19000 ACCESS4_EXECUTE set), and the reply has just ACCESS4_EXECUTE set in 19001 the access field (or just one reply has ACCESS4_EXECUTE set), then 19002 the user has authorization to OPEN or READ the file. 19004 18.1.4. IMPLEMENTATION 19006 In general, it is not sufficient for the client to attempt to deduce 19007 access permissions by inspecting the uid, gid, and mode fields in the 19008 file attributes or by attempting to interpret the contents of the ACL 19009 attribute. This is because the server may perform uid or gid mapping 19010 or enforce additional access control restrictions. It is also 19011 possible that the server may not be in the same ID space as the 19012 client. In these cases (and perhaps others), the client can not 19013 reliably perform an access check with only current file attributes. 19015 In the NFSv2 protocol, the only reliable way to determine whether an 19016 operation was allowed was to try it and see if it succeeded or 19017 failed. Using the ACCESS operation in the NFSv4.1 protocol, the 19018 client can ask the server to indicate whether or not one or more 19019 classes of operations are permitted. The ACCESS operation is 19020 provided to allow clients to check before doing a series of 19021 operations which will result in an access failure. The OPEN 19022 operation provides a point where the server can verify access to the 19023 file object and method to return that information to the client. The 19024 ACCESS operation is still useful for directory operations or for use 19025 in the case the UNIX interface access() is used on the client. 19027 The information returned by the server in response to an ACCESS call 19028 is not permanent. It was correct at the exact time that the server 19029 performed the checks, but not necessarily afterwards. The server can 19030 revoke access permission at any time. 19032 The client should use the effective credentials of the user to build 19033 the authentication information in the ACCESS request used to 19034 determine access rights. It is the effective user and group 19035 credentials that are used in subsequent read and write operations. 19037 Many implementations do not directly support the ACCESS4_DELETE 19038 permission. Operating systems like UNIX will ignore the 19039 ACCESS4_DELETE bit if set on an access request on a non-directory 19040 object. In these systems, delete permission on a file is determined 19041 by the access permissions on the directory in which the file resides, 19042 instead of being determined by the permissions of the file itself. 19043 Therefore, the mask returned enumerating which access rights can be 19044 determined will have the ACCESS4_DELETE value set to 0. This 19045 indicates to the client that the server was unable to check that 19046 particular access right. The ACCESS4_DELETE bit in the access mask 19047 returned will then be ignored by the client. 19049 18.2. Operation 4: CLOSE - Close File 19051 18.2.1. ARGUMENTS 19053 struct CLOSE4args { 19054 /* CURRENT_FH: object */ 19055 seqid4 seqid; 19056 stateid4 open_stateid; 19057 }; 19059 18.2.2. RESULTS 19061 union CLOSE4res switch (nfsstat4 status) { 19062 case NFS4_OK: 19063 stateid4 open_stateid; 19064 default: 19065 void; 19066 }; 19068 18.2.3. DESCRIPTION 19070 The CLOSE operation releases share reservations for the regular or 19071 named attribute file as specified by the current filehandle. The 19072 share reservations and other state information released at the server 19073 as a result of this CLOSE is only that associated with the supplied 19074 stateid. State associated with other OPENs is not affected. 19076 If byte-range locks are held, the client SHOULD release all locks 19077 before issuing a CLOSE. The server MAY free all outstanding locks on 19078 CLOSE but some servers may not support the CLOSE of a file that still 19079 has byte-range locks held. The server MUST return failure if any 19080 locks would exist after the CLOSE. 19082 The argument seqid MAY have any value and the server MUST ignore 19083 seqid. 19085 On success, the current filehandle retains its value. 19087 The server MAY require that the principal, security flavor, and 19088 applicable, the GSS mechanism, combination that sent the OPEN request 19089 also be the one to CLOSE the file. This might not be possible if 19090 credentials for the principal are no longer available. The server 19091 MAY allow the machine credential or SSV credential (see 19092 Section 18.35) to send CLOSE. 19094 18.2.4. IMPLEMENTATION 19096 Even though CLOSE returns a stateid, this stateid is not useful to 19097 the client and should be treated as deprecated. CLOSE "shuts down" 19098 the state associated with all OPENs for the file by a single open- 19099 owner. As noted above, CLOSE will either release all file locking 19100 state or return an error. Therefore, the stateid returned by CLOSE 19101 is not useful for operations that follow. To help find any uses of 19102 this stateid by clients, the server SHOULD return the invalid special 19103 stated (the "other" value is zero and the "seqid" field is 19104 NFS4_UINT32_MAX, see Section 8.2.3). 19106 A CLOSE operation may make delegations grantable where they were not 19107 previously. Servers may choose to respond immediately if there are 19108 pending delegation want requests or may respond to the situation at a 19109 later time. 19111 18.3. Operation 5: COMMIT - Commit Cached Data 19113 18.3.1. ARGUMENTS 19115 struct COMMIT4args { 19116 /* CURRENT_FH: file */ 19117 offset4 offset; 19118 count4 count; 19119 }; 19121 18.3.2. RESULTS 19123 struct COMMIT4resok { 19124 verifier4 writeverf; 19125 }; 19127 union COMMIT4res switch (nfsstat4 status) { 19128 case NFS4_OK: 19129 COMMIT4resok resok4; 19130 default: 19131 void; 19132 }; 19134 18.3.3. DESCRIPTION 19136 The COMMIT operation forces or flushes uncommitted, modified data to 19137 stable storage for the file specified by the current filehandle. The 19138 flushed data is that which was previously written with a WRITE 19139 operation which had the stable field set to UNSTABLE4. 19141 The offset specifies the position within the file where the flush is 19142 to begin. An offset value of 0 (zero) means to flush data starting 19143 at the beginning of the file. The count specifies the number of 19144 bytes of data to flush. If count is 0 (zero), a flush from offset to 19145 the end of the file is done. 19147 The server returns a write verifier upon successful completion of the 19148 COMMIT. The write verifier is used by the client to determine if the 19149 server has restarted between the initial WRITE(s) and the COMMIT. 19150 The client does this by comparing the write verifier returned from 19151 the initial writes and the verifier returned by the COMMIT operation. 19152 The server must vary the value of the write verifier at each server 19153 event or instantiation that may lead to a loss of uncommitted data. 19154 Most commonly this occurs when the server is restarted; however, 19155 other events at the server may result in uncommitted data loss as 19156 well. 19158 On success, the current filehandle retains its value. 19160 18.3.4. IMPLEMENTATION 19162 The COMMIT operation is similar in operation and semantics to the 19163 POSIX fsync() [24] system interface that synchronizes a file's state 19164 with the disk (file data and metadata is flushed to disk or stable 19165 storage). COMMIT performs the same operation for a client, flushing 19166 any unsynchronized data and metadata on the server to the server's 19167 disk or stable storage for the specified file. Like fsync(2), it may 19168 be that there is some modified data or no modified data to 19169 synchronize. The data may have been synchronized by the server's 19170 normal periodic buffer synchronization activity. COMMIT should 19171 return NFS4_OK, unless there has been an unexpected error. 19173 COMMIT differs from fsync(2) in that it is possible for the client to 19174 flush a range of the file (most likely triggered by a buffer- 19175 reclamation scheme on the client before file has been completely 19176 written). 19178 The server implementation of COMMIT is reasonably simple. If the 19179 server receives a full file COMMIT request, that is starting at 19180 offset 0 and count 0, it should do the equivalent of fsync()'ing the 19181 file. Otherwise, it should arrange to have the modified data in the 19182 range specified by offset and count to be flushed to stable storage. 19183 In both cases, any metadata associated with the file must be flushed 19184 to stable storage before returning. It is not an error for there to 19185 be nothing to flush on the server. This means that the data and 19186 metadata that needed to be flushed have already been flushed or lost 19187 during the last server failure. 19189 The client implementation of COMMIT is a little more complex. There 19190 are two reasons for wanting to commit a client buffer to stable 19191 storage. The first is that the client wants to reuse a buffer. In 19192 this case, the offset and count of the buffer are sent to the server 19193 in the COMMIT request. The server then flushes any modified data 19194 based on the offset and count, and flushes any modified metadata 19195 associated with the file. It then returns the status of the flush 19196 and the write verifier. The other reason for the client to generate 19197 a COMMIT is for a full file flush, such as may be done at close. In 19198 this case, the client would gather all of the buffers for this file 19199 that contain uncommitted data, do the COMMIT operation with an offset 19200 of 0 and count of 0, and then free all of those buffers. Any other 19201 dirty buffers would be sent to the server in the normal fashion. 19203 After a buffer is written by the client with the stable parameter set 19204 to UNSTABLE4, the buffer must be considered as modified by the client 19205 until the buffer has either been flushed via a COMMIT operation or 19206 written via a WRITE operation with stable parameter set to FILE_SYNC4 19207 or DATA_SYNC4. This is done to prevent the buffer from being freed 19208 and reused before the data can be flushed to stable storage on the 19209 server. 19211 When a response is returned from either a WRITE or a COMMIT operation 19212 and it contains a write verifier that is different than previously 19213 returned by the server, the client will need to retransmit all of the 19214 buffers containing uncommitted data to the server. How this is to be 19215 done is up to the implementor. If there is only one buffer of 19216 interest, then it should sent in a WRITE request with the FILE_SYNC4 19217 stable parameter. If there is more than one buffer, it might be 19218 worthwhile retransmitting all of the buffers in WRITE requests with 19219 the stable parameter set to UNSTABLE4 and then retransmitting the 19220 COMMIT operation to flush all of the data on the server to stable 19221 storage. However, if the server repeatably returns from COMMIT a 19222 verifier that differs from that returned by WRITE, the only way to 19223 ensure progress is to retransmit all of the buffers with WRITE 19224 requests with the FILE_SYNC4 stable parameter. 19226 The above description applies to page-cache-based systems as well as 19227 buffer-cache-based systems. In those systems, the virtual memory 19228 system will need to be modified instead of the buffer cache. 19230 18.4. Operation 6: CREATE - Create a Non-Regular File Object 19232 18.4.1. ARGUMENTS 19234 union createtype4 switch (nfs_ftype4 type) { 19235 case NF4LNK: 19236 linktext4 linkdata; 19237 case NF4BLK: 19238 case NF4CHR: 19239 specdata4 devdata; 19240 case NF4SOCK: 19241 case NF4FIFO: 19242 case NF4DIR: 19243 void; 19244 default: 19245 void; /* server should return NFS4ERR_BADTYPE */ 19246 }; 19248 struct CREATE4args { 19249 /* CURRENT_FH: directory for creation */ 19250 createtype4 objtype; 19251 component4 objname; 19252 fattr4 createattrs; 19253 }; 19255 18.4.2. RESULTS 19257 struct CREATE4resok { 19258 change_info4 cinfo; 19259 bitmap4 attrset; /* attributes set */ 19260 }; 19262 union CREATE4res switch (nfsstat4 status) { 19263 case NFS4_OK: 19264 /* new CURRENTFH: created object */ 19265 CREATE4resok resok4; 19266 default: 19267 void; 19268 }; 19270 18.4.3. DESCRIPTION 19272 The CREATE operation creates a file object other than an ordinary 19273 file in a directory with a given name. The OPEN operation MUST be 19274 used to create a regular file or a named attribute. 19276 The current filehandle must be a directory: an object of type NF4DIR. 19277 If the current filehandle is an attribute directory (type 19278 NF4ATTRDIR), the error NFS4ERR_WRONG_TYPE is returned. If the 19279 current file handle designates any other type of object, the error 19280 NFS4ERR_NOTDIR results. 19282 The objname specifies the name for the new object. The objtype 19283 determines the type of object to be created: directory, symlink, etc. 19284 If the object type specified is that of an ordinary file, a named 19285 attribute, or a named attribute directory, the error NFS4ERR_BADTYPE 19286 results. 19288 If an object of the same name already exists in the directory, the 19289 server will return the error NFS4ERR_EXIST. 19291 For the directory where the new file object was created, the server 19292 returns change_info4 information in cinfo. With the atomic field of 19293 the change_info4 data type, the server will indicate if the before 19294 and after change attributes were obtained atomically with respect to 19295 the file object creation. 19297 If the objname has a length of 0 (zero), or if objname does not obey 19298 the UTF-8 definition, the error NFS4ERR_INVAL will be returned. 19300 The current filehandle is replaced by that of the new object. 19302 The createattrs specifies the initial set of attributes for the 19303 object. The set of attributes may include any writable attribute 19304 valid for the object type. When the operation is successful, the 19305 server will return to the client an attribute mask signifying which 19306 attributes were successfully set for the object. 19308 If createattrs includes neither the owner attribute nor an ACL with 19309 an ACE for the owner, and if the server's file system both supports 19310 and requires an owner attribute (or an owner ACE) then the server 19311 MUST derive the owner (or the owner ACE). This would typically be 19312 from the principal indicated in the RPC credentials of the call, but 19313 the server's operating environment or file system semantics may 19314 dictate other methods of derivation. Similarly, if createattrs 19315 includes neither the group attribute nor a group ACE, and if the 19316 server's file system both supports and requires the notion of a group 19317 attribute (or group ACE), the server MUST derive the group attribute 19318 (or the corresponding owner ACE) for the file. This could be from 19319 the RPC call's credentials, such as the group principal if the 19320 credentials include it (such as with AUTH_SYS), from the group 19321 identifier associated with the principal in the credentials (e.g., 19322 POSIX systems have a user database [25] that has a group identifier 19323 for every user identifier), inherited from directory the object is 19324 created in, or whatever else the server's operating environment or 19325 file system semantics dictate. This applies to the OPEN operation 19326 too. 19328 Conversely, it is possible the client will specify in createattrs an 19329 owner attribute, group attribute, or ACL that the principal indicated 19330 the RPC call's credentials does not have permissions to create files 19331 for. The error to be returned in this instance is NFS4ERR_PERM. 19332 This applies to the OPEN operation too. 19334 If the current filehandle designates a directory for which another 19335 client holds a directory delegation, then, unless the delegation is 19336 such that the situation can be resolved by sending a notification, 19337 the delegation MUST be recalled, and the CREATE operation MUST NOT 19338 proceed until the delegation is returned or revoked. Except where 19339 this happens very quickly, one or more NFS4ERR_DELAY errors will be 19340 returned to requests made while delegation remains outstanding. 19342 When the current filehandle designates a directory for which one or 19343 more directory delegations exist, then, when those delegations 19344 request such notifications, NOTIFY4_ADD_ENTRY will be generated as a 19345 result of this operation. 19347 If the capability FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 is set 19348 (Section 14.4), and a symbolic link is being created, then the 19349 content of the symbolic link MUST be in UTF-8 encoding. 19351 18.4.4. IMPLEMENTATION 19353 If the client desires to set attribute values after the create, a 19354 SETATTR operation can be added to the COMPOUND request so that the 19355 appropriate attributes will be set. 19357 18.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting Recovery 19359 18.5.1. ARGUMENTS 19361 struct DELEGPURGE4args { 19362 clientid4 clientid; 19363 }; 19365 18.5.2. RESULTS 19367 struct DELEGPURGE4res { 19368 nfsstat4 status; 19369 }; 19371 18.5.3. DESCRIPTION 19373 Purges all of the delegations awaiting recovery for a given client. 19374 This is useful for clients which do not commit delegation information 19375 to stable storage to indicate that conflicting requests need not be 19376 delayed by the server awaiting recovery of delegation information. 19378 The client is NOT specified by the clientid field of the request. 19379 The client SHOULD set the client field to zero and the server MUST 19380 ignore the clientid field. Instead the server MUST derive the client 19381 ID from the value of the session ID in the arguments of the SEQUENCE 19382 operation that precedes DELEGPURGE in the COMPOUND request. 19384 This operation should be used by clients that record delegation 19385 information on stable storage on the client. In this case, 19386 DELEGPURGE should be sent immediately after doing delegation recovery 19387 on all delegations known to the client. Doing so will notify the 19388 server that no additional delegations for the client will be 19389 recovered allowing it to free resources, and avoid delaying other 19390 clients which make requests that conflict with the unrecovered 19391 delegations. The set of delegations known to the server and the 19392 client may be different. The reason for this is that a client may 19393 fail after making a request which resulted in delegation but before 19394 it received the results and committed them to the client's stable 19395 storage. 19397 The server MAY support DELEGPURGE, but if it does not, it MUST NOT 19398 support CLAIM_DELEGATE_PREV. 19400 18.6. Operation 8: DELEGRETURN - Return Delegation 19402 18.6.1. ARGUMENTS 19404 struct DELEGRETURN4args { 19405 /* CURRENT_FH: delegated object */ 19406 stateid4 deleg_stateid; 19407 }; 19409 18.6.2. RESULTS 19411 struct DELEGRETURN4res { 19412 nfsstat4 status; 19413 }; 19415 18.6.3. DESCRIPTION 19417 Returns the delegation represented by the current filehandle and 19418 stateid. 19420 Delegations may be returned when recalled or voluntarily (i.e. before 19421 the server has recalled them). In either case the client must 19422 properly propagate state changed under the context of the delegation 19423 to the server before returning the delegation. 19425 The server MAY require that the principal, security flavor, and if 19426 applicable, the GSS mechanism, combination that acquired the 19427 delegation also be the one to send DELEGRETURN on the file. This 19428 might not be possible if credentials for the principal are no longer 19429 available. The server MAY allow the machine credential or SSV 19430 credential (see Section 18.35) to send DELEGRETURN. 19432 18.7. Operation 9: GETATTR - Get Attributes 19434 18.7.1. ARGUMENTS 19436 struct GETATTR4args { 19437 /* CURRENT_FH: object */ 19438 bitmap4 attr_request; 19439 }; 19441 18.7.2. RESULTS 19443 struct GETATTR4resok { 19444 fattr4 obj_attributes; 19445 }; 19447 union GETATTR4res switch (nfsstat4 status) { 19448 case NFS4_OK: 19449 GETATTR4resok resok4; 19450 default: 19451 void; 19452 }; 19454 18.7.3. DESCRIPTION 19456 The GETATTR operation will obtain attributes for the file system 19457 object specified by the current filehandle. The client sets a bit in 19458 the bitmap argument for each attribute value that it would like the 19459 server to return. The server returns an attribute bitmap that 19460 indicates the attribute values which it was able to return, which 19461 will include all attributes requested by the client which are 19462 attributes supported by the server for the target file system. This 19463 bitmap is followed by the attribute values ordered lowest attribute 19464 number first. 19466 The server MUST return a value for each attribute that the client 19467 requests if the attribute is supported by the server for the target 19468 file system. If the server does not support a particular attribute 19469 on the target file system then it MUST NOT return the attribute value 19470 and MUST NOT set the attribute bit in the result bitmap. The server 19471 MUST return an error if it supports an attribute on the target but 19472 cannot obtain its value. In that case, no attribute values will be 19473 returned. 19475 File systems which are absent should be treated as having support for 19476 a very small set of attributes as described in GETATTR Within an 19477 Absent File System (Section 5), even if previously, when the file 19478 system was present, more attributes were supported. 19480 All servers MUST support the REQUIRED attributes as specified in File 19481 Attributes (Section 11.3.1), for all file systems, with the exception 19482 of absent file systems. 19484 On success, the current filehandle retains its value. 19486 18.7.4. IMPLEMENTATION 19488 Suppose there is a write delegation held by another client for file 19489 in question and size and/or change are among the set of attributes 19490 being interrogated. The server has two choices. First, the server 19491 can obtain the actual current value of these attributes from the 19492 client holding the delegation by using the CB_GETATTR callback. 19493 Second, the server, particularly when the delegated client is 19494 unresponsive, can recall the delegation in question. The GETATTR 19495 MUST NOT proceed until one of the following occurs: 19497 o The requested attribute values are returned in the response to 19498 CB_GETATTR. 19500 o The write delegation is returned. 19502 o The write delegation is revoked. 19504 Unless one of the above happens very quickly, one or more 19505 NFS4ERR_DELAY errors will be returned if while a delegation is 19506 outstanding. 19508 18.8. Operation 10: GETFH - Get Current Filehandle 19510 18.8.1. ARGUMENTS 19512 /* CURRENT_FH: */ 19513 void; 19515 18.8.2. RESULTS 19517 struct GETFH4resok { 19518 nfs_fh4 object; 19519 }; 19521 union GETFH4res switch (nfsstat4 status) { 19522 case NFS4_OK: 19523 GETFH4resok resok4; 19524 default: 19525 void; 19526 }; 19528 18.8.3. DESCRIPTION 19530 This operation returns the current filehandle value. 19532 On success, the current filehandle retains its value. 19534 As described in Section 2.10.6.4, GETFH is REQUIRED or RECOMMENDED to 19535 immediately follow certain operations, and servers are free to reject 19536 such operations the client fails to insert GETFH in the request as 19537 REQUIRED or RECOMMENDED. Section 18.16.4.1 provides additional 19538 justification for why GETFH MUST follow OPEN. 19540 18.8.4. IMPLEMENTATION 19542 Operations that change the current filehandle like LOOKUP or CREATE 19543 do not automatically return the new filehandle as a result. For 19544 instance, if a client needs to lookup a directory entry and obtain 19545 its filehandle then the following request is needed. 19547 PUTFH (directory filehandle) 19549 LOOKUP (entry name) 19551 GETFH 19553 18.9. Operation 11: LINK - Create Link to a File 19555 18.9.1. ARGUMENTS 19557 struct LINK4args { 19558 /* SAVED_FH: source object */ 19559 /* CURRENT_FH: target directory */ 19560 component4 newname; 19561 }; 19563 18.9.2. RESULTS 19565 struct LINK4resok { 19566 change_info4 cinfo; 19567 }; 19569 union LINK4res switch (nfsstat4 status) { 19570 case NFS4_OK: 19571 LINK4resok resok4; 19572 default: 19573 void; 19574 }; 19576 18.9.3. DESCRIPTION 19578 The LINK operation creates an additional newname for the file 19579 represented by the saved filehandle, as set by the SAVEFH operation, 19580 in the directory represented by the current filehandle. The existing 19581 file and the target directory must reside within the same file system 19582 on the server. On success, the current filehandle will continue to 19583 be the target directory. If an object exists in the target directory 19584 with the same name as newname, the server must return NFS4ERR_EXIST. 19586 For the target directory, the server returns change_info4 information 19587 in cinfo. With the atomic field of the change_info4 data type, the 19588 server will indicate if the before and after change attributes were 19589 obtained atomically with respect to the link creation. 19591 If the newname has a length of 0 (zero), or if newname does not obey 19592 the UTF-8 definition, the error NFS4ERR_INVAL will be returned. 19594 18.9.4. IMPLEMENTATION 19596 The server MAY impose restrictions on the LINK operation such that 19597 LINK may not be done when the file is open or when that open is done 19598 by particular protocols, or with particular options or access modes. 19599 When LINK is rejected because of such restrictions, the error 19600 NFS4ERR_FILE_OPEN is returned. 19602 If a server does implement such restrictions and those restrictions 19603 include cases of NFSv4 opens preventing successful execution of a 19604 link, the server needs to recall any delegations which could hide the 19605 existence of opens relevant to that decision. The reason is that 19606 when a client holds a delegation, the server might not have an 19607 accurate account of the opens for that client, since the client may 19608 execute OPENs and CLOSEs locally. The LINK operation must be delayed 19609 only until a definitive result can be obtained. E.g., suppose there 19610 are multiple delegations and one of them establishes an open whose 19611 presence would prevent the link. Given the server's semantics, 19612 NFS4ERR_FILE_OPEN may be returned to the caller as soon as that 19613 delegation is returned without waiting for other delegations to be 19614 returned. Similarly, if such opens are not associated with 19615 delegations, NFS4ERR_FILE_OPEN can be returned immediately with no 19616 delegation recall being done. 19618 If the current filehandle designates a directory for which another 19619 client holds a directory delegation, then, unless the delegation is 19620 such that the situation can be resolved by sending a notification, 19621 the delegation MUST be recalled, and the operation cannot be 19622 performed successfully. until the delegation is returned or revoked. 19623 Except where this happens very quickly, one or more NFS4ERR_DELAY 19624 errors will be returned to requests made while delegation remains 19625 outstanding. 19627 When the current filehandle designates a directory for which one or 19628 more directory delegations exist, then, when those delegations 19629 request such notifications, instead of a recall, NOTIFY4_ADD_ENTRY 19630 will be generated as a result of the LINK operation. 19632 If the current file system supports the numlinks attribute, and other 19633 clients have delegations to the file being linked, then those 19634 delegations MUST be recalled and the LINK operation MUST NOT proceed 19635 until all delegations are returned or revoked. Except where this 19636 happens very quickly, one or more NFS4ERR_DELAY errors will be 19637 returned to requests made while delegation remains outstanding. 19639 Changes to any property of the "hard" linked files are reflected in 19640 all of the linked files. When a link is made to a file, the 19641 attributes for the file should have a value for numlinks that is one 19642 greater than the value before the LINK operation. 19644 The statement "file and the target directory must reside within the 19645 same file system on the server" means that the fsid fields in the 19646 attributes for the objects are the same. If they reside on different 19647 file systems, the error NFS4ERR_XDEV is returned. This error may be 19648 returned by some server when there is an internal partitioning of a 19649 file system which the LINK operation would violate. 19651 On some servers, "." and ".." are illegal values for newname and the 19652 error NFS4ERR_BADNAME will be returned if they are specified. 19654 When the current filehandle designates a named attribute directory 19655 and the object to be linked (the saved filehandle) is not a named 19656 attribute for the same object, the error NFS4ERR_XDEV MUST be 19657 returned. When the saved filehandle designates a named attribute and 19658 the current filehandle is not the appropriate named attribute 19659 directory, the error NFS4ERR_XDEV MUST also be returned. 19661 When the current filehandle designates a named attribute directory 19662 and the object to be linked (the saved filehandle) is a named 19663 attribute within that directory, the server may return the error 19664 NFS4ERR_NOTSUPP. 19666 In the case that newname is already linked to the file represented by 19667 the saved filehandle, the server will return NFS4ERR_EXIST. 19669 Note that symbolic links are created with the CREATE operation. 19671 18.10. Operation 12: LOCK - Create Lock 19673 18.10.1. ARGUMENTS 19675 /* 19676 * For LOCK, transition from open_stateid and lock_owner 19677 * to a lock stateid. 19678 */ 19679 struct open_to_lock_owner4 { 19680 seqid4 open_seqid; 19681 stateid4 open_stateid; 19682 seqid4 lock_seqid; 19683 lock_owner4 lock_owner; 19684 }; 19686 /* 19687 * For LOCK, existing lock stateid continues to request new 19688 * file lock for the same lock_owner and open_stateid. 19689 */ 19690 struct exist_lock_owner4 { 19691 stateid4 lock_stateid; 19692 seqid4 lock_seqid; 19693 }; 19695 union locker4 switch (bool new_lock_owner) { 19696 case TRUE: 19697 open_to_lock_owner4 open_owner; 19698 case FALSE: 19699 exist_lock_owner4 lock_owner; 19700 }; 19702 /* 19703 * LOCK/LOCKT/LOCKU: Record lock management 19704 */ 19705 struct LOCK4args { 19706 /* CURRENT_FH: file */ 19707 nfs_lock_type4 locktype; 19708 bool reclaim; 19709 offset4 offset; 19710 length4 length; 19711 locker4 locker; 19712 }; 19714 18.10.2. RESULTS 19716 struct LOCK4denied { 19717 offset4 offset; 19718 length4 length; 19719 nfs_lock_type4 locktype; 19720 lock_owner4 owner; 19721 }; 19723 struct LOCK4resok { 19724 stateid4 lock_stateid; 19725 }; 19727 union LOCK4res switch (nfsstat4 status) { 19728 case NFS4_OK: 19729 LOCK4resok resok4; 19730 case NFS4ERR_DENIED: 19731 LOCK4denied denied; 19732 default: 19733 void; 19734 }; 19736 18.10.3. DESCRIPTION 19738 The LOCK operation requests a byte-range lock for the byte range 19739 specified by the offset and length parameters, and lock type 19740 specified in the locktype parameter. If this is a reclaim request, 19741 the reclaim parameter will be TRUE. 19743 Bytes in a file may be locked even if those bytes are not currently 19744 allocated to the file. To lock the file from a specific offset 19745 through the end-of-file (no matter how long the file actually is) use 19746 a length field equal to NFS4_UINT64_MAX. The server MUST return 19747 NFS4ERR_INVAL under the following combinations of length and offset: 19749 o Length is equal to zero. 19751 o Length is not equal to NFS4_UINT64_MAX, and the sum of length and 19752 offset exceeds NFS4_UINT64_MAX. 19754 32-bit servers are servers that support locking for byte offsets that 19755 fit within 32 bits (i.e. less than or equal to NFS4_UINT32_MAX). If 19756 the client specifies a range that overlaps one or more bytes beyond 19757 offset NFS4_UINT32_MAX, but does not end at offset NFS4_UINT64_MAX, 19758 then such a 32-bit server MUST return the error NFS4ERR_BAD_RANGE. 19760 If the server returns NFS4ERR_DENIED, owner, offset, and length of a 19761 conflicting lock are returned. 19763 The locker argument specifies the lock-owner that is associated with 19764 the LOCK request. The locker4 structure is a switched union that 19765 indicates whether the client has already created byte-range locking 19766 state associated with the current open file and lock-owner. In the 19767 case in which it has, the argument is just a stateid representing the 19768 set of locks associated with that open file and lock-owner, together 19769 with a lock_seqid value which MAY be any value and MUST be ignored by 19770 the server. In the case where no byte-range locking state has been 19771 established, or the client does not have the stateid available, the 19772 argument contains the stateid of the open file with which this lock 19773 is to be associated, together with the lock-owner with which the lock 19774 is to be associated. The open_to_lock_owner case covers the very 19775 first lock done by a lock-owner for a given open file and offers a 19776 method to use the established state of the open_stateid to transition 19777 to the use of a lock stateid. 19779 The following fields of the locker parameter MAY be set to any value 19780 by the client and MUST be ignored by the server: 19782 o The clientid field of the lock_owner field of the open_owner field 19783 (locker.open_owner.lock_owner.clientid). The reason the server 19784 MUST ignore the clientid field is that the server MUST derive the 19785 client ID from the session ID from the SEQUENCE operation of the 19786 COMPOUND request. 19788 o The open_seqid and lock_seqid fields of the open_owner field 19789 (locker.open_owner.open_seqid and locker.open_owner.lock_seqid). 19791 o The lock_seqid field of the lock_owner field 19792 (locker.lock_owner.lock_seqid). 19794 Note that the client ID appearing in a LOCK4denied structure is the 19795 actual client associated with the conflicting lock, whether this is 19796 the client ID associated with the current session, or a different 19797 one. Thus if the server returns NFS4ERR_DENIED, it MUST set the 19798 clientid field of the owner field of the denied field. 19800 If the current filehandle is not an ordinary file, an error will be 19801 returned to the client. In the case that the current filehandle 19802 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. if 19803 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 19804 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 19806 On success, the current filehandle retains its value. 19808 18.10.4. IMPLEMENTATION 19810 If the server is unable to determine the exact offset and length of 19811 the conflicting lock, the same offset and length that were provided 19812 in the arguments should be returned in the denied results 19814 LOCK operations are subject to permission checks and to checks 19815 against the access type of the associated file. However, the 19816 specific right and modes required for various type of locks, reflect 19817 the semantics of the server-exported file system, and are not 19818 specified by the protocol. For example, Windows 2000 allows a write 19819 lock of a file open for READ, while a POSIX-compliant system does 19820 not. 19822 When the client makes a lock request that corresponds to a range that 19823 the lock-owner has locked already (with the same or different lock 19824 type), or to a sub-region of such a range, or to a region which 19825 includes multiple locks already granted to that lock-owner, in whole 19826 or in part, and the server does not support such locking operations 19827 (i.e. does not support POSIX locking semantics), the server will 19828 return the error NFS4ERR_LOCK_RANGE. In that case, the client may 19829 return an error, or it may emulate the required operations, using 19830 only LOCK for ranges that do not include any bytes already locked by 19831 that lock-owner and LOCKU of locks held by that lock-owner 19832 (specifying an exactly-matching range and type). Similarly, when the 19833 client makes a lock request that amounts to upgrading (changing from 19834 a read lock to a write lock) or downgrading (changing from write lock 19835 to a read lock) an existing byte-range lock, and the server does not 19836 support such a lock, the server will return NFS4ERR_LOCK_NOTSUPP. 19837 Such operations may not perfectly reflect the required semantics in 19838 the face of conflicting lock requests from other clients. 19840 When a client holds a write delegation, the client holding that 19841 delegation is assured that there are no opens by other clients. 19842 Thus, there can be no conflicting LOCK requests from such clients. 19843 Therefore, the client may be handling locking requests locally, 19844 without doing LOCK operations on the server. If it does that, it 19845 must be prepared to update the lock status on the server, by doing 19846 appropriate LOCK and LOCKU requests before returning the delegation. 19848 When one or more clients hold read delegations, any LOCK request 19849 where the server is implementing mandatory locking semantics, MUST 19850 result in the recall of all such delegations. The LOCK request may 19851 not be granted until all such delegations are return or revoked. 19852 Except where this happens very quickly, one or more NFS4ERR_DELAY 19853 errors will be returned to requests made while the delegation remains 19854 outstanding. 19856 18.11. Operation 13: LOCKT - Test For Lock 19858 18.11.1. ARGUMENTS 19860 struct LOCKT4args { 19861 /* CURRENT_FH: file */ 19862 nfs_lock_type4 locktype; 19863 offset4 offset; 19864 length4 length; 19865 lock_owner4 owner; 19866 }; 19868 18.11.2. RESULTS 19870 union LOCKT4res switch (nfsstat4 status) { 19871 case NFS4ERR_DENIED: 19872 LOCK4denied denied; 19873 case NFS4_OK: 19874 void; 19875 default: 19876 void; 19877 }; 19879 18.11.3. DESCRIPTION 19881 The LOCKT operation tests the lock as specified in the arguments. If 19882 a conflicting lock exists, the owner, offset, length, and type of the 19883 conflicting lock are returned. The owner field in the results 19884 includes the client ID of the owner of conflicting lock, whether this 19885 is the client ID associated with the current session or a different 19886 client ID. If no lock is held, nothing other than NFS4_OK is 19887 returned. Lock types READ_LT and READW_LT are processed in the same 19888 way in that a conflicting lock test is done without regard to 19889 blocking or non-blocking. The same is true for WRITE_LT and 19890 WRITEW_LT. 19892 The ranges are specified as for LOCK. The NFS4ERR_INVAL and 19893 NFS4ERR_BAD_RANGE errors are returned under the same circumstances as 19894 for LOCK. 19896 The clientid field of the owner MAY be set to any value by the client 19897 and MUST be ignored by the server. The reason the server MUST ignore 19898 the clientid field is that the server MUST derive the client ID from 19899 the session ID from the SEQUENCE operation of the COMPOUND request. 19901 If the current filehandle is not an ordinary file, an error will be 19902 returned to the client. In the case that the current filehandle 19903 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. if 19904 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 19905 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 19907 On success, the current filehandle retains its value. 19909 18.11.4. IMPLEMENTATION 19911 If the server is unable to determine the exact offset and length of 19912 the conflicting lock, the same offset and length that were provided 19913 in the arguments should be returned in the denied results. 19915 LOCKT uses a lock_owner4 rather a stateid4, as is used in LOCK to 19916 identify the owner. This is because the client does not have to open 19917 the file to test for the existence of a lock, so a stateid might not 19918 be available. 19920 As noted in Section 18.10.4, some servers may return 19921 NFS4ERR_LOCK_RANGE to certain (otherwise non-conflicting) lock 19922 requests that overlap ranges already granted to the current lock- 19923 owner. 19925 The LOCKT operation's test for conflicting locks SHOULD exclude locks 19926 for the current lock-owner, and thus should return NFS4_OK in such 19927 cases. Note that this means that a server might return NFS4_OK to a 19928 LOCKT request even though a LOCK request for the same range and lock- 19929 owner would fail with NFS4ERR_LOCK_RANGE. 19931 When a client holds a write delegation, it may choose (see 19932 Section 18.10.4) to handle LOCK requests locally. In such a case, 19933 LOCKT requests will similarly be handled locally. 19935 18.12. Operation 14: LOCKU - Unlock File 19937 18.12.1. ARGUMENTS 19939 struct LOCKU4args { 19940 /* CURRENT_FH: file */ 19941 nfs_lock_type4 locktype; 19942 seqid4 seqid; 19943 stateid4 lock_stateid; 19944 offset4 offset; 19945 length4 length; 19946 }; 19948 18.12.2. RESULTS 19950 union LOCKU4res switch (nfsstat4 status) { 19951 case NFS4_OK: 19952 stateid4 lock_stateid; 19953 default: 19954 void; 19955 }; 19957 18.12.3. DESCRIPTION 19959 The LOCKU operation unlocks the byte-range lock specified by the 19960 parameters. The client may set the locktype field to any value that 19961 is legal for the nfs_lock_type4 enumerated type, and the server MUST 19962 accept any legal value for locktype. Any legal value for locktype 19963 has no effect on the success or failure of the LOCKU operation. 19965 The ranges are specified as for LOCK. The NFS4ERR_INVAL and 19966 NFS4ERR_BAD_RANGE errors are returned under the same circumstances as 19967 for LOCK. 19969 The seqid parameter MAY be any value and the server MUST ignore it. 19971 If the current filehandle is not an ordinary file, an error will be 19972 returned to the client. In the case that the current filehandle 19973 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. if 19974 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 19975 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 19977 On success, the current filehandle retains its value. 19979 The server MAY require that the principal, security flavor, and 19980 applicable, the GSS mechanism, combination that sent a LOCK request 19981 also be the one to send LOCKU on the file. This might not be 19982 possible if credentials for the principal are no longer available. 19983 The server MAY allow the machine credential or SSV credential (see 19984 Section 18.35) to send LOCKU. 19986 18.12.4. IMPLEMENTATION 19988 If the area to be unlocked does not correspond exactly to a lock 19989 actually held by the lock-owner the server may return the error 19990 NFS4ERR_LOCK_RANGE. This includes the case in which the area is not 19991 locked, where the area is a sub-range of the area locked, where it 19992 overlaps the area locked without matching exactly or the area 19993 specified includes multiple locks held by the lock-owner. In all of 19994 these cases, allowed by POSIX locking [23] semantics, a client 19995 receiving this error, should if it desires support for such 19996 operations, simulate the operation using LOCKU on ranges 19997 corresponding to locks it actually holds, possibly followed by LOCK 19998 requests for the sub-ranges not being unlocked. 20000 When a client holds a write delegation, it may choose (See 20001 Section 18.10.4) to handle LOCK requests locally. In such a case, 20002 LOCKU requests will similarly be handled locally. 20004 18.13. Operation 15: LOOKUP - Lookup Filename 20006 18.13.1. ARGUMENTS 20008 struct LOOKUP4args { 20009 /* CURRENT_FH: directory */ 20010 component4 objname; 20011 }; 20013 18.13.2. RESULTS 20015 struct LOOKUP4res { 20016 /* New CURRENT_FH: object */ 20017 nfsstat4 status; 20018 }; 20020 18.13.3. DESCRIPTION 20022 This operation LOOKUPs or finds a file system object using the 20023 directory specified by the current filehandle. LOOKUP evaluates the 20024 component and if the object exists the current filehandle is replaced 20025 with the component's filehandle. 20027 If the component cannot be evaluated either because it does not exist 20028 or because the client does not have permission to evaluate the 20029 component, then an error will be returned and the current filehandle 20030 will be unchanged. 20032 If the component is a zero length string or if any component does not 20033 obey the UTF-8 definition, the error NFS4ERR_INVAL will be returned. 20035 18.13.4. IMPLEMENTATION 20037 If the client wants to achieve the effect of a multi-component 20038 lookup, it may construct a COMPOUND request such as (and obtain each 20039 filehandle): 20041 PUTFH (directory filehandle) 20042 LOOKUP "pub" 20043 GETFH 20044 LOOKUP "foo" 20045 GETFH 20046 LOOKUP "bar" 20047 GETFH 20049 Unlike NFSv3, NFSv4.1 allows LOOKUP requests to cross mountpoints on 20050 the server. The client can detect a mountpoint crossing by comparing 20051 the fsid attribute of the directory with the fsid attribute of the 20052 directory looked up. If the fsids are different then the new 20053 directory is a server mountpoint. UNIX clients that detect a 20054 mountpoint crossing will need to mount the server's file system. 20055 This needs to be done to maintain the file object identity checking 20056 mechanisms common to UNIX clients. 20058 Servers that limit NFS access to "shares" or "exported" file systems 20059 should provide a pseudo file system into which the exported file 20060 systems can be integrated, so that clients can browse the server's 20061 name space. The clients view of a pseudo file system will be limited 20062 to paths that lead to exported file systems. 20064 Note: previous versions of the protocol assigned special semantics to 20065 the names "." and "..". NFSv4.1 assigns no special semantics to 20066 these names. The LOOKUPP operator must be used to lookup a parent 20067 directory. 20069 Note that this operation does not follow symbolic links. The client 20070 is responsible for all parsing of filenames including filenames that 20071 are modified by symbolic links encountered during the lookup process. 20073 If the current filehandle supplied is not a directory but a symbolic 20074 link, the error NFS4ERR_SYMLINK is returned as the error. For all 20075 other non-directory file types, the error NFS4ERR_NOTDIR is returned. 20077 18.14. Operation 16: LOOKUPP - Lookup Parent Directory 20079 18.14.1. ARGUMENTS 20081 /* CURRENT_FH: object */ 20082 void; 20084 18.14.2. RESULTS 20086 struct LOOKUPP4res { 20087 /* new CURRENT_FH: parent directory */ 20088 nfsstat4 status; 20089 }; 20091 18.14.3. DESCRIPTION 20093 The current filehandle is assumed to refer to a regular directory or 20094 a named attribute directory. LOOKUPP assigns the filehandle for its 20095 parent directory to be the current filehandle. If there is no parent 20096 directory an NFS4ERR_NOENT error must be returned. Therefore, 20097 NFS4ERR_NOENT will be returned by the server when the current 20098 filehandle is at the root or top of the server's file tree. 20100 As is the case with LOOKUP, LOOKUPP will also cross mountpoints. 20102 If the current filehandle is not a directory or named attribute 20103 directory, the error NFS4ERR_NOTDIR is returned. 20105 If the requester's security flavor does not match that configured for 20106 the parent directory, then the server SHOULD return NFS4ERR_WRONGSEC 20107 (a future minor revision of NFSv4 may upgrade this to MUST) in the 20108 LOOKUPP response. However, if the server does so, it MUST support 20109 the SECINFO_NO_NAME operation (Section 18.45), so that the client can 20110 gracefully determine the correct security flavor. 20112 If the current filehandle is a named attribute directory that is 20113 associated with a file system object via OPENATTR (i.e. not a sub- 20114 directory of a named attribute directory) LOOKUPP SHOULD return the 20115 filehandle of the associated file system object. 20117 18.14.4. IMPLEMENTATION 20119 An issue to note is upward navigation from named attribute 20120 directories. The named attribute directories are essentially 20121 detached from the namespace and this property should be safely 20122 represented in the client operating environment. LOOKUPP on a named 20123 attribute directory may return the filehandle of the associated file 20124 and conveying this to applications might be unsafe as many 20125 applications expect the parent of an object to always be a directory. 20126 Therefore the client may want to hide the parent of named attribute 20127 directories (represented as ".." in UNIX) or represent the named 20128 attribute directory as its own parent (as typically done for the file 20129 system root directory in UNIX). 20131 18.15. Operation 17: NVERIFY - Verify Difference in Attributes 20133 18.15.1. ARGUMENTS 20135 struct NVERIFY4args { 20136 /* CURRENT_FH: object */ 20137 fattr4 obj_attributes; 20138 }; 20140 18.15.2. RESULTS 20142 struct NVERIFY4res { 20143 nfsstat4 status; 20144 }; 20146 18.15.3. DESCRIPTION 20148 This operation is used to prefix a sequence of operations to be 20149 performed if one or more attributes have changed on some file system 20150 object. If all the attributes match then the error NFS4ERR_SAME MUST 20151 be returned. 20153 On success, the current filehandle retains its value. 20155 18.15.4. IMPLEMENTATION 20157 This operation is useful as a cache validation operator. If the 20158 object to which the attributes belong has changed then the following 20159 operations may obtain new data associated with that object. For 20160 instance, to check if a file has been changed and obtain new data if 20161 it has: 20163 SEQUENCE 20164 PUTFH fh 20165 NVERIFY attrbits attrs 20166 READ 0 32767 20168 Contrast this with NFSv3, which would first send a GETATTR in one 20169 request/reply round trip, and then if attributes indicated that the 20170 client's cache was stale, then send a READ in another request/reply 20171 round trip. 20173 In the case that a RECOMMENDED attribute is specified in the NVERIFY 20174 operation and the server does not support that attribute for the file 20175 system object, the error NFS4ERR_ATTRNOTSUPP is returned to the 20176 client. 20178 When the attribute rdattr_error or any set-only attribute (e.g. 20179 time_modify_set) is specified, the error NFS4ERR_INVAL is returned to 20180 the client. 20182 18.16. Operation 18: OPEN - Open a Regular File 20184 18.16.1. ARGUMENTS 20186 /* 20187 * Various definitions for OPEN 20188 */ 20189 enum createmode4 { 20190 UNCHECKED4 = 0, 20191 GUARDED4 = 1, 20192 /* Deprecated in NFSv4.1. */ 20193 EXCLUSIVE4 = 2, 20194 /* 20195 * New to NFSv4.1. If session is persistent, 20196 * GUARDED4 MUST be used. Otherwise, use 20197 * EXCLUSIVE4_1 instead of EXCLUSIVE4. 20198 */ 20199 EXCLUSIVE4_1 = 3 20200 }; 20202 struct creatverfattr { 20203 verifier4 cva_verf; 20204 fattr4 cva_attrs; 20205 }; 20207 union createhow4 switch (createmode4 mode) { 20208 case UNCHECKED4: 20209 case GUARDED4: 20210 fattr4 createattrs; 20211 case EXCLUSIVE4: 20212 verifier4 createverf; 20213 case EXCLUSIVE4_1: 20214 creatverfattr ch_createboth; 20215 }; 20217 enum opentype4 { 20218 OPEN4_NOCREATE = 0, 20219 OPEN4_CREATE = 1 20220 }; 20222 union openflag4 switch (opentype4 opentype) { 20223 case OPEN4_CREATE: 20224 createhow4 how; 20225 default: 20227 void; 20228 }; 20230 /* Next definitions used for OPEN delegation */ 20231 enum limit_by4 { 20232 NFS_LIMIT_SIZE = 1, 20233 NFS_LIMIT_BLOCKS = 2 20234 /* others as needed */ 20235 }; 20237 struct nfs_modified_limit4 { 20238 uint32_t num_blocks; 20239 uint32_t bytes_per_block; 20240 }; 20242 union nfs_space_limit4 switch (limit_by4 limitby) { 20243 /* limit specified as file size */ 20244 case NFS_LIMIT_SIZE: 20245 uint64_t filesize; 20246 /* limit specified by number of blocks */ 20247 case NFS_LIMIT_BLOCKS: 20248 nfs_modified_limit4 mod_blocks; 20249 } ; 20251 /* 20252 * Share Access and Deny constants for open argument 20253 */ 20254 const OPEN4_SHARE_ACCESS_READ = 0x00000001; 20255 const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; 20256 const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; 20258 const OPEN4_SHARE_DENY_NONE = 0x00000000; 20259 const OPEN4_SHARE_DENY_READ = 0x00000001; 20260 const OPEN4_SHARE_DENY_WRITE = 0x00000002; 20261 const OPEN4_SHARE_DENY_BOTH = 0x00000003; 20263 /* new flags for share_access field of OPEN4args */ 20264 const OPEN4_SHARE_ACCESS_WANT_DELEG_MASK = 0xFF00; 20265 const OPEN4_SHARE_ACCESS_WANT_NO_PREFERENCE = 0x0000; 20266 const OPEN4_SHARE_ACCESS_WANT_READ_DELEG = 0x0100; 20267 const OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG = 0x0200; 20268 const OPEN4_SHARE_ACCESS_WANT_ANY_DELEG = 0x0300; 20269 const OPEN4_SHARE_ACCESS_WANT_NO_DELEG = 0x0400; 20270 const OPEN4_SHARE_ACCESS_WANT_CANCEL = 0x0500; 20272 const 20273 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL 20274 = 0x10000; 20276 const 20277 OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED 20278 = 0x20000; 20280 enum open_delegation_type4 { 20281 OPEN_DELEGATE_NONE = 0, 20282 OPEN_DELEGATE_READ = 1, 20283 OPEN_DELEGATE_WRITE = 2, 20284 OPEN_DELEGATE_NONE_EXT = 3 /* new to v4.1 */ 20285 }; 20287 enum open_claim_type4 { 20288 /* 20289 * Not a reclaim. 20290 */ 20291 CLAIM_NULL = 0, 20293 CLAIM_PREVIOUS = 1, 20294 CLAIM_DELEGATE_CUR = 2, 20295 CLAIM_DELEGATE_PREV = 3, 20297 /* 20298 * Not a reclaim. 20299 * 20300 * Like CLAIM_NULL, but object identified 20301 * by the current filehandle. 20302 */ 20303 CLAIM_FH = 4, /* new to v4.1 */ 20305 /* 20306 * Like CLAIM_DELEGATE_CUR, but object identified 20307 * by current filehandle. 20308 */ 20309 CLAIM_DELEG_CUR_FH = 5, /* new to v4.1 */ 20311 /* 20312 * Like CLAIM_DELEGATE_PREV, but object identified 20313 * by current filehandle. 20314 */ 20315 CLAIM_DELEG_PREV_FH = 6 /* new to v4.1 */ 20316 }; 20318 struct open_claim_delegate_cur4 { 20319 stateid4 delegate_stateid; 20320 component4 file; 20321 }; 20322 union open_claim4 switch (open_claim_type4 claim) { 20323 /* 20324 * No special rights to file. 20325 * Ordinary OPEN of the specified file. 20326 */ 20327 case CLAIM_NULL: 20328 /* CURRENT_FH: directory */ 20329 component4 file; 20330 /* 20331 * Right to the file established by an 20332 * open previous to server reboot. File 20333 * identified by filehandle obtained at 20334 * that time rather than by name. 20335 */ 20336 case CLAIM_PREVIOUS: 20337 /* CURRENT_FH: file being reclaimed */ 20338 open_delegation_type4 delegate_type; 20340 /* 20341 * Right to file based on a delegation 20342 * granted by the server. File is 20343 * specified by name. 20344 */ 20345 case CLAIM_DELEGATE_CUR: 20346 /* CURRENT_FH: directory */ 20347 open_claim_delegate_cur4 delegate_cur_info; 20349 /* 20350 * Right to file based on a delegation 20351 * granted to a previous boot instance 20352 * of the client. File is specified by name. 20353 */ 20354 case CLAIM_DELEGATE_PREV: 20355 /* CURRENT_FH: directory */ 20356 component4 file_delegate_prev; 20358 /* 20359 * Like CLAIM_NULL. No special rights 20360 * to file. Ordinary OPEN of the 20361 * specified file by current filehandle. 20362 */ 20363 case CLAIM_FH: /* new to v4.1 */ 20364 /* CURRENT_FH: regular file to open */ 20365 void; 20367 /* 20368 * Like CLAIM_DELEGATE_PREV. Right to file based on a 20369 * delegation granted to a previous boot 20370 * instance of the client. File is identified by 20371 * by filehandle. 20372 */ 20373 case CLAIM_DELEG_PREV_FH: /* new to v4.1 */ 20374 /* CURRENT_FH: file being opened */ 20375 void; 20377 /* 20378 * Like CLAIM_DELEGATE_CUR. Right to file based on 20379 * a delegation granted by the server. 20380 * File is identified by filehandle. 20381 */ 20382 case CLAIM_DELEG_CUR_FH: /* new to v4.1 */ 20383 /* CURRENT_FH: file being opened */ 20384 stateid4 oc_delegate_stateid; 20386 }; 20388 /* 20389 * OPEN: Open a file, potentially receiving an open delegation 20390 */ 20391 struct OPEN4args { 20392 seqid4 seqid; 20393 uint32_t share_access; 20394 uint32_t share_deny; 20395 open_owner4 owner; 20396 openflag4 openhow; 20397 open_claim4 claim; 20398 }; 20400 18.16.2. RESULTS 20402 struct open_read_delegation4 { 20403 stateid4 stateid; /* Stateid for delegation*/ 20404 bool recall; /* Pre-recalled flag for 20405 delegations obtained 20406 by reclaim (CLAIM_PREVIOUS) */ 20408 nfsace4 permissions; /* Defines users who don't 20409 need an ACCESS call to 20410 open for read */ 20411 }; 20413 struct open_write_delegation4 { 20414 stateid4 stateid; /* Stateid for delegation */ 20415 bool recall; /* Pre-recalled flag for 20416 delegations obtained 20417 by reclaim 20418 (CLAIM_PREVIOUS) */ 20420 nfs_space_limit4 20421 space_limit; /* Defines condition that 20422 the client must check to 20423 determine whether the 20424 file needs to be flushed 20425 to the server on close. */ 20427 nfsace4 permissions; /* Defines users who don't 20428 need an ACCESS call as 20429 part of a delegated 20430 open. */ 20431 }; 20433 enum why_no_delegation4 { /* new to v4.1 */ 20434 WND4_NOT_WANTED = 0, 20435 WND4_CONTENTION = 1, 20436 WND4_RESOURCE = 2, 20437 WND4_NOT_SUPP_FTYPE = 3, 20438 WND4_WRITE_DELEG_NOT_SUPP_FTYPE = 4, 20439 WND4_NOT_SUPP_UPGRADE = 5, 20440 WND4_NOT_SUPP_DOWNGRADE = 6, 20441 WND4_CANCELED = 7, 20442 WND4_IS_DIR = 8 20443 }; 20445 union open_none_delegation4 /* new to v4.1 */ 20446 switch (why_no_delegation4 ond_why) { 20447 case WND4_CONTENTION: 20448 bool ond_server_will_push_deleg; 20449 case WND4_RESOURCE: 20450 bool ond_server_will_signal_avail; 20451 default: 20452 void; 20453 }; 20455 union open_delegation4 20456 switch (open_delegation_type4 delegation_type) { 20457 case OPEN_DELEGATE_NONE: 20458 void; 20459 case OPEN_DELEGATE_READ: 20460 open_read_delegation4 read; 20461 case OPEN_DELEGATE_WRITE: 20462 open_write_delegation4 write; 20463 case OPEN_DELEGATE_NONE_EXT: /* new to v4.1 */ 20464 open_none_delegation4 od_whynone; 20465 }; 20467 /* 20468 * Result flags 20469 */ 20471 /* Client must confirm open */ 20472 const OPEN4_RESULT_CONFIRM = 0x00000002; 20473 /* Type of file locking behavior at the server */ 20474 const OPEN4_RESULT_LOCKTYPE_POSIX = 0x00000004; 20475 /* Server will preserve file if removed while open */ 20476 const OPEN4_RESULT_PRESERVE_UNLINKED = 0x00000008; 20478 /* 20479 * Server may use CB_NOTIFY_LOCK on locks 20480 * derived from this open 20481 */ 20482 const OPEN4_RESULT_MAY_NOTIFY_LOCK = 0x00000020; 20484 struct OPEN4resok { 20485 stateid4 stateid; /* Stateid for open */ 20486 change_info4 cinfo; /* Directory Change Info */ 20487 uint32_t rflags; /* Result flags */ 20488 bitmap4 attrset; /* attribute set for create*/ 20489 open_delegation4 delegation; /* Info on any open 20490 delegation */ 20491 }; 20493 union OPEN4res switch (nfsstat4 status) { 20494 case NFS4_OK: 20495 /* New CURRENT_FH: opened file */ 20496 OPEN4resok resok4; 20497 default: 20498 void; 20499 }; 20501 18.16.3. DESCRIPTION 20503 The OPEN operation opens a regular file in a directory with the 20504 provided name or filehandle. OPEN can also create a file if a name 20505 is provided, and the client specifies it wants to create a file. 20506 Specification whether a file is be created or not, and the method of 20507 creation is via the openhow parameter. The openhow parameter 20508 consists of a switched union (data type opengflag4), which switches 20509 on the value of opentype (OPEN4_NOCREATE or OPEN4_CREATE). If 20510 OPEN4_CREATE is specified, this leads to another switched union (data 20511 type createhow4) that supports four cases of creation methods: 20512 UNCHECKED4, GUARDED4, EXCLUSIVE4, or EXCLUSIVE4_1. If opentype is 20513 OPEN4_CREATE, then the claim field of the claim field (sic) MUST be 20514 one of CLAIM_NULL, CLAIM_DELEGATE_CUR, or CLAIM_DELEGATE_PREV, 20515 because these claim methods include a component of a file name. 20517 Upon success (which might entail creation of a new file), the current 20518 filehandle is replaced by that of the created or existing object. 20520 If the current filehandle is a named attribute directory, OPEN will 20521 then create or open a named attribute file. Note that exclusive 20522 create of a named attribute is not supported. If the createmode is 20523 EXCLUSIVE4 or EXCLUSIVE4_1 and the current filehandle is a named 20524 attribute directory, the server will return EINVAL. 20526 UNCHECKED4 means that the file should be created if a file of that 20527 name does not exist and encountering an existing regular file of that 20528 name is not an error. For this type of create, createattrs specifies 20529 the initial set of attributes for the file. The set of attributes 20530 may include any writable attribute valid for regular files. When an 20531 UNCHECKED4 create encounters an existing file, the attributes 20532 specified by createattrs are not used, except that when createattrs 20533 specifies the size attribute with a size of zero, the existing file 20534 is truncated. 20536 If GUARDED4 is specified, the server checks for the presence of a 20537 duplicate object by name before performing the create. If a 20538 duplicate exists, NFS4ERR_EXIST is returned. If the object does not 20539 exist, the request is performed as described for UNCHECKED4. 20541 For the UNCHECKED4 and GUARDED4 cases, where the operation is 20542 successful, the server will return to the client an attribute mask 20543 signifying which attributes were successfully set for the object. 20545 EXCLUSIVE4_1 and EXCLUSIVE4 specify that the server is to follow 20546 exclusive creation semantics, using the verifier to ensure exclusive 20547 creation of the target. The server should check for the presence of 20548 a duplicate object by name. If the object does not exist, the server 20549 creates the object and stores the verifier with the object. If the 20550 object does exist and the stored verifier matches the client provided 20551 verifier, the server uses the existing object as the newly created 20552 object. If the stored verifier does not match, then an error of 20553 NFS4ERR_EXIST is returned. 20555 If using EXCLUSIVE4, and if the server uses attributes to store the 20556 exclusive create verifier, the server will signify which attributes 20557 it used by setting the appropriate bits in the attribute mask that is 20558 returned in the results. Unlike UNCHECKED4, GUARDED4, and 20559 EXCLUSIVE4_1, EXCLUSIVE4 does not support the setting of attributes 20560 at file creation, and after a successful OPEN via EXCLUSIVE4, the 20561 client MUST send a SETATTR to set attributes to a known state. 20563 In NFSv4.1, EXCLUSIVE4 has been deprecated in favor of EXCLUSIVE4_1. 20564 Unlike EXCLUSIVE4, attributes may be provided in the EXCLUSIVE4_1 20565 case, but because the server may use attributes of the target object 20566 to store the verifier, the set of allowable attributes may be fewer 20567 than the set of attributes SETATTR allows. The allowable attributes 20568 for EXCLUSIVE4_1 are indicated in the suppattr_exclcreat 20569 (Section 5.8.1.14) attribute. If the client attempts to set in 20570 cva_attrs an attribute that is not in suppattr_exclcreat, the server 20571 MUST return NFS4ERR_INVAL. The response field, attrset indicates 20572 both which attributes the server set from cva_attrs, and which 20573 attributes the server used to store the verifier. As described in 20574 Section 18.16.4, the client can compare cva_attrs.attrmask with 20575 attrset to determine which attributes were used to store the 20576 verifier. 20578 With the addition of persistent sessions and pNFS, under some 20579 conditions EXCLUSIVE4 MUST NOT be used by the client or supported by 20580 the server. The following table summarizes the appropriate and 20581 mandated exclusive create methods for implementations of NFSv4.1: 20583 Required methods for exclusive create 20585 +----------------+-----------+---------------+----------------------+ 20586 | Persistent | Server | Server | Client Allowed | 20587 | Reply Cache | Supports | REQUIRED | | 20588 | Enabled | pNFS | | | 20589 +----------------+-----------+---------------+----------------------+ 20590 | no | no | EXCLUSIVE4_1 | EXCLUSIVE4_1 | 20591 | | | and | (SHOULD) or | 20592 | | | EXCLUSIVE4 | EXCLUSIVE4 (SHOULD | 20593 | | | | NOT) | 20594 | no | yes | EXCLUSIVE4_1 | EXCLUSIVE4_1 | 20595 | yes | no | GUARDED4 | GUARDED4 | 20596 | yes | yes | GUARDED4 | GUARDED4 | 20597 +----------------+-----------+---------------+----------------------+ 20599 Table 10 20601 If CREATE_SESSION4_FLAG_PERSIST is set in the results of 20602 CREATE_SESSION the reply cache is persistent (see Section 18.36). If 20603 the EXCHGID4_FLAG_USE_PNFS_MDS flag is set in the results from 20604 EXCHANGE_ID, the server is a pNFS server (see Section 18.35). If the 20605 client attempts to use EXCLUSIVE4 on a persistent session, or a 20606 session derived from a EXCHGID4_FLAG_USE_PNFS_MDS client ID, the 20607 server MUST return NFS4ERR_INVAL. 20609 With persistent sessions, exclusive create semantics are fully 20610 achievable via GUARDED4, and so EXCLUSIVE4 or EXCLUSIVE4_1 MUST NOT 20611 be used. When pNFS is being used, the layout_hint attribute might 20612 not be supported after the file is created. Only the EXCLUSIVE4_1 20613 and GUARDED methods of exclusive file creation allow the atomic 20614 setting of attributes. 20616 For the target directory, the server returns change_info4 information 20617 in cinfo. With the atomic field of the change_info4 data type, the 20618 server will indicate if the before and after change attributes were 20619 obtained atomically with respect to the link creation. 20621 The OPEN operation provides for Windows share reservation capability 20622 with the use of the share_access and share_deny fields of the OPEN 20623 arguments. The client specifies at OPEN the required share_access 20624 and share_deny modes. For clients that do not directly support 20625 SHAREs (i.e. UNIX), the expected deny value is DENY_NONE. In the 20626 case that there is a existing SHARE reservation that conflicts with 20627 the OPEN request, the server returns the error NFS4ERR_SHARE_DENIED. 20628 For additional discussion of SHARE semantics see Section 9.7. 20630 For each OPEN, the client provides a value for the owner field of the 20631 OPEN argument. The owner field is of data type open_owner4, and 20632 contains a field called clientid and a field called owner. The 20633 client can set the clientid field to any value and the server MUST 20634 ignore it. Instead the server MUST derive the client ID from the 20635 session ID of the SEQUENCE operation of the COMPOUND request. 20637 The seqid field of the request is not used in NFSv4.1, but it MAY be 20638 any value and the server MUST ignore it. 20640 In the case that the client is recovering state from a server 20641 failure, the claim field of the OPEN argument is used to signify that 20642 the request is meant to reclaim state previously held. 20644 The "claim" field of the OPEN argument is used to specify the file to 20645 be opened and the state information which the client claims to 20646 possess. There are seven claim types as follows: 20648 +----------------------+--------------------------------------------+ 20649 | open type | description | 20650 +----------------------+--------------------------------------------+ 20651 | CLAIM_NULL, CLAIM_FH | For the client, this is a new OPEN request | 20652 | | and there is no previous state associate | 20653 | | with the file for the client. With | 20654 | | CLAIM_NULL the file is identified by the | 20655 | | current filehandle and the specified | 20656 | | component name. With CLAIM_FH (new to | 20657 | | NFSv4.1) the file is identified by just | 20658 | | the current filehandle. | 20659 | CLAIM_PREVIOUS | The client is claiming basic OPEN state | 20660 | | for a file that was held previous to a | 20661 | | server restart. Generally used when a | 20662 | | server is returning persistent | 20663 | | filehandles; the client may not have the | 20664 | | file name to reclaim the OPEN. | 20665 | CLAIM_DELEGATE_CUR, | The client is claiming a delegation for | 20666 | CLAIM_DELEG_CUR_FH | OPEN as granted by the server. Generally | 20667 | | this is done as part of recalling a | 20668 | | delegation. With CLAIM_DELEGATE_CUR, the | 20669 | | file is identified by the current | 20670 | | filehandle and the specified component | 20671 | | name. With CLAIM_DELEG_CUR_FH (new to | 20672 | | NFSv4.1), the file is identified by just | 20673 | | the current filehandle. | 20674 | CLAIM_DELEGATE_PREV, | The client is claiming a delegation | 20675 | CLAIM_DELEG_PREV_FH | granted to a previous client instance; | 20676 | | used after the client restarts. The server | 20677 | | MAY support CLAIM_DELEGATE_PREV or | 20678 | | CLAIM_DELEG_PREV_FH (new to NFSv4.1). If | 20679 | | it does support either open type, | 20680 | | CREATE_SESSION MUST NOT remove the | 20681 | | client's delegation state, and the server | 20682 | | MUST support the DELEGPURGE operation. | 20683 +----------------------+--------------------------------------------+ 20685 For OPEN requests that reach the server during the grace period, the 20686 server returns an error of NFS4ERR_GRACE. The following claim types 20687 are exceptions: 20689 o OPEN requests specifying the claim type CLAIM_PREVIOUS are devoted 20690 to reclaiming opens after a server restart and are typically only 20691 valid during the grace period. 20693 o OPEN requests specifying the claim types CLAIM_DELEGATE_CUR and 20694 CLAIM_DELEG_CUR_FH are valid both during and after the grace 20695 period. Since the granting of the delegation that they are 20696 subordinate to assures that there is no conflict with locks to be 20697 reclaimed by other clients, the server need not return 20698 NFS4ERR_GRACE when these are received during the grace period. 20700 For any OPEN request, the server may return an open delegation, which 20701 allows further opens and closes to be handled locally on the client 20702 as described in Section 10.4. Note that delegation is up to the 20703 server to decide. The client should never assume that delegation 20704 will or will not be granted in a particular instance. It should 20705 always be prepared for either case. A partial exception is the 20706 reclaim (CLAIM_PREVIOUS) case, in which a delegation type is claimed. 20707 In this case, delegation will always be granted, although the server 20708 may specify an immediate recall in the delegation structure. 20710 The rflags returned by a successful OPEN allow the server to return 20711 information governing how the open file is to be handled. 20713 o OPEN4_RESULT_CONFIRM is deprecated and MUST NOT be returned by an 20714 NFSv4.1 server. 20716 o OPEN4_RESULT_LOCKTYPE_POSIX indicates the server's file locking 20717 behavior supports the complete set of POSIX locking techniques 20718 [23]. From this the client can choose to manage file locking 20719 state in a way to handle a mis-match of file locking management. 20721 o OPEN4_RESULT_PRESERVE_UNLINKED indicates the server will preserve 20722 the open file if the client (or any other client) removes the file 20723 as long as it is open. Furthermore, the server promises to 20724 preserve the file through the grace period after server restart, 20725 thereby giving the client the opportunity to reclaim its open. 20727 o OPEN4_RESULT_MAY_NOTIFY_LOCK indicates that the server may attempt 20728 CB_NOTIFY_LOCK callbacks for locks on this file. This flag is a 20729 hint only, and may be safely ignored by the client. 20731 If the component is of zero length, NFS4ERR_INVAL will be returned. 20732 The component is also subject to the normal UTF-8, character support, 20733 and name checks. See Section 14.5 for further discussion. 20735 When an OPEN is done and the specified open-owner already has the 20736 resulting filehandle open, the result is to "OR" together the new 20737 share and deny status together with the existing status. In this 20738 case, only a single CLOSE need be done, even though multiple OPENs 20739 were completed. When such an OPEN is done, checking of share 20740 reservations for the new OPEN proceeds normally, with no exception 20741 for the existing OPEN held by the same open-owner. In this case, the 20742 stateid returned as an "other" field that matches that of the 20743 previous open while the "seqid" field is incremented to reflect the 20744 change status due to the new open. 20746 If the underlying file system at the server is only accessible in a 20747 read-only mode and the OPEN request has specified ACCESS_WRITE or 20748 ACCESS_BOTH, the server will return NFS4ERR_ROFS to indicate a read- 20749 only file system. 20751 As with the CREATE operation, the server MUST derive the owner, owner 20752 ACE, group, or group ACE if any of the four attributes are required 20753 and supported by the server's file system. For an OPEN with the 20754 EXCLUSIVE4 createmode, the server has no choice, since such OPEN 20755 calls do not include the createattrs field. Conversely, if 20756 createattrs (UNCHECKED4 or GUARDED4) or cva_attrs (EXCLUSIVE4_1) is 20757 specified, and includes an owner, or owner_group, or ACE that the 20758 principal in the RPC call's credentials does not have authorization 20759 to create files for, then the server may return NFS4ERR_PERM. 20761 In the case of an OPEN which specifies a size of zero (e.g. 20762 truncation) and the file has named attributes, the named attributes 20763 are left as is and are not removed. 20765 NFSv4.1 gives more precise control to clients over acquisition of 20766 delegations via the following new flags for the share_access field of 20767 OPEN4args: 20769 OPEN4_SHARE_ACCESS_WANT_READ_DELEG 20771 OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 20773 OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 20775 OPEN4_SHARE_ACCESS_WANT_NO_DELEG 20777 OPEN4_SHARE_ACCESS_WANT_CANCEL 20779 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL 20781 OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED 20783 If (share_access & OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) is not zero, 20784 then the client will have specified one and only one of: 20786 OPEN4_SHARE_ACCESS_WANT_READ_DELEG 20788 OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 20789 OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 20791 OPEN4_SHARE_ACCESS_WANT_NO_DELEG 20793 OPEN4_SHARE_ACCESS_WANT_CANCEL 20795 Otherwise the client is indicating no desire for a delegation and the 20796 server MAY or MAY not return a delegation in the OPEN response. 20798 If the server supports the new _WANT_ flags and the client sends one 20799 or more of the new flags, then in the event the server does not 20800 return a delegation, it MUST return a delegation type of 20801 OPEN_DELEGATE_NONE_EXT. The field od_whynone in the reply indicates 20802 why no delegation was returned and will be one of: 20804 WND4_NOT_WANTED The client specified 20805 OPEN4_SHARE_ACCESS_WANT_NO_DELEG. 20807 WND4_CONTENTION There is a conflicting delegation or open on the 20808 file. 20810 WND4_RESOURCE Resource limitations prevent the server from granting 20811 a delegation. 20813 WND4_NOT_SUPP_FTYPE The server does not support delegations on this 20814 file type. 20816 WND4_WRITE_DELEG_NOT_SUPP_FTYPE The server does not support write 20817 delegations on this file type. 20819 WND4_NOT_SUPP_UPGRADE The server does not support atomic upgrade of 20820 a read delegation to a write delegation. 20822 WND4_NOT_SUPP_DOWNGRADE The server does not support atomic downgrade 20823 of a write delegation to a read delegation. 20825 WND4_CANCELED The client specified OPEN4_SHARE_ACCESS_WANT_CANCEL 20826 and now any "want" for this file object is cancelled. 20828 WND4_IS_DIR The specified file object is a directory, and the 20829 operation is OPEN or WANT_DELEGATION which do not support 20830 delegations on directories. 20832 OPEN4_SHARE_ACCESS_WANT_READ_DELEG, 20833 OPEN_SHARE_ACCESS_WANT_WRITE_DELEG, or 20834 OPEN_SHARE_ACCESS_WANT_ANY_DELEG mean, respectively, the client wants 20835 a read, write, or any delegation regardless which of 20836 OPEN4_SHARE_ACCESS_READ, OPEN4_SHARE_ACCESS_WRITE, or 20837 OPEN4_SHARE_ACCESS_BOTH is set. If the client has a read delegation 20838 on a file, and requests a write delegation, then the client is 20839 requesting atomic upgrade of its read delegation to a write 20840 delegation. If the client has a write delegation on a file, and 20841 requests a read delegation, then the client is requesting atomic 20842 downgrade to a read delegation. A server MAY support atomic upgrade 20843 or downgrade. If it does, then the returned delegation_type of 20844 OPEN_DELEGATE_READ or OPEN_DELEGATE_WRITE that is different than the 20845 delegation type the client currently has, indicates successful 20846 upgrade or downgrade. If it does not support atomic delegation 20847 upgrade or downgrade, then od_whynone will be WND4_NOT_SUPP_UPGRADE 20848 or WND4_NOT_SUPP_DOWNGRADE. 20850 OPEN4_SHARE_ACCESS_WANT_NO_DELEG means the client wants no 20851 delegation. 20853 OPEN4_SHARE_ACCESS_WANT_CANCEL means the client wants no delegation 20854 and wants to cancel any previously registered "want" for a 20855 delegation. 20857 The client may set one or both of 20858 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL and 20859 OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED. However, they 20860 will have no effect unless one of following are set: 20862 o OPEN4_SHARE_ACCESS_WANT_READ_DELEG 20864 o OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 20866 o OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 20868 If the client specifies 20869 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL, then it wishes 20870 to register a "want" for a delegation, in the event the OPEN results 20871 do not include a delegation. If so and the server denies the 20872 delegation due to insufficient resources, the server MAY later inform 20873 the client, via the CB_RECALLABLE_OBJ_AVAIL operation, that the 20874 resource limitation condition has eased. The server will tell the 20875 client that it intends to send a future CB_RECALLABLE_OBJ_AVAIL 20876 operation by setting delegation_type in the results to 20877 OPEN_DELEGATE_NONE_EXT, ond_why to WND4_RESOURCE, and 20878 ond_server_will_signal_avail set to TRUE. If 20879 ond_server_will_signal_avail is set to TRUE, the server MUST later 20880 send a CB_RECALLABLE_OBJ_AVAIL operation. 20882 If the client specifies 20883 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_UNCONTENDED, then it wishes 20884 to register a "want" for a delegation, in the event the OPEN results 20885 do not include a delegation. If so and the server denies the 20886 delegation due to contention, the server MAY later inform the client, 20887 via the CB_PUSH_DELEG operation, that the contention condition has 20888 eased. The server will tell the client that it intends to send a 20889 future CB_PUSH_DELEG operation by setting delegation_type in the 20890 results to OPEN_DELEGATE_NONE_EXT, ond_why to WND4_CONTENTION, and 20891 ond_server_will_push_deleg to TRUE. If ond_server_will_push_deleg is 20892 TRUE, the server MUST later send a CB_PUSH_DELEG operation. 20894 If the client has previously registered a want for a delegation on a 20895 file, and then sends a request to register a want for a delegation on 20896 the same file, the server MUST return a new error: 20897 NFS4ERR_DELEG_ALREADY_WANTED. If the client wishes to register a 20898 different type of delegation want for the same file, it MUST cancel 20899 the existing delegation WANT. 20901 18.16.4. IMPLEMENTATION 20903 In absence of a persistent session, the client invokes exclusive 20904 create by setting the how parameter to EXCLUSIVE4 or EXCLUSIVE4_1. 20905 In these cases, the client provides a verifier that can reasonably be 20906 expected to be unique. A combination of a client identifier, perhaps 20907 the client network address, and a unique number generated by the 20908 client, perhaps the RPC transaction identifier, may be appropriate. 20910 If the object does not exist, the server creates the object and 20911 stores the verifier in stable storage. For file systems that do not 20912 provide a mechanism for the storage of arbitrary file attributes, the 20913 server may use one or more elements of the object metadata to store 20914 the verifier. The verifier MUST be stored in stable storage to 20915 prevent erroneous failure on retransmission of the request. It is 20916 assumed that an exclusive create is being performed because exclusive 20917 semantics are critical to the application. Because of the expected 20918 usage, exclusive CREATE does not rely solely on the server's reply 20919 cache for storage of the verifier. A nonpersistent reply cache does 20920 not survive a crash and the session and reply cache may be deleted 20921 after a network partition that exceeds the lease time, thus opening 20922 failure windows. 20924 An NFSv4.1 server SHOULD NOT store the verifier in any of the file's 20925 RECOMMENDED or REQUIRED attributes. If it does, the server SHOULD 20926 use time_modify_set or time_access_set to store the verifier. The 20927 server SHOULD NOT store the verifier in the following attributes: acl 20928 (it is desirable for access control to be established at creation), 20929 dacl (ditto), mode (ditto), owner (ditto), owner_group (ditto), 20930 retentevt_set (it may be desired to establish retention at creation) 20931 retention_hold (ditto), retention_set (ditto), sacl (it is desirable 20932 for auditing control to be established at creation), size (on some 20933 servers, size may have a limited range of values), mode_set_masked 20934 (as with mode), and time_creation (a meaningful file creation should 20935 be set when the file is created). Another alternative for the server 20936 is to use a named attribute to store the verifier. 20938 Because the EXCLUSIVE4 create method does not specify initial 20939 attributes, when processing an EXCLUSIVE4 create, the server 20941 o SHOULD set the owner of the file to that corresponding to the 20942 credential of request's RPC header. 20944 o SHOULD NOT leave the file's access control to anyone but the owner 20945 of the file. 20947 If the server cannot support exclusive create semantics, possibly 20948 because of the requirement to commit the verifier to stable storage, 20949 it should fail the OPEN request with the error, NFS4ERR_NOTSUPP. 20951 During an exclusive CREATE request, if the object already exists, the 20952 server reconstructs the object's verifier and compares it with the 20953 verifier in the request. If they match, the server treats the 20954 request as a success. The request is presumed to be a duplicate of 20955 an earlier, successful request for which the reply was lost and that 20956 the server duplicate request cache mechanism did not detect. If the 20957 verifiers do not match, the request is rejected with the status, 20958 NFS4ERR_EXIST. 20960 After the client has performed a successful exclusive create, the 20961 attrset response indicates which attributes were used to store the 20962 verifier. If EXCLUSIVE4 was used, the attributes set in attrset were 20963 used for the verifier. If EXCLUSIVE4_1 was used, the client 20964 determines the attributes used for the verifier by comparing attrset 20965 with cva_attrs.attrmask; any bits set in the former but not the 20966 latter identify the attributes used store the verifier. The client 20967 MUST immediately send a SETATTR to set attributes used to store the 20968 verifier. Until it does so, the attributes used to store the 20969 verifier cannot be relied upon. The subsequent SETATTR MUST NOT 20970 occur in the same COMPOUND request as the OPEN. 20972 Unless a persistent session is used, use of the GUARDED4 attribute 20973 does not provide exactly-once semantics. In particular, if a reply 20974 is lost and the server does not detect the retransmission of the 20975 request, the operation can fail with NFS4ERR_EXIST, even though the 20976 create was performed successfully. The client would use this 20977 behavior in the case that the application has not requested an 20978 exclusive create but has asked to have the file truncated when the 20979 file is opened. In the case of the client timing out and 20980 retransmitting the create request, the client can use GUARDED4 to 20981 prevent against a sequence like: create, write, create 20982 (retransmitted) from occurring. 20984 For SHARE reservations, the client MUST specify a value for 20985 share_access that is one of READ, WRITE, or BOTH. For share_deny, 20986 the client MUST specify one of NONE, READ, WRITE, or BOTH. If the 20987 client fails to do this, the server MUST return NFS4ERR_INVAL. 20989 Based on the share_access value (READ, WRITE, or BOTH) the client 20990 should check that the requester has the proper access rights to 20991 perform the specified operation. This would generally be the results 20992 of applying the ACL access rules to the file for the current 20993 requester. However, just as with the ACCESS operation, the client 20994 should not attempt to second-guess the server's decisions, as access 20995 rights may change and may be subject to server administrative 20996 controls outside the ACL framework. If the requester is not 20997 authorized to READ or WRITE (depending on the share_access value), 20998 the server MUST return NFS4ERR_ACCESS. 21000 Note that if the client ID was not created with 21001 EXCHGID4_FLAG_BIND_PRINC_STATEID set in the reply to EXCHANGE_ID, 21002 then the server MUST NOT impose any requirement that READs and WRITEs 21003 sent for an open file have the same credentials as the OPEN itself, 21004 and the server is REQUIRED to perform access checking on the READs 21005 and WRITEs themselves. Otherwise, if the reply to EXCHANGE_ID did 21006 have EXCHGID4_FLAG_BIND_PRINC_STATEID set, then with one exception, 21007 the credentials used in the OPEN request MUST match those used in the 21008 READs and WRITEs, and the stateids in the READs and WRITEs MUST 21009 match, or be derived from the stateid from the reply to OPEN. The 21010 exception is if SP4_SSV or SP4_MACH_CRED state protection is used, 21011 and the spo_must_allow result of EXCHANGE_ID includes the READ and/or 21012 WRITE operations. In that case, the machine or SSV credential will 21013 be allowed to issue READ and/or WRITE. See Section 18.35. 21015 If the component provided to OPEN is a symbolic link, the error 21016 NFS4ERR_SYMLINK will be returned to the client, while if it is a 21017 directory the error NFS4ERR_ISDIR. If the component is neither of 21018 those but not an ordinary file, the error NFS4ERR_WRONG_TYPE is 21019 returned. If the current filehandle is not a directory, the error 21020 NFS4ERR_NOTDIR will be returned. 21022 The use of the OPEN4_RESULT_PRESERVE_UNLINKED result flag allows a 21023 client avoid the common implementation practice of renaming an open 21024 file to ".nfs" after it removes the file. After the 21025 server returns OPEN4_RESULT_PRESERVE_UNLINKED, if a client sends a 21026 REMOVE operation that would reduce the file's link count to zero, the 21027 server SHOULD report a value of zero for the numlinks attribute on 21028 the file. 21030 If another client has a delegation of the file being opened that 21031 conflicts with open being done (sometimes depending of the 21032 share_access or share_deny value specified), the delegation(s) MUST 21033 be recalled, and the operation cannot proceed until each such 21034 delegation is returned or revoked. Except where this happens very 21035 quickly, one or more NFS4ERR_DELAY errors will be returned to 21036 requests made while delegation remains outstanding. In the case of a 21037 write delegation, any open by a different client will conflict, while 21038 for a read delegation only opens with one of the following 21039 characteristics will be considered conflicting: 21041 o The value of share_access includes the bit 21042 OPEN4_SHARE_ACCESS_WRITE. 21044 o The value of share_deny specifies READ or BOTH. 21046 o OPEN4_CREATE is specified together with UNCHECKED4, the size 21047 attribute is specified as zero (for truncation) and an existing 21048 file is truncated. 21050 If OPEN4_CREATE is specified and the file does not exist and the 21051 current filehandle designates a directory for which another client 21052 holds a directory delegation, then, unless the delegation is such 21053 that the situation can be resolved by sending a notification, the 21054 delegation MUST be recalled, and the operation cannot proceed until 21055 the delegation is returned or revoked. Except where this happens 21056 very quickly, one or more NFS4ERR_DELAY errors will be returned to 21057 requests made while delegation remains outstanding. 21059 If OPEN4_CREATE is specified and the file does not exist and the 21060 current filehandle designates a directory for which one or more 21061 directory delegations exist, then, when those delegations request 21062 such notifications, NOTIFY4_ADD_ENTRY will be generated as a result 21063 of this operation. 21065 18.16.4.1. WARNING TO CLIENT IMPLEMENTORS 21067 OPEN resembles LOOKUP in that it generates a filehandle for the 21068 client to use. Unlike LOOKUP though, OPEN creates server state on 21069 the filehandle. In normal circumstances, the client can only release 21070 this state with a CLOSE operation. CLOSE uses the current filehandle 21071 to determine which file to close. Therefore the client MUST follow 21072 every OPEN operation with a GETFH operation in the same COMPOUND 21073 procedure. This will supply the client with the filehandle such that 21074 CLOSE can be used appropriately. 21076 Simply waiting for the lease on the file to expire is insufficient 21077 because the server may maintain the state indefinitely as long as 21078 another client does not attempt to make a conflicting access to the 21079 same file. 21081 See also Section 2.10.6.4. 21083 18.17. Operation 19: OPENATTR - Open Named Attribute Directory 21085 18.17.1. ARGUMENTS 21087 struct OPENATTR4args { 21088 /* CURRENT_FH: object */ 21089 bool createdir; 21090 }; 21092 18.17.2. RESULTS 21094 struct OPENATTR4res { 21095 /* 21096 * If status is NFS4_OK, 21097 * new CURRENT_FH: named attribute 21098 * directory 21099 */ 21100 nfsstat4 status; 21101 }; 21103 18.17.3. DESCRIPTION 21105 The OPENATTR operation is used to obtain the filehandle of the named 21106 attribute directory associated with the current filehandle. The 21107 result of the OPENATTR will be a filehandle to an object of type 21108 NF4ATTRDIR. From this filehandle, READDIR and LOOKUP operations can 21109 be used to obtain filehandles for the various named attributes 21110 associated with the original file system object. Filehandles 21111 returned within the named attribute directory will designate objects 21112 of type of NF4NAMEDATTR. 21114 The createdir argument allows the client to signify if a named 21115 attribute directory should be created as a result of the OPENATTR 21116 operation. Some clients may use the OPENATTR operation with a value 21117 of FALSE for createdir to determine if any named attributes exist for 21118 the object. If none exist, then NFS4ERR_NOENT will be returned. If 21119 createdir has a value of TRUE and no named attribute directory 21120 exists, one is created and its filehandle becomes the current 21121 filehandle. On the other hand, if createdir has a value of TRUE and 21122 the named attribute directory already exists, no error results and 21123 the filehandle of the existing directory becomes the current 21124 filehandle. The creation of a named attribute directory assumes that 21125 the server has implemented named attribute support in this fashion 21126 and is not required to do so by this definition. 21128 If the current file handle designates an object of type NF4NAMEDATTR 21129 (a named attribute) or NF4ATTRDIR (a named attribute directory), an 21130 error of NFS4ERR_WRONG_TYPE is returned to the client. Name 21131 attributes or a named attribute directory may have their own named 21132 attributes. 21134 18.17.4. IMPLEMENTATION 21136 If the server does not support named attributes for the current 21137 filehandle, an error of NFS4ERR_NOTSUPP will be returned to the 21138 client. 21140 18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access 21142 18.18.1. ARGUMENTS 21144 struct OPEN_DOWNGRADE4args { 21145 /* CURRENT_FH: opened file */ 21146 stateid4 open_stateid; 21147 seqid4 seqid; 21148 uint32_t share_access; 21149 uint32_t share_deny; 21150 }; 21152 18.18.2. RESULTS 21154 struct OPEN_DOWNGRADE4resok { 21155 stateid4 open_stateid; 21156 }; 21158 union OPEN_DOWNGRADE4res switch(nfsstat4 status) { 21159 case NFS4_OK: 21160 OPEN_DOWNGRADE4resok resok4; 21161 default: 21162 void; 21163 }; 21165 18.18.3. DESCRIPTION 21167 This operation is used to adjust the access and deny states for a 21168 given open. This is necessary when a given open-owner opens the same 21169 file multiple times with different access and deny values. In this 21170 situation, a close of one of the opens may change the appropriate 21171 share_access and share_deny flags to remove bits associated with 21172 opens no longer in effect. 21174 Valid values for the share_access field are: OPEN4_SHARE_ACCESS_READ, 21175 OPEN4_SHARE_ACCESS_WRITE, or OPEN4_SHARE_ACCESS_BOTH. If the client 21176 specifies other values, the server MUST reply with NFS4ERR_INVAL. 21178 Valid values for the share_deny field are: OPEN4_SHARE_DENY_NONE, 21179 OPEN4_SHARE_DENY_READ, OPEN4_SHARE_DENY_WRITE, or 21180 OPEN4_SHARE_DENY_BOTH. If the client specifies other values, the 21181 server MUST reply with NFS4ERR_INVAL. 21183 After checking for valid values of share_access and share_deny, the 21184 server replaces the current access and deny modes on the file with 21185 share_access and share_deny subject to the following constraints: 21187 o The bits in share_access SHOULD equal the union of the 21188 share_access bits (not including OPEN4_SHARE_WANT_* bits) 21189 specified for some subset of the OPENs in effect for the current 21190 open-owner on the current file. 21192 o The bits in share_deny SHOULD equal the union of the share_deny 21193 bits specified for some subset of the OPENs in effect for the 21194 current open-owner on the current file. 21196 If the above constraints are not respected, the server SHOULD return 21197 the error NFS4ERR_INVAL. Since share_access and share_deny bits 21198 should be subsets of those already granted, short of a defect in the 21199 client or server implementation, it is not possible for the 21200 OPEN_DOWNGRADE request to be denied because of conflicting share 21201 reservations. 21203 The seqid argument is not used in NFSv4.1, MAY be any value, and MUST 21204 be ignored by the server. 21206 On success, the current filehandle retains its value. 21208 18.18.4. IMPLEMENTATION 21210 An OPEN_DOWNGRADE operation may make read delegations grantable where 21211 they were not previously. Servers may choose to respond immediately 21212 if there are pending delegation want requests or may respond to the 21213 situation at a later time. 21215 18.19. Operation 22: PUTFH - Set Current Filehandle 21217 18.19.1. ARGUMENTS 21219 struct PUTFH4args { 21220 nfs_fh4 object; 21221 }; 21223 18.19.2. RESULTS 21225 struct PUTFH4res { 21226 /* 21227 * If status is NFS4_OK, 21228 * new CURRENT_FH: argument to PUTFH 21229 */ 21230 nfsstat4 status; 21231 }; 21233 18.19.3. DESCRIPTION 21235 Replaces the current filehandle with the filehandle provided as an 21236 argument. Clears the current stateid. 21238 If the security mechanism used by the requester does not meet the 21239 requirements of the filehandle provided to this operation, the server 21240 MUST return NFS4ERR_WRONGSEC. 21242 See Section 16.2.3.1.1 for more details on the current filehandle. 21244 See Section 16.2.3.1.2 for more details on the current stateid. 21246 18.19.4. IMPLEMENTATION 21248 Commonly used as the second operator (after SEQUENCE) in a COMPOUND 21249 request to set the context for following operations. 21251 18.20. Operation 23: PUTPUBFH - Set Public Filehandle 21253 18.20.1. ARGUMENT 21255 void; 21257 18.20.2. RESULT 21259 struct PUTPUBFH4res { 21260 /* 21261 * If status is NFS4_OK, 21262 * new CURRENT_FH: public fh 21263 */ 21264 nfsstat4 status; 21265 }; 21267 18.20.3. DESCRIPTION 21269 Replaces the current filehandle with the filehandle that represents 21270 the public filehandle of the server's name space. This filehandle 21271 may be different from the "root" filehandle which may be associated 21272 with some other directory on the server. 21274 PUTPUBFH also clears the current stateid. 21276 The public filehandle represents the concepts embodied in RFC2054 21277 [41], RFC2055 [42], and RFC2224 [52]. The intent for NFSv4.1 is that 21278 the public filehandle (represented by the PUTPUBFH operation) be used 21279 as a method of providing WebNFS server compatibility with NFSv3. 21281 The public filehandle and the root filehandle (represented by the 21282 PUTROOTFH operation) SHOULD be equivalent. If the public and root 21283 filehandles are not equivalent, then the directory corresponding to 21284 the public filehandle MUST be a descendant of the directory 21285 corresponding to the root filehandle. 21287 See Section 16.2.3.1.1 for more details on the current filehandle. 21289 See Section 16.2.3.1.2 for more details on the current stateid. 21291 18.20.4. IMPLEMENTATION 21293 Used as the second operator (after SEQUENCE) in an NFS request to set 21294 the context for file accessing operations that follow in the same 21295 COMPOUND request. 21297 With the NFSv3 public filehandle, the client is able to specify 21298 whether the path name provided in the LOOKUP should be evaluated as 21299 either an absolute path relative to the server's root or relative to 21300 the public filehandle. RFC2224 [52] contains further discussion of 21301 the functionality. With NFSv4.1, that type of specification is not 21302 directly available in the LOOKUP operation. The reason for this is 21303 because the component separators needed to specify absolute vs. 21305 relative are not allowed in NFSv4. Therefore, the client is 21306 responsible for constructing its request such that the use of either 21307 PUTROOTFH or PUTPUBFH are used to signify absolute or relative 21308 evaluation of an NFS URL respectively. 21310 Note that there are warnings mentioned in RFC2224 [52] with respect 21311 to the use of absolute evaluation and the restrictions the server may 21312 place on that evaluation with respect to how much of its namespace 21313 has been made available. These same warnings apply to NFSv4.1. It 21314 is likely, therefore that because of server implementation details, 21315 an NFSv3 absolute public filehandle lookup may behave differently 21316 than an NFSv4.1 absolute resolution. 21318 There is a form of security negotiation as described in RFC2755 [53] 21319 that uses the public filehandle and an overloading of the pathname. 21320 This method is not available with NFSv4.1 as filehandles are not 21321 overloaded with special meaning and therefore do not provide the same 21322 framework as NFSv3. Clients should therefore use the security 21323 negotiation mechanisms described in Section 2.6. 21325 18.21. Operation 24: PUTROOTFH - Set Root Filehandle 21327 18.21.1. ARGUMENTS 21329 void; 21331 18.21.2. RESULTS 21333 struct PUTROOTFH4res { 21334 /* 21335 * If status is NFS4_OK, 21336 * new CURRENT_FH: root fh 21337 */ 21338 nfsstat4 status; 21339 }; 21341 18.21.3. DESCRIPTION 21343 Replaces the current filehandle with the filehandle that represents 21344 the root of the server's name space. From this filehandle a LOOKUP 21345 operation can locate any other filehandle on the server. This 21346 filehandle may be different from the "public" filehandle which may be 21347 associated with some other directory on the server. 21349 PUTROOTFH also clears the current stateid. 21351 See Section 16.2.3.1.1 for more details on the current filehandle. 21353 See Section 16.2.3.1.2 for more details on the current stateid. 21355 18.21.4. IMPLEMENTATION 21357 Commonly used as the second operator (after SEQUENCE) in an NFS 21358 request to set the context for file accessing operations that follow 21359 in the same COMPOUND request. 21361 18.22. Operation 25: READ - Read from File 21363 18.22.1. ARGUMENTS 21365 struct READ4args { 21366 /* CURRENT_FH: file */ 21367 stateid4 stateid; 21368 offset4 offset; 21369 count4 count; 21370 }; 21372 18.22.2. RESULTS 21374 struct READ4resok { 21375 bool eof; 21376 opaque data<>; 21377 }; 21379 union READ4res switch (nfsstat4 status) { 21380 case NFS4_OK: 21381 READ4resok resok4; 21382 default: 21383 void; 21384 }; 21386 18.22.3. DESCRIPTION 21388 The READ operation reads data from the regular file identified by the 21389 current filehandle. 21391 The client provides an offset of where the READ is to start and a 21392 count of how many bytes are to be read. An offset of 0 (zero) means 21393 to read data starting at the beginning of the file. If offset is 21394 greater than or equal to the size of the file, the status, NFS4_OK, 21395 is returned with a data length set to 0 (zero) and eof is set to 21396 TRUE. The READ is subject to access permissions checking. 21398 If the client specifies a count value of 0 (zero), the READ succeeds 21399 and returns 0 (zero) bytes of data again subject to access 21400 permissions checking. The server may choose to return fewer bytes 21401 than specified by the client. The client needs to check for this 21402 condition and handle the condition appropriately. 21404 Except when special stateids are used, the stateid value for a READ 21405 request represents a value returned from a previous byte-range lock 21406 or share reservation request or the stateid associated with a 21407 delegation. The stateid identifies the associated owners if any and 21408 is used by the server to verify that the associated locks are still 21409 valid (e.g. have not been revoked). 21411 If the read ended at the end-of-file (formally, in a correctly formed 21412 READ request, if offset + count is equal to the size of the file), or 21413 the read request extends beyond the size of the file (if offset + 21414 count is greater than the size of the file), eof is returned as TRUE; 21415 otherwise it is FALSE. A successful READ of an empty file will 21416 always return eof as TRUE. 21418 If the current filehandle is not an ordinary file, an error will be 21419 returned to the client. In the case that the current filehandle 21420 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. if 21421 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 21422 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 21424 For a READ with a stateid value of all bits 0, the server MAY allow 21425 the READ to be serviced subject to mandatory file locks or the 21426 current share deny modes for the file. For a READ with a stateid 21427 value of all bits 1, the server MAY allow READ operations to bypass 21428 locking checks at the server. 21430 On success, the current filehandle retains its value. 21432 18.22.4. IMPLEMENTATION 21434 It is possible for the server to return fewer than count bytes of 21435 data. If the server returns less than the count requested and eof is 21436 set to FALSE, the client should send another READ to get the 21437 remaining data. A server may return less data than requested under 21438 several circumstances. The file may have been truncated by another 21439 client or perhaps on the server itself, changing the file size from 21440 what the requesting client believes to be the case. This would 21441 reduce the actual amount of data available to the client. It is 21442 possible that the server may back off the transfer size and reduce 21443 the read request return. Server resource exhaustion may also occur 21444 necessitating a smaller read return. 21446 If mandatory file locking is in effect for the file, and if the 21447 region corresponding to the data to be read from file is write locked 21448 by an owner not associated the stateid, the server will return the 21449 NFS4ERR_LOCKED error. The client should try to get the appropriate 21450 read byte-range lock via the LOCK operation before re-attempting the 21451 READ. When the READ completes, the client should release the byte- 21452 range lock via LOCKU. 21454 If another client has a write delegation for the file being read, the 21455 delegation must be recalled, and the operation cannot proceed until 21456 that delegation is returned or revoked. Except where this happens 21457 very quickly, one or more NFS4ERR_DELAY errors will be returned to 21458 requests made while the delegation remains outstanding. Normally, 21459 delegations will not be recalled as a result of a READ operation 21460 since the recall will occur as a result of an earlier OPEN. However, 21461 since it is possible for a READ to be done with a special stateid, 21462 the server needs to check for this case even though the client should 21463 have done an OPEN previously. 21465 18.23. Operation 26: READDIR - Read Directory 21467 18.23.1. ARGUMENTS 21469 struct READDIR4args { 21470 /* CURRENT_FH: directory */ 21471 nfs_cookie4 cookie; 21472 verifier4 cookieverf; 21473 count4 dircount; 21474 count4 maxcount; 21475 bitmap4 attr_request; 21476 }; 21478 18.23.2. RESULTS 21480 struct entry4 { 21481 nfs_cookie4 cookie; 21482 component4 name; 21483 fattr4 attrs; 21484 entry4 *nextentry; 21485 }; 21487 struct dirlist4 { 21488 entry4 *entries; 21489 bool eof; 21490 }; 21492 struct READDIR4resok { 21493 verifier4 cookieverf; 21494 dirlist4 reply; 21495 }; 21497 union READDIR4res switch (nfsstat4 status) { 21498 case NFS4_OK: 21499 READDIR4resok resok4; 21500 default: 21501 void; 21502 }; 21504 18.23.3. DESCRIPTION 21506 The READDIR operation retrieves a variable number of entries from a 21507 file system directory and returns client requested attributes for 21508 each entry along with information to allow the client to request 21509 additional directory entries in a subsequent READDIR. 21511 The arguments contain a cookie value that represents where the 21512 READDIR should start within the directory. A value of 0 (zero) for 21513 the cookie is used to start reading at the beginning of the 21514 directory. For subsequent READDIR requests, the client specifies a 21515 cookie value that is provided by the server on a previous READDIR 21516 request. 21518 The request's cookieverf field should be set to 0 (zero) when the 21519 request's cookie field is 0 (zero) (first directory read). On 21520 subsequent requests, the cookieverf field must match the cookieverf 21521 returned by the READDIR in which the cookie was acquired. If the 21522 server determines that the cookieverf is no longer valid for the 21523 directory, the error NFS4ERR_NOT_SAME must be returned. 21525 The dircount field of the request is a hint of the maximum number of 21526 bytes of directory information that should be returned. This value 21527 represents the total length of the names of the directory entries and 21528 the cookie value for these entries. This length represents the XDR 21529 encoding of the data (names and cookies) and not the length in the 21530 native format of the server. 21532 The maxcount field of the request represents the maximum total size 21533 of all of the data being returned within the READDIR4resok structure 21534 and includes the XDR overhead. The server MAY return less data. If 21535 the server is unable to return a single directory entry within the 21536 maxcount limit, the error NFS4ERR_TOOSMALL MUST be returned to the 21537 client. 21539 Finally, the request's attr_request field represents the list of 21540 attributes to be returned for each directory entry supplied by the 21541 server. 21543 A successful reply consists of a list of directory entries. Each of 21544 these entries contains the name of the directory entry, a cookie 21545 value for that entry, and the associated attributes as requested. 21546 The "eof" flag has a value of TRUE if there are no more entries in 21547 the directory. 21549 The cookie value is only meaningful to the server and is used as a 21550 cursor for the directory entry. As mentioned, this cookie is used by 21551 the client for subsequent READDIR operations so that it may continue 21552 reading a directory. The cookie is similar in concept to a READ 21553 offset but MUST NOT be interpreted as such by the client. Ideally, 21554 the cookie value SHOULD NOT change if the directory is modified since 21555 the client may be caching these values. 21557 In some cases, the server may encounter an error while obtaining the 21558 attributes for a directory entry. Instead of returning an error for 21559 the entire READDIR operation, the server can instead return the 21560 attribute rdattr_error (Section 5.8.1.12). With this, the server is 21561 able to communicate the failure to the client and not fail the entire 21562 operation in the instance of what might be a transient failure. 21563 Obviously, the client must request the fattr4_rdattr_error attribute 21564 for this method to work properly. If the client does not request the 21565 attribute, the server has no choice but to return failure for the 21566 entire READDIR operation. 21568 For some file system environments, the directory entries "." and ".." 21569 have special meaning and in other environments, they do not. If the 21570 server supports these special entries within a directory, they SHOULD 21571 NOT be returned to the client as part of the READDIR response. To 21572 enable some client environments, the cookie values of 0, 1, and 2 are 21573 to be considered reserved. Note that the UNIX client will use these 21574 values when combining the server's response and local representations 21575 to enable a fully formed UNIX directory presentation to the 21576 application. 21578 For READDIR arguments, cookie values of 1 and 2 SHOULD NOT be used 21579 and for READDIR results cookie values of 0, 1, and 2 SHOULD NOT be 21580 returned. 21582 On success, the current filehandle retains its value. 21584 18.23.4. IMPLEMENTATION 21586 The server's file system directory representations can differ 21587 greatly. A client's programming interfaces may also be bound to the 21588 local operating environment in a way that does not translate well 21589 into the NFS protocol. Therefore the use of the dircount and 21590 maxcount fields are provided to enable the client to provide hints to 21591 the server. If the client is aggressive about attribute collection 21592 during a READDIR, the server has an idea of how to limit the encoded 21593 response. 21595 If dircount is zero, the server bounds the reply's size based on 21596 request's maxcount field. 21598 The cookieverf may be used by the server to help manage cookie values 21599 that may become stale. It should be a rare occurrence that a server 21600 is unable to continue properly reading a directory with the provided 21601 cookie/cookieverf pair. The server SHOULD make every effort to avoid 21602 this condition since the application at the client might be unable to 21603 properly handle this type of failure. 21605 The use of the cookieverf will also protect the client from using 21606 READDIR cookie values that might be stale. For example, if the file 21607 system has been migrated, the server might or might not be able to 21608 use the same cookie values to service READDIR as the previous server 21609 used. With the client providing the cookieverf, the server is able 21610 to provide the appropriate response to the client. This prevents the 21611 case where the server accepts a cookie value but the underlying 21612 directory has changed and the response is invalid from the client's 21613 context of its previous READDIR. 21615 Since some servers will not be returning "." and ".." entries as has 21616 been done with previous versions of the NFS protocol, the client that 21617 requires these entries be present in READDIR responses must fabricate 21618 them. 21620 18.24. Operation 27: READLINK - Read Symbolic Link 21622 18.24.1. ARGUMENTS 21624 /* CURRENT_FH: symlink */ 21625 void; 21627 18.24.2. RESULTS 21629 struct READLINK4resok { 21630 linktext4 link; 21631 }; 21633 union READLINK4res switch (nfsstat4 status) { 21634 case NFS4_OK: 21635 READLINK4resok resok4; 21636 default: 21637 void; 21638 }; 21640 18.24.3. DESCRIPTION 21642 READLINK reads the data associated with a symbolic link. Depending 21643 on the value of the UTF-8 capability attribute (Section 14.4), the 21644 data is encoded in UTF-8. Whether created by an NFS client or 21645 created locally on the server, the data in a symbolic link is not 21646 interpreted (except possibly to check for proper UTF-8 encoding) when 21647 created, but is simply stored. 21649 On success, the current filehandle retains its value. 21651 18.24.4. IMPLEMENTATION 21653 A symbolic link is nominally a pointer to another file. The data is 21654 not necessarily interpreted by the server, just stored in the file. 21655 It is possible for a client implementation to store a path name that 21656 is not meaningful to the server operating system in a symbolic link. 21657 A READLINK operation returns the data to the client for 21658 interpretation. If different implementations want to share access to 21659 symbolic links, then they must agree on the interpretation of the 21660 data in the symbolic link. 21662 The READLINK operation is only allowed on objects of type NF4LNK. 21663 The server should return the error NFS4ERR_WRONG_TYPE if the object 21664 is not of type NF4LNK. 21666 18.25. Operation 28: REMOVE - Remove File System Object 21668 18.25.1. ARGUMENTS 21670 struct REMOVE4args { 21671 /* CURRENT_FH: directory */ 21672 component4 target; 21673 }; 21675 18.25.2. RESULTS 21677 struct REMOVE4resok { 21678 change_info4 cinfo; 21679 }; 21681 union REMOVE4res switch (nfsstat4 status) { 21682 case NFS4_OK: 21683 REMOVE4resok resok4; 21684 default: 21685 void; 21686 }; 21688 18.25.3. DESCRIPTION 21690 The REMOVE operation removes (deletes) a directory entry named by 21691 filename from the directory corresponding to the current filehandle. 21692 If the entry in the directory was the last reference to the 21693 corresponding file system object, the object may be destroyed. The 21694 directory may be either of type NF4DIR or NF4ATTRDIR. 21696 For the directory where the filename was removed, the server returns 21697 change_info4 information in cinfo. With the atomic field of the 21698 change_info4 data type, the server will indicate if the before and 21699 after change attributes were obtained atomically with respect to the 21700 removal. 21702 If the target has a length of 0 (zero), or if target does not obey 21703 the UTF-8 definition (and the server is enforcing UTF-8 encoding, see 21704 Section 14.4), the error NFS4ERR_INVAL will be returned. 21706 On success, the current filehandle retains its value. 21708 18.25.4. IMPLEMENTATION 21710 NFSv3 required a different operator RMDIR for directory removal and 21711 REMOVE for non-directory removal. This allowed clients to skip 21712 checking the file type when being passed a non-directory delete 21713 system call (e.g. unlink() [26] in POSIX) to remove a directory, as 21714 well as the converse (e.g. a rmdir() on a non-directory) because they 21715 knew the server would check the file type. NFSv4.1 REMOVE can be 21716 used to delete any directory entry independent of its file type. The 21717 implementor of an NFSv4.1 client's entry points from the unlink() and 21718 rmdir() system calls should first check the file type against the 21719 types the system call is allowed to remove before issuing a REMOVE. 21720 Alternatively, the implementor can produce a COMPOUND call that 21721 includes a LOOKUP/VERIFY sequence to verify the file type before a 21722 REMOVE operation in the same COMPOUND call. 21724 The concept of last reference is server specific. However, if the 21725 numlinks field in the previous attributes of the object had the value 21726 1, the client should not rely on referring to the object via a 21727 filehandle. Likewise, the client should not rely on the resources 21728 (disk space, directory entry, and so on) formerly associated with the 21729 object becoming immediately available. Thus, if a client needs to be 21730 able to continue to access a file after using REMOVE to remove it, 21731 the client should take steps to make sure that the file will still be 21732 accessible. While the traditional mechanism used is to RENAME the 21733 file from its old name to a new hidden name, the NFSv4.1 OPEN 21734 operation MAY return a result flag, OPEN4_RESULT_PRESERVE_UNLINKED, 21735 which indicates to the client that the file will be preserved if the 21736 file has an outstanding open (see Section 18.16). 21738 If the server finds that the file is still open when the REMOVE 21739 arrives: 21741 o The server SHOULD NOT delete the file's directory entry if the 21742 file was opened with OPEN4_SHARE_DENY_WRITE or 21743 OPEN4_SHARE_DENY_BOTH. 21745 o If the file was not opened with OPEN4_SHARE_DENY_WRITE or 21746 OPEN4_SHARE_DENY_BOTH, the server SHOULD delete the file's 21747 directory entry. However, until last CLOSE of the file, the 21748 server MAY continue to allow access to the file via its 21749 filehandle. 21751 o The server MUST NOT delete the directory entry if the reply from 21752 OPEN had the flag OPEN4_RESULT_PRESERVE_UNLINKED set. 21754 The server MAY implement its own restrictions on removal of a file 21755 while it is open. The server might disallow such a REMOVE (or a 21756 removal that occurs as part of RENAME). The conditions that 21757 influence the restrictions on removal of a file while it is still 21758 open include: 21760 o Whether certain access protocols (i.e. not just NFS) are holding 21761 the file open. 21763 o Whether particular options, access modes, or policies on the 21764 server are enabled. 21766 In all cases in which a decision is made to not allow the file's 21767 directory entry be removed because of an open, the error 21768 NFS4ERR_FILE_OPEN is returned. 21770 Where the determination above cannot be made definitively because 21771 delegations are being held, they MUST be recalled to allow processing 21772 of the REMOVE to continue. When a delegation is held, the server's 21773 knowledge of the status of opens for that client is not to be relied 21774 on, so that unless there are files opened with the particular deny 21775 modes by clients without delegations, the determination cannot be 21776 made until delegations are recalled, and the operation cannot proceed 21777 until each sufficient delegations have been returned or revoked to 21778 allow the server to make a correct determination. 21780 In all cases in which delegations are recalled, the server is likely 21781 to return one or more NFS4ERR_DELAY errors while delegations remain 21782 outstanding. 21784 If the current filehandle designates a directory for which another 21785 client holds a directory delegation, then, unless the situation can 21786 be resolved by sending a notification, the directory delegation MUST 21787 be recalled, and the operation MUST NOT proceed until the delegation 21788 is returned or revoked. Except where this happens very quickly, one 21789 or more NFS4ERR_DELAY errors will be returned to requests made while 21790 delegation remains outstanding. 21792 When the current filehandle designates a directory for which one or 21793 more directory delegations exist, then, when those delegations 21794 request such notifications, NOTIFY4_REMOVE_ENTRY will be generated as 21795 a result of this operation. 21797 Note that when a remove occurs as a result of a RENAME, 21798 NOTIFY4_REMOVE_ENTRY will only be generated if the removal happens as 21799 a separate operation. In the case in which the removal is integrated 21800 and atomic with RENAME, the notification of the removal is integrated 21801 with notification for the RENAME. See the discussion of the 21802 NOTIFY4_RENAME_ENTRY notification in Section 20.4. 21804 18.26. Operation 29: RENAME - Rename Directory Entry 21805 18.26.1. ARGUMENTS 21807 struct RENAME4args { 21808 /* SAVED_FH: source directory */ 21809 component4 oldname; 21810 /* CURRENT_FH: target directory */ 21811 component4 newname; 21812 }; 21814 18.26.2. RESULTS 21816 struct RENAME4resok { 21817 change_info4 source_cinfo; 21818 change_info4 target_cinfo; 21819 }; 21821 union RENAME4res switch (nfsstat4 status) { 21822 case NFS4_OK: 21823 RENAME4resok resok4; 21824 default: 21825 void; 21826 }; 21828 18.26.3. DESCRIPTION 21830 The RENAME operation renames the object identified by oldname in the 21831 source directory corresponding to the saved filehandle, as set by the 21832 SAVEFH operation, to newname in the target directory corresponding to 21833 the current filehandle. The operation is required to be atomic to 21834 the client. Source and target directories MUST reside on the same 21835 file system on the server. On success, the current filehandle will 21836 continue to be the target directory. 21838 If the target directory already contains an entry with the name, 21839 newname, the source object MUST be compatible with the target: either 21840 both are non-directories or both are directories and the target MUST 21841 be empty. If compatible, the existing target is removed before the 21842 rename occurs or preferably as part of the rename and atomic with it. 21843 See Section 18.25.4 for client and server actions whenever a target 21844 is removed. Note however that when the removal is performed 21845 atomically with the rename, certain parts of the removal described 21846 there are integrated with the rename. For example, notification of 21847 the removal will not be via a NOTIFY4_REMOVE_ENTRY but will be 21848 indicated as part of the NOTIFY4_ADD_ENTRY or NOTIFY4_RENAME_ENTRY 21849 generated by the rename. 21851 If the source object and the target are not compatible or if the 21852 target is a directory but not empty, the server will return the 21853 error, NFS4ERR_EXIST. 21855 If oldname and newname both refer to the same file (e.g. they might 21856 be hard links of each other), then unless the file is open (see 21857 Section 18.26.4), RENAME MUST perform no action and return NFS4_OK. 21859 For both directories involved in the RENAME, the server returns 21860 change_info4 information. With the atomic field of the change_info4 21861 data type, the server will indicate if the before and after change 21862 attributes were obtained atomically with respect to the rename. 21864 If oldname refers to a named attribute and the saved and current 21865 filehandles refer to different file system objects, the server will 21866 return NFS4ERR_XDEV just as if the saved and current filehandles 21867 represented directories on different file systems. 21869 If oldname or newname have a length of 0 (zero), or if oldname or 21870 newname do not obey the UTF-8 definition, the error NFS4ERR_INVAL 21871 will be returned. 21873 18.26.4. IMPLEMENTATION 21875 The server MAY impose restrictions on the RENAME operation such that 21876 RENAME may not be done when the file being renamed is open or when 21877 that open is done by particular protocols, or with particular options 21878 or access modes. Similar restrictions may be applied when a file 21879 exists with the target name and is open. When RENAME is rejected 21880 because of such restrictions, the error NFS4ERR_FILE_OPEN is 21881 returned. 21883 When oldname and rename refer to the same file and that file is open 21884 in a fashion such that RENAME would normally be rejected with 21885 NFS4ERR_FILE_OPEN if oldname and newname were different files, then 21886 RENAME SHOULD be rejected with NFS4ERR_FILE_OPEN. 21888 If a server does implement such restrictions and those restrictions 21889 include cases of NFSv4 opens preventing successful execution of a 21890 rename, the server needs to recall any delegations which could hide 21891 the existence of opens relevant to that decision. This is because 21892 when a client holds a delegation, the server might not have an 21893 accurate account of the opens for that client, since the client may 21894 execute OPENs and CLOSEs locally. The RENAME operation need only be 21895 delayed until a definitive result can be obtained. For example, if 21896 there are multiple delegations and one of them establishes an open 21897 whose presence would prevent the rename, given the server's 21898 semantics, NFS4ERR_FILE_OPEN may be returned to the caller as soon as 21899 that delegation is returned without waiting for other delegations to 21900 be returned. Similarly, if such opens are not associated with 21901 delegations, NFS4ERR_FILE_OPEN can be returned immediately with no 21902 delegation recall being done. 21904 If the current filehandle or the saved filehandle designate a 21905 directory for which another client holds a directory delegation, 21906 then, unless the situation can be resolved by sending a notification, 21907 the delegation MUST be recalled, and the operation cannot proceed 21908 until the delegation is returned or revoked. Except where this 21909 happens very quickly, one or more NFS4ERR_DELAY errors will be 21910 returned to requests made while delegation remains outstanding. 21912 When the current and saved filehandles are the same and they 21913 designate a directory for which one or more directory delegations 21914 exist, then, when those delegations request such notifications, a 21915 notification of type NOTIFY4_RENAME_ENTRY will be generated as a 21916 result of this operation. When oldname and rename refer to the same 21917 file, no notification is generated (because as Section 18.26.3 21918 states, the server MUST take no action). When a file is removed 21919 because it has the same name as the target, if that removal is done 21920 atomically with the rename, a NOTIFY4_REMOVE_ENTRY notification will 21921 not be generated. Instead, the deletion of the file will be reported 21922 as part of the NOTIFY4_RENAME_ENTRY notification. 21924 When the current and saved filehandles are not the same: 21926 o If the current filehandle designates a directory for which one or 21927 more directory delegations exist, then, when those delegations 21928 request such notifications, NOTIFY4_ADD_ENTRY will be generated as 21929 a result of this operation. When a file is removed because it has 21930 the same name as the target, if that removal is done atomically 21931 with the rename, a NOTIFY4_REMOVE_ENTRY notification will not be 21932 generated. Instead, the deletion of the file will be reported as 21933 part of the NOTIFY4_ADD_ENTRY notification. 21935 o If the saved filehandle designates a directory for which one or 21936 more directory delegations exist, then, when those delegations 21937 request such notifications, NOTIFY4_REMOVE_ENTRY will be generated 21938 as a result of this operation. 21940 If the object being renamed has file delegations held by clients 21941 other than the one doing the RENAME, the delegations MUST be 21942 recalled, and the operation cannot proceed until each such delegation 21943 is returned or revoked. Note that in the case of multiply linked 21944 files, the delegation recall requirement applies even if the 21945 delegation was obtained through a different name than the one being 21946 renamed. In all cases in which delegations are recalled, the server 21947 is likely to return one or more NFS4ERR_DELAY error while the 21948 delegation(s) remains outstanding, although it may, if the returns 21949 happen quickly, not do that. 21951 The RENAME operation must be atomic to the client. The statement 21952 "source and target directories MUST reside on the same file system on 21953 the server" means that the fsid fields in the attributes for the 21954 directories are the same. If they reside on different file systems, 21955 the error, NFS4ERR_XDEV, is returned. 21957 Based on the value of the fh_expire_type attribute for the object, 21958 the filehandle may or may not expire on a RENAME. However, server 21959 implementors are strongly encouraged to attempt to keep filehandles 21960 from expiring in this fashion. 21962 On some servers, the file names "." and ".." are illegal as either 21963 oldname or newname, and will result in the error NFS4ERR_BADNAME. In 21964 addition, on many servers the case of oldname or newname being an 21965 alias for the source directory will be checked for. Such servers 21966 will return the error NFS4ERR_INVAL in these cases. 21968 If either of the source or target filehandles are not directories, 21969 the server will return NFS4ERR_NOTDIR. 21971 18.27. Operation 31: RESTOREFH - Restore Saved Filehandle 21973 18.27.1. ARGUMENTS 21975 /* SAVED_FH: */ 21976 void; 21978 18.27.2. RESULTS 21980 struct RESTOREFH4res { 21981 /* 21982 * If status is NFS4_OK, 21983 * new CURRENT_FH: value of saved fh 21984 */ 21985 nfsstat4 status; 21986 }; 21988 18.27.3. DESCRIPTION 21990 Set the current filehandle and stateid to the values in the saved 21991 filehandle and stateid. If there is no saved filehandle then the 21992 server will return the error NFS4ERR_NOFILEHANDLE. 21994 See Section 16.2.3.1.1 for more details on the current filehandle. 21996 See Section 16.2.3.1.2 for more details on the current stateid. 21998 18.27.4. IMPLEMENTATION 22000 Operations like OPEN and LOOKUP use the current filehandle to 22001 represent a directory and replace it with a new filehandle. Assuming 22002 the previous filehandle was saved with a SAVEFH operator, the 22003 previous filehandle can be restored as the current filehandle. This 22004 is commonly used to obtain post-operation attributes for the 22005 directory, e.g. 22007 PUTFH (directory filehandle) 22008 SAVEFH 22009 GETATTR attrbits (pre-op dir attrs) 22010 CREATE optbits "foo" attrs 22011 GETATTR attrbits (file attributes) 22012 RESTOREFH 22013 GETATTR attrbits (post-op dir attrs) 22015 18.28. Operation 32: SAVEFH - Save Current Filehandle 22017 18.28.1. ARGUMENTS 22019 /* CURRENT_FH: */ 22020 void; 22022 18.28.2. RESULTS 22024 struct SAVEFH4res { 22025 /* 22026 * If status is NFS4_OK, 22027 * new SAVED_FH: value of current fh 22028 */ 22029 nfsstat4 status; 22030 }; 22032 18.28.3. DESCRIPTION 22034 Save the current filehandle and stateid. If a previous filehandle 22035 was saved then it is no longer accessible. The saved filehandle can 22036 be restored as the current filehandle with the RESTOREFH operator. 22038 On success, the current filehandle retains its value. 22040 See Section 16.2.3.1.1 for more details on the current filehandle. 22042 See Section 16.2.3.1.2 for more details on the current stateid. 22044 18.28.4. IMPLEMENTATION 22046 18.29. Operation 33: SECINFO - Obtain Available Security 22048 18.29.1. ARGUMENTS 22050 struct SECINFO4args { 22051 /* CURRENT_FH: directory */ 22052 component4 name; 22053 }; 22055 18.29.2. RESULTS 22057 /* 22058 * From RFC 2203 22059 */ 22060 enum rpc_gss_svc_t { 22061 RPC_GSS_SVC_NONE = 1, 22062 RPC_GSS_SVC_INTEGRITY = 2, 22063 RPC_GSS_SVC_PRIVACY = 3 22064 }; 22066 struct rpcsec_gss_info { 22067 sec_oid4 oid; 22068 qop4 qop; 22069 rpc_gss_svc_t service; 22070 }; 22072 /* RPCSEC_GSS has a value of '6' - See RFC 2203 */ 22073 union secinfo4 switch (uint32_t flavor) { 22074 case RPCSEC_GSS: 22075 rpcsec_gss_info flavor_info; 22076 default: 22077 void; 22078 }; 22080 typedef secinfo4 SECINFO4resok<>; 22082 union SECINFO4res switch (nfsstat4 status) { 22083 case NFS4_OK: 22084 /* CURRENTFH: consumed */ 22085 SECINFO4resok resok4; 22086 default: 22087 void; 22088 }; 22090 18.29.3. DESCRIPTION 22092 The SECINFO operation is used by the client to obtain a list of valid 22093 RPC authentication flavors for a specific directory filehandle, file 22094 name pair. SECINFO should apply the same access methodology used for 22095 LOOKUP when evaluating the name. Therefore, if the requester does 22096 not have the appropriate access to LOOKUP the name then SECINFO MUST 22097 behave the same way and return NFS4ERR_ACCESS. 22099 The result will contain an array which represents the security 22100 mechanisms available, with an order corresponding to the server's 22101 preferences, the most preferred being first in the array. The client 22102 is free to pick whatever security mechanism it both desires and 22103 supports, or to pick in the server's preference order the first one 22104 it supports. The array entries are represented by the secinfo4 22105 structure. The field 'flavor' will contain a value of AUTH_NONE, 22106 AUTH_SYS (as defined in RFC1831 [3]), or RPCSEC_GSS (as defined in 22107 RFC2203 [4]). The field flavor can also be any other security flavor 22108 registered with IANA. 22110 For the flavors AUTH_NONE and AUTH_SYS, no additional security 22111 information is returned. The same is true of many (if not most) 22112 other security flavors, including AUTH_DH. For a return value of 22113 RPCSEC_GSS, a security triple is returned that contains the mechanism 22114 object identifier (OID, as defined in RFC2743 [7]), the quality of 22115 protection (as defined in RFC2743 [7]) and the service type (as 22116 defined in RFC2203 [4]). It is possible for SECINFO to return 22117 multiple entries with flavor equal to RPCSEC_GSS with different 22118 security triple values. 22120 On success, the current filehandle is consumed (see 22121 Section 2.6.3.1.1.8), and if the next operation after SECINFO tries 22122 to use the current filehandle, that operation will fail with the 22123 status NFS4ERR_NOFILEHANDLE. 22125 If the name has a length of 0 (zero), or if name does not obey the 22126 UTF-8 definition (assuming UTF-8 capabilities are enabled, see 22127 Section 14.4), the error NFS4ERR_INVAL will be returned. 22129 See Section 2.6 for additional information on the use of SECINFO. 22131 18.29.4. IMPLEMENTATION 22133 The SECINFO operation is expected to be used by the NFS client when 22134 the error value of NFS4ERR_WRONGSEC is returned from another NFS 22135 operation. This signifies to the client that the server's security 22136 policy is different from what the client is currently using. At this 22137 point, the client is expected to obtain a list of possible security 22138 flavors and choose what best suits its policies. 22140 As mentioned, the server's security policies will determine when a 22141 client request receives NFS4ERR_WRONGSEC. See Table 8 for a list 22142 operations which can return NFS4ERR_WRONGSEC. In addition, when 22143 READDIR returns attributes, the rdattr_error (Section 5.8.1.12) can 22144 contain NFS4ERR_WRONGSEC. Note that CREATE and REMOVE MUST NOT 22145 return NFS4ERR_WRONGSEC. The rationale for CREATE is that unless the 22146 target name exists it cannot have a separate security policy from the 22147 parent directory, and the security policy of the parent was checked 22148 when its filehandle was injected into the COMPOUND request's 22149 operations stream (for similar reasons, an OPEN operation that 22150 creates the target MUST NOT return NFS4ERR_WRONGSEC). If the target 22151 name exists, while it might have a separate security policy, that is 22152 irrelevant because CREATE MUST return NFS4ERR_EXIST. The rationale 22153 for REMOVE is that while that target might have separate security 22154 policy, the target is going to be removed, and so the security policy 22155 of the parent trumps that of the object being removed. RENAME and 22156 LINK MAY return NFS4ERR_WRONGSEC, but the NFS4ERR_WRONGSEC error 22157 applies only to the saved filehandle (see Section 2.6.3.1.2). Any 22158 NFS4ERR_WRONGSEC error on the current filehandle used by LINK and 22159 RENAME MUST be returned by the PUTFH, PUTPUBFH, PUTROOTFH, or 22160 RESTOREFH operation that injected the current filehandle. 22162 With the exception of LINK and RENAME, the set of operations that can 22163 return NFS4ERR_WRONGSEC represent the point at which the client can 22164 inject a filehandle into the "current filehandle" at the server. The 22165 filehandle is either provided by the client (PUTFH, PUTPUBFH, 22166 PUTROOTFH), generated as a result of a name to filehandle translation 22167 (LOOKUP and OPEN), or generated from the saved filehandle via 22168 RESTOREFH. As Section 2.6.3.1.1.1 states, a put filehandle operation 22169 followed by SAVEFH MUST NOT return NFS4ERR_WRONGSEC. Thus the 22170 RESTOREFH operation, under certain conditions (see Section 2.6.3.1.1) 22171 is permitted to return NFS4ERR_WRONGSEC so that security policies can 22172 be honored. 22174 The READDIR operation will not directly return the NFS4ERR_WRONGSEC 22175 error. However, if the READDIR request included a request for 22176 attributes, it is possible that the READDIR request's security triple 22177 did not match that of a directory entry. If this is the case and the 22178 client has requested the rdattr_error attribute, the server will 22179 return the NFS4ERR_WRONGSEC error in rdattr_error for the entry. 22181 To resolve an error return of NFS4ERR_WRONGSEC, the client does the 22182 following: 22184 o For LOOKUP and OPEN, the client will use SECINFO with the same 22185 current filehandle and name as provided in the original LOOKUP or 22186 OPEN to enumerate the available security triples. 22188 o For the rdattr_error, the client will use SECINFO with the same 22189 current filehandle as provided in the original READDIR. The name 22190 passed to SECINFO will be that of the directory entry (as returned 22191 from READDIR) that had the NFS4ERR_WRONGSEC error in the 22192 rdattr_error attribute. 22194 o For PUTFH, PUTROOTFH, PUTPUBFH, RESTOREFH, LINK, and RENAME, the 22195 client will use SECINFO_NO_NAME { style = 22196 SECINFO_STYLE4_CURRENT_FH }. The client will prefix the 22197 SECINFO_NO_NAME operation with the appropriate PUTFH, PUTPUBFH, or 22198 PUTROOTFH operation that provides the filehandle originally 22199 provided by the PUTFH, PUTPUBFH, PUTROOTFH, or RESTOREFH 22200 operation. 22202 NOTE: In NFSv4.0, the client was required to use SECINFO, and had 22203 to reconstruct the parent of the original filehandle, and the 22204 component name of the original filehandle. The introduction in 22205 NFSv4.1 of SECINFO_NO_NAME obviates the need for reconstruction. 22207 o For LOOKUPP, the client will use SECINFO_NO_NAME { style = 22208 SECINFO_STYLE4_PARENT } and provide the filehandle which equals 22209 the filehandle originally provided to LOOKUPP. 22211 See Section 21 for a discussion on the recommendations for the 22212 security flavor used by SECINFO and SECINFO_NO_NAME. 22214 18.30. Operation 34: SETATTR - Set Attributes 22216 18.30.1. ARGUMENTS 22218 struct SETATTR4args { 22219 /* CURRENT_FH: target object */ 22220 stateid4 stateid; 22221 fattr4 obj_attributes; 22222 }; 22224 18.30.2. RESULTS 22226 struct SETATTR4res { 22227 nfsstat4 status; 22228 bitmap4 attrsset; 22229 }; 22231 18.30.3. DESCRIPTION 22233 The SETATTR operation changes one or more of the attributes of a file 22234 system object. The new attributes are specified with a bitmap and 22235 the attributes that follow the bitmap in bit order. 22237 The stateid argument for SETATTR is used to provide file locking 22238 context that is necessary for SETATTR requests that set the size 22239 attribute. Since setting the size attribute modifies the file's 22240 data, it has the same locking requirements as a corresponding WRITE. 22241 Any SETATTR that sets the size attribute is incompatible with a share 22242 reservation that specifies DENY_WRITE. The area between the old end- 22243 of-file and the new end-of-file is considered to be modified just as 22244 would have been the case had the area in question been specified as 22245 the target of WRITE, for the purpose of checking conflicts with byte- 22246 range locks, for those cases in which a server is implementing 22247 mandatory byte-range locking behavior. A valid stateid SHOULD always 22248 be specified. When the file size attribute is not set, the special 22249 stateid consisting of all bits zero MAY be passed. 22251 On either success or failure of the operation, the server will return 22252 the attrsset bitmask to represent what (if any) attributes were 22253 successfully set. The attrsset in the response is a subset of the 22254 attrmask field of the obj_attributes field in the argument. 22256 On success, the current filehandle retains its value. 22258 18.30.4. IMPLEMENTATION 22260 If the request specifies the owner attribute to be set, the server 22261 SHOULD allow the operation to succeed if the current owner of the 22262 object matches the value specified in the request. Some servers may 22263 be implemented in a way as to prohibit the setting of the owner 22264 attribute unless the requester has privilege to do so. If the server 22265 is lenient in this one case of matching owner values, the client 22266 implementation may be simplified in cases of creation of an object 22267 (e.g. an exclusive create via OPEN) followed by a SETATTR. 22269 The file size attribute is used to request changes to the size of a 22270 file. A value of zero causes the file to be truncated, a value less 22271 than the current size of the file causes data from new size to the 22272 end of the file to be discarded, and a size greater than the current 22273 size of the file causes logically zeroed data bytes to be added to 22274 the end of the file. Servers are free to implement this using 22275 unallocated bytes (holes) or allocated data bytes set to zero. 22276 Clients should not make any assumptions regarding a server's 22277 implementation of this feature, beyond that the bytes in affected 22278 region returned by READ will be zeroed. Servers MUST support 22279 extending the file size via SETATTR. 22281 SETATTR is not guaranteed to be atomic. A failed SETATTR may 22282 partially change a file's attributes, hence the reason why the reply 22283 always includes the status and the list of attributes that were set. 22285 If the object whose attributes are being changed has a file 22286 delegation which is held by a client other than the one doing the 22287 SETATTR, the delegation(s) must be recalled, and the operation cannot 22288 proceed to actually change an attribute until each such delegation is 22289 returned or revoked. In all cases in which delegations are recalled, 22290 the server is likely to return one or more NFS4ERR_DELAY error while 22291 the delegation(s) remains outstanding, although it may, if the 22292 returns happen quickly, not do that. 22294 If the object whose attributes are being set is a directory and 22295 another client holds a directory delegation for that directory, then 22296 if enabled, asynchronous notifications will be generated when the set 22297 of attributes changed has a non-null intersection with the set of 22298 attributes for which notification is requested. Notifications of 22299 type NOTIFY4_CHANGE_DIR_ATTRS will be sent to the appropriate 22300 client(s), but the SETATTR is not delayed by waiting for these 22301 notifications to be sent. 22303 If the object whose attributes are being set is a member of directory 22304 for which another client holds a directory delegation, then 22305 asynchronous notifications will be generated when the set of 22306 attributes changed has a non-null intersection with the set of 22307 attributes for which notification is requested. Notifications of 22308 type NOTIFY4_CHANGE_CHILD_ATTRS will be sent to the appropriate 22309 clients, but the SETATTR is not delayed by waiting for these 22310 notifications to be sent. 22312 Changing the size of a file with SETATTR indirectly changes the 22313 time_modify and change attributes. A client must account for this as 22314 size changes can result in data deletion. 22316 The attributes time_access_set and time_modify_set are write-only 22317 attributes constructed as a switched union so the client can direct 22318 the server in setting the time values. If the switched union 22319 specifies SET_TO_CLIENT_TIME4, the client has provided an nfstime4 to 22320 be used for the operation. If the switch union does not specify 22321 SET_TO_CLIENT_TIME4, the server is to use its current time for the 22322 SETATTR operation. 22324 If server and client times differ, programs that compare client time 22325 to file times can break. A time synchronization protocol should be 22326 used to limit client/server time skew. 22328 Use of a COMPOUND containing a VERIFY operation specifying only the 22329 change attribute, immediately followed by a SETATTR, provides a means 22330 whereby a client may specify a request that emulates the 22331 functionality of the SETATTR guard mechanism of NFSv3. Since the 22332 function of the guard mechanism is to avoid changes to the file 22333 attributes based on stale information, delays between checking of the 22334 guard condition and the setting of the attributes have the potential 22335 to compromise this function, as would the corresponding delay in the 22336 NFSv4 emulation. Therefore, NFSv4.1 servers SHOULD take care to 22337 avoid such delays, to the degree possible, when executing such a 22338 request. 22340 If the server does not support an attribute as requested by the 22341 client, the server SHOULD return NFS4ERR_ATTRNOTSUPP. 22343 A mask of the attributes actually set is returned by SETATTR in all 22344 cases. That mask MUST NOT include attributes bits not requested to 22345 be set by the client. If the attribute masks in the request and 22346 reply are equal, the status field in the reply MUST be NFS4_OK. 22348 18.31. Operation 37: VERIFY - Verify Same Attributes 22350 18.31.1. ARGUMENTS 22352 struct VERIFY4args { 22353 /* CURRENT_FH: object */ 22354 fattr4 obj_attributes; 22355 }; 22357 18.31.2. RESULTS 22359 struct VERIFY4res { 22360 nfsstat4 status; 22361 }; 22363 18.31.3. DESCRIPTION 22365 The VERIFY operation is used to verify that attributes have the value 22366 assumed by the client before proceeding with following operations in 22367 the COMPOUND request. If any of the attributes do not match then the 22368 error NFS4ERR_NOT_SAME must be returned. The current filehandle 22369 retains its value after successful completion of the operation. 22371 18.31.4. IMPLEMENTATION 22373 One possible use of the VERIFY operation is the following series of 22374 operations. With this the client is attempting to verify that the 22375 file being removed will match what the client expects to be removed. 22376 This series can help prevent the unintended deletion of a file. 22378 PUTFH (directory filehandle) 22379 LOOKUP (file name) 22380 VERIFY (filehandle == fh) 22381 PUTFH (directory filehandle) 22382 REMOVE (file name) 22384 This series does not prevent a second client from removing and 22385 creating a new file in the middle of this sequence but it does help 22386 avoid the unintended result. 22388 In the case that a RECOMMENDED attribute is specified in the VERIFY 22389 operation and the server does not support that attribute for the file 22390 system object, the error NFS4ERR_ATTRNOTSUPP is returned to the 22391 client. 22393 When the attribute rdattr_error or any set-only attribute (e.g. 22394 time_modify_set) is specified, the error NFS4ERR_INVAL is returned to 22395 the client. 22397 18.32. Operation 38: WRITE - Write to File 22399 18.32.1. ARGUMENTS 22401 enum stable_how4 { 22402 UNSTABLE4 = 0, 22403 DATA_SYNC4 = 1, 22404 FILE_SYNC4 = 2 22405 }; 22407 struct WRITE4args { 22408 /* CURRENT_FH: file */ 22409 stateid4 stateid; 22410 offset4 offset; 22411 stable_how4 stable; 22412 opaque data<>; 22413 }; 22415 18.32.2. RESULTS 22417 struct WRITE4resok { 22418 count4 count; 22419 stable_how4 committed; 22420 verifier4 writeverf; 22421 }; 22423 union WRITE4res switch (nfsstat4 status) { 22424 case NFS4_OK: 22425 WRITE4resok resok4; 22426 default: 22427 void; 22428 }; 22430 18.32.3. DESCRIPTION 22432 The WRITE operation is used to write data to a regular file. The 22433 target file is specified by the current filehandle. The offset 22434 specifies the offset where the data should be written. An offset of 22435 0 (zero) specifies that the write should start at the beginning of 22436 the file. The count, as encoded as part of the opaque data 22437 parameter, represents the number of bytes of data that are to be 22438 written. If the count is 0 (zero), the WRITE will succeed and return 22439 a count of 0 (zero) subject to permissions checking. The server MAY 22440 write fewer bytes than requested by the client. 22442 The client specifies with the stable parameter the method of how the 22443 data is to be processed by the server. If stable is FILE_SYNC4, the 22444 server MUST commit the data written plus all file system metadata to 22445 stable storage before returning results. This corresponds to the 22446 NFSv2 protocol semantics. Any other behavior constitutes a protocol 22447 violation. If stable is DATA_SYNC4, then the server MUST commit all 22448 of the data to stable storage and enough of the metadata to retrieve 22449 the data before returning. The server implementor is free to 22450 implement DATA_SYNC4 in the same fashion as FILE_SYNC4, but with a 22451 possible performance drop. If stable is UNSTABLE4, the server is 22452 free to commit any part of the data and the metadata to stable 22453 storage, including all or none, before returning a reply to the 22454 client. There is no guarantee whether or when any uncommitted data 22455 will subsequently be committed to stable storage. The only 22456 guarantees made by the server are that it will not destroy any data 22457 without changing the value of writeverf and that it will not commit 22458 the data and metadata at a level less than that requested by the 22459 client. 22461 Except when special stateids are used, the stateid value for a WRITE 22462 request represents a value returned from a previous byte-range LOCK 22463 or OPEN request or the stateid associated with a delegation. The 22464 stateid identifies the associated owners if any and is used by the 22465 server to verify that the associated locks are still valid (e.g. have 22466 not been revoked). 22468 Upon successful completion, the following results are returned. The 22469 count result is the number of bytes of data written to the file. The 22470 server may write fewer bytes than requested. If so, the actual 22471 number of bytes written starting at location, offset, is returned. 22473 The server also returns an indication of the level of commitment of 22474 the data and metadata via committed. Per Table 11, 22476 o The server MAY commit the data at a stronger level than requested. 22478 o The server MUST commit the data at a level at least as high as 22479 that committed. 22481 Valid combinations of the fields stable in the request and committed 22482 in the reply. 22484 +------------+-----------------------------------+ 22485 | stable | committed | 22486 +------------+-----------------------------------+ 22487 | UNSTABLE4 | FILE_SYNC4, DATA_SYNC4, UNSTABLE4 | 22488 | DATA_SYNC4 | FILE_SYNC4, DATA_SYNC4 | 22489 | FILE_SYNC4 | FILE_SYNC4 | 22490 +------------+-----------------------------------+ 22492 Table 11 22494 The final portion of the result is the field writeverf. This field 22495 is the write verifier and is a cookie that the client can use to 22496 determine whether a server has changed instance state (e.g. server 22497 restart) between a call to WRITE and a subsequent call to either 22498 WRITE or COMMIT. This cookie MUST be unchanged during a single 22499 instance of the NFSv4.1 server and MUST be unique between instances 22500 of the NFSv4.1 server. If the cookie changes, then the client MUST 22501 assume that any data written with an UNSTABLE4 value for committed 22502 and an old writeverf in the reply has been lost and will need to be 22503 recovered. 22505 If a client writes data to the server with the stable argument set to 22506 UNSTABLE4 and the reply yields a committed response of DATA_SYNC4 or 22507 UNSTABLE4, the client will follow up some time in the future with a 22508 COMMIT operation to synchronize outstanding asynchronous data and 22509 metadata with the server's stable storage, barring client error. It 22510 is possible that due to client crash or other error that a subsequent 22511 COMMIT will not be received by the server. 22513 For a WRITE with a stateid value of all bits 0, the server MAY allow 22514 the WRITE to be serviced subject to mandatory file locks or the 22515 current share deny modes for the file. For a WRITE with a stateid 22516 value of all bits 1, the server MUST NOT allow the WRITE operation to 22517 bypass locking checks at the server and otherwise is treated as if a 22518 stateid of all bits 0 were used. 22520 On success, the current filehandle retains its value. 22522 18.32.4. IMPLEMENTATION 22524 It is possible for the server to write fewer bytes of data than 22525 requested by the client. In this case, the server SHOULD NOT return 22526 an error unless no data was written at all. If the server writes 22527 less than the number of bytes specified, the client will need to send 22528 another WRITE to write the remaining data. 22530 It is assumed that the act of writing data to a file will cause the 22531 time_modified and change attributes of the file to be updated. 22532 However, these attributes SHOULD NOT be changed unless the contents 22533 of the file are changed. Thus, a WRITE request with count set to 0 22534 SHOULD NOT cause the time_modified and change attributes of the file 22535 to be updated. 22537 Stable storage is persistent storage that survives: 22539 1. Repeated power failures. 22541 2. Hardware failures (of any board, power supply, etc.). 22543 3. Repeated software crashes and restarts. 22545 This definition does not address failure of the stable storage module 22546 itself. 22548 The verifier is defined to allow a client to detect different 22549 instances of an NFSv4.1 protocol server over which cached, 22550 uncommitted data may be lost. In the most likely case, the verifier 22551 allows the client to detect server restarts. This information is 22552 required so that the client can safely determine whether the server 22553 could have lost cached data. If the server fails unexpectedly and 22554 the client has uncommitted data from previous WRITE requests (done 22555 with the stable argument set to UNSTABLE4 and in which the result 22556 committed was returned as UNSTABLE4 as well) the server might not 22557 have flushed cached data to stable storage. The burden of recovery 22558 is on the client and the client will need to retransmit the data to 22559 the server. 22561 A suggested verifier would be to use the time that the server was 22562 last started (if restarting the server results in lost buffers). 22564 The reply's committed field allows the client to do more effective 22565 caching. If the server is committing all WRITE requests to stable 22566 storage, then it SHOULD return with committed set to FILE_SYNC4, 22567 regardless of the value of the stable field in the arguments. A 22568 server that uses an NVRAM accelerator may choose to implement this 22569 policy. The client can use this to increase the effectiveness of the 22570 cache by discarding cached data that has already been committed on 22571 the server. 22573 Some implementations may return NFS4ERR_NOSPC instead of 22574 NFS4ERR_DQUOT when a user's quota is exceeded. 22576 In the case that the current filehandle is of type NF4DIR, the server 22577 will return NFS4ERR_ISDIR. If the current file is a symbolic link, 22578 the error NFS4ERR_SYMLINK will be returned. Otherwise, if the 22579 current filehandle does not designate an ordinary file, the server 22580 will return NFS4ERR_WRONG_TYPE. 22582 If mandatory file locking is in effect for the file, and the 22583 corresponding byte-range of the data to be written to the file is 22584 read or write locked by an owner that is not associated with the 22585 stateid, the server MUST return NFS4ERR_LOCKED. If so, the client 22586 MUST check if the owner corresponding to the stateid used with the 22587 WRITE operation has a conflicting read lock that overlaps with the 22588 region that was to be written. If the stateid's owner has no 22589 conflicting read lock, then the client SHOULD try to get the 22590 appropriate write byte-range lock via the LOCK operation before re- 22591 attempting the WRITE. When the WRITE completes, the client SHOULD 22592 release the byte-range lock via LOCKU. 22594 If the stateid's owner had a conflicting read lock, then the client 22595 has no choice but to return an error to the application that 22596 attempted the WRITE. The reason is that since the stateid's owner 22597 had a read lock, the server either attempted to temporarily 22598 effectively upgrade this read lock to a write lock, or the server has 22599 no upgrade capability. If the server attempted to upgrade the read 22600 lock and failed, it is pointless for the client to re-attempt the 22601 upgrade via the LOCK operation, because there might be another client 22602 also trying to upgrade. If two clients are blocked trying upgrade 22603 the same lock, the clients deadlock. If the server has no upgrade 22604 capability, then it is pointless to try a LOCK operation to upgrade. 22606 If one or more other clients have delegations for the file being 22607 written, those delegations MUST be recalled, and the operation cannot 22608 proceed until those delegations are returned or revoked. Except 22609 where this happens very quickly, one or more NFS4ERR_DELAY errors 22610 will be returned to requests made while the delegation remains 22611 outstanding. Normally, delegations will not be recalled as a result 22612 of a WRITE operation since the recall will occur as a result of an 22613 earlier OPEN. However, since it is possible for a WRITE to be done 22614 with a special stateid, the server needs to check for this case even 22615 though the client should have done an OPEN previously. 22617 18.33. Operation 40: BACKCHANNEL_CTL - Backchannel Control 22618 18.33.1. ARGUMENT 22620 typedef opaque gsshandle4_t<>; 22622 struct gss_cb_handles4 { 22623 rpc_gss_svc_t gcbp_service; /* RFC 2203 */ 22624 gsshandle4_t gcbp_handle_from_server; 22625 gsshandle4_t gcbp_handle_from_client; 22626 }; 22628 union callback_sec_parms4 switch (uint32_t cb_secflavor) { 22629 case AUTH_NONE: 22630 void; 22631 case AUTH_SYS: 22632 authsys_parms cbsp_sys_cred; /* RFC 1831 */ 22633 case RPCSEC_GSS: 22634 gss_cb_handles4 cbsp_gss_handles; 22635 }; 22637 struct BACKCHANNEL_CTL4args { 22638 uint32_t bca_cb_program; 22639 callback_sec_parms4 bca_sec_parms<>; 22640 }; 22642 18.33.2. RESULT 22644 struct BACKCHANNEL_CTL4res { 22645 nfsstat4 bcr_status; 22646 }; 22648 18.33.3. DESCRIPTION 22650 The BACKCHANNEL_CTL operation replaces the backchannel's callback 22651 program number and adds (not replaces) RPCSEC_GSS contexts for use by 22652 the backchannel. 22654 The arguments of the BACKCHANNEL_CTL call are a subset of the 22655 CREATE_SESSION parameters. In the arguments of BACKCHANNEL_CTL, the 22656 bca_cb_program field and bca_sec_parms fields correspond respectively 22657 to the csa_cb_program and csa_sec_parms fields of the arguments of 22658 CREATE_SESSION (Section 18.36). 22660 BACKCHANNEL_CTL MUST appear in a COMPOUND that starts with SEQUENCE. 22662 If the RPCSEC_GSS handle identified by gcbp_handle_from_server does 22663 not exist on the server, the server MUST return NFS4ERR_NOENT. 22665 18.34. Operation 41: BIND_CONN_TO_SESSION - Associate Connection with 22666 Session 22668 18.34.1. ARGUMENT 22670 enum channel_dir_from_client4 { 22671 CDFC4_FORE = 0x1, 22672 CDFC4_BACK = 0x2, 22673 CDFC4_FORE_OR_BOTH = 0x3, 22674 CDFC4_BACK_OR_BOTH = 0x7 22675 }; 22677 struct BIND_CONN_TO_SESSION4args { 22678 sessionid4 bctsa_sessid; 22680 channel_dir_from_client4 22681 bctsa_dir; 22683 bool bctsa_use_conn_in_rdma_mode; 22684 }; 22686 18.34.2. RESULT 22688 enum channel_dir_from_server4 { 22689 CDFS4_FORE = 0x1, 22690 CDFS4_BACK = 0x2, 22691 CDFS4_BOTH = 0x3 22692 }; 22694 struct BIND_CONN_TO_SESSION4resok { 22695 sessionid4 bctsr_sessid; 22697 channel_dir_from_server4 22698 bctsr_dir; 22700 bool bctsr_use_conn_in_rdma_mode; 22701 }; 22703 union BIND_CONN_TO_SESSION4res 22704 switch (nfsstat4 bctsr_status) { 22706 case NFS4_OK: 22707 BIND_CONN_TO_SESSION4resok 22708 bctsr_resok4; 22710 default: void; 22711 }; 22713 18.34.3. DESCRIPTION 22715 BIND_CONN_TO_SESSION is used to associate additional connections with 22716 a session. It MUST be used on the connection being associated with 22717 the session. It MUST be the only operation in the COMPOUND 22718 procedure. If SP4_NONE (Section 18.35) state protection is used, any 22719 principal, security flavor, or RPCSEC_GSS context MAY be used to 22720 invoke the operation. If SP4_MACH_CRED is used, RPCSEC_GSS MUST be 22721 used with the integrity or privacy services, using the principal that 22722 created the client ID. If SP4_SSV is used, RPCSEC_GSS with the SSV 22723 GSS mechanism (Section 2.10.9) and integrity or privacy MUST be used. 22725 If, when the client ID was created, the client opted for SP4_NONE 22726 state protection, the client is not required to use 22727 BIND_CONN_TO_SESSION to associate the connection with the session, 22728 unless the client wishes to associate the connection with the 22729 backchannel. When SP4_NONE protection is used, simply sending a 22730 COMPOUND request with a SEQUENCE operation is sufficient to associate 22731 the connection with the session specified in SEQUENCE. 22733 The field bctsa_dir indicates whether the client wants to associate 22734 the connection with the fore channel or the backchannel or both 22735 channels. The value CDFC4_FORE_OR_BOTH indicates the client wants to 22736 associate the connection with both the fore channel and backchannel, 22737 but will accept the connection being associated to just the fore 22738 channel. The value CDFC4_BACK_OR_BOTH indicates the client wants to 22739 associate with both the fore and backchannel, but will accept the 22740 connection being associated with just the backchannel. The server 22741 replies in bctsr_dir which channel(s) the connection is associated 22742 with. If the client specified CDFC4_FORE, the server MUST return 22743 CDFS4_FORE. If the client specified CDFC4_BACK, the server MUST 22744 return CDFS4_BACK. If the client specified CDFC4_FORE_OR_BOTH, the 22745 server MUST return CDFS4_FORE or CDFS4_BOTH. If the client specified 22746 CDFC4_BACK_OR_BOTH, the server MUST return CDFS4_BACK or CDFS4_BOTH. 22748 See the CREATE_SESSION operation (Section 18.36), and the description 22749 of the argument csa_use_conn_in_rdma_mode to understand 22750 bctsa_use_conn_in_rdma_mode, and the description of 22751 csr_use_conn_in_rdma_mode to understand bctsr_use_conn_in_rdma_mode. 22753 Invoking BIND_CONN_TO_SESSION on a connection already associated with 22754 the specified session has no effect, and the server MUST respond with 22755 NFS4_OK, unless the client is demanding changes to the set of 22756 channels the connection is associated with. If so, the server MUST 22757 return NFS4ERR_INVAL. 22759 18.34.4. IMPLEMENTATION 22761 If a session's channel loses all connections, depending on the client 22762 ID's state protection and type of channel, the client might need to 22763 use BIND_CONN_TO_SESSION to associate a new connection. If the 22764 server restarted and does not keep the reply cache in stable storage, 22765 the server will not recognize the session ID. The client will 22766 ultimately have to invoke EXCHANGE_ID to create a new client ID and 22767 session. 22769 Suppose SP4_SSV state protection is being used, and 22770 BIND_CONN_TO_SESSION is among the operations included in the 22771 spo_must_enforce set when the client ID was created (Section 18.35). 22772 If so, there is an issue if SET_SSV is sent, no response is returned, 22773 and the last connection associated with the client ID drops. The 22774 client, per the sessions model, MUST retry the SET_SSV. But it needs 22775 a new connection to do so, and MUST associate that connection with 22776 the session via a BIND_CONN_TO_SESSION authenticated with the SSV GSS 22777 mechanism. The problem is that the RPCSEC_GSS message integrity 22778 codes use a subkey derived from the SSV as the key and the SSV may 22779 have changed. While there are multiple recovery strategies, a 22780 single, general strategy is described here. 22782 o The client reconnects. 22784 o The client assumes the SET_SSV was executed, and so sends 22785 BIND_CONN_TO_SESSION with the subkey (derived from the new SSV, 22786 i.e., what SET_SSV would have set the SSV to) used as the key for 22787 the RPCSEC_GSS credential message integrity codes. 22789 o If the request succeeds, this means the original attempted SET_SSV 22790 did execute successfully. The client re-sends the original 22791 SET_SSV, which the server will reply to via the reply cache. 22793 o If the server returns an RPC authentication error, this means the 22794 server's current SSV was not changed, (and the SET_SSV was likely 22795 not executed). The client then tries BIND_CONN_TO_SESSION with 22796 the subkey derived from the old SSV as the key for the RPCSEC_GSS 22797 message integrity codes. 22799 o The attempted BIND_CONN_TO_SESSION with the old SSV should 22800 succeed. If so the client re-sends the original SET_SSV. If the 22801 original SET_SSV was not executed, then the server executes it. 22802 If the original SET_SSV was executed, but failed, the server will 22803 return the SET_SSV from the reply cache. 22805 18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID 22807 Exchange long hand client and server identifiers (owners), and create 22808 a client ID 22810 18.35.1. ARGUMENT 22811 const EXCHGID4_FLAG_SUPP_MOVED_REFER = 0x00000001; 22812 const EXCHGID4_FLAG_SUPP_MOVED_MIGR = 0x00000002; 22814 const EXCHGID4_FLAG_BIND_PRINC_STATEID = 0x00000100; 22816 const EXCHGID4_FLAG_USE_NON_PNFS = 0x00010000; 22817 const EXCHGID4_FLAG_USE_PNFS_MDS = 0x00020000; 22818 const EXCHGID4_FLAG_USE_PNFS_DS = 0x00040000; 22820 const EXCHGID4_FLAG_MASK_PNFS = 0x00070000; 22822 const EXCHGID4_FLAG_UPD_CONFIRMED_REC_A = 0x40000000; 22823 const EXCHGID4_FLAG_CONFIRMED_R = 0x80000000; 22825 struct state_protect_ops4 { 22826 bitmap4 spo_must_enforce; 22827 bitmap4 spo_must_allow; 22828 }; 22830 struct ssv_sp_parms4 { 22831 state_protect_ops4 ssp_ops; 22832 sec_oid4 ssp_hash_algs<>; 22833 sec_oid4 ssp_encr_algs<>; 22834 uint32_t ssp_window; 22835 uint32_t ssp_num_gss_handles; 22836 }; 22838 enum state_protect_how4 { 22839 SP4_NONE = 0, 22840 SP4_MACH_CRED = 1, 22841 SP4_SSV = 2 22842 }; 22844 union state_protect4_a switch(state_protect_how4 spa_how) { 22845 case SP4_NONE: 22846 void; 22847 case SP4_MACH_CRED: 22848 state_protect_ops4 spa_mach_ops; 22849 case SP4_SSV: 22850 ssv_sp_parms4 spa_ssv_parms; 22851 }; 22853 struct EXCHANGE_ID4args { 22854 client_owner4 eia_clientowner; 22855 uint32_t eia_flags; 22856 state_protect4_a eia_state_protect; 22857 nfs_impl_id4 eia_client_impl_id<1>; 22858 }; 22860 18.35.2. RESULT 22862 struct ssv_prot_info4 { 22863 state_protect_ops4 spi_ops; 22864 uint32_t spi_hash_alg; 22865 uint32_t spi_encr_alg; 22866 uint32_t spi_ssv_len; 22867 uint32_t spi_window; 22868 gsshandle4_t spi_handles<>; 22869 }; 22871 union state_protect4_r switch(state_protect_how4 spr_how) { 22872 case SP4_NONE: 22873 void; 22874 case SP4_MACH_CRED: 22875 state_protect_ops4 spr_mach_ops; 22876 case SP4_SSV: 22877 ssv_prot_info4 spr_ssv_info; 22878 }; 22880 struct EXCHANGE_ID4resok { 22881 clientid4 eir_clientid; 22882 sequenceid4 eir_sequenceid; 22883 uint32_t eir_flags; 22884 state_protect4_r eir_state_protect; 22885 server_owner4 eir_server_owner; 22886 opaque eir_server_scope; 22887 nfs_impl_id4 eir_server_impl_id<1>; 22888 }; 22890 union EXCHANGE_ID4res switch (nfsstat4 eir_status) { 22891 case NFS4_OK: 22892 EXCHANGE_ID4resok eir_resok4; 22894 default: 22895 void; 22896 }; 22898 18.35.3. DESCRIPTION 22900 The client uses the EXCHANGE_ID operation to register a particular 22901 client owner with the server. The client ID returned from this 22902 operation will be necessary for requests that create state on the 22903 server and will serve as a parent object to sessions created by the 22904 client. In order to confirm the client ID it must first be used, 22905 along with the returned eir_sequenceid, as arguments to 22906 CREATE_SESSION. If the flag EXCHGID4_FLAG_CONFIRMED_R is set in the 22907 result, eir_flags, then eir_sequenceid MUST be ignored, as it has no 22908 relevancy. 22910 EXCHANGE_ID MAY be sent in a COMPOUND procedure that starts with 22911 SEQUENCE. However, when a client communicates with a server for the 22912 first time, it will not have a session, so using SEQUENCE will not be 22913 possible. If EXCHANGE_ID is sent without a preceding SEQUENCE, then 22914 it MUST be the only operation in the COMPOUND procedure's request. 22915 If is not, the server MUST return NFS4ERR_NOT_ONLY_OP. 22917 The eia_clientowner field is composed of a co_verifier field and a 22918 co_ownerid string. As noted in Section 2.4, the co_ownerid describes 22919 the client, and the co_verifier is the incarnation of the client. An 22920 EXCHANGE_ID sent with a new incarnation of the client will lead to 22921 the server removing lock state of the old incarnation. Whereas an 22922 EXCHANGE_ID sent with the current incarnation and co_ownerid will 22923 result in an error or an update of the client ID's properties, 22924 depending on the arguments to EXCHANGE_ID. 22926 A server MUST NOT use the same client ID for two different 22927 incarnations of an eir_clientowner. 22929 In addition to the client ID and sequence ID, the server returns a 22930 server owner (eir_server_owner) and server scope (eir_server_scope). 22931 The former field is used for network trunking as described in 22932 Section 2.10.5. The latter field is used to allow clients to 22933 determine when client IDs sent by one server may be recognized by 22934 another in the event of file system migration (see Section 11.7.7). 22936 The client ID returned by EXCHANGE_ID is only unique relative to the 22937 combination of eir_server_owner.so_major_id and eir_server_scope. 22938 Thus if two servers return the same client ID, the onus is on the 22939 client to distinguish the client IDs on the basis of 22940 eir_server_owner.so_major_id and eir_server_scope. In the event two 22941 different server's claim matching server_owner.so_major_id and 22942 eir_server_scope, the client can use the verification techniques 22943 discussed in Section 2.10.5 to determine if the servers are distinct. 22944 If they are distinct, then the client will need to note the 22945 destination network addresses of the connections used with each 22946 server, and use the network address as the final discriminator. 22948 The server, as defined by the unique identity expressed in the 22949 so_major_id of the server owner and the server scope, needs to track 22950 several properties of each client ID it hands out. The properties 22951 apply to the client ID and all sessions associated with the client 22952 ID. The properties are derived from the arguments and results of 22953 EXCHANGE_ID. The client ID properties include: 22955 o The capabilities expressed by the following bits, which come from 22956 the results of EXCHANGE_ID: 22958 * EXCHGID4_FLAG_SUPP_MOVED_REFER 22960 * EXCHGID4_FLAG_SUPP_MOVED_MIGR 22962 * EXCHGID4_FLAG_BIND_PRINC_STATEID 22964 * EXCHGID4_FLAG_USE_NON_PNFS 22966 * EXCHGID4_FLAG_USE_PNFS_MDS 22968 * EXCHGID4_FLAG_USE_PNFS_DS 22970 These properties may be updated by subsequent EXCHANGE_ID requests 22971 on confirmed client IDs though the server MAY refuse to change 22972 them. 22974 o The state protection method used, one of SP4_NONE, SP4_MACH_CRED, 22975 or SP4_SSV, as set by the spa_how field of the arguments to 22976 EXCHANGE_ID. Once the client ID is confirmed, this property 22977 cannot be updated by subsequent EXCHANGE_ID requests. 22979 o For SP4_MACH_CRED or SP4_SSV state protection: 22981 * The list of operations that MUST use the specified state 22982 protection: spo_must_enforce, which come from the results of 22983 EXCHANGE_ID. 22985 * The list of operations that MAY use the specified state 22986 protection: spo_must_allow, which come from the results of 22987 EXCHANGE_ID. 22989 Once the client ID is confirmed, these properties cannot be 22990 updated by subsequent EXCHANGE_ID requests. 22992 o For SP4_SSV protection: 22994 * The OID of the hash algorithm. This property is represented by 22995 one of the algorithms in the ssp_hash_algs field of the 22996 EXCHANGE_ID arguments. Once the client ID is confirmed, this 22997 property cannot be updated by subsequent EXCHANGE_ID requests. 22999 * The OID of the encryption algorithm. This property is 23000 represented by one of the algorithms in the ssp_encr_algs field 23001 of the EXCHANGE_ID arguments. Once the client ID is confirmed, 23002 this property cannot be updated by subsequent EXCHANGE_ID 23003 requests. 23005 * The length of the SSV. This property is represented by the 23006 spi_ssv_len field in the EXCHANGE_ID results. Once the client 23007 ID is confirmed, this property cannot be updated by subsequent 23008 EXCHANGE_ID requests. 23010 There are REQUIRED and RECOMMENDED relationships among the 23011 length of the key of the encryption algorithm ("key length"), 23012 the length of the output of hash algorithm ("hash length"), and 23013 the length of the SSV ("SSV length"). 23015 + key length MUST be <= hash length. This is because the keys 23016 used for the encryption algorithm are actually subkeys 23017 derived from the SSV, and the derivation is via the hash 23018 algorithm. The selection of an encryption algorithm with a 23019 key length that exceeded the length of the output of the 23020 hash algorithm would require padding, and thus weaken the 23021 use of the encryption algorithm. 23023 + hash length SHOULD be <= SSV length. This is because the 23024 SSV is a key used to derive subkeys via an HMAC, and it is 23025 recommended that the key used as input to an HMAC be at 23026 least as long as the length of the HMAC's hash algorithm's 23027 output (see Section 3 of RFC2104 [11]). 23029 + key length SHOULD be <= SSV length. This is a transitive 23030 result of the above two invariants. 23032 + key length SHOULD be >= hash length / 2. This is because 23033 the subkey derivation is via an HMAC and it is recommended 23034 that if the HMAC has to be truncated, it should not be 23035 truncated to less than half the hash length (see Section 4 23036 of RFC2104 [11]). 23038 * Number of concurrent versions of the SSV the client and server 23039 will support (Section 2.10.9). This property is represented by 23040 spi_window, in the EXCHANGE_ID results. The property may be 23041 updated by subsequent EXCHANGE_ID requests. 23043 o The client's implementation ID as represented by the 23044 eia_client_impl_id field of the arguments. The property may be 23045 updated by subsequent EXCHANGE_ID requests. 23047 o The server's implementation ID as represented by the 23048 eir_server_impl_id field of the reply. The property may be 23049 updated by replies to subsequent EXCHANGE_ID requests. 23051 The eia_flags passed as part of the arguments and the eir_flags 23052 results allow the client and server to inform each other of their 23053 capabilities as well as indicate how the client ID will be used. 23054 Whether a bit is set or cleared on the arguments' flags does not 23055 force the server to set or clear the same bit on the results' side. 23056 Bits not defined above cannot be set in the eia_flags field. If they 23057 are, the server MUST reject the operation with NFS4ERR_INVAL. 23059 The EXCHGID4_FLAG_UPD_CONFIRMED_REC_A bit can only be set in 23060 eia_flags; it is always off in eir_flags. The 23061 EXCHGID4_FLAG_CONFIRMED_R bit can only be set in eir_flags; it is 23062 always off in eia_flags. If the server recognizes the co_ownerid and 23063 co_verifier as mapping to a confirmed client ID, it sets 23064 EXCHGID4_FLAG_CONFIRMED_R in eir_flags. The 23065 EXCHGID4_FLAG_CONFIRMED_R flag allows a client to tell if the client 23066 ID it is trying to create already exists and is confirmed. 23068 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set in eia_flags, this means 23069 the client is attempting to update properties of an existing 23070 confirmed client ID (if the client wants to update properties of an 23071 unconfirmed client ID, it MUST NOT set 23072 EXCHGID4_FLAG_UPD_CONFIRMED_REC_A). If so, it is RECOMMENDED the 23073 client send the update EXCHANGE_ID operation in the same COMPOUND as 23074 a SEQUENCE so that the EXCHANGE_ID is executed exactly once. Whether 23075 the client can update the properties of client ID depends on the 23076 state protection it selected when the client ID was created, and the 23077 principal and security flavor it uses when sending the EXCHANGE_ID 23078 request. The situations described in Sub-Paragraph 6, Sub- 23079 Paragraph 7, Sub-Paragraph 8, or Sub-Paragraph 9, of Paragraph 6 in 23080 Section 18.35.4 will apply. Note that if the operation succeeds and 23081 returns a client ID that is already confirmed, the server MUST set 23082 the EXCHGID4_FLAG_CONFIRMED_R bit in eir_flags. 23084 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set in eia_flags, this 23085 means the client is trying to establish a new client ID; it is 23086 attempting to trunk data communication to the server 23087 (Section 2.10.5); or it is attempting to update properties of an 23088 unconfirmed client ID. The situations described in Sub-Paragraph 1, 23089 Sub-Paragraph 2, Sub-Paragraph 3, Sub-Paragraph 4, or Sub-Paragraph 5 23090 of Paragraph 6 in Section 18.35.4 will apply. Note that if the 23091 operation succeeds and returns a client ID that was previously 23092 confirmed, the server MUST set the EXCHGID4_FLAG_CONFIRMED_R bit in 23093 eir_flags. 23095 When the EXCHGID4_FLAG_SUPP_MOVED_REFER flag bit is set, the client 23096 indicates that it is capable of dealing with an NFS4ERR_MOVED error 23097 as part of a referral sequence. When this bit is not set, it is 23098 still legal for the server to perform a referral sequence. However, 23099 a server may use the fact that the client is incapable of correctly 23100 responding to a referral, by avoiding it for that particular client. 23101 It may, for instance, act as a proxy for that particular file system, 23102 at some cost in performance, although it is not obligated to do so. 23103 If the server will potentially perform a referral, it MUST set 23104 EXCHGID4_FLAG_SUPP_MOVED_REFER in eir_flags. 23106 When the EXCHGID4_FLAG_SUPP_MOVED_MIGR is set, the client indicates 23107 that it is capable of dealing with an NFS4ERR_MOVED error as part of 23108 a file system migration sequence. When this bit is not set, it is 23109 still legal for the server to indicate that a file system has moved, 23110 when this in fact happens. However, a server may use the fact that 23111 the client is incapable of correctly responding to a migration in its 23112 scheduling of file systems to migrate so as to avoid migration of 23113 file systems being actively used. It may also hide actual migrations 23114 from clients unable to deal with them by acting as a proxy for a 23115 migrated file system for particular clients, at some cost in 23116 performance, although it is not obligated to do so. If the server 23117 will potentially perform a migration, it MUST set 23118 EXCHGID4_FLAG_SUPP_MOVED_MIGR in eir_flags. 23120 When EXCHGID4_FLAG_BIND_PRINC_STATEID is set, the client indicates it 23121 wants the server to bind the stateid to the principal. This means 23122 that when a principal creates a stateid, it has to be the one to use 23123 the stateid. If the server will perform binding it will return 23124 EXCHGID4_FLAG_BIND_PRINC_STATEID. The server MAY return 23125 EXCHGID4_FLAG_BIND_PRINC_STATEID even if the client does not request 23126 it. If an update to the client ID changes the value of 23127 EXCHGID4_FLAG_BIND_PRINC_STATEID's client ID property, the effect 23128 applies only to new stateids. Existing stateids (and all stateids 23129 with the same "other" field) that were created with stateid to 23130 principal binding in force will continue to have binding in force. 23131 Existing stateids (and all stateids with same "other" field) that 23132 were created with stateid to principal not in force will continue to 23133 have binding not in force. 23135 The EXCHGID4_FLAG_USE_NON_PNFS, EXCHGID4_FLAG_USE_PNFS_MDS, and 23136 EXCHGID4_FLAG_USE_PNFS_DS bits are described in Section 13.1 and 23137 convey roles the client ID is to be used for in a pNFS environment. 23138 The server MUST set one of the acceptable combinations of these bits 23139 (roles) in eir_flags, as specified in Section 13.1. Note that the 23140 same client owner/server owner pair can have multiple roles. 23141 Multiple roles can be associated with the same client ID or with 23142 different client IDs. Thus, if a client sends EXCHANGE_ID from the 23143 same client owner to the same server owner multiple times, but 23144 specifies different pNFS roles each time, the server might return 23145 different client IDs. Given that different pNFS roles might have 23146 different client IDs, the client may ask for different properties for 23147 each role/client ID. 23149 The spa_how field of the eia_state_protect field specifies how the 23150 client wants to protect its client, locking and session state from 23151 unauthorized changes (Section 2.10.8.3): 23153 o SP4_NONE. The client does not request the NFSv4.1 server to 23154 enforce state protection. The NFSv4.1 server MUST NOT enforce 23155 state protection for the returned client ID. 23157 o SP4_MACH_CRED. If spa_how is SP4_MACH_CRED, then the client MUST 23158 send the EXCHANGE_ID request with RPCSEC_GSS as the security 23159 flavor, and with a service of RPC_GSS_SVC_INTEGRITY or 23160 RPC_GSS_SVC_PRIVACY. If SP4_MACH_CRED is specified, then the 23161 client wants to use an RPCSEC_GSS-based machine credential to 23162 protect its state. The server MUST note the principal the 23163 EXCHANGE_ID operation was sent with, and the GSS mechanism used. 23164 These notes collectively comprise the machine credential. 23166 After the client ID is confirmed, as long as the lease associated 23167 with the client ID is unexpired, a subsequent EXCHANGE_ID 23168 operation that uses the same eia_clientowner.co_owner as the first 23169 EXCHANGE_ID, MUST also use the same machine credential as the 23170 first EXCHANGE_ID. The server returns the same client ID for the 23171 subsequent EXCHANGE_ID as that returned from the first 23172 EXCHANGE_ID. 23174 o SP4_SSV. If spa_how is SP4_SSV, then the client MUST send the 23175 EXCHANGE_ID request with RPCSEC_GSS as the security flavor, and 23176 with a service of RPC_GSS_SVC_INTEGRITY or RPC_GSS_SVC_PRIVACY. 23177 If SP4_SSV is specified, then the client wants to use the SSV to 23178 protect its state. The server records the credential used in the 23179 request as the machine credential (as defined above) for the 23180 eia_clientowner.co_owner. The CREATE_SESSION operation that 23181 confirms the client ID MUST use the same machine credential. 23183 When a client specifies SP4_MACH_CRED or SP4_SSV, it also provides 23184 two lists of operations (each expressed as a bit map). The first 23185 list is spo_must_enforce and consists of those operations the client 23186 MUST send (subject to the server confirming the list of operations in 23187 the result of EXCHANGE_ID) with the machine credential (if 23188 SP4_MACH_CRED protection is specified) or the SSV-based credential 23189 (if SP4_SSV protection is used). The client MUST send the operations 23190 with RPCSEC_GSS credentials that specify the RPC_GSS_SVC_INTEGRITY or 23191 RPC_GSS_SVC_PRIVACY security service. Typically the first list of 23192 operations includes EXCHANGE_ID, CREATE_SESSION, DELEGPURGE, 23193 DESTROY_SESSION, BIND_CONN_TO_SESSION, and DESTROY_CLIENTID. The 23194 client SHOULD NOT specify in this list any operations that require a 23195 filehandle because the server's access policies MAY conflict with the 23196 client's choice, and thus the client would then be unable to access a 23197 subset of the server's namespace. 23199 Note that if SP4_SSV protection is specified, and the client 23200 indicates that CREATE_SESSION must be protected with SP4_SSV, because 23201 the SSV cannot exist without a confirmed client ID, the first 23202 CREATE_SESSION MUST instead be sent using the machine credential, and 23203 the server MUST accept the machine credential. 23205 There is a corresponding result, also called spo_must_enforce, of the 23206 operations the server will require SP4_MACH_CRED or SP4_SSV 23207 protection for. Normally the server's result equals the client's 23208 argument, but the result MAY be different. If the client requests 23209 one or more operations in the set { EXCHANGE_ID, CREATE_SESSION, 23210 DELEGPURGE, DESTROY_SESSION, BIND_CONN_TO_SESSION, DESTROY_CLIENTID 23211 }, then the result spo_must_enforce MUST include the operations the 23212 client requested from that set. 23214 If spo_must_enforce in the results has BIND_CONN_TO_SESSION set, then 23215 connection binding enforcement is enabled, and the client MUST use 23216 the machine (if SP4_MACH_CRED protection is used) or SSV (if SP4_SSV 23217 protection is used) credential on calls to BIND_CONN_TO_SESSION. 23219 The second list is spo_must_allow and consists of those operations 23220 the client wants to have the option of issuing with the machine 23221 credential or the SSV-based credential, even if the object the 23222 operations are performed on is not owned by the machine or SSV 23223 credential. 23225 The corresponding result, also called spo_must_allow, consists of the 23226 operations the server will allow the client to use SP4_SSV or 23227 SP4_MACH_CRED credentials with. Normally the server's result equals 23228 the client's argument, but the result MAY be different. 23230 The purpose of spo_must_allow is to allow clients to solve the 23231 following conundrum. Suppose the client ID is confirmed with 23232 EXCHGID4_FLAG_BIND_PRINC_STATEID, and it calls OPEN with the 23233 RPCSEC_GSS credentials of a normal user. Now suppose the user's 23234 credentials expire, and cannot be renewed (e.g. a Kerberos ticket 23235 granting ticket expires, and the user has logged off and will not be 23236 acquiring a new ticket granting ticket). The client will be unable 23237 to send CLOSE without the user's credentials, which is to say the 23238 client has to either leave the state on the server, or it has to re- 23239 send EXCHANGE_ID with a new verifier to clear all state. That is, 23240 unless the client includes CLOSE on the list of operations in 23241 spo_must_allow and the server agrees. 23243 The SP4_SSV protection parameters also have: 23245 ssp_hash_algs: 23247 This is the set of algorithms the client supports for the purpose 23248 of computing the digests needed for the internal SSV GSS mechanism 23249 and for the SET_SSV operation. Each algorithm is specified as an 23250 object identifier (OID). The REQUIRED algorithms for a server are 23251 id-sha1, id-sha224, id-sha256, id-sha384, and id-sha512 [27]. The 23252 algorithm the server selects among the set is indicated in 23253 spi_hash_alg, a field of spr_ssv_prot_info. The field 23254 spi_hash_alg is an index into the array ssp_hash_algs. If the 23255 server does not support any of the offered algorithms, it returns 23256 NFS4ERR_HASH_ALG_UNSUPP. If ssp_hash_algs is empty, the server 23257 MUST return NFS4ERR_INVAL. 23259 ssp_encr_algs: 23261 This is the set of algorithms the client supports for the purpose 23262 of providing privacy protection for the internal SSV GSS 23263 mechanism. Each algorithm is specified as an OID. The REQUIRED 23264 algorithm for a server is id-aes256-CBC. The RECOMMENDED 23265 algorithms are id-aes192-CBC and id-aes128-CBC [28]. The selected 23266 algorithm is returned in spi_encr_alg, an index into 23267 ssp_encr_algs. If the server does not support any of the offered 23268 algorithms, it returns NFS4ERR_ENCR_ALG_UNSUPP. If ssp_encr_algs 23269 is empty, the server MUST return NFS4ERR_INVAL. Note that due to 23270 previously stated requirements and recommendations on the 23271 relationships between key length and hash length, some 23272 combinations of RECOMMENDED and REQUIRED encryption algorithm and 23273 hash algorithm either SHOULD NOT or MUST NOT be used. Table 12 23274 summarizes the illegal and discouraged combinations. 23276 ssp_window: 23278 This is the number of SSV versions the client wants the server to 23279 maintain (i.e. each successful call to SET_SSV produces a new 23280 version of the SSV). If ssp_window is zero, the server MUST 23281 return NFS4ERR_INVAL. The server responds with spi_window, which 23282 MUST NOT exceed ssp_window, and MUST be at least one (1). Any 23283 requests on the backchannel or fore channel that are using a 23284 version of the SSV that is outside the window will fail with an 23285 ONC RPC authentication error, and the requester will have to retry 23286 them with the same slot ID and sequence ID. 23288 ssp_num_gss_handles: 23290 This is the number of RPCSEC_GSS handles the server should create 23291 that are based on the GSS SSV mechanism (Section 2.10.9). It is 23292 not the total number of RPCSEC_GSS handles for the client ID. 23293 Indeed, subsequent calls to EXCHANGE_ID will add RPCSEC_GSS 23294 handles. The server responds with a list of handles in 23295 spi_handles. If the client asks for at least one handle and the 23296 server cannot create it, the server MUST return an error. The 23297 handles in spi_handles are not available for use until the client 23298 ID is confirmed, which could be immediately if EXCHANGE_ID returns 23299 EXCHGID4_FLAG_CONFIRMED_R, or upon successful confirmation from 23300 CREATE_SESSION. While a client ID can span all the connections 23301 that are connected to a server sharing the same 23302 eir_server_owner.so_major_id, the RPCSEC_GSS handles returned in 23303 spi_handles can only be used on connections connected to a server 23304 that returns the same the eir_server_owner.so_major_id and 23305 eir_server_owner.so_minor_id on each connection. It is 23306 permissible for the client to set ssp_num_gss_handles to zero (0); 23307 the client can create more handles with another EXCHANGE_ID call. 23309 The seq_window (see Section 5.2.3.1 of RFC2203 [4]) of each 23310 RPCSEC_GSS handle in spi_handle MUST be the same as the seq_window 23311 of the RPCSEC_GSS handle used for the credential of the RPC 23312 request that the EXCHANGE_ID request was sent with. 23314 +-------------------+----------------------+------------------------+ 23315 | Encryption | MUST NOT be combined | SHOULD NOT be combined | 23316 | Algorithm | with | with | 23317 +-------------------+----------------------+------------------------+ 23318 | id-aes128-CBC | | id-sha384, id-sha512 | 23319 | id-aes192-CBC | id-sha1 | id-sha512 | 23320 | id-aes256-CBC | id-sha1, id-sha224 | | 23321 +-------------------+----------------------+------------------------+ 23323 Table 12 23325 The arguments include an array of up to one element in length called 23326 eia_client_impl_id. If eia_client_impl_id is present it contains the 23327 information identifying the implementation of the client. Similarly, 23328 the results include an array of up to one element in length called 23329 eir_server_impl_id that identifies the implementation of the server. 23330 Servers MUST accept a zero length eia_client_impl_id array, and 23331 clients MUST accept a zero length eir_server_impl_id array. 23333 An example use for implementation identifiers would be diagnostic 23334 software that extract this information in an attempt to identify 23335 interoperability problems, performance workload behaviors or general 23336 usage statistics. Since the intent of having access to this 23337 information is for planning or general diagnosis only, the client and 23338 server MUST NOT interpret this implementation identity information in 23339 a way that affects interoperational behavior of the implementation. 23340 The reason is that if clients and servers did such a thing, they 23341 might use fewer capabilities of the protocol than the peer can 23342 support, or the client and server might refuse to interoperate. 23344 Because it is possible some implementations will violate the protocol 23345 specification and interpret the identity information, implementations 23346 MUST allow the users of the NFSv4 client and server to set the 23347 contents of the sent nfs_impl_id structure to any value. 23349 18.35.4. IMPLEMENTATION 23351 A server's client record is a 5-tuple: 23353 1. co_ownerid 23355 The client identifier string, from the eia_clientowner 23356 structure of the EXCHANGE_ID4args structure 23358 2. co_verifier: 23360 A client-specific value used to indicate incarnations (where a 23361 client restart represents a new incarnation), from the 23362 eia_clientowner structure of the EXCHANGE_ID4args structure 23364 3. principal: 23366 The principal that was defined in the RPC header's credential 23367 and/or verifier at the time the client record was established. 23369 4. client ID: 23371 The shorthand client identifier, generated by the server and 23372 returned via the eir_clientid field in the EXCHANGE_ID4resok 23373 structure 23375 5. confirmed: 23377 A private field on the server indicating whether or not a 23378 client record has been confirmed. A client record is 23379 confirmed if there has been a successful CREATE_SESSION 23380 operation to confirm it. Otherwise it is unconfirmed. An 23381 unconfirmed record is established by a EXCHANGE_ID call. Any 23382 unconfirmed record that is not confirmed within a lease period 23383 SHOULD be removed. 23385 The following identifiers represent special values for the fields in 23386 the records. 23388 ownerid_arg: 23390 The value of the eia_clientowner.co_ownerid subfield of the 23391 EXCHANGE_ID4args structure of the current request. 23393 verifier_arg: 23395 The value of the eia_clientowner.co_verifier subfield of the 23396 EXCHANGE_ID4args structure of the current request. 23398 old_verifier_arg: 23400 A value of the eia_clientowner.co_verifier field of a client 23401 record received in a previous request; this is distinct from 23402 verifier_arg. 23404 principal_arg: 23406 The value of the RPCSEC_GSS principal for the current request. 23408 old_principal_arg: 23410 A value of the principal of a client record as defined by the RPC 23411 header's credential or verifier of a previous request. This is 23412 distinct from principal_arg. 23414 clientid_ret: 23416 The value of the eir_clientid field the server will return in the 23417 EXCHANGE_ID4resok structure for the current request. 23419 old_clientid_ret: 23421 The value of the eir_clientid field the server returned in the 23422 EXCHANGE_ID4resok structure for a previous request. This is 23423 distinct from clientid_ret. 23425 confirmed: 23427 The client ID has been confirmed. 23429 unconfirmed: 23431 The client ID has not been confirmed. 23433 Since EXCHANGE_ID is a non-idempotent operation, we must consider the 23434 possibility that retries occur as a result of a client restart, 23435 network partition, malfunctioning router, etc. Retries are 23436 identified by the value of the eia_clientowner field of 23437 EXCHANGE_ID4args and the method for dealing with them is outlined in 23438 the scenarios below. 23440 The scenarios are described in terms of the client record(s) a server 23441 has for a given co_ownerid. Note if the client ID was created 23442 specifying SP4_SSV state protection and EXCHANGE_ID as the one of the 23443 operations in spo_must_allow, then server MUST authorize EXCHANGE_IDs 23444 with the SSV principal in addition to the principal that created the 23445 client ID. 23447 1. New Owner ID 23449 If the server has no client records with 23450 eia_clientowner.co_ownerid matching ownerid_arg, and 23451 EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set in the 23452 EXCHANGE_ID, then a new shorthand client ID (let us call it 23453 clientid_ret) is generated, and the following unconfirmed 23454 record is added to the server's state. 23456 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 23457 unconfirmed } 23459 Subsequently, the server returns clientid_ret. 23461 2. Non-Update on Existing Client ID 23463 If the server has the following confirmed record, and the 23464 request does not have EXCHGID4_FLAG_UPD_CONFIRMED_REC_A set, 23465 then the request is the result of a retried request due to a 23466 faulty router or lost connection, or the client is trying to 23467 determine if it can perform trunking. 23469 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 23470 confirmed } 23472 Since the record has been confirmed, the client must have 23473 received the server's reply from the initial EXCHANGE_ID 23474 request. Since the server has a confirmed record, and since 23475 EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set, with the 23476 possible exception of eir_server_owner.so_minor_id, the server 23477 returns the same result it did when the client ID's properties 23478 were last updated (or if never updated, the result when the 23479 client ID was created). The confirmed record is unchanged. 23481 3. Client Collision 23483 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set, and if the 23484 server has the following confirmed record, then this request 23485 is likely the result of a chance collision between the values 23486 of the eia_clientowner.co_ownerid subfield of EXCHANGE_ID4args 23487 for two different clients. 23489 { ownerid_arg, *, old_principal_arg, old_clientid_ret, 23490 confirmed } 23492 If there is currently no state associated with 23493 old_clientid_ret, or if there is state but the lease has 23494 expired, then this case is effectively equivalent to the New 23495 Owner ID case of Paragraph 1. The confirmed record is 23496 deleted, the old_clientid_ret and its lock state are deleted, 23497 a new shorthand client ID is generated, and the following 23498 unconfirmed record is added to the server's state. 23500 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 23501 unconfirmed } 23503 Subsequently, the server returns clientid_ret. 23505 If old_clientid_ret has an unexpired lease with state, then no 23506 state of old_clientid_ret is changed or deleted. The server 23507 returns NFS4ERR_CLID_INUSE to indicate the client should retry 23508 with a different value for the eia_clientowner.co_ownerid 23509 subfield of EXCHANGE_ID4args. The client record is not 23510 changed. 23512 4. Replacement of Unconfirmed Record 23514 If the EXCHGID4_FLAG_UPD_CONFIRMED_REC_A flag is not set, and 23515 the server has the following unconfirmed record then the 23516 client is attempting EXCHANGE_ID again on an unconfirmed 23517 client ID, perhaps due to a retry, or perhaps due to a client 23518 restart before client ID confirmation (i.e. before 23519 CREATE_SESSION was called), or some other reason. 23521 { ownerid_arg, *, *, old_clientid_ret, unconfirmed } 23523 It is possible the properties of old_clientid_ret are 23524 different than those specified in the current EXCHANGE_ID. 23525 Whether the properties are being updated or not, to eliminate 23526 ambiguity, the server deletes the unconfirmed record, 23527 generates a new client ID (clientid_ret) and establishes the 23528 following unconfirmed record: 23530 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 23531 unconfirmed } 23533 5. Client Restart 23535 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set, and if the 23536 server has the following confirmed client record, then this 23537 request is likely from a previously confirmed client which has 23538 restarted. 23540 { ownerid_arg, old_verifier_arg, principal_arg, 23541 old_clientid_ret, confirmed } 23543 Since the previous incarnation of the same client will no 23544 longer be making requests, once the new client ID is confirmed 23545 by CREATE_SESSION, lock and share reservations should be 23546 released immediately rather than forcing the new incarnation 23547 to wait for the lease time on the previous incarnation to 23548 expire. Furthermore, session state should be removed since if 23549 the client had maintained that information across restart, 23550 this request would not have been sent. If the server does not 23551 support the CLAIM_DELEGATE_PREV claim type, associated 23552 delegations should be purged as well; otherwise, delegations 23553 are retained and recovery proceeds according to 23554 Section 10.2.1. 23556 After processing, clientid_ret is returned to the client and 23557 this client record is added: 23559 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 23560 unconfirmed } 23562 The previously described confirmed record continues to exist, 23563 and thus the same ownerid_arg exists in both a confirmed and 23564 unconfirmed state at the same time. The number of states can 23565 collapse to one once the server receives an applicable 23566 CREATE_SESSION or EXCHANGE_ID. 23568 + If the server subsequently receives a successful 23569 CREATE_SESSION that confirms clientid_ret, then the server 23570 atomically destroys the confirmed record and makes the 23571 unconfirmed record confirmed as described in 23572 Section 18.36.4. 23574 + If the server instead subsequently receives an EXCHANGE_ID 23575 with the client owner equal to ownerid_arg, one strategy is 23576 to simply delete the unconfirmed record, and process the 23577 EXCHANGE_ID as described in the entirety of 23578 Section 18.35.4. 23580 6. Update 23582 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server 23583 has the following confirmed record, then this request is an 23584 attempt at an update. 23586 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 23587 confirmed } 23589 Since the record has been confirmed, the client must have 23590 received the server's reply from the initial EXCHANGE_ID 23591 request. The server allows the update, and the client record 23592 is left intact. 23594 7. Update but No Confirmed Record 23596 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server 23597 has no confirmed record corresponding ownerid_arg, then the 23598 server returns NFS4ERR_NOENT and leaves any unconfirmed record 23599 intact. 23601 8. Update but Wrong Verifier 23603 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server 23604 has the following confirmed record, then this request is an 23605 illegal attempt at an update, perhaps because of a retry from 23606 an previous client incarnation. 23608 { ownerid_arg, old_verifier_arg, *, clientid_ret, confirmed } 23610 The server returns NFS4ERR_NOT_SAME and leaves the client 23611 record intact. 23613 9. Update but Wrong Principal 23615 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server 23616 has the following confirmed record, then this request is an 23617 illegal attempt at an update by an unauthorized principal. 23619 { ownerid_arg, verifier_arg, old_principal_arg, clientid_ret, 23620 confirmed } 23622 The server returns NFS4ERR_PERM and leaves the client record 23623 intact. 23625 18.36. Operation 43: CREATE_SESSION - Create New Session and Confirm 23626 Client ID 23628 18.36.1. ARGUMENT 23630 struct channel_attrs4 { 23631 count4 ca_headerpadsize; 23632 count4 ca_maxrequestsize; 23633 count4 ca_maxresponsesize; 23634 count4 ca_maxresponsesize_cached; 23635 count4 ca_maxoperations; 23636 count4 ca_maxrequests; 23637 uint32_t ca_rdma_ird<1>; 23638 }; 23640 const CREATE_SESSION4_FLAG_PERSIST = 0x00000001; 23641 const CREATE_SESSION4_FLAG_CONN_BACK_CHAN = 0x00000002; 23642 const CREATE_SESSION4_FLAG_CONN_RDMA = 0x00000004; 23644 struct CREATE_SESSION4args { 23645 clientid4 csa_clientid; 23646 sequenceid4 csa_sequence; 23648 uint32_t csa_flags; 23650 channel_attrs4 csa_fore_chan_attrs; 23651 channel_attrs4 csa_back_chan_attrs; 23653 uint32_t csa_cb_program; 23654 callback_sec_parms4 csa_sec_parms<>; 23655 }; 23657 18.36.2. RESULT 23659 struct CREATE_SESSION4resok { 23660 sessionid4 csr_sessionid; 23661 sequenceid4 csr_sequence; 23663 uint32_t csr_flags; 23665 channel_attrs4 csr_fore_chan_attrs; 23666 channel_attrs4 csr_back_chan_attrs; 23667 }; 23669 union CREATE_SESSION4res switch (nfsstat4 csr_status) { 23670 case NFS4_OK: 23671 CREATE_SESSION4resok csr_resok4; 23672 default: 23673 void; 23674 }; 23676 18.36.3. DESCRIPTION 23678 This operation is used by the client to create new session objects on 23679 the server. 23681 CREATE_SESSION can be sent with or without a preceding SEQUENCE 23682 operation in the same COMPOUND procedure. If CREATE_SESSION is sent 23683 with a preceding SEQUENCE operation, any session created by 23684 CREATE_SESSION has no direct relation to the session specified in the 23685 SEQUENCE operation, although the two sessions might be associated 23686 with the same client ID. If CREATE_SESSION is sent without a 23687 preceding SEQUENCE, then it MUST be the only operation in the 23688 COMPOUND procedure's request. If is not, the server MUST return 23689 NFS4ERR_NOT_ONLY_OP. 23691 In addition to creating a session, CREATE_SESSION has the following 23692 effects: 23694 o The first session created with a new client ID serves to confirm 23695 the creation of that client's state on the server. The server 23696 returns the parameter values for the new session. 23698 o The connection CREATE_SESSION is sent over is associated with the 23699 session's fore channel. 23701 The arguments and results of CREATE_SESSION are described as follows: 23703 csa_clientid: 23705 This is the client ID the new session will be associated with. 23706 The corresponding result is csr_sessionid, the session ID of the 23707 new session. 23709 csa_sequence: 23711 Each client ID serializes CREATE_SESSION via a per client ID 23712 sequence number (see Section 18.36.4). The corresponding result 23713 is csr_sequence, which MUST be equal to csa_sequence. 23715 In the next three arguments, the client offers a value that is to be 23716 a property of the session. It is RECOMMENDED that the server accept 23717 the value. If it is not acceptable, the server MAY use a different 23718 value. Regardless, the server MUST return the value the session will 23719 use (which will be either what the client offered, or what the server 23720 is insisting on). return the value used to the client. These 23721 parameters have the following interpretation. 23723 csa_flags: 23725 The csa_flags field contains a list of the following flag bits: 23727 CREATE_SESSION4_FLAG_PERSIST: 23729 If CREATE_SESSION4_FLAG_PERSIST is set, the client wants the 23730 server to provide a persistent reply cache. For sessions in 23731 which only idempotent operations will be used (e.g. a read-only 23732 session), clients SHOULD NOT set CREATE_SESSION4_FLAG_PERSIST. 23733 If the server does not or cannot provide a persistent reply 23734 cache, the server MUST NOT set CREATE_SESSION4_FLAG_PERSIST in 23735 the field csr_flags. 23737 If the server is a pNFS metadata server, for reasons described 23738 in Section 12.5.2 it SHOULD support 23739 CREATE_SESSION4_FLAG_PERSIST if it supports the layout_hint 23740 (Section 5.12.4) attribute. 23742 CREATE_SESSION4_FLAG_CONN_BACK_CHAN: 23744 If CREATE_SESSION4_FLAG_CONN_BACK_CHAN is set in csa_flags, the 23745 client is requesting that the server use the connection 23746 CREATE_SESSION is called over for the backchannel as well as 23747 the fore channel. The server sets 23748 CREATE_SESSION4_FLAG_CONN_BACK_CHAN in the result field 23749 csr_flags if it agrees. If CREATE_SESSION4_FLAG_CONN_BACK_CHAN 23750 is not set in csa_flags, then 23751 CREATE_SESSION4_FLAG_CONN_BACK_CHAN MUST NOT be set in 23752 csr_flags. 23754 CREATE_SESSION4_FLAG_CONN_RDMA: 23756 If CREATE_SESSION4_FLAG_CONN_RDMA is set in csa_flags, and if 23757 the connection CREATE_SESSION is called over is currently in 23758 non-RDMA mode, but has the capability to operate in RDMA mode, 23759 then client is requesting the server agree to "step up" to RDMA 23760 mode on the connection. The server sets 23761 CREATE_SESSION4_FLAG_CONN_RDMA in the result field csr_flags if 23762 it agrees. If CREATE_SESSION4_FLAG_CONN_RDMA is not set in 23763 csa_flags, then CREATE_SESSION4_FLAG_CONN_RDMA MUST NOT be set 23764 in csr_flags. Note that once the server agrees to step up, it 23765 and the client MUST exchange all future traffic on the 23766 connection with RPC RDMA framing and not Record Marking ([8]). 23768 csa_fore_chan_attrs, csa_fore_chan_attrs: 23770 The csa_fore_chan_attrs and csa_back_chan_attrs fields apply to 23771 attributes of the fore channel (which conveys requests originating 23772 from the client to the server), and the backchannel (the channel 23773 that conveys callback requests originating from the server to the 23774 client), respectively. The results are in corresponding 23775 structures called csr_fore_chan_attrs and csr_back_chan_attrs. 23776 The results establish attributes for each channel, and on all 23777 subsequent use of each channel of the session. Each structure has 23778 the following fields: 23780 ca_headerpadsize: 23782 The maximum amount of padding the requester is willing to apply 23783 to ensure that write payloads are aligned on some boundary at 23784 the replier. The replier should reply in ca_headerpadsize with 23785 its preferred value, or zero if padding is not in use. The 23786 replier may decrease this value but MUST NOT increase it. 23788 ca_maxrequestsize: 23790 The maximum size of a COMPOUND or CB_COMPOUND request that will 23791 be sent. This size represents the XDR encoded size of the 23792 request, including the RPC headers (including security flavor 23793 credentials and verifiers) but excludes any RPC transport 23794 framing headers. Imagine a request coming over a non-RDMA 23795 TCP/IP connection, and that it has a single Record Marking 23796 header preceding it. The maximum allowable count encoded in 23797 the header will be ca_maxrequestsize. If a requester sends a 23798 request that exceeds ca_maxrequestsize, the error 23799 NFS4ERR_REQ_TOO_BIG will be returned per the description in 23800 Section 2.10.6.4. 23802 ca_maxresponsesize: 23804 The maximum size of a COMPOUND or CB_COMPOUND reply that the 23805 requester will accept from the replier including RPC headers 23806 (see the ca_maxrequestsize definition). The NFSv4.1 server 23807 MUST NOT increase the value of this parameter in the 23808 CREATE_SESSION results. However, if the client selects a value 23809 for ca_maxresponsesize such that a replier on a channel could 23810 never send a response, the server SHOULD return 23811 NFS4ERR_TOOSMALL in the CREATE_SESSION reply. If a requester 23812 sends a request for which the size of the reply would exceed 23813 this value, the replier will return NFS4ERR_REP_TOO_BIG, per 23814 the description in Section 2.10.6.4. 23816 ca_maxresponsesize_cached: 23818 Like ca_maxresponsesize, but the maximum size of a reply that 23819 will be stored in the reply cache (Section 2.10.6.1). If the 23820 reply to CREATE_SESSION has ca_maxresponsesize_cached less than 23821 ca_maxresponsesize, then this is an indication to the requester 23822 on the channel that it needs to be selective about which 23823 replies it directs the replier to cache; for example large 23824 replies from nonidempotent operations (e.g. COMPOUND requests 23825 with a READ operation), should not be cached. The requester 23826 decides which replies to cache via an argument to the SEQUENCE 23827 (the sa_cachethis field, see Section 18.46) or CB_SEQUENCE (the 23828 csa_cachethis field, see Section 20.9) operations. If a 23829 requester sends a request for which the size of the reply would 23830 exceed this value, the replier will return 23831 NFS4ERR_REP_TOO_BIG_TO_CACHE, per the description in 23832 Section 2.10.6.4. 23834 ca_maxoperations: 23836 The maximum number of operations the replier will accept in a 23837 COMPOUND or CB_COMPOUND. The server MUST NOT increase 23838 ca_maxoperations in the reply to CREATE_SESSION. If the 23839 requester sends a COMPOUND or CB_COMPOUND with more operations 23840 than ca_maxoperations, the replier MUST return 23841 NFS4ERR_TOO_MANY_OPS. 23843 ca_maxrequests: 23845 The maximum number of concurrent COMPOUND or CB_COMPOUND 23846 requests the requester will send on the session. Subsequent 23847 requests will each be assigned a slot identifier by the 23848 requester within the range 0 to ca_maxrequests - 1 inclusive. 23850 ca_rdma_ird: 23852 This array has a maximum of one element. If this array has one 23853 element, then the element contains the inbound RDMA read queue 23854 depth (IRD). 23856 csa_cb_program 23858 This is the ONC RPC program number the server MUST use in any 23859 callbacks sent through the backchannel to the client. The server 23860 MUST specify an ONC RPC program number equal to csa_cb_program and 23861 an ONC RPC version number equal to 4 in callbacks sent to the 23862 client. If a CB_COMPOUND is sent to the client, the server MUST 23863 use a minor version number of 1. There is no corresponding 23864 result. 23866 csa_sec_parms 23868 The field csa_sec_parms is an array of acceptable security 23869 credentials the server can use on the session's backchannel. 23870 Three security flavors are supported: AUTH_NONE, AUTH_SYS, and 23871 RPCSEC_GSS. If AUTH_NONE is specified for a credential, then this 23872 says the client is authorizing the server to use AUTH_NONE on all 23873 callbacks for the session. If AUTH_SYS is specified, then the 23874 client is authorizing the server to use AUTH_SYS on all callbacks, 23875 using the credential specified cbsp_sys_cred. If RPCSEC_GSS is 23876 specified, then the server is allowed to use the RPCSEC_GSS 23877 context specified in cbsp_gss_parms as the RPCSEC_GSS context in 23878 the credential of the RPC header of callbacks to the client. 23879 There is no corresponding result. 23881 The RPCSEC_GSS context for the backchannel is specified via a pair 23882 of values of data type gsshandle4_t. The data type gsshandle4_t 23883 represents an RPCSEC_GSS handle, and is precisely the same as the 23884 data type of the "handle" field of the rpc_gss_init_res data type 23885 defined in Section 5.2.3.1, "Context Creation Response - 23886 Successful Acceptance" of [4]. 23888 The first RPCSEC_GSS handle, gcbp_handle_from_server, is the fore 23889 handle the server returned to the client (either in the handle 23890 field of data type rpc_gss_init_res or one of the elements of the 23891 spi_handles field returned in the reply to EXCHANGE_ID) when the 23892 RPCSEC_GSS context was created on the server. The second handle, 23893 gcbp_handle_from_client, is the back handle the client will map 23894 the RPCSEC_GSS context to. The server can immediately use the 23895 value of gcbp_handle_from_client in the RPCSEC_GSS credential in 23896 callback RPCs. I.e., the value in gcbp_handle_from_client can be 23897 used as the value of the field "handle" in data type 23898 rpc_gss_cred_t (see Section 5, "Elements of the RPCSEC_GSS 23899 Security Protocol" of [4]) in callback RPCs. The server MUST use 23900 the RPCSEC_GSS security service specified in gcbp_service, i.e. it 23901 MUST set the "service" field of the rpc_gss_cred_t data type in 23902 RPCSEC_GSS credential to the value of gcbp_service (see Section 23903 5.3.1, "RPC Request Header", of [4]). 23905 If the RPCSEC_GSS handle identified by gcbp_handle_from_server 23906 does not exist on the server, the server will return 23907 NFS4ERR_NOENT. 23909 Within each element of csa_sec_parms, the fore and back RPCSEC_GSS 23910 contexts MUST share the same GSS context and MUST have the same 23911 seq_window (see Section 5.2.3.1 of RFC2203 [4]). The fore and 23912 back RPCSEC_GSS context state are independent of each other as far 23913 as the RPCSEC_GSS sequence number (see the seq_num field in the 23914 rpc_gss_cred_t data type of Section 5 and of Section 5.3.1, "RPC 23915 Request Header", of RFC2203). 23917 Once the session is created, the first SEQUENCE or CB_SEQUENCE 23918 received on a slot MUST have a sequence ID equal to 1; if not the 23919 server MUST return NFS4ERR_SEQ_MISORDERED. 23921 18.36.4. IMPLEMENTATION 23923 To describe a possible implementation, the same notation for client 23924 records introduced in the description of EXCHANGE_ID is used with the 23925 following addition: 23927 clientid_arg: The value of the csa_clientid field of the 23928 CREATE_SESSION4args structure of the current request. 23930 Since CREATE_SESSION is a non-idempotent operation, we need to 23931 consider the possibility that retries may occur as a result of a 23932 client restart, network partition, malfunctioning router, etc. For 23933 each client ID created by EXCHANGE_ID, the server maintains a 23934 separate reply cache (called the CREATE_SESSION reply cache) similar 23935 to the session reply cache used for SEQUENCE operations, with two 23936 distinctions. 23938 o First this is a reply cache just for detecting and processing 23939 CREATE_SESSION requests for a given client ID. 23941 o Second, the size of the client ID reply cache is of one slot (and 23942 as a result, the CREATE_SESSION request does not carry a slot 23943 number). This means that at most one CREATE_SESSION request for a 23944 given client ID can be outstanding. 23946 As previously stated, CREATE_SESSION can be sent with or without a 23947 preceding SEQUENCE operation. Even if SEQUENCE precedes 23948 CREATE_SESSION, the server MUST maintain the CREATE_SESSION reply 23949 cache, which is separate from the reply cache for the session 23950 associated with SEQUENCE. If CREATE_SESSION was originally sent by 23951 itself, the client MAY send a retry of the CREATE_SESSION operation 23952 within a COMPOUND preceded by SEQUENCE. If CREATE_SESSION was 23953 originally sent in a COMPOUND that started with SEQUENCE, then the 23954 client SHOULD send a retry in a COMPOUND that starts with SEQUENCE 23955 that has the same session ID as the SEQUENCE of the original request. 23956 However, the client MAY send a retry in a COMPOUND that either has no 23957 preceding SEQUENCE, or has a preceding SEQUENCE that refers to a 23958 different session than the original CREATE_SESSION. This might be 23959 necessary if the client sends a CREATE_SESSION in a COMPOUND preceded 23960 by a SEQUENCE with session ID X, and session X no longer exists. 23961 Regardless, any retry of CREATE_SESSION, with or without a preceding 23962 SEQUENCE, MUST use the same value of csa_sequence as the original. 23964 When a client sends a successful EXCHANGE_ID and it is returned an 23965 unconfirmed client ID, the client is also returned eir_sequenceid, 23966 and the client is expected to set the value of csa_sequenceid in the 23967 client ID-confirming-CREATE_SESSION it sends with that client ID to 23968 the value of eir_sequenceid. When EXCHANGE_ID returns a new, 23969 unconfirmed client ID, the server initializes the client ID slot to 23970 be equal to eir_sequenceid - 1 (accounting for underflow), and 23971 records a contrived CREATE_SESSION result with a "cached" result of 23972 NFS4ERR_SEQ_MISORDERED. With the slot thus initialized, the 23973 processing of the CREATE_SESSION operation is divided into four 23974 phases: 23976 1. Client record lookup. The server looks up the client ID in its 23977 client record table. If the server contains no records with 23978 client ID equal to clientid_arg, then most likely the client's 23979 state has been purged during a period of inactivity, possibly due 23980 to a loss of connectivity. NFS4ERR_STALE_CLIENTID is returned, 23981 and no changes are made to any client records on the server. 23982 Otherwise, the server goes to phase 2. 23984 2. Sequence ID processing. If csa_sequenceid is equal to the 23985 sequence ID in the client ID's slot, then this is a replay of the 23986 previous CREATE_SESSION request, and the server returns the 23987 cached result. If csa_sequenceid is not equal to the sequence ID 23988 in the slot, and is more than one greater (accounting for 23989 wraparound), then the server returns the error 23990 NFS4ERR_SEQ_MISORDERED, and does not change the slot. If 23991 csa_sequenceid is equal to the slot's sequence ID + 1 (accounting 23992 for wraparound), then the slot's sequence ID is set to 23993 csa_sequenceid, and the CREATE_SESSION processing goes to the 23994 next phase. A subsequent new CREATE_SESSION call over the same 23995 client ID MUST use a csa_sequenceid that is one greater than the 23996 sequence ID in the slot. 23998 3. Client ID confirmation. If this would be the first session for 23999 the client ID, the CREATE_SESSION operation serves to confirm the 24000 client ID. Otherwise the client ID confirmation phase is skipped 24001 and only the session creation phase occurs. Any case in which 24002 there is more than one record with identical values for client ID 24003 represents a server implementation error. Operation in the 24004 potential valid cases is summarized as follows. 24006 * Successful Confirmation 24008 If the server has the following unconfirmed record, then 24009 this is the expected confirmation of an unconfirmed record. 24011 { ownerid, verifier, principal_arg, clientid_arg, 24012 unconfirmed } 24014 As noted in Section 18.35.4, the server might also have the 24015 following confirmed record. 24017 { ownerid, old_verifier, principal_arg, old_clientid, 24018 confirmed } 24020 The server schedules the replacement of both records with: 24022 { ownerid, verifier, principal_arg, clientid_arg, confirmed 24023 } 24025 The processing of CREATE_SESSION continues on to session 24026 creation. Once the session is successfully created, the 24027 scheduled client record replacement is committed. If the 24028 session is not successfully created, then no changes are 24029 made to any client records on the server. 24031 * Unsuccessful Confirmation 24033 If the server has the following record, then the client has 24034 changed principals after the previous EXCHANGE_ID request, 24035 or there has been a chance collision between shorthand 24036 client identifiers. 24038 { *, *, old_principal_arg, clientid_arg, * } 24040 Neither of these cases are permissible. Processing stops 24041 and NFS4ERR_CLID_INUSE is returned to the client. No 24042 changes are made to any client records on the server. 24044 4. Session creation. The server confirmed the client ID, either in 24045 this CREATE_SESSION operation, or a previous CREATE_SESSION 24046 operation. The server examines the remaining fields of the 24047 arguments. 24049 5. The server creates the session by recording the parameter values 24050 used (including whether the CREATE_SESSION4_FLAG_PERSIST flag is 24051 set and has been accepted by the server) and allocating space for 24052 the session reply cache (if there is not enough space, the server 24053 returns NFS4ERR_NOSPC). For each slot in the reply cache, the 24054 server sets the sequence ID to zero (0), and records an entry 24055 containing a COMPOUND reply with zero operations and the error 24056 NFS4ERR_SEQ_MISORDERED. This way, if the first SEQUENCE request 24057 sent has a sequence ID equal to zero, the server can simply 24058 return what is in the reply cache: NFS4ERR_SEQ_MISORDERED. The 24059 client initializes its reply cache for receiving callbacks in the 24060 same way, and similarly, the first CB_SEQUENCE operation on a 24061 slot after session creation MUST have a sequence ID of one. 24063 6. If the session state is created successfully, the server 24064 associates the session with the client ID provided by the client. 24066 7. When a request that had CREATE_SESSION4_FLAG_CONN_RDMA set needs 24067 to be retried, the retry MUST be done on a new connection that is 24068 in non-RDMA mode. If properties of the new connection are 24069 different enough that the arguments to CREATE_SESSION need to 24070 change, then a non-retry MUST be sent. The server will 24071 eventually dispose of any session that was created on the 24072 original connection. 24074 On the backchannel, the client and server might wish to have many 24075 slots, in some cases perhaps more that the fore channel, in to deal 24076 with the situations where the network link has high latency and is 24077 the primary bottleneck for response to recalls. If so, and if the 24078 client provides too few slots to the backchannel, the server might 24079 limit the number of recallable objects it gives to the server. 24081 Implementing RPCSEC_GSS callback support requires the client and 24082 server change their RPCSEC_GSS implementations. One possible set of 24083 changes includes: 24085 o Adding a data structure that wraps the GSS-API context with a 24086 reference count. 24088 o New functions to increment and decrement the reference count. If 24089 the reference count is decremented to zero, the wrapper data 24090 structure and the GSS-API context it refers to would be freed. 24092 o Change RPCSEC_GSS to create the wrapper data structure upon 24093 receiving GSS-API context from gss_accept_sec_context() and 24094 gss_init_sec_context(). The reference count would be initialized 24095 to 1. 24097 o Adding a function to map an existing RPCSEC_GSS handle to a 24098 pointer to the wrapper data structure. The reference count would 24099 be incremented. 24101 o Adding a function to create a new RPCSEC_GSS handle from a pointer 24102 to the wrapper data structure. The reference count would be 24103 incremented. 24105 o Replacing calls from RPCSEC_GSS that free GSS-API contexts, with 24106 calls to decrement the reference count on the wrapper data 24107 structure. 24109 18.37. Operation 44: DESTROY_SESSION - Destroy a Session 24111 18.37.1. ARGUMENT 24113 struct DESTROY_SESSION4args { 24114 sessionid4 dsa_sessionid; 24115 }; 24117 18.37.2. RESULT 24119 struct DESTROY_SESSION4res { 24120 nfsstat4 dsr_status; 24121 }; 24123 18.37.3. DESCRIPTION 24125 The DESTROY_SESSION operation closes the session and discards the 24126 session's reply cache, if any. Any remaining connections associated 24127 with the session are immediately disassociated. If the connection 24128 has no remaining associated sessions, the connection MAY be closed by 24129 the server. Locks, delegations, layouts, wants, and the lease, which 24130 are all tied to the client ID, are not affected by DESTROY_SESSION. 24132 DESTROY_SESSION MUST be invoked on a connection that is associated 24133 with the session being destroyed. In addition if SP4_MACH_CRED state 24134 protection was specified when the client ID was created, the 24135 RPCSEC_GSS principal that created the session MUST be the one that 24136 destroys the session, using RPCSEC_GSS privacy or integrity. If 24137 SP4_SSV state protection was specified when the client ID was 24138 created, RPCSEC_GSS using the SSV mechanism (Section 2.10.9) MUST be 24139 used, with integrity or privacy. 24141 If the COMPOUND request starts with SEQUENCE, and if the sessionids 24142 specified in SEQUENCE and DESTROY_SESSION are the same, then 24144 o DESTROY_SESSION MUST be the final operation in the COMPOUND 24145 request. 24147 o It is advisable to not place DESTROY_SESSION in a COMPOUND request 24148 with other state-modifying operations, because the DESTROY_SESSION 24149 will destroy the reply cache. 24151 DESTROY_SESSION MAY be the only operation in a COMPOUND request. 24153 Because the session is destroyed, a client that retries the request 24154 may receive an error in reply to the retry, even though the original 24155 request was successful. 24157 If there is a backchannel on the session and the server has 24158 outstanding CB_COMPOUND operations for the session which have not 24159 been replied to, then the server MAY refuse to destroy the session 24160 and return an error. In the event the backchannel is down, the 24161 server SHOULD return NFS4ERR_CB_PATH_DOWN to inform the client that 24162 the backchannel needs to repaired before the server will allow the 24163 session to be destroyed. Otherwise, the error CB_BACK_CHAN_BUSY 24164 SHOULD be returned to indicate that there are CB_COMPOUNDs that need 24165 to be replied to. The client SHOULD reply to all outstanding 24166 CB_COMPOUNDs before re-sending DESTROY_SESSION. 24168 18.38. Operation 45: FREE_STATEID - Free Stateid with No Locks 24170 18.38.1. ARGUMENT 24172 struct FREE_STATEID4args { 24173 stateid4 fsa_stateid; 24174 }; 24176 18.38.2. RESULT 24178 struct FREE_STATEID4res { 24179 nfsstat4 fsr_status; 24180 }; 24182 18.38.3. DESCRIPTION 24184 The FREE_STATEID operation is used to free a stateid which no longer 24185 has any associated locks (including opens, byte-range locks, 24186 delegations, layouts). This may be because of client unlock 24187 operations or because of server revocation. If there are valid locks 24188 (of any kind) associated with the stateid in question, the error 24189 NFS4ERR_LOCKS_HELD will be returned, and the associated stateid will 24190 not be freed. 24192 When a stateid is freed which had been associated with revoked locks, 24193 the client, by doing the FREE_STATEID acknowledges the loss of those 24194 locks. This allows the server, once all such revoked state is 24195 acknowledged, to allow that client again to reclaim locks, without 24196 encountering the edge conditions discussed in Section 8.4.2. 24198 Once a successful FREE_STATEID is done for a given stateid, any 24199 subsequent use of that stateid will result in an NFS4ERR_BAD_STATEID 24200 error. 24202 18.39. Operation 46: GET_DIR_DELEGATION - Get a directory delegation 24204 18.39.1. ARGUMENT 24206 typedef nfstime4 attr_notice4; 24208 struct GET_DIR_DELEGATION4args { 24209 /* CURRENT_FH: delegated directory */ 24210 bool gdda_signal_deleg_avail; 24211 bitmap4 gdda_notification_types; 24212 attr_notice4 gdda_child_attr_delay; 24213 attr_notice4 gdda_dir_attr_delay; 24214 bitmap4 gdda_child_attributes; 24215 bitmap4 gdda_dir_attributes; 24216 }; 24218 18.39.2. RESULT 24220 struct GET_DIR_DELEGATION4resok { 24221 verifier4 gddr_cookieverf; 24222 /* Stateid for get_dir_delegation */ 24223 stateid4 gddr_stateid; 24224 /* Which notifications can the server support */ 24225 bitmap4 gddr_notification; 24226 bitmap4 gddr_child_attributes; 24227 bitmap4 gddr_dir_attributes; 24228 }; 24230 enum gddrnf4_status { 24231 GDD4_OK = 0, 24232 GDD4_UNAVAIL = 1 24233 }; 24235 union GET_DIR_DELEGATION4res_non_fatal 24236 switch (gddrnf4_status gddrnf_status) { 24237 case GDD4_OK: 24238 GET_DIR_DELEGATION4resok gddrnf_resok4; 24239 case GDD4_UNAVAIL: 24240 bool gddrnf_will_signal_deleg_avail; 24241 }; 24243 union GET_DIR_DELEGATION4res 24244 switch (nfsstat4 gddr_status) { 24245 case NFS4_OK: 24246 GET_DIR_DELEGATION4res_non_fatal gddr_res_non_fatal4; 24247 default: 24248 void; 24249 }; 24251 18.39.3. DESCRIPTION 24253 The GET_DIR_DELEGATION operation is used by a client to request a 24254 directory delegation. The directory is represented by the current 24255 filehandle. The client also specifies whether it wants the server to 24256 notify it when the directory changes in certain ways by setting one 24257 or more bits in a bitmap. The server may choose not to grant the 24258 delegation. In that case the server will return 24259 NFS4ERR_DIRDELEG_UNAVAIL. If the server decides to hand out the 24260 delegation, it will return a cookie verifier for that directory. If 24261 the cookie verifier changes when the client is holding the 24262 delegation, the delegation will be recalled unless the client has 24263 asked for notification for this event. 24265 The server will also return a directory delegation stateid, 24266 gddr_stateid, as a result of the GET_DIR_DELEGATION operation. This 24267 stateid will appear in callback messages related to the delegation, 24268 such as notifications and delegation recalls. The client will use 24269 this stateid to return the delegation voluntarily or upon recall. A 24270 delegation is returned by calling the DELEGRETURN operation. 24272 The server might not be able to support notifications of certain 24273 events. If the client asks for such notifications, the server MUST 24274 inform the client of its inability to do so as part of the 24275 GET_DIR_DELEGATION reply by not setting the appropriate bits in the 24276 supported notifications bitmask, gddr_notification, contained in the 24277 reply. The server MUST NOT add bits to gddr_notification that the 24278 client did not request. 24280 The GET_DIR_DELEGATION operation can be used for both normal and 24281 named attribute directories. 24283 If client sets gdda_signal_deleg_avail to TRUE, then it is 24284 registering with the client a "want" for a directory delegation. If 24285 the delegation is not available, and the server supports and will 24286 honor the "want", the results will have 24287 gddrnf_will_signal_deleg_avail set to TRUE and no error will be 24288 indicated on return. If so the client should expect a future 24289 CB_RECALLABLE_OBJ_AVAIL operation to indicate that a directory 24290 delegation is available. If the server does not wish to honor the 24291 "want" or is not able to do so, it returns the error 24292 NFS4ERR_DIRDELEG_UNAVAIL. If the delegation is immediately 24293 available, the server SHOULD return it with the response to the 24294 operation, rather than via a callback. 24296 18.39.4. IMPLEMENTATION 24298 Directory delegations provide the benefit of improving cache 24299 consistency of namespace information. This is done through 24300 synchronous callbacks. A server must support synchronous callbacks 24301 in order to support directory delegations. In addition to that, 24302 asynchronous notifications provide a way to reduce network traffic as 24303 well as improve client performance in certain conditions. 24305 Notifications are specified in terms of potential changes to the 24306 directory. A client can ask to be notified of events by setting one 24307 or more bits in gdda_notification_types. The client can ask for 24308 notifications on addition of entries to a directory (by setting the 24309 NOTIFY4_ADD_ENTRY in gdda_notification_types), notifications on entry 24310 removal (NOTIFY4_REMOVE_ENTRY), renames (NOTIFY4_RENAME_ENTRY), 24311 directory attribute changes (NOTIFY4_CHANGE_DIR_ATTRIBUTES), and 24312 cookie verifier changes (NOTIFY4_CHANGE_COOKIE_VERIFIER) by setting 24313 one or more corresponding bits in the gdda_notification_types field. 24315 The client can also ask for notifications of changes to attributes of 24316 directory entries (NOTIFY4_CHANGE_CHILD_ATTRIBUTES) in order to keep 24317 its attribute cache up to date. However any changes made to child 24318 attributes do not cause the delegation to be recalled. If a client 24319 is interested in directory entry caching, or negative name caching, 24320 it can set the gdda_notification_types appropriately to its 24321 particular need and the server will notify it of all changes that 24322 would otherwise invalidate its name cache. The kind of notification 24323 a client asks for may depend on the directory size, its rate of 24324 change and the applications being used to access that directory. The 24325 enumeration of the conditions under which a client might ask for a 24326 notification is out of the scope of this specification. 24328 For attribute notifications, the client will set bits in the 24329 gdda_dir_attributes bitmap to indicate which attributes it wants to 24330 be notified of. If the server does not support notifications for 24331 changes to a certain attribute, it SHOULD NOT set that attribute in 24332 the supported attribute bitmap specified in the reply 24333 (gddr_dir_attributes). The client will also set in the 24334 gdda_child_attributes bitmap the attributes of directory entries it 24335 wants to be notified of, and the server will indicate in 24336 gddr_child_attributes which attributes of directory entries it will 24337 notify the client of. 24339 The client will also let the server know if it wants to get the 24340 notification as soon as the attribute change occurs or after a 24341 certain delay by setting a delay factor; gdda_child_attr_delay is for 24342 attribute changes to directory entries and gdda_dir_attr_delay is for 24343 attribute changes to the directory. If this delay factor is set to 24344 zero, that indicates to the server that the client wants to be 24345 notified of any attribute changes as soon as they occur. If the 24346 delay factor is set to N seconds, the server will make a best effort 24347 guarantee that attribute updates are synchronized within N seconds. 24348 If the client asks for a delay factor that the server does not 24349 support or that may cause significant resource consumption on the 24350 server by causing the server to send a lot of notifications, the 24351 server should not commit to sending out notifications for attributes 24352 and therefore must not set the appropriate bit in the 24353 gddr_child_attributes and gddr_dir_attributes bitmaps in the 24354 response. 24356 The client MUST use a security tuple (Section 2.6.1) that the 24357 directory or its applicable ancestor (Section 2.6) is exported with. 24358 If not, the server MUST return NFS4ERR_WRONGSEC to the operation that 24359 both precedes GET_DIR_DELEGATION and sets the current filehandle (see 24360 Section 2.6.3.1). 24362 The directory delegation covers all the entries in the directory 24363 except the parent entry. That means if a directory and its parent 24364 both hold directory delegations, any changes to the parent will not 24365 cause a notification to be sent for the child even though the child's 24366 parent entry points to the parent directory. 24368 18.40. Operation 47: GETDEVICEINFO - Get Device Information 24370 18.40.1. ARGUMENT 24372 struct GETDEVICEINFO4args { 24373 deviceid4 gdia_device_id; 24374 layouttype4 gdia_layout_type; 24375 count4 gdia_maxcount; 24376 bitmap4 gdia_notify_types; 24377 }; 24379 18.40.2. RESULT 24381 struct GETDEVICEINFO4resok { 24382 device_addr4 gdir_device_addr; 24383 bitmap4 gdir_notification; 24384 }; 24386 union GETDEVICEINFO4res switch (nfsstat4 gdir_status) { 24387 case NFS4_OK: 24388 GETDEVICEINFO4resok gdir_resok4; 24389 case NFS4ERR_TOOSMALL: 24390 count4 gdir_mincount; 24391 default: 24392 void; 24393 }; 24395 18.40.3. DESCRIPTION 24397 Returns pNFS storage device address information for the specified 24398 device ID. The client identifies the device information to be 24399 returned by providing the gdia_device_id and gdia_layout_type that 24400 uniquely identify the device. The client provides gdia_maxcount to 24401 limit the number of bytes for the result. This maximum size 24402 represents all of the data being returned within the 24403 GETDEVICEINFO4resok structure and includes the XDR overhead. The 24404 server may return less data. If the server is unable to return any 24405 information within the gdia_maxcount limit, the error 24406 NFS4ERR_TOOSMALL will be returned. However, if gdia_maxcount is 24407 zero, NFS4ERR_TOOSMALL MUST NOT be returned. 24409 The da_layout_type field of the gdir_device_addr returned by the 24410 server MUST be equal to the gdia_layout_type specified by the client. 24411 If it is not equal, the client SHOULD ignore the response as invalid 24412 and behave as if the server returned an error, even if the client 24413 does have support for the layout type returned. 24415 The client also provides a notification bitmap, gdia_notify_types for 24416 the device ID mapping notification for which it is interested in 24417 receiving; the server must support device ID notifications for the 24418 notification request to have affect. The notification mask is 24419 composed in the same manner as the bitmap for file attributes 24420 (Section 3.3.7). The numbers of bit positions are listed in the 24421 notify_device_type4 enumeration type (Section 20.12). Only two 24422 enumerated values of notify_device_type4 currently apply to 24423 GETDEVICEINFO: NOTIFY_DEVICEID4_CHANGE and NOTIFY_DEVICEID4_DELETE 24424 (see Section 20.12). 24426 The notification bitmap applies only to the specified device ID. If 24427 a client issues GETDEVICEINFO on a deviceID multiple times, the last 24428 notification bitmap is used by the server for subsequent 24429 notifications. If the bitmap is zero or empty, then the device ID's 24430 notifications are turned off. 24432 If the client wants to just update or turn off notifications, it MAY 24433 issue GETDEVICEINFO with gdia_maxcount set to zero. In that event, 24434 if the device ID is valid, the reply's da_addr_body field of the 24435 gdir_device_addr field will be of zero length. 24437 If an unknown device ID is given in gdia_device_id, the server 24438 returns NFS4ERR_NOENT. Otherwise, the device address information is 24439 returned in gdir_device_addr. Finally, if the server supports 24440 notifications for device ID mappings, the gdir_notification result 24441 will contain a bitmap of which notifications it will actually send to 24442 the client (via CB_NOTIFY_DEVICEID, see Section 20.12). 24444 If NFS4ERR_TOOSMALL is returned, the results also contain 24445 gdir_mincount. The value of gdir_mincount represents the minimum 24446 size necessary to obtain the device information. 24448 18.40.4. IMPLEMENTATION 24450 Aside from updating or turning off notifications, another use case 24451 for gdia_maxcount being set to zero is to validate a device ID. 24453 The client SHOULD request a notification for changes or deletion of a 24454 device ID to device address mapping so that the server can allow the 24455 client gracefully use a new mapping, without having pending I/O fail 24456 abruptly, or force layouts using the device ID to be recalled or 24457 revoked. 24459 It is possible that GETDEVICEINFO (and GETDEVICELIST) will race with 24460 CB_NOTIFY_DEVICEID, i.e. CB_NOTIFY_DEVICEID arrives before the 24461 client gets and processes the response to GETDEVICEINFO or 24462 GETDEVICELIST. The analysis of the race leverages the fact that the 24463 server MUST NOT delete a device ID that is referred to by a layout 24464 the client has. 24466 o CB_NOTIFY_DEVICEID deletes a device ID. If the client believes it 24467 has layouts that refer to the device ID, then it is possible the 24468 layouts have been revoked. The client should send a TEST_STATEID 24469 request using the stateid for each layout that might have been 24470 revoked. If TEST_STATEID indicates any layouts have been revoked, 24471 the client must recover from layout revocation as described in 24472 Section 12.5.6. If TEST_STATEID indicates at least one layout has 24473 not been revoked, the client should send a GETDEVICEINFO on the 24474 device ID to verify that the device ID has been deleted. If 24475 GETDEVICEINFO indicates the device ID does not exist, the client 24476 then assumes the server is faulty, and recovers issuing by 24477 EXCHANGE_ID. If the client does not have layouts that refer to 24478 the device ID, no harm is done. The client should mark the device 24479 ID as deleted, and when the GETDEVICEINFO or GETDEVICELIST results 24480 are finally received for the device ID, delete the device ID from 24481 client's cache. 24483 o CB_NOTIFY_DEVICEID indicates a device ID's device addressing 24484 mappings have changed. The client should assume that the results 24485 from the in progress GETDEVICEINFO will be stale for the device ID 24486 once received, and so it should send another GETDEVICEINFO on the 24487 device ID. 24489 18.41. Operation 48: GETDEVICELIST - Get All Device Mappings for a File 24490 System 24492 18.41.1. ARGUMENT 24494 struct GETDEVICELIST4args { 24495 /* CURRENT_FH: object belonging to the file system */ 24496 layouttype4 gdla_layout_type; 24498 /* number of deviceIDs to return */ 24499 count4 gdla_maxdevices; 24501 nfs_cookie4 gdla_cookie; 24502 verifier4 gdla_cookieverf; 24503 }; 24505 18.41.2. RESULT 24507 struct GETDEVICELIST4resok { 24508 nfs_cookie4 gdlr_cookie; 24509 verifier4 gdlr_cookieverf; 24510 deviceid4 gdlr_deviceid_list<>; 24511 bool gdlr_eof; 24512 }; 24514 union GETDEVICELIST4res switch (nfsstat4 gdlr_status) { 24515 case NFS4_OK: 24516 GETDEVICELIST4resok gdlr_resok4; 24517 default: 24518 void; 24519 }; 24521 18.41.3. DESCRIPTION 24523 This operation is used by the client to enumerate all of the device 24524 IDs a server's file system uses. 24526 The client provides a current filehandle of a file object that 24527 belongs to the file system (i.e. all file objects sharing the same 24528 fsid as that of the current filehandle), and the layout type in 24529 gdia_layout_type. Since this operation might require multiple calls 24530 to enumerate all the device IDs (and is thus similar to the READDIR 24531 (Section 18.23) operation), the client also provides gdia_cookie and 24532 gdia_cookieverf to specify the current cursor position in the list. 24533 When the client wants to read from the beginning of the file system's 24534 device mappings, it sets gdla_cookie to zero. The field 24535 gdla_cookieverf MUST be ignored by the server when gdla_cookie is 24536 zero. The client provides gdla_maxdevices to limit the number of 24537 device IDs in the result. If gdla_maxdevices is zero, the server 24538 MUST return NFS4ERR_INVAL. The server MAY return fewer device IDs. 24540 The successful response to the operation will contain the cookie, 24541 gdlr_cookie, and cookie verifier, gdlr_cookieverf, to be used on the 24542 subsequent GETDEVICELIST. A gdlr_eof value of TRUE signifies that 24543 there are no remaining entries in the server's device list. Each 24544 element of gdlr_deviceid_list contains a device ID. 24546 18.41.4. IMPLEMENTATION 24548 An example of the use of this operation is for pNFS clients and 24549 servers that use LAYOUT4_BLOCK_VOLUME layouts. In these environments 24550 it may be helpful for a client to determine device accessibility upon 24551 first file system access. 24553 18.42. Operation 49: LAYOUTCOMMIT - Commit Writes Made Using a Layout 24555 18.42.1. ARGUMENT 24557 union newtime4 switch (bool nt_timechanged) { 24558 case TRUE: 24559 nfstime4 nt_time; 24560 case FALSE: 24561 void; 24562 }; 24564 union newoffset4 switch (bool no_newoffset) { 24565 case TRUE: 24566 offset4 no_offset; 24567 case FALSE: 24568 void; 24569 }; 24571 struct LAYOUTCOMMIT4args { 24572 /* CURRENT_FH: file */ 24573 offset4 loca_offset; 24574 length4 loca_length; 24575 bool loca_reclaim; 24576 stateid4 loca_stateid; 24577 newoffset4 loca_last_write_offset; 24578 newtime4 loca_time_modify; 24579 layoutupdate4 loca_layoutupdate; 24580 }; 24582 18.42.2. RESULT 24584 union newsize4 switch (bool ns_sizechanged) { 24585 case TRUE: 24586 length4 ns_size; 24587 case FALSE: 24588 void; 24589 }; 24591 struct LAYOUTCOMMIT4resok { 24592 newsize4 locr_newsize; 24593 }; 24595 union LAYOUTCOMMIT4res switch (nfsstat4 locr_status) { 24596 case NFS4_OK: 24597 LAYOUTCOMMIT4resok locr_resok4; 24598 default: 24599 void; 24600 }; 24602 18.42.3. DESCRIPTION 24604 Commits changes in the layout represented by the current filehandle, 24605 client ID (derived from the session ID in the preceding SEQUENCE 24606 operation), byte range, and stateid. Since layouts are sub- 24607 dividable, a smaller portion of a layout, retrieved via LAYOUTGET, 24608 can be committed. The region being committed is specified through 24609 the byte range (loca_offset and loca_length). This region MUST 24610 overlap with one or more existing layouts previously granted via 24611 LAYOUTGET (Section 18.43), each with an iomode of LAYOUTIOMODE4_RW. 24612 In the case where the iomode of any held layout segment is not 24613 LAYOUTIOMODE4_RW, the server should return the error 24614 NFS4ERR_BAD_IOMODE. For the case where the client does not hold 24615 matching layout segment(s) for the defined region, the server should 24616 return the error NFS4ERR_BAD_LAYOUT. 24618 The LAYOUTCOMMIT operation indicates that the client has completed 24619 writes using a layout obtained by a previous LAYOUTGET. The client 24620 may have only written a subset of the data range it previously 24621 requested. LAYOUTCOMMIT allows it to commit or discard provisionally 24622 allocated space and to update the server with a new end of file. The 24623 layout referenced by LAYOUTCOMMIT is still valid after the operation 24624 completes and can be continued to be referenced by the client ID, 24625 filehandle, byte range, layout type, and stateid. 24627 If the loca_reclaim field is set to TRUE, this indicates that the 24628 client is attempting to commit changes to a layout after the restart 24629 of the metadata server during the metadata server's recovery grace 24630 period (see Section 12.7.4). This type of request may be necessary 24631 when the client has uncommitted writes to provisionally allocated 24632 regions of a file which were sent to the storage devices before the 24633 restart of the metadata server. In this case the layout provided by 24634 the client MUST be a subset of a writable layout that the client held 24635 immediately before the restart of the metadata server. The metadata 24636 server is free to accept or reject this request based on its own 24637 internal metadata consistency checks. If the metadata server finds 24638 that the layout provided by the client does not pass its consistency 24639 checks, it MUST reject the request with the status 24640 NFS4ERR_RECLAIM_BAD. The successful completion of the LAYOUTCOMMIT 24641 request with loca_reclaim set to TRUE does NOT provide the client 24642 with a layout for the file. It simply commits the changes to the 24643 layout specified in the loca_layoutupdate field. To obtain a layout 24644 for the file the client must send a LAYOUTGET request to the server 24645 after the server's grace period has expired. If the metadata server 24646 receives a LAYOUTCOMMIT request with loca_reclaim set to TRUE when 24647 the metadata server is not in its recovery grace period, it MUST 24648 reject the request with the status NFS4ERR_NO_GRACE. 24650 Setting the loca_reclaim field to TRUE is required if and only if the 24651 committed layout was acquired before the metadata server restart. If 24652 the client is committing a layout that was acquired during the 24653 metadata server's grace period, it MUST set the "reclaim" field to 24654 FALSE. 24656 The loca_stateid is a layout stateid value as returned by previously 24657 successful layout operations (see Section 12.5.3). 24659 The loca_last_write_offset field specifies the offset of the last 24660 byte written by the client previous to the LAYOUTCOMMIT. Note that 24661 this value is never equal to the file's size (at most it is one byte 24662 less than the file's size) and MUST be less than or equal to 24663 NFS4_MAXFILEOFF. Also, loca_last_write_offset MUST overlap the range 24664 described by loca_offset and loca_length. The metadata server may 24665 use this information to determine whether the file's size needs to be 24666 updated. If the metadata server updates the file's size as the 24667 result of the LAYOUTCOMMIT operation, it must return the new size 24668 (locr_newsize.ns_size) as part of the results. 24670 The loca_time_modify field allows the client to suggest a 24671 modification time it would like the metadata server to set. The 24672 metadata server may use the suggestion or it may use the time of the 24673 LAYOUTCOMMIT operation to set the modification time. If the metadata 24674 server uses the client provided modification time, it should ensure 24675 time does not flow backwards. If the client wants to force the 24676 metadata server to set an exact time, the client should use a SETATTR 24677 operation in a COMPOUND right after LAYOUTCOMMIT. See Section 12.5.4 24678 for more details. If the client desires the resultant modification 24679 time it should construct the COMPOUND so that a GETATTR follows the 24680 LAYOUTCOMMIT. 24682 The loca_layoutupdate argument to LAYOUTCOMMIT provides a mechanism 24683 for a client to provide layout specific updates to the metadata 24684 server. For example, the layout update can describe what regions of 24685 the original layout have been used and what regions can be 24686 deallocated. There is no NFSv4.1 file layout-specific layoutupdate4 24687 structure. 24689 The layout information is more verbose for block devices than for 24690 objects and files because the latter two hide the details of block 24691 allocation behind their storage protocols. At the minimum, the 24692 client needs to communicate changes to the end of file location back 24693 to the server, and, if desired, its view of the file's modification 24694 time. For block/volume layouts, it needs to specify precisely which 24695 blocks have been used. 24697 If the layout identified in the arguments does not exist, the error 24698 NFS4ERR_BADLAYOUT is returned. The layout being committed may also 24699 be rejected if it does not correspond to an existing layout with an 24700 iomode of LAYOUTIOMODE4_RW. 24702 On success, the current filehandle retains its value and the current 24703 stateid retains its value. 24705 18.42.4. IMPLEMENTATION 24707 The client MAY also use LAYOUTCOMMIT with the loca_reclaim field set 24708 to TRUE to convey hints to modified file attributes or to report 24709 layout-type specific information such as I/O errors for object-based 24710 storage layouts, as normally done during normal operation. Doing so 24711 may help the metadata server to recover files more efficiently after 24712 restart. For example, some file system implementations may require 24713 expansive recovery of file system objects if the metadata server does 24714 not get a positive indication from all clients holding a write layout 24715 that they have successfully completed all their writes. Sending a 24716 LAYOUTCOMMIT (if required) and then following with LAYOUTRETURN can 24717 provide such an indication and allow for graceful and efficient 24718 recovery. 24720 18.43. Operation 50: LAYOUTGET - Get Layout Information 24721 18.43.1. ARGUMENT 24723 struct LAYOUTGET4args { 24724 /* CURRENT_FH: file */ 24725 bool loga_signal_layout_avail; 24726 layouttype4 loga_layout_type; 24727 layoutiomode4 loga_iomode; 24728 offset4 loga_offset; 24729 length4 loga_length; 24730 length4 loga_minlength; 24731 stateid4 loga_stateid; 24732 count4 loga_maxcount; 24733 }; 24735 18.43.2. RESULT 24737 struct LAYOUTGET4resok { 24738 bool logr_return_on_close; 24739 stateid4 logr_stateid; 24740 layout4 logr_layout<>; 24741 }; 24743 union LAYOUTGET4res switch (nfsstat4 logr_status) { 24744 case NFS4_OK: 24745 LAYOUTGET4resok logr_resok4; 24746 case NFS4ERR_LAYOUTTRYLATER: 24747 bool logr_will_signal_layout_avail; 24748 default: 24749 void; 24750 }; 24752 18.43.3. DESCRIPTION 24754 Requests a layout from the metadata server for reading or writing the 24755 file given by the filehandle at the byte range specified by offset 24756 and length. Layouts are identified by the client ID (derived from 24757 the session ID in the preceding SEQUENCE operation), current 24758 filehandle, layout type (loga_layout_type), and the layout stateid 24759 (loga_stateid). The use of the loga_iomode field depends upon the 24760 layout type, but should reflect the client's data access intent. 24762 If the metadata server is in a grace period, and does not persist 24763 layouts and device ID to device address mappings, then it MUST return 24764 NFS4ERR_GRACE (see Section 8.4.2.1). 24766 The LAYOUTGET operation returns layout information for the specified 24767 byte range: a layout. The client actually specifies two ranges, both 24768 starting at the offset in the loga_offset field. The first range is 24769 between loga_offset and loga_offset + loga_length - 1 inclusive. 24770 This range indicates the desired range the client wants the layout to 24771 cover. The second range is between loga_offset and loga_offset + 24772 loga_minlength - 1 inclusive. This range indicates the required 24773 range the client needs the layout to cover. Thus, loga_minlength 24774 MUST be less than or equal to loga_length. 24776 When a length field is set to NFS4_UINT64_MAX, this indicates a 24777 desire (when loga_length is NFS4_UINT64_MAX) or requirement (when 24778 loga_minlength is NFS4_UINT64_MAX) to get a layout from loga_offset 24779 through the end-of-file, regardless of the file's length. 24781 The following rules govern the relationships among, and the minima of 24782 loga_length, loga_minlength, and loga_offset. 24784 o If loga_length is less than loga_minlength, the metadata server 24785 MUST return NFS4ERR_INVAL. 24787 o If loga_minlength is zero, this is an indication to the metadata 24788 server that the client desires any layout at offset loga_offset or 24789 less that the metadata server has "readily available". Readily is 24790 subjective, and depends on the layout type and the pNFS server 24791 implementation. For example, some metadata servers might have to 24792 pre-allocate stable storage when they receive a request for a 24793 range of a file that goes beyond the file's current length. If 24794 loga_minlength is zero and loga_length is greater than zero, this 24795 tells the metadata server what range of the layout the client 24796 would prefer to have. If loga_length and loga_minlength are both 24797 zero, then the client is indicating it desires a layout of any 24798 length with the ending offset of the range no less than specified 24799 loga_offset, and the starting offset at or below loga_offset. If 24800 the metadata server does not have a layout that is readily 24801 available, then it MUST return NFS4ERR_LAYOUTTRYLATER. 24803 o If the sum of loga_offset and loga_minlength exceeds 24804 NFS4_UINT64_MAX, and loga_minlength is not NFS4_UINT64_MAX, the 24805 error NFS4ERR_INVAL MUST result. 24807 o If the sum of loga_offset and loga_length exceeds NFS4_UINT64_MAX, 24808 and loga_length is not NFS4_UINT64_MAX, the error NFS4ERR_INVAL 24809 MUST result. 24811 After the metadata server has performed the above checks on 24812 loga_offset, loga_minlength, and loga_offset, the metadata server 24813 MUST return a layout according to the rules in Table 13. 24815 Acceptable layouts based on loga_minlength. Note: u64m = 24816 NFS4_UINT64_MAX; a_off = loga_offset; a_minlen = loga_minlength. 24818 +-----------+-----------+----------+----------+---------------------+ 24819 | Layout | Layout | Layout | Layout | Layout length of | 24820 | iomode of | a_minlen | iomode | offset | reply | 24821 | request | of | of reply | of reply | | 24822 | | request | | | | 24823 +-----------+-----------+----------+----------+---------------------+ 24824 | _READ | u64m | MAY be | MUST be | MUST be >= file | 24825 | | | _READ | <= a_off | length - layout | 24826 | | | | | offset | 24827 | _READ | u64m | MAY be | MUST be | MUST be u64m | 24828 | | | _RW | <= a_off | | 24829 | _READ | > 0 and < | MAY be | MUST be | MUST be >= MIN(file | 24830 | | u64m | _READ | <= a_off | length, a_minlen + | 24831 | | | | | a_off) - layout | 24832 | | | | | offset | 24833 | _READ | > 0 and < | MAY be | MUST be | MUST be >= a_off - | 24834 | | u64m | _RW | <= a_off | layout offset + | 24835 | | | | | a_minlen | 24836 | _READ | 0 | MAY be | MUST be | MUST be > 0 | 24837 | | | _READ | <= a_off | | 24838 | _READ | 0 | MAY be | MUST be | MUST be > 0 | 24839 | | | _RW | <= a_off | | 24840 | _RW | u64m | MUST be | MUST be | MUST be u64m | 24841 | | | _RW | <= a_off | | 24842 | _RW | > 0 and < | MUST be | MUST be | MUST be >= a_off - | 24843 | | u64m | _RW | <= a_off | layout offset + | 24844 | | | | | a_minlen | 24845 | _RW | 0 | MUST be | MUST be | MUST be > 0 | 24846 | | | _RW | <= a_off | | 24847 +-----------+-----------+----------+----------+---------------------+ 24849 Table 13 24851 If loga_minlength is not zero and the metadata server cannot return a 24852 layout according to the rules in Table 13, then the metadata server 24853 MUST return the error NFS4ERR_BADLAYOUT. If loga_minlength is zero 24854 and the metadata server cannot or will not return a layout according 24855 to the rules in Table 13, then the metadata server MUST return the 24856 error NFS4ERR_LAYOUTTRYLATER. Assuming loga_length is greater than 24857 loga_minlength or equal to zero, the metadata server SHOULD return a 24858 layout according to the rules in Table 14. 24860 Desired layouts based on loga_length. The rules of Table 13 MUST be 24861 applied first. Note: u64m = NFS4_UINT64_MAX; a_off = loga_offset; 24862 a_len = loga_length. 24864 +------------+------------+-----------+-----------+-----------------+ 24865 | Layout | Layout | Layout | Layout | Layout length | 24866 | iomode of | a_len of | iomode of | offset of | of reply | 24867 | request | request | reply | reply | | 24868 +------------+------------+-----------+-----------+-----------------+ 24869 | _READ | u64m | MAY be | MUST be | SHOULD be u64m | 24870 | | | _READ | <= a_off | | 24871 | _READ | u64m | MAY be | MUST be | SHOULD be u64m | 24872 | | | _RW | <= a_off | | 24873 | _READ | > 0 and < | MAY be | MUST be | SHOULD be >= | 24874 | | u64m | _READ | <= a_off | a_off - layout | 24875 | | | | | offset + a_len | 24876 | _READ | > 0 and < | MAY be | MUST be | SHOULD be >= | 24877 | | u64m | _RW | <= a_off | a_off - layout | 24878 | | | | | offset + a_len | 24879 | _READ | 0 | MAY be | MUST be | SHOULD be > | 24880 | | | _READ | <= a_off | a_off - layout | 24881 | | | | | offset | 24882 | _READ | 0 | MAY be | MUST be | SHOULD be > | 24883 | | | _READ | <= a_off | a_off - layout | 24884 | | | | | offset | 24885 | _RW | u64m | MUST be | MUST be | SHOULD be u64m | 24886 | | | _RW | <= a_off | | 24887 | _RW | > 0 and < | MUST be | MUST be | SHOULD be >= | 24888 | | u64m | _RW | <= a_off | a_off - layout | 24889 | | | | | offset + a_len | 24890 | _RW | 0 | MUST be | MUST be | SHOULD be > | 24891 | | | _RW | <= a_off | a_off - layout | 24892 | | | | | offset | 24893 +------------+------------+-----------+-----------+-----------------+ 24895 Table 14 24897 The loga_stateid field specifies a valid stateid. If a layout is not 24898 currently held by the client, the loga_stateid field represents a 24899 stateid reflecting the correspondingly valid open, byte-range lock, 24900 or delegation stateid. Once a layout is held on the file by the 24901 client, the loga_stateid field MUST be a stateid as returned from a 24902 previous LAYOUTGET or LAYOUTRETURN operation or provided by a 24903 CB_LAYOUTRECALL operation (see Section 12.5.3). 24905 The loga_maxcount field specifies the maximum layout size (in bytes) 24906 that the client can handle. If the size of the layout structure 24907 exceeds the size specified by maxcount, the metadata server will 24908 return the NFS4ERR_TOOSMALL error. 24910 The returned layout is expressed as an array, logr_layout, with each 24911 element of type layout4. If a file has a single striping pattern, 24912 then logr_layout SHOULD contain just one entry. Otherwise, if the 24913 requested range overlaps more than one striping pattern, logr_layout 24914 will contain the required number of entries. The elements of 24915 logr_layout MUST be sorted in ascending order of the value of the 24916 lo_offset field of each element. There MUST be no gaps or overlaps 24917 in the range between two successive elements of logr_layout. The 24918 lo_iomode field in each element of logr_layout MUST be the same. 24920 Table 13 and Table 14 both refer to a returned layout iomode, offset, 24921 and length. Because the returned layout is encoded in the 24922 logr_layout array, more description is required. 24924 iomode 24926 The value of the returned layout iomode listed in Table 13 and 24927 Table 14 is equal to the value of the lo_iomode field in each 24928 element of logr_layout. As shown in Table 13 and Table 14, the 24929 metadata server MAY return a layout with an lo_iomode different 24930 from the requested iomode (field loga_iomode of the request). If 24931 it does so, it MUST ensure that the lo_iomode is more permissive 24932 than the loga_iomode requested. For example, this behavior allows 24933 an implementation to upgrade read-only requests to read/write 24934 requests at its discretion, within the limits of the layout type 24935 specific protocol. A lo_iomode of either LAYOUTIOMODE4_READ or 24936 LAYOUTIOMODE4_RW MUST be returned. 24938 offset 24940 The value of the returned layout offset listed in Table 13 and 24941 Table 14 is always equal to the lo_offset field of the first 24942 element logr_layout. 24944 length 24946 When setting the value of the returned layout length, the 24947 situation is complicated by the possibility that the special 24948 layout length value NFS4_UINT64_MAX is involved. For a 24949 logr_layout array of N elements, the lo_length field in the first 24950 N-1 elements MUST NOT be NFS4_UINT64_MAX. The lo_length field of 24951 the last element of logr_layout can be NFS4_UINT64_MAX under some 24952 conditions as described in the following list. 24954 * If an applicable rule of Table 13 states the metadata server 24955 MUST return a layout of length NFS4_UINT64_MAX, then lo_length 24956 field of the last element of logr_layout MUST be 24957 NFS4_UINT64_MAX. 24959 * If an applicable rule of Table 13 states the metadata server 24960 MUST NOT return a layout of length NFS4_UINT64_MAX, then 24961 lo_length field of the last element of logr_layout MUST NOT be 24962 NFS4_UINT64_MAX. 24964 * If an applicable rule of Table 14 states the metadata server 24965 SHOULD return a layout of length NFS4_UINT64_MAX, then 24966 lo_length field of the last element of logr_layout SHOULD be 24967 NFS4_UINT64_MAX. 24969 * When the value of the returned layout length of Table 13 and 24970 Table 14 is not NFS4_UINT64_MAX, then the returned layout 24971 length is equal to the sum of the lo_length fields of each 24972 element of logr_layout. 24974 The logr_return_on_close result field is a directive to return the 24975 layout before closing the file. When the metadata server sets this 24976 return value to TRUE, it MUST be prepared to recall the layout in the 24977 case the client fails to return the layout before close. For the 24978 metadata server that knows a layout must be returned before a close 24979 of the file, this return value can be used to communicate the desired 24980 behavior to the client and thus remove one extra step from the 24981 client's and metadata server's interaction. 24983 The logr_stateid stateid is returned to the client for use in 24984 subsequent layout related operations. See Section 8.2, 24985 Section 12.5.3, and Section 12.5.5.2 for a further discussion and 24986 requirements. 24988 The format of the returned layout (lo_content) is specific to the 24989 layout type. The value of the layout type (lo_content.loc_type) for 24990 each of the elements of the array of layouts returned by the metadata 24991 server (logr_layout) MUST be equal to the loga_layout_type specified 24992 by the client. If it is not equal, the client SHOULD ignore the 24993 response as invalid and behave as if the metadata server returned an 24994 error, even if the client does have support for the layout type 24995 returned. 24997 If layouts are not supported for the requested file or its containing 24998 file system the metadata server MUST return 24999 NFS4ERR_LAYOUTUNAVAILABLE. If the layout type is not supported, the 25000 metadata server MUST return NFS4ERR_UNKNOWN_LAYOUTTYPE. If layouts 25001 are supported but no layout matches the client provided layout 25002 identification, the metadata server MUST return NFS4ERR_BADLAYOUT. 25003 If an invalid loga_iomode is specified, or a loga_iomode of 25004 LAYOUTIOMODE4_ANY is specified, the metadata server MUST return 25005 NFS4ERR_BADIOMODE. 25007 If the layout for the file is unavailable due to transient 25008 conditions, e.g. file sharing prohibits layouts, the metadata server 25009 MUST return NFS4ERR_LAYOUTTRYLATER. 25011 If the layout request is rejected due to an overlapping layout 25012 recall, the metadata server MUST return NFS4ERR_RECALLCONFLICT. See 25013 Section 12.5.5.2 for details. 25015 If the layout conflicts with a mandatory byte range lock held on the 25016 file, and if the storage devices have no method of enforcing 25017 mandatory locks, other than through the restriction of layouts, the 25018 metadata server SHOULD return NFS4ERR_LOCKED. 25020 If client sets loga_signal_layout_avail to TRUE, then it is 25021 registering with the client a "want" for a layout in the event the 25022 layout cannot be obtained due to resource exhaustion. If the 25023 metadata server supports and will honor the "want", the results will 25024 have logr_will_signal_layout_avail set to TRUE. If so the client 25025 should expect a CB_RECALLABLE_OBJ_AVAIL operation to indicate that a 25026 layout is available. 25028 On success, the current filehandle retains its value and the current 25029 stateid is updated to match the value as returned in the results. 25031 18.43.4. IMPLEMENTATION 25033 Typically, LAYOUTGET will be called as part of a COMPOUND request 25034 after an OPEN operation and results in the client having location 25035 information for the file; this requires that loga_stateid be set to 25036 the special stateid that tells the metadata server to use the current 25037 stateid, which is set by OPEN (see Section 16.2.3.1.2) . A client 25038 may also hold a layout across multiple OPENs. The client specifies a 25039 layout type that limits what kind of layout the metadata server will 25040 return. This prevents metadata servers from granting layouts that 25041 are unusable by the client. 25043 As indicated by Table 13 and Table 14 the specification of LAYOUTGET 25044 allows a pNFS client and server considerable flexibility. A pNFS 25045 client can take several strategies for sending LAYOUTGET. Some 25046 examples are as follows. 25048 o If LAYOUTGET is preceded by OPEN in the same COMPOUND request, and 25049 the OPEN requests read access, the client might opt to request a 25050 _READ layout with loga_offset set to zero, loga_minlength set to 25051 zero, and loga_length set to NFS4_UINT64_MAX. If the file has 25052 space allocated to it, that space is striped over one or more 25053 storage devices, and there is either no conflicting layout, or the 25054 concept of a conflicting layout does not apply to the pNFS 25055 server's layout type or implementation, then the metadata server 25056 might return a layout with a starting offset of zero, and a length 25057 equal to the length of the file, if not NFS4_UINT64_MAX. If the 25058 length of the file is not a multiple of the pNFS server's stripe 25059 width (see Section 13.2 for a formal definition), the metadata 25060 server might round the returned layout's length up. 25062 o If LAYOUTGET is preceded by OPEN in the same COMPOUND request, and 25063 the OPEN does not truncate the file, and requests write access, 25064 the client might opt to request a _RW layout with loga_offset set 25065 to zero, loga_minlength set to zero, and loga_length set to the 25066 file's current length (if known), or NFS4_UINT64_MAX. As with the 25067 previous case, under some conditions the metadata server might 25068 return a layout that covers the entire length of the file or 25069 beyond. 25071 o As above, but the OPEN truncates the file. In this case, client 25072 might anticipate it will be writing to the file from offset zero, 25073 and so loga_offset and loga_minlength are set to zero, and 25074 loga_length is set to the value of threshold4_write_iosize. The 25075 metadata server might return a layout from offset zero with a 25076 length at least as long as as threshold4_write_iosize. 25078 o A process on the client invokes a request to read from offset 25079 10000 for length 50000. The client is using buffered I/O, and has 25080 buffer sizes of 4096 bytes. The client intends to map the request 25081 of the process into a series of READ requests starting at offset 25082 8192. The end offset needs to be higher than 10000 + 50000 = 25083 60000, and the next offset that is a multiple of 4096 is 61440. 25084 The difference between 61440 and that starting offset of the 25085 layout is 53248 (which is the product of 4096 and 15). The value 25086 of threshold4_read_iosize is less than 53248, so the client sends 25087 a LAYOUTGET request with loga_offset set to 8192, loga_minlength 25088 set to 53248, and loga_length set to the file's length (if known) 25089 minus 8192 or NFS4_UINT64_MAX (if the file's length is not known). 25090 Since this LAYOUTGET request exceeds the metadata server's 25091 threshold, it grants the layout, possibly with an initial offset 25092 of 0, with an end offset of at least 8192 + 53248 - 1 = 61439, but 25093 preferably a layout with an offset aligned on the stripe width and 25094 a length that is a multiple of the stripe width. 25096 o As above, but the client is not using buffered I/O, and instead 25097 all internal I/O requests are sent directly to the server. The 25098 LAYOUTGET request has loga_offset equal to 10000, and 25099 loga_minlength set to 50000. The value of loga_length is set to 25100 the length of the file. The metadata server is free to return a 25101 layout that fully overlaps the requested range, with a starting 25102 offset and length aligned on the stripe width. 25104 o Again a process on the client invokes a request to read from 25105 offset 10000 for length 50000, and buffered I/O is in use. The 25106 client is expecting that the server might not be able to return 25107 the layout for the full I/O range, with loga_offset set to 8192 25108 and loga_minlength set to 53248. The client intends to map the 25109 request of the process into a series of READ requests starting at 25110 offset 8192, each with length 4096, with a total length of 53248 25111 (which equals 13 * 4096). Because the value of 25112 threshold4_read_iosize is equal to 4096, it is practical and 25113 reasonable for the client to use several LAYOUTGETs to complete 25114 the series of READs. The client sends a LAYOUTGET request with 25115 loga_offset set to 8192, loga_minlength set to 4096, and 25116 loga_length set to 53248 or higher. The server will grant a 25117 layout possibly with an initial offset of 0, with an end offset of 25118 at least 8192 + 4096 - 1 = 12287, but preferably a layout with an 25119 offset aligned on the stripe width and a length that is a multiple 25120 of the stripe width. This will allow the client to make forward 25121 progress, possibly having to issue more LAYOUTGET requests for the 25122 remainder of the range. 25124 o An NFS client detects a sequential read pattern, and so issues a 25125 LAYOUTGET that goes well beyond any current or pending read 25126 requests to the server. The server might likewise detect this 25127 pattern, and grant the LAYOUTGET request. The client continues to 25128 send LAYOUTGET requests once it has read from an offset of the 25129 file that represents 50% of the way through the last layout it 25130 received. 25132 o As above but the client fails to detect the pattern, but the 25133 server does. The next time the metadata server gets a LAYOUTGET, 25134 it returns a layout with a length that is well beyond 25135 loga_minlength. 25137 o A client is using buffered I/O, and has a long queue of write 25138 behinds to process and also detects a sequential write pattern. 25139 It issues a LAYOUTGET for a layout that spans the range of the 25140 queued write behinds and well beyond, including ranges beyond the 25141 filer's current length. The client continues to issue LAYOUTGETs 25142 once the write behind queue reaches 50% of the maximum queue 25143 length. 25145 Once the client has obtained a layout referring to a particular 25146 device ID, the metadata server MUST NOT delete the device ID until 25147 the layout is returned or revoked. 25149 CB_NOTIFY_DEVICEID can race with LAYOUTGET. One race scenario is 25150 that LAYOUTGET returns a device ID the client does not have device 25151 address mappings for, and the metadata server sends a 25152 CB_NOTIFY_DEVICEID to add the device ID to the client's awareness and 25153 meanwhile the client sends GETDEVICEINFO on the device ID. This 25154 scenario is discussed in Section 18.40.4. Another scenario is that 25155 the CB_NOTIFY_DEVICEID is processed by the client before it processes 25156 the results from LAYOUTGET. The client will send a GETDEVICEINFO on 25157 the device ID. If the results from GETDEVICEINFO are received before 25158 the client gets results from LAYOUTGET, then there is no longer a 25159 race. If the results from LAYOUTGET are received before the results 25160 from GETDEVICEINFO, the client can either wait for results of 25161 GETDEVICEINFO, or send another one to get possibly more up to date 25162 device address mappings for the device ID. 25164 18.44. Operation 51: LAYOUTRETURN - Release Layout Information 25166 18.44.1. ARGUMENT 25168 /* Constants used for LAYOUTRETURN and CB_LAYOUTRECALL */ 25169 const LAYOUT4_RET_REC_FILE = 1; 25170 const LAYOUT4_RET_REC_FSID = 2; 25171 const LAYOUT4_RET_REC_ALL = 3; 25173 enum layoutreturn_type4 { 25174 LAYOUTRETURN4_FILE = LAYOUT4_RET_REC_FILE, 25175 LAYOUTRETURN4_FSID = LAYOUT4_RET_REC_FSID, 25176 LAYOUTRETURN4_ALL = LAYOUT4_RET_REC_ALL 25177 }; 25179 struct layoutreturn_file4 { 25180 offset4 lrf_offset; 25181 length4 lrf_length; 25182 stateid4 lrf_stateid; 25183 /* layouttype4 specific data */ 25184 opaque lrf_body<>; 25185 }; 25187 union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { 25188 case LAYOUTRETURN4_FILE: 25189 layoutreturn_file4 lr_layout; 25190 default: 25191 void; 25192 }; 25193 struct LAYOUTRETURN4args { 25194 /* CURRENT_FH: file */ 25195 bool lora_reclaim; 25196 layouttype4 lora_layout_type; 25197 layoutiomode4 lora_iomode; 25198 layoutreturn4 lora_layoutreturn; 25199 }; 25201 18.44.2. RESULT 25203 union layoutreturn_stateid switch (bool lrs_present) { 25204 case TRUE: 25205 stateid4 lrs_stateid; 25206 case FALSE: 25207 void; 25208 }; 25210 union LAYOUTRETURN4res switch (nfsstat4 lorr_status) { 25211 case NFS4_OK: 25212 layoutreturn_stateid lorr_stateid; 25213 default: 25214 void; 25215 }; 25217 18.44.3. DESCRIPTION 25219 This operation returns from the client to the server one or more 25220 layouts represented by the client ID (derived from the session ID in 25221 the preceding SEQUENCE operation), lora_layout_type, and lora_iomode. 25222 When lr_returntype is LAYOUTRETURN4_FILE, the returned layout is 25223 further identified by the current filehandle, lrf_offset, lrf_length, 25224 and lrf_stateid. If the lrf_length field is NFS4_UINT64_MAX, all 25225 bytes of the layout, starting at lrf_offset are returned. When 25226 lr_returntype is LAYOUTRETURN4_FSID, the current filehandle is used 25227 to identify the file system and all layouts matching the client ID, 25228 the fsid of the file system, lora_layout_type, and lora_iomode are 25229 returned. When lr_returntype is LAYOUTRETURN4_ALL, all layouts 25230 matching the client ID, lora_layout_type, and lora_iomode are 25231 returned and the current filehandle is not used. After this call, 25232 the client MUST NOT use the returned layout(s) and the associated 25233 storage protocol to access the file data. 25235 If the set of layouts designated in the case of LAYOUTRETURN4_FSID or 25236 LAYOUTRETURN4_ALL is empty, then no error results. In the case of 25237 LAYOUTRETURN4_FILE, the byte range specified is returned even if it 25238 is a subdivision of a layout previously obtained with LAYOUTGET, a 25239 combination of multiple layouts previously obtained with LAYOUTGET, 25240 or a combination including some layouts previously obtained with 25241 LAYOUTGET, and one or more subdivisions of such layouts. When the 25242 byte range does not designate any bytes for which a layout is held 25243 for the specified file, client ID, layout type and mode, no error 25244 results. See Section 12.5.5.2.1.5 for considerations with "bulk" 25245 return of layouts. 25247 The layout being returned may be a subset or superset of a layout 25248 specified by CB_LAYOUTRECALL. However, if it is a subset, the recall 25249 is not complete until the full recalled scope has been returned. 25250 Recalled scope refers to the byte range in the case of 25251 LAYOUTRETURN4_FILE, use of LAYOUTRETURN4_FSID, or the use of 25252 LAYOUTRETURN4_ALL. There must be a LAYOUTRETURN with a matching 25253 scope to complete the return even if all current layout ranges have 25254 been previously individually returned. 25256 For all lr_returntype values, an iomode of LAYOUTIOMODE4_ANY 25257 specifies that all layouts that match the other arguments to 25258 LAYOUTRETURN (i.e., client ID, lora_layout_type, and one of current 25259 filehandle and range; fsid derived from current filehandle; or 25260 LAYOUTRETURN4_ALL) are being returned. 25262 In the case that lr_returntype is LAYOUTRETURN4_FILE, the lrf_stateid 25263 provided by the client is a layout stateid as returned from previous 25264 layout operations. Note that the "seqid" field of lrf_stateid MUST 25265 NOT be zero. See Section 8.2, Section 12.5.3, and Section 12.5.5.2 25266 for a further discussion and requirements. 25268 Return of a layout or all layouts does not invalidate the mapping of 25269 storage device ID to storage device address which remains in effect 25270 until specifically changed or deleted via device ID notification 25271 callbacks. 25273 If the lora_reclaim field is set to TRUE, the client is attempting to 25274 return a layout that was acquired before the restart of the metadata 25275 server during the metadata server's grace period. When returning 25276 layouts that were acquired during the metadata server's grace period, 25277 the client MUST set the lora_reclaim field to FALSE. The 25278 lora_reclaim field MUST be set to FALSE also when lr_layoutreturn is 25279 LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL. See LAYOUTCOMMIT 25280 (Section 18.42) for more details. 25282 Layouts may be returned when recalled or voluntarily (i.e., before 25283 the server has recalled them). In either case the client must 25284 properly propagate state changed under the context of the layout to 25285 the storage device(s) or to the metadata server before returning the 25286 layout. 25288 If the client returns the layout in response to a CB_LAYOUTRECALL 25289 where the lor_recalltype field of the clora_recall field was 25290 LAYOUTRECALL4_FILE, the client should use the lor_stateid value from 25291 CB_LAYOUTRECALL as the value for lrf_stateid. Otherwise, it should 25292 use logr_stateid (from a previous LAYOUTGET result) or lorr_stateid 25293 (from a previous LAYRETURN result). This is done to indicate the 25294 point in time (in terms of layout stateid transitions) when the 25295 recall was sent. The client uses the precise lora_recallstateid 25296 value and MUST NOT set the stateid's seqid to zero; otherwise 25297 NFS4ERR_BAD_STATEID MUST be returned. NFS4ERR_OLD_STATEID can be 25298 returned if the client is using an old seqid, and the server knows 25299 the client should not be using the old seqid. E.g. the client uses 25300 the seqid on slot 1 of the session, received the response with the 25301 new seqid, and uses the slot to send another request with the old 25302 seqid. 25304 If a client fails to return a layout in a timely manner, then the 25305 metadata server SHOULD use its control protocol with the storage 25306 devices to fence the client from accessing the data referenced by the 25307 layout. See Section 12.5.5 for more details. 25309 If the LAYOUTRETURN request sets the lora_reclaim field to TRUE after 25310 the metadata server's grace period, NFS4ERR_NO_GRACE is returned. 25312 If the LAYOUTRETURN request sets the lora_reclaim field to TRUE and 25313 lr_returntype is set to LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL, 25314 NFS4ERR_INVAL is returned. 25316 If the client sets the lr_returntype field to LAYOUTRETURN4_FILE, 25317 then the lrs_stateid field will represent the layout stateid as 25318 updated for this operation's processing; the current stateid will 25319 also be updated to match the returned value. If the last byte of any 25320 layout for the current file, client ID, and layout type is being 25321 returned and there are no remaining pending CB_LAYOUTRECALL 25322 operations for which a LAYOUTRETURN operation must be done, 25323 lrs_present MUST be FALSE, and no stateid will be returned. In 25324 addition, the COMPOUND request's current stateid will be set to all- 25325 zeroes special stateid (see Section 16.2.3.1.2). The server MUST 25326 reject with NFS4ERR_BAD_STATEID any further use of the current 25327 stateid in that COMPOUND until the current stateid is re-established 25328 by a later stateid-returning operation. 25330 On success, the current filehandle retains its value. 25332 If the EXCHGID4_FLAG_BIND_PRINC_STATEID capability is set on the 25333 client ID (see Section 18.35), the server will require that the 25334 principal, security flavor, and if applicable, the GSS mechanism, 25335 combination that acquired the layout also be the one to send 25336 LAYOUTRETURN. This might not be possible if credentials for the 25337 principal are no longer available. The server will allow the machine 25338 credential or SSV credential (see Section 18.35) to send LAYOUTRETURN 25339 if LAYOUTRETURN's operation code was set in the spo_must_allow result 25340 of EXCHANGE_ID. 25342 18.44.4. IMPLEMENTATION 25344 The final LAYOUTRETURN operation in response to a CB_LAYOUTRECALL 25345 callback MUST be serialized with any outstanding, intersecting 25346 LAYOUTRETURN operations. Note that it is possible that while a 25347 client is returning the layout for some recalled range the server may 25348 recall a superset of that range (e.g. LAYOUTRECALL4_ALL); the final 25349 return operation for the latter must block until the former layout 25350 recall is done. 25352 Returning all layouts in a file system using LAYOUTRETURN4_FSID is 25353 typically done in response to a CB_LAYOUTRECALL for that file system 25354 as the final return operation. Similarly, LAYOUTRETURN4_ALL is used 25355 in response to a recall callback for all layouts. It is possible 25356 that the client already returned some outstanding layouts via 25357 individual LAYOUTRETURN calls and the call for LAYOUTRETURN4_FSID or 25358 LAYOUTRETURN4_ALL marks the end of the LAYOUTRETURN sequence. See 25359 Section 12.5.5.1 for more details. 25361 Once the client has returned all layouts referring to a particular 25362 device ID, the server MAY delete the device ID. 25364 18.45. Operation 52: SECINFO_NO_NAME - Get Security on Unnamed Object 25366 18.45.1. ARGUMENT 25368 enum secinfo_style4 { 25369 SECINFO_STYLE4_CURRENT_FH = 0, 25370 SECINFO_STYLE4_PARENT = 1 25371 }; 25373 /* CURRENT_FH: object or child directory */ 25374 typedef secinfo_style4 SECINFO_NO_NAME4args; 25376 18.45.2. RESULT 25378 /* CURRENTFH: consumed if status is NFS4_OK */ 25379 typedef SECINFO4res SECINFO_NO_NAME4res; 25381 18.45.3. DESCRIPTION 25383 Like the SECINFO operation, SECINFO_NO_NAME is used by the client to 25384 obtain a list of valid RPC authentication flavors for a specific file 25385 object. Unlike SECINFO, SECINFO_NO_NAME only works with objects that 25386 are accessed by filehandle. 25388 There are two styles of SECINFO_NO_NAME, as determined by the value 25389 of the secinfo_style4 enumeration. If SECINFO_STYLE4_CURRENT_FH is 25390 passed, then SECINFO_NO_NAME is querying for the required security 25391 for the current filehandle. If SECINFO_STYLE4_PARENT is passed, then 25392 SECINFO_NO_NAME is querying for the required security of the current 25393 filehandle's parent. If the style selected is SECINFO_STYLE4_PARENT, 25394 then SECINFO should apply the same access methodology used for 25395 LOOKUPP when evaluating the traversal to the parent directory. 25396 Therefore, if the requester does not have the appropriate access to 25397 LOOKUPP the parent then SECINFO_NO_NAME must behave the same way and 25398 return NFS4ERR_ACCESS. 25400 If PUTFH, PUTPUBFH, PUTROOTFH, or RESTOREFH return NFS4ERR_WRONGSEC, 25401 then the client resolves the situation by sending a COMPOUND request 25402 that consists of PUTFH, PUTPUBFH, or PUTROOTFH immediately followed 25403 by SECINFO_NO_NAME, style SECINFO_STYLE4_CURRENT_FH. See Section 2.6 25404 for instructions on dealing with NFS4ERR_WRONGSEC error returns from 25405 PUTFH, PUTROOTFH, PUTPUBFH, or RESTOREFH. 25407 If SECINFO_STYLE4_PARENT is specified and there is no parent 25408 directory, SECINFO_NO_NAME MUST return NFS4ERR_NOENT. 25410 On success, the current filehandle is consumed (see 25411 Section 2.6.3.1.1.8), and if the next operation after SECINFO_NO_NAME 25412 tries to use the current filehandle, that operation will fail with 25413 the status NFS4ERR_NOFILEHANDLE. 25415 Everything else about SECINFO_NO_NAME is the same as SECINFO. See 25416 the discussion on SECINFO (Section 18.29.3). 25418 18.45.4. IMPLEMENTATION 25420 See the discussion on SECINFO (Section 18.29.4). 25422 18.46. Operation 53: SEQUENCE - Supply Per-Procedure Sequencing and 25423 Control 25425 18.46.1. ARGUMENT 25427 struct SEQUENCE4args { 25428 sessionid4 sa_sessionid; 25429 sequenceid4 sa_sequenceid; 25430 slotid4 sa_slotid; 25431 slotid4 sa_highest_slotid; 25432 bool sa_cachethis; 25433 }; 25435 18.46.2. RESULT 25437 const SEQ4_STATUS_CB_PATH_DOWN = 0x00000001; 25438 const SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING = 0x00000002; 25439 const SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED = 0x00000004; 25440 const SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED = 0x00000008; 25441 const SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED = 0x00000010; 25442 const SEQ4_STATUS_ADMIN_STATE_REVOKED = 0x00000020; 25443 const SEQ4_STATUS_RECALLABLE_STATE_REVOKED = 0x00000040; 25444 const SEQ4_STATUS_LEASE_MOVED = 0x00000080; 25445 const SEQ4_STATUS_RESTART_RECLAIM_NEEDED = 0x00000100; 25446 const SEQ4_STATUS_CB_PATH_DOWN_SESSION = 0x00000200; 25447 const SEQ4_STATUS_BACKCHANNEL_FAULT = 0x00000400; 25448 const SEQ4_STATUS_DEVID_CHANGED = 0x00000800; 25449 const SEQ4_STATUS_DEVID_DELETED = 0x00001000; 25451 struct SEQUENCE4resok { 25452 sessionid4 sr_sessionid; 25453 sequenceid4 sr_sequenceid; 25454 slotid4 sr_slotid; 25455 slotid4 sr_highest_slotid; 25456 slotid4 sr_target_highest_slotid; 25457 uint32_t sr_status_flags; 25458 }; 25460 union SEQUENCE4res switch (nfsstat4 sr_status) { 25461 case NFS4_OK: 25462 SEQUENCE4resok sr_resok4; 25463 default: 25464 void; 25465 }; 25467 18.46.3. DESCRIPTION 25469 The SEQUENCE operation is used by the server to implement session 25470 request control and the reply cache semantics. 25472 This operation MUST appear as the first operation of any COMPOUND in 25473 which it appears. The error NFS4ERR_SEQUENCE_POS will be returned 25474 when it is found in any position in a COMPOUND beyond the first. 25475 Operations other than SEQUENCE, BIND_CONN_TO_SESSION, EXCHANGE_ID, 25476 CREATE_SESSION, and DESTROY_SESSION, MUST NOT appear as the first 25477 operation in a COMPOUND. Such operations MUST yield the error 25478 NFS4ERR_OP_NOT_IN_SESSION if they do appear at the start of a 25479 COMPOUND. 25481 If SEQUENCE is received on a connection not associated with the 25482 session via CREATE_SESSION or BIND_CONN_TO_SESSION, and connection 25483 association enforcement is enabled (see Section 18.35), then the 25484 server returns NFS4ERR_CONN_NOT_BOUND_TO_SESSION. 25486 The sa_sessionid argument identifies the session this request applies 25487 to. The sr_sessionid result MUST equal sa_sessionid. 25489 The sa_slotid argument is the index in the reply cache for the 25490 request. The sa_sequenceid field is the sequence number of the 25491 request for the reply cache entry (slot). The sr_slotid result MUST 25492 equal sa_slotid. The sr_sequenceid result MUST equal sa_sequenceid. 25494 The sa_highest_slotid argument is the highest slot ID the client has 25495 a request outstanding for; it could be equal to sa_slotid. The 25496 server returns two "highest_slotid" values: sr_highest_slotid, and 25497 sr_target_highest_slotid. The former is the highest slot ID the 25498 server will accept in future SEQUENCE operation, and SHOULD NOT be 25499 less than the value of sa_highest_slotid. (but see Section 2.10.6.1 25500 for an exception). The latter is the highest slot ID the server 25501 would prefer the client use on a future SEQUENCE operation. 25503 If sa_cachethis is TRUE, then the client is requesting that the 25504 server cache the entire reply in the server's reply cache; therefore 25505 the server MUST cache the reply (see Section 2.10.6.1.3). The server 25506 MAY cache the reply if sa_cachethis is FALSE. If the server does not 25507 cache the entire reply, it MUST still record that it executed the 25508 request at the specified slot and sequence ID. 25510 The response to the SEQUENCE operation contains a word of status 25511 flags (sr_status_flags) that can provide to the client information 25512 related to the status of the client's lock state and communications 25513 paths. Note that any status bits relating to lock state MAY be reset 25514 when lock state is lost due to a server restart (even if the session 25515 is persistent across restarts; session persistence does not imply 25516 lock state persistence) or the establishment of a new client 25517 instance. 25519 SEQ4_STATUS_CB_PATH_DOWN 25520 When set, indicates that the client has no operational backchannel 25521 path for any session associated with the client ID, making it 25522 necessary for the client to re-establish one. This bit remains 25523 set on all SEQUENCE responses on all sessions associated with the 25524 client ID until at least one backchannel is available on any 25525 session associated with the client ID. If the client fails to re- 25526 establish a backchannel for the client ID, it is subject to having 25527 recallable state revoked. 25529 SEQ4_STATUS_CB_PATH_DOWN_SESSION 25530 When set, indicates that the session has no operational 25531 backchannel. There are two reasons why 25532 SEQ4_STATUS_CB_PATH_DOWN_SESSION may be set and not 25533 SEQ4_STATUS_CB_PATH_DOWN. First is that a callback operation that 25534 applies specifically to the session (e.g. CB_RECALL_SLOT, see 25535 Section 20.8) needs to be sent. Second is that the server did 25536 send a callback operation, but the connection was lost before the 25537 reply. The server cannot be sure whether the client received the 25538 callback operation or not, and so, per rules on request retry, the 25539 server MUST retry the callback operation over the same session. 25540 The SEQ4_STATUS_CB_PATH_DOWN_SESSION bit is the indication to the 25541 client that it needs to associate a connection to the session's 25542 backchannel. This bit remains set on all SEQUENCE responses on 25543 the session until a backchannel on the session the path is 25544 available. If the client fails to re-establish a backchannel for 25545 the session, it is subject to having recallable state revoked. 25547 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING 25548 When set, indicates that all GSS contexts assigned to the 25549 session's backchannel will expire within a period equal to the 25550 lease time. This bit remains set on all SEQUENCE replies until 25551 the expiration time of at least one context is beyond the lease 25552 period from the current time (relative to the time of when a 25553 SEQUENCE response was sent) or until all GSS contexts for the 25554 session's backchannel have expired. 25556 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED 25557 When set, indicates all GSS contexts assigned to the session's 25558 backchannel have expired. This bit remains set on all SEQUENCE 25559 replies until at least one non-expired context for the session's 25560 backchannel has been established. 25562 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED 25563 When set, indicates that the lease has expired and as a result the 25564 server released all of the client's locking state. This status 25565 bit remains set on all SEQUENCE replies until the loss of all such 25566 locks has been acknowledged by use of FREE_STATEID (see 25567 Section 18.38), or by establishing a new client instance by 25568 destroying all sessions (via DESTROY_SESSION), the client ID (via 25569 DESTROY_CLIENTID), and then invoking EXCHANGE_ID and 25570 CREATE_SESSION to establish a new client ID. 25572 SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED 25573 When set indicates that some subset of the client's locks have 25574 been revoked due to expiration of the lease period followed by 25575 another client's conflicting lock request. This status bit 25576 remains set on all SEQUENCE replies until the loss of all such 25577 locks has been acknowledged by use of FREE_STATEID. 25579 SEQ4_STATUS_ADMIN_STATE_REVOKED 25580 When set indicates that one or more locks have been revoked 25581 without expiration of the lease period, due to administrative 25582 action. This status bit remains set on all SEQUENCE replies until 25583 the loss of all such locks has been acknowledged by use of 25584 FREE_STATEID. 25586 SEQ4_STATUS_RECALLABLE_STATE_REVOKED 25587 When set indicates that one or more recallable objects have been 25588 revoked without expiration of the lease period, due to the 25589 client's failure to return them when recalled which may be a 25590 consequence of there being no working backchannel and the client 25591 failing to reestablish a backchannel per the 25592 SEQ4_STATUS_CB_PATH_DOWN, SEQ4_STATUS_CB_PATH_DOWN_SESSION, or 25593 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED status flags. This status bit 25594 remains set on all SEQUENCE replies until the loss of all such 25595 locks has been acknowledged by use of FREE_STATEID. 25597 SEQ4_STATUS_LEASE_MOVED 25598 When set indicates that responsibility for lease renewal has been 25599 transferred to one or more new servers. This condition will 25600 continue until the client receives an NFS4ERR_MOVED error and the 25601 server receives the subsequent GETATTR for the fs_locations or 25602 fs_locations_info attribute for an access to each file system for 25603 which a lease has been moved to a new server. See 25604 Section 11.7.7.1. 25606 SEQ4_STATUS_RESTART_RECLAIM_NEEDED 25607 When set indicates that due to server restart the client must 25608 reclaim locking state. Until the client sends a global 25609 RECLAIM_COMPLETE (Section 18.51), every SEQUENCE operation will 25610 return SEQ4_STATUS_RESTART_RECLAIM_NEEDED. 25612 SEQ4_STATUS_BACKCHANNEL_FAULT 25613 The server has encountered an unrecoverable fault with the 25614 backchannel (e.g. it has lost track of the sequence ID for a slot 25615 in the backchannel). The client MUST stop sending more requests 25616 on the session's fore channel, wait for all outstanding requests 25617 to complete on the fore and back channel, and then destroy the 25618 session. 25620 SEQ4_STATUS_DEVID_CHANGED 25621 The client is using device ID notifications and the server has 25622 changed a device ID mapping held by the client. This flag will 25623 stay present until the client has obtained the new mapping with 25624 GETDEVICEINFO. 25626 SEQ4_STATUS_DEVID_DELETED 25627 The client is using device ID notifications and the server has 25628 deleted a device ID mapping held by the client. This flag will 25629 stay in effect until the client sends a GETDEVICEINFO on the 25630 device ID with a null value in the argument gdia_notify_types. 25632 The value of the sa_sequenceid argument relative to the cached 25633 sequence ID on the slot falls into one of three cases. 25635 o If the difference between sa_sequenceid and the server's cached 25636 sequence ID at the slot ID is two (2) or more, or if sa_sequenceid 25637 is less than the cached sequence ID (accounting for wraparound of 25638 the unsigned sequence ID value), then the server MUST return 25639 NFS4ERR_SEQ_MISORDERED. 25641 o If sa_sequenceid and the cached sequence ID are the same, this is 25642 a retry, and the server replies with the COMPOUND reply that is 25643 stored the reply cache. The lease is possibly renewed as 25644 described below. 25646 o If sa_sequenceid is one greater (accounting for wraparound) than 25647 the cached sequence ID, then this is a new request, and the slot's 25648 sequence ID is incremented. The operations subsequent to 25649 SEQUENCE, if any, are processed. If there are no other 25650 operations, the only other effects are to cache the SEQUENCE reply 25651 in the slot, maintain the session's activity, and possibly renew 25652 the lease. 25654 If the client reuses a slot ID and sequence ID for a completely 25655 different request, the server MAY treat the request as if it is retry 25656 of what it has already executed. The server MAY however detect the 25657 client's illegal reuse and return NFS4ERR_SEQ_FALSE_RETRY. 25659 If SEQUENCE returns an error, then the state of the slot (sequence 25660 ID, cached reply) MUST NOT change, and the associated lease MUST NOT 25661 be renewed. 25663 If SEQUENCE returns NFS4_OK, then the associated lease MUST be 25664 renewed (see Section 8.3), except if 25665 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED is returned in sr_status_flags. 25667 18.46.4. IMPLEMENTATION 25669 The server MUST maintain a mapping of session ID to client ID in 25670 order to validate any operations that follow SEQUENCE that take a 25671 stateid as an argument and/or result. 25673 If the client establishes a persistent session, then a SEQUENCE done 25674 after a server restart may encounter requests performed and recorded 25675 in a persistent reply cache before the server restart. In this case, 25676 SEQUENCE will be processed successfully, while requests which were 25677 not processed previously are rejected with NFS4ERR_DEADSESSION. 25679 Depending on which of the operations within the COMPOUND were 25680 successfully performed before the server restart, these operations 25681 will also have replies sent from the server reply cache. Note that 25682 when these operations establish locking state it is locking state 25683 that applies to the previous server instance and to the previous 25684 client ID, even though the server restart, which logically happened 25685 after these operations, eliminated that state. In the case of a 25686 partially executed COMPOUND, processing may reach an operation not 25687 processed during the earlier server instance, making this operation a 25688 new one and not performable on the existing session. In this case, 25689 NFS4ERR_DEADSESSION will be returned from that operation. 25691 18.47. Operation 54: SET_SSV - Update SSV for a Client ID 25693 18.47.1. ARGUMENT 25695 struct ssa_digest_input4 { 25696 SEQUENCE4args sdi_seqargs; 25697 }; 25699 struct SET_SSV4args { 25700 opaque ssa_ssv<>; 25701 opaque ssa_digest<>; 25702 }; 25704 18.47.2. RESULT 25706 struct ssr_digest_input4 { 25707 SEQUENCE4res sdi_seqres; 25708 }; 25710 struct SET_SSV4resok { 25711 opaque ssr_digest<>; 25712 }; 25714 union SET_SSV4res switch (nfsstat4 ssr_status) { 25715 case NFS4_OK: 25716 SET_SSV4resok ssr_resok4; 25717 default: 25718 void; 25719 }; 25721 18.47.3. DESCRIPTION 25723 This operation is used to update the SSV for a client ID. Before 25724 SET_SSV is called the first time on a client ID, the SSV is zero (0). 25725 The SSV is the key used for the SSV GSS mechanism (Section 2.10.9) 25727 SET_SSV MUST be preceded by a SEQUENCE operation in the same 25728 COMPOUND. It MUST NOT be used if the client did not opt for SP4_SSV 25729 state protection when the client ID was created (see Section 18.35); 25730 the server returns NFS4ERR_INVAL in that case. 25732 The field ssa_digest is computed as the output of the HMAC RFC2104 25733 [11] using the subkey derived from the SSV4_SUBKEY_MIC_I2T and 25734 current SSV as the key (See Section 2.10.9 for a description of 25735 subkeys), and an XDR encoded value of data type ssa_digest_input4. 25736 The field sdi_seqargs is equal to the arguments of the SEQUENCE 25737 operation for the COMPOUND procedure that SET_SSV is within. 25739 The argument ssa_ssv is XORed with the current SSV to produce the new 25740 SSV. The argument ssa_ssv SHOULD be generated randomly. 25742 In the response, ssr_digest is the output of the HMAC using the 25743 subkey derived from SSV4_SUBKEY_MIC_T2I and new SSV as the key, and 25744 an XDR encoded value of data type ssr_digest_input4. The field 25745 sdi_seqres is equal to the results of the SEQUENCE operation for the 25746 COMPOUND procedure that SET_SSV is within. 25748 As noted in Section 18.35, the client and server can maintain 25749 multiple concurrent versions of the SSV. The client and server each 25750 MUST maintain an internal SSV version number, which is set to one (1) 25751 the first time SET_SSV executes on the server and the client receives 25752 the first SET_SSV reply. Each subsequent SET_SSV increases the 25753 internal SSV version number by one (1). The value of this version 25754 number corresponds to the smpt_ssv_seq, smt_ssv_seq, sspt_ssv_seq, 25755 and ssct_ssv_seq fields of the SSV GSS mechanism tokens (see 25756 Section 2.10.9). 25758 18.47.4. IMPLEMENTATION 25760 When the server receives ssa_digest, it MUST verify the digest by 25761 computing the digest the same way the client did and comparing it 25762 with ssa_digest. If the server gets a different result, this is an 25763 error, NFS4ERR_BAD_SESSION_DIGEST. This error might be the result of 25764 another SET_SSV from the same client ID changing the SSV. If so, the 25765 client recovers by issuing SET_SSV again with a recomputed digest 25766 based on the subkey of the new SSV. If the transport connection is 25767 dropped after the SET_SSV request is sent, but before the SET_SSV 25768 reply is received, then there are special considerations for recovery 25769 if the client has no more connections associated with sessions 25770 associated with the client ID of the SSV. See Section 18.34.4. 25772 Clients SHOULD NOT send an ssa_ssv that is equal to a previous 25773 ssa_ssv, nor equal to a previous or current SSV (including an ssa_ssv 25774 equal to zero since the SSV is initialized to zero when the client ID 25775 is created). 25777 Clients SHOULD send SET_SSV with RPCSEC_GSS privacy. Servers MUST 25778 support RPCSEC_GSS with privacy for any COMPOUND that has { SEQUENCE, 25779 SET_SSV }. 25781 A client SHOULD NOT send SET_SSV with the SSV GSS mechanism's 25782 credential because the purpose of SET_SSV is to seed the SSV from 25783 non-SSV credentials. Instead SET_SSV SHOULD be sent with the 25784 credential of a user that is accessing the client ID for the first 25785 time (Section 2.10.8.3). However if the client does send SET_SSV 25786 with SSV credentials, the digest protecting the arguments uses the 25787 value of the SSV before ssa_ssv is XORed in, and the digest 25788 protecting the results uses the value of the SSV after the ssa_ssv is 25789 XORed in. 25791 18.48. Operation 55: TEST_STATEID - Test Stateids for Validity 25793 18.48.1. ARGUMENT 25795 struct TEST_STATEID4args { 25796 stateid4 ts_stateids<>; 25797 }; 25799 18.48.2. RESULT 25801 struct TEST_STATEID4resok { 25802 nfsstat4 tsr_status_codes<>; 25803 }; 25805 union TEST_STATEID4res switch (nfsstat4 tsr_status) { 25806 case NFS4_OK: 25807 TEST_STATEID4resok tsr_resok4; 25808 default: 25809 void; 25810 }; 25812 18.48.3. DESCRIPTION 25814 The TEST_STATEID operation is used to check the validity of a set of 25815 stateids. It can be used at any time but the client should 25816 definitely use it when it receives an indication that one or more of 25817 its stateids have been invalidated due to lock revocation. This 25818 occurs when the SEQUENCE operation returns with one of the following 25819 sr_status_flags set: 25821 o SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED 25823 o SEQ4_STATUS_EXPIRED_ADMIN_STATE_REVOKED 25825 o SEQ4_STATUS_EXPIRED_RECALLABLE_STATE_REVOKED 25827 The client can use TEST_STATEID one or more times to test the 25828 validity of its stateids. Each use of TEST_STATEID allows a large 25829 set of such stateids to be tested and allows problems with earlier 25830 stateids not to interfere with checking of subsequent ones as would 25831 happen if individual stateids are tested by operation in a COMPOUND. 25833 For each stateid, the server returns the status code that would be 25834 returned if that stateid were to be used in normal operation. 25835 Returning such a status indication is not an error and does not cause 25836 compound processing to terminate. Checks for the validity of the 25837 stateid proceed as they would for normal operations with a number of 25838 exceptions: 25840 o There is no check for the type of stateid object, as would be the 25841 case for normal use of a stateid. 25843 o There is no reference to the current filehandle. 25845 o Special stateids are always considered invalid (they result in the 25846 error code NFS4ERR_BAD_STATEID). 25848 All stateids are interpreted as being associated with the client for 25849 the current session. Any possible association with a previous 25850 instance of the client (as stale stateids) is not considered. 25852 The errors which are validly returned within the status_code array 25853 are: NFS4ERR_OK, NFS4ERR_BAD_STATEID, NFS4ERR_OLD_STATEID, 25854 NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, and NFS4ERR_DELEG_REVOKED. 25856 18.48.4. IMPLEMENTATION 25858 See Section 8.2.2 and Section 8.2.4 for a discussion of stateid 25859 structure, lifetime, and validation. 25861 18.49. Operation 56: WANT_DELEGATION - Request Delegation 25862 18.49.1. ARGUMENT 25864 union deleg_claim4 switch (open_claim_type4 dc_claim) { 25865 /* 25866 * No special rights to object. Ordinary delegation 25867 * request of the specified object. Object identified 25868 * by filehandle. 25869 */ 25870 case CLAIM_FH: /* new to v4.1 */ 25871 /* CURRENT_FH: object being delegated */ 25872 void; 25874 /* 25875 * Right to file based on a delegation granted 25876 * to a previous boot instance of the client. 25877 * File is specified by filehandle. 25878 */ 25879 case CLAIM_DELEG_PREV_FH: /* new to v4.1 */ 25880 /* CURRENT_FH: object being delegated */ 25881 void; 25883 /* 25884 * Right to the file established by an open previous 25885 * to server reboot. File identified by filehandle. 25886 * Used during server reclaim grace period. 25887 */ 25888 case CLAIM_PREVIOUS: 25889 /* CURRENT_FH: object being reclaimed */ 25890 open_delegation_type4 dc_delegate_type; 25891 }; 25893 struct WANT_DELEGATION4args { 25894 uint32_t wda_want; 25895 deleg_claim4 wda_claim; 25896 }; 25898 18.49.2. RESULT 25900 union WANT_DELEGATION4res switch (nfsstat4 wdr_status) { 25901 case NFS4_OK: 25902 open_delegation4 wdr_resok4; 25903 default: 25904 void; 25905 }; 25907 18.49.3. DESCRIPTION 25909 Where this description mandates the return of a specific error code 25910 for a specific condition, and where multiple conditions apply, the 25911 server MAY return any of the mandated error codes. 25913 This operation allows a client to: 25915 o Get a delegation on all types of files except directories. 25917 o Register a "want" for a delegation for the specified file object, 25918 and be notified via a callback when the delegation is available. 25919 The server MAY support notifications of availability via 25920 callbacks. If the server does not support registration of wants 25921 it MUST NOT return an error to indicate that, and instead MUST 25922 return with ond_why set to WND4_CONTENTION or WND4_RESOURCE and 25923 ond_server_will_push_deleg or ond_server_will_signal_avail set to 25924 FALSE. When the server indicates that it will notify the client 25925 by means of a callback, it will either provide the delegation 25926 using a CB_PUSH_DELEG operation, or cancel its promise by sending 25927 a CB_WANTS_CANCELLED operation. 25929 o Cancel a want for a delegation. 25931 The client SHOULD NOT set OPEN4_SHARE_ACCESS_READ and SHOULD NOT set 25932 OPEN4_SHARE_ACCESS_WRITE in wda_want. If it does, the server MUST 25933 ignore them. 25935 The meanings of the following flags in wda_want are the same as they 25936 are in OPEN: 25938 o OPEN4_SHARE_ACCESS_WANT_READ_DELEG 25940 o OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 25942 o OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 25944 o OPEN4_SHARE_ACCESS_WANT_NO_DELEG 25946 o OPEN4_SHARE_ACCESS_WANT_CANCEL 25948 o OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL 25950 o OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED 25952 The handling of the above flags in WANT_DELEGATION is the same as in 25953 OPEN. Information about the delegation and/or the promises the 25954 server is making regarding future callbacks are the same as those 25955 described in the open_delegation4 structure. 25957 The successful results of WANT_DELEG are of type open_delegation4 25958 which is the same type as the "delegation" field in the results of 25959 the OPEN operation (see Section 18.16.3). The server constructs 25960 wdr_resok4 the same way it constructs OPEN's "delegation" with one 25961 difference: WANT_DELEGATION MUST NOT return a delegation type of 25962 OPEN_DELEGATE_NONE. 25964 If (wda_want & OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) is zero then the 25965 client is indicating no desire for a delegation and the server MUST 25966 return NFS4ERR_INVAL. 25968 The client uses the OPEN4_SHARE_ACCESS_WANT_NO_DELEG flag in the 25969 WANT_DELEGATION operation to cancel a previously requested want for a 25970 delegation. Note that if the server is in the process of sending the 25971 delegation (via CB_PUSH_DELEG) at the time the client sends a 25972 cancellation of the want, the delegation might still be pushed to the 25973 client. 25975 If WANT_DELEGATION fails to return a delegation, and the server 25976 returns NFS4_OK, the server MUST set the delegation type to 25977 OPEN4_DELEGATE_NONE_EXT, and set od_whynone, as described in 25978 Section 18.16. Write delegations are not available for file types 25979 that are not writable. This includes file objects of types: NF4BLK, 25980 NF4CHR, NF4LNK, NF4SOCK, and NF4FIFO. If the client requests 25981 OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG without 25982 OPEN4_SHARE_ACCESS_WANT_READ_DELEG on an object with one of the 25983 aforementioned file types, the server must set 25984 WND4_WRITE_DELEG_NOT_SUPP_FTYPE. 25986 18.49.4. IMPLEMENTATION 25988 A request for a conflicting delegation is not normally intended to 25989 trigger the recall of the existing delegation. Servers may choose to 25990 treat some clients as having higher priority such that their wants 25991 will trigger recall of an existing delegation, although that is 25992 expected to be an unusual situation. 25994 Servers will generally recall delegations assigned by WANT_DELEGATION 25995 on the same basis as those assigned by OPEN. CB_RECALL will 25996 generally be done only when other clients perform operations 25997 inconsistent with the delegation. The normal response to aging of 25998 delegations is to use CB_RECALL_ANY, in order to give the client the 25999 opportunity to keep the delegations most useful from its point of 26000 view. 26002 18.50. Operation 57: DESTROY_CLIENTID - Destroy a Client ID 26004 18.50.1. ARGUMENT 26006 struct DESTROY_CLIENTID4args { 26007 clientid4 dca_clientid; 26008 }; 26010 18.50.2. RESULT 26012 struct DESTROY_CLIENTID4res { 26013 nfsstat4 dcr_status; 26014 }; 26016 18.50.3. DESCRIPTION 26018 The DESTROY_CLIENTID operation destroys the client ID. If there are 26019 sessions (both idle and non-idle), opens, locks, delegations, 26020 layouts, and/or wants (Section 18.49) associated with the unexpired 26021 lease of the client ID, the server MUST return NFS4ERR_CLIENTID_BUSY. 26022 DESTROY_CLIENTID MAY be preceded with a SEQUENCE operation as long as 26023 the client ID derived from the session ID of SEQUENCE is not the same 26024 as the client ID to be destroyed. If the client IDs are the same, 26025 then the server MUST return NFS4ERR_CLIENTID_BUSY. 26027 If DESTROY_CLIENTID is not prefixed by SEQUENCE, it MUST be the only 26028 operation in the COMPOUND request (otherwise the server MUST return 26029 NFS4ERR_NOT_ONLY_OP). If the operation is sent without a SEQUENCE 26030 preceding it, a client that retransmits the request may receive an 26031 error in response, because the original request might have been 26032 successfully executed. 26034 18.50.4. IMPLEMENTATION 26036 DESTROY_CLIENTID allows a server to immediately reclaim the resources 26037 consumed by an unused client ID, and also to forget that it ever 26038 generated the client ID. By forgetting it ever generated the client 26039 ID the server can safely reuse the client ID on a future EXCHANGE_ID 26040 operation. 26042 18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims Finished 26043 18.51.1. ARGUMENT 26045 struct RECLAIM_COMPLETE4args { 26046 /* 26047 * If rca_one_fs TRUE, 26048 * 26049 * CURRENT_FH: object in 26050 * filesystem reclaim is 26051 * complete for. 26052 */ 26053 bool rca_one_fs; 26054 }; 26056 18.51.2. RESULTS 26058 struct RECLAIM_COMPLETE4res { 26059 nfsstat4 rcr_status; 26060 }; 26062 18.51.3. DESCRIPTION 26064 A RECLAIM_COMPLETE operation is used to indicate that the client has 26065 reclaimed all of the locking state that it will recover, when it is 26066 recovering state due to either a server restart or the transfer of a 26067 file system to another server. There are two types of 26068 RECLAIM_COMPLETE operations: 26070 o When rca_one_fs is FALSE, a global RECLAIM_COMPLETE is being done. 26071 This indicates that recovery of all locks that the client held on 26072 the previous server instance have been completed. 26074 o When rca_one_fs is TRUE, a file system-specific RECLAIM_COMPLETE 26075 is being done. This indicates that recovery of locks for a single 26076 fs (the one designated by the current filehandle) due to a file 26077 system transition have been completed. Presence of a current 26078 filehandle is only required when rca_one_fs is set to TRUE. 26080 Once a RECLAIM_COMPLETE is done, there can be no further reclaim 26081 operations for locks whose scope is defined as having completed 26082 recovery. Once the client sends RECLAIM_COMPLETE, the server will 26083 not allow the client to do subsequent reclaims of locking state for 26084 that scope and if these are attempted, will return NFS4ERR_NO_GRACE. 26086 Whenever a client establishes a new client ID and before it does the 26087 first non-reclaim operation that obtains a lock, it MUST send a 26088 RECLAIM_COMPLETE with rca_one_fs set to FALSE, even if there are no 26089 locks to reclaim. If non-reclaim locking operations are done before 26090 the RECLAIM_COMPLETE, an NFS4ERR_GRACE error will be returned. 26092 Similarly, when the client accesses a file system on a new server, 26093 before it sends the first non-reclaim operation that obtains a lock 26094 on this new server, it MUST send a RECLAIM_COMPLETE with rca_one_fs 26095 set to TRUE and current filehandle within that file system, even if 26096 there are no locks to reclaim. If non-reclaim locking operations are 26097 done on that file system before the RECLAIM_COMPLETE, an 26098 NFS4ERR_GRACE error will be returned. 26100 Any locks not reclaimed at the point at which RECLAIM_COMPLETE is 26101 done become non-reclaimable. The client MUST NOT attempt to reclaim 26102 them, either during the current server instance or in any subsequent 26103 server instance, or on another server to which responsibility for 26104 that file system is transferred. If the client were to do so, it 26105 would be violating the protocol by representing itself as owning 26106 locks that it does not own, and so has no right to reclaim. See 26107 Section 8.4.3 for a discussion of edge conditions related to lock 26108 reclaim. 26110 By sending a RECLAIM_COMPLETE, the client indicates readiness to 26111 proceed to do normal non-reclaim locking operations. The client 26112 should be aware that such operations may temporarily result in 26113 NFS4ERR_GRACE errors until the server is ready to terminate its grace 26114 period. 26116 18.51.4. IMPLEMENTATION 26118 Servers will typically use the information as to when reclaim 26119 activity is complete to reduce the length of the grace period. When 26120 the server maintains in persistent storage a list of clients that 26121 might have had locks, it is in a position to use the fact that all 26122 such clients have done a RECLAIM_COMPLETE to terminate the grace 26123 period and begin normal operations (i.e. grant requests for new 26124 locks) sooner than it might otherwise. 26126 Latency can be minimized by doing a RECLAIM_COMPLETE as part of the 26127 COMPOUND request in which the last lock-reclaiming operation is done. 26128 When there are no reclaims to be done, RECLAIM_COMPLETE should be 26129 done immediately in order to allow the grace period to end as soon as 26130 possible. 26132 RECLAIM_COMPLETE should only be done once for each server instance, 26133 or occasion of the transition of a file system. If it is done a 26134 second time, the error NFS4ERR_COMPLETE_ALREADY will result. Note 26135 that because of the session feature's retry protection, retries of 26136 COMPOUND requests containing RECLAIM_COMPLETE operation will not 26137 result in this error. 26139 When a RECLAIM_COMPLETE is done, the client effectively acknowledges 26140 any locks not yet reclaimed as lost. This allows the server to again 26141 mark this client as able to subsequently recover locks if it had been 26142 prevented from doing so, be by logic to prevent the occurrence of 26143 edge conditions, as described in Section 8.4.3. 26145 18.52. Operation 10044: ILLEGAL - Illegal operation 26147 18.52.1. ARGUMENTS 26149 void; 26151 18.52.2. RESULTS 26153 struct ILLEGAL4res { 26154 nfsstat4 status; 26155 }; 26157 18.52.3. DESCRIPTION 26159 This operation is a placeholder for encoding a result to handle the 26160 case of the client sending an operation code within COMPOUND that is 26161 not supported. See the COMPOUND procedure description for more 26162 details. 26164 The status field of ILLEGAL4res MUST be set to NFS4ERR_OP_ILLEGAL. 26166 18.52.4. IMPLEMENTATION 26168 A client will probably not send an operation with code OP_ILLEGAL but 26169 if it does, the response will be ILLEGAL4res just as it would be with 26170 any other invalid operation code. Note that if the server gets an 26171 illegal operation code that is not OP_ILLEGAL, and if the server 26172 checks for legal operation codes during the XDR decode phase, then 26173 the ILLEGAL4res would not be returned. 26175 19. NFSv4.1 Callback Procedures 26177 The procedures used for callbacks are defined in the following 26178 sections. In the interest of clarity, the terms "client" and 26179 "server" refer to NFS clients and servers, despite the fact that for 26180 an individual callback RPC, the sense of these terms would be 26181 precisely the opposite. 26183 Both procedures, CB_NULL and CB_COMPOUND, MUST be implemented. 26185 19.1. Procedure 0: CB_NULL - No Operation 26187 19.1.1. ARGUMENTS 26189 void; 26191 19.1.2. RESULTS 26193 void; 26195 19.1.3. DESCRIPTION 26197 CB_NULL is the standard ONC RPC NULL procedure, with the standard 26198 void argument and void response. Even though there is no direct 26199 functionality associated with this procedure, the server will use 26200 CB_NULL to confirm the existence of a path for RPCs from the server 26201 to client. 26203 19.1.4. ERRORS 26205 None. 26207 19.2. Procedure 1: CB_COMPOUND - Compound Operations 26209 19.2.1. ARGUMENTS 26211 enum nfs_cb_opnum4 { 26212 OP_CB_GETATTR = 3, 26213 OP_CB_RECALL = 4, 26214 /* Callback operations new to NFSv4.1 */ 26215 OP_CB_LAYOUTRECALL = 5, 26216 OP_CB_NOTIFY = 6, 26217 OP_CB_PUSH_DELEG = 7, 26218 OP_CB_RECALL_ANY = 8, 26219 OP_CB_RECALLABLE_OBJ_AVAIL = 9, 26220 OP_CB_RECALL_SLOT = 10, 26221 OP_CB_SEQUENCE = 11, 26222 OP_CB_WANTS_CANCELLED = 12, 26223 OP_CB_NOTIFY_LOCK = 13, 26224 OP_CB_NOTIFY_DEVICEID = 14, 26226 OP_CB_ILLEGAL = 10044 26227 }; 26228 union nfs_cb_argop4 switch (unsigned argop) { 26229 case OP_CB_GETATTR: 26230 CB_GETATTR4args opcbgetattr; 26231 case OP_CB_RECALL: 26232 CB_RECALL4args opcbrecall; 26233 case OP_CB_LAYOUTRECALL: 26234 CB_LAYOUTRECALL4args opcblayoutrecall; 26235 case OP_CB_NOTIFY: 26236 CB_NOTIFY4args opcbnotify; 26237 case OP_CB_PUSH_DELEG: 26238 CB_PUSH_DELEG4args opcbpush_deleg; 26239 case OP_CB_RECALL_ANY: 26240 CB_RECALL_ANY4args opcbrecall_any; 26241 case OP_CB_RECALLABLE_OBJ_AVAIL: 26242 CB_RECALLABLE_OBJ_AVAIL4args opcbrecallable_obj_avail; 26243 case OP_CB_RECALL_SLOT: 26244 CB_RECALL_SLOT4args opcbrecall_slot; 26245 case OP_CB_SEQUENCE: 26246 CB_SEQUENCE4args opcbsequence; 26247 case OP_CB_WANTS_CANCELLED: 26248 CB_WANTS_CANCELLED4args opcbwants_cancelled; 26249 case OP_CB_NOTIFY_LOCK: 26250 CB_NOTIFY_LOCK4args opcbnotify_lock; 26251 case OP_CB_NOTIFY_DEVICEID: 26252 CB_NOTIFY_DEVICEID4args opcbnotify_deviceid; 26253 case OP_CB_ILLEGAL: void; 26254 }; 26256 struct CB_COMPOUND4args { 26257 utf8str_cs tag; 26258 uint32_t minorversion; 26259 uint32_t callback_ident; 26260 nfs_cb_argop4 argarray<>; 26261 }; 26263 19.2.2. RESULTS 26265 union nfs_cb_resop4 switch (unsigned resop) { 26266 case OP_CB_GETATTR: CB_GETATTR4res opcbgetattr; 26267 case OP_CB_RECALL: CB_RECALL4res opcbrecall; 26269 /* new NFSv4.1 operations */ 26270 case OP_CB_LAYOUTRECALL: 26271 CB_LAYOUTRECALL4res 26272 opcblayoutrecall; 26274 case OP_CB_NOTIFY: CB_NOTIFY4res opcbnotify; 26276 case OP_CB_PUSH_DELEG: CB_PUSH_DELEG4res 26277 opcbpush_deleg; 26279 case OP_CB_RECALL_ANY: CB_RECALL_ANY4res 26280 opcbrecall_any; 26282 case OP_CB_RECALLABLE_OBJ_AVAIL: 26283 CB_RECALLABLE_OBJ_AVAIL4res 26284 opcbrecallable_obj_avail; 26286 case OP_CB_RECALL_SLOT: 26287 CB_RECALL_SLOT4res 26288 opcbrecall_slot; 26290 case OP_CB_SEQUENCE: CB_SEQUENCE4res opcbsequence; 26292 case OP_CB_WANTS_CANCELLED: 26293 CB_WANTS_CANCELLED4res 26294 opcbwants_cancelled; 26296 case OP_CB_NOTIFY_LOCK: 26297 CB_NOTIFY_LOCK4res 26298 opcbnotify_lock; 26300 case OP_CB_NOTIFY_DEVICEID: 26301 CB_NOTIFY_DEVICEID4res 26302 opcbnotify_deviceid; 26304 /* Not new operation */ 26305 case OP_CB_ILLEGAL: CB_ILLEGAL4res opcbillegal; 26306 }; 26307 struct CB_COMPOUND4res { 26308 nfsstat4 status; 26309 utf8str_cs tag; 26310 nfs_cb_resop4 resarray<>; 26311 }; 26313 19.2.3. DESCRIPTION 26315 The CB_COMPOUND procedure is used to combine one or more of the 26316 callback procedures into a single RPC request. The main callback RPC 26317 program has two main procedures: CB_NULL and CB_COMPOUND. All other 26318 operations use the CB_COMPOUND procedure as a wrapper. 26320 During the processing of the CB_COMPOUND procedure, the client may 26321 find that it does not have the available resources to execute any or 26322 all of the operations within the CB_COMPOUND sequence. Refer to 26323 Section 2.10.6.4 for details. 26325 The minorversion field of the arguments MUST be the same as the 26326 minorversion of the COMPOUND procedure used to created the client ID 26327 and session. For NFSv4.1, minorversion MUST be set to 1. 26329 Contained within the CB_COMPOUND results is a 'status' field. This 26330 status MUST be equal to the status of the last operation that was 26331 executed within the CB_COMPOUND procedure. Therefore, if an 26332 operation incurred an error then the 'status' value will be the same 26333 error value as is being returned for the operation that failed. 26335 The "tag" field is handled the same way as that of COMPOUND procedure 26336 (see Section 16.2.3). 26338 Illegal operation codes are handled in the same way as they are 26339 handled for the COMPOUND procedure. 26341 19.2.4. IMPLEMENTATION 26343 The CB_COMPOUND procedure is used to combine individual operations 26344 into a single RPC request. The client interprets each of the 26345 operations in turn. If an operation is executed by the client and 26346 the status of that operation is NFS4_OK, then the next operation in 26347 the CB_COMPOUND procedure is executed. The client continues this 26348 process until there are no more operations to be executed or one of 26349 the operations has a status value other than NFS4_OK. 26351 19.2.5. ERRORS 26353 CB_COMPOUND will of course return every error that each operation on 26354 the backchannel can return (see Table 7). However if CB_COMPOUND 26355 returns zero operations, obviously the error returned by COMPOUND has 26356 nothing to do with an error returned by an operation. The list of 26357 errors CB_COMPOUND will return if it processes zero operations 26358 include: 26360 CB_COMPOUND error returns 26362 +------------------------------+------------------------------------+ 26363 | Error | Notes | 26364 +------------------------------+------------------------------------+ 26365 | NFS4ERR_BADCHAR | The tag argument has a character | 26366 | | the replier does not support. | 26367 | NFS4ERR_BADXDR | | 26368 | NFS4ERR_DELAY | | 26369 | NFS4ERR_INVAL | The tag argument is not in UTF-8 | 26370 | | encoding. | 26371 | NFS4ERR_MINOR_VERS_MISMATCH | | 26372 | NFS4ERR_SERVERFAULT | | 26373 | NFS4ERR_TOO_MANY_OPS | | 26374 | NFS4ERR_REP_TOO_BIG | | 26375 | NFS4ERR_REP_TOO_BIG_TO_CACHE | | 26376 | NFS4ERR_REQ_TOO_BIG | | 26377 +------------------------------+------------------------------------+ 26379 Table 15 26381 20. NFSv4.1 Callback Operations 26383 20.1. Operation 3: CB_GETATTR - Get Attributes 26385 20.1.1. ARGUMENT 26387 struct CB_GETATTR4args { 26388 nfs_fh4 fh; 26389 bitmap4 attr_request; 26390 }; 26392 20.1.2. RESULT 26394 struct CB_GETATTR4resok { 26395 fattr4 obj_attributes; 26396 }; 26398 union CB_GETATTR4res switch (nfsstat4 status) { 26399 case NFS4_OK: 26400 CB_GETATTR4resok resok4; 26401 default: 26402 void; 26403 }; 26405 20.1.3. DESCRIPTION 26407 The CB_GETATTR operation is used by the server to obtain the current 26408 modified state of a file that has been write delegated. The 26409 attributes size and change are the only ones guaranteed to be 26410 serviced by the client. See Section 10.4.3 for a full description of 26411 how the client and server are to interact with the use of CB_GETATTR. 26413 If the filehandle specified is not one for which the client holds a 26414 write delegation, an NFS4ERR_BADHANDLE error is returned. 26416 20.1.4. IMPLEMENTATION 26418 The client returns attrmask bits and the associated attribute values 26419 only for the change attribute, and attributes that it may change 26420 (time_modify, and size). 26422 20.2. Operation 4: CB_RECALL - Recall a Delegation 26424 20.2.1. ARGUMENT 26426 struct CB_RECALL4args { 26427 stateid4 stateid; 26428 bool truncate; 26429 nfs_fh4 fh; 26430 }; 26432 20.2.2. RESULT 26434 struct CB_RECALL4res { 26435 nfsstat4 status; 26436 }; 26438 20.2.3. DESCRIPTION 26440 The CB_RECALL operation is used to begin the process of recalling a 26441 delegation and returning it to the server. 26443 The truncate flag is used to optimize recall for a file object which 26444 is a regular file and is about to be truncated to zero. When it is 26445 TRUE, the client is freed of the obligation to propagate modified 26446 data for the file to the server, since this data is irrelevant. 26448 If the handle specified is not one for which the client holds a 26449 delegation, an NFS4ERR_BADHANDLE error is returned. 26451 If the stateid specified is not one corresponding to an open 26452 delegation for the file specified by the filehandle, an 26453 NFS4ERR_BAD_STATEID is returned. 26455 20.2.4. IMPLEMENTATION 26457 The client SHOULD reply to the callback immediately. Replying does 26458 not complete the recall except when the value of the reply's status 26459 field is neither NFS4ERR_DELAY nor NFS4_OK. The recall is not 26460 complete until the delegation is returned using a DELEGRETURN 26461 operation. 26463 20.3. Operation 5: CB_LAYOUTRECALL - Recall Layout from Client 26464 20.3.1. ARGUMENT 26466 /* 26467 * NFSv4.1 callback arguments and results 26468 */ 26470 enum layoutrecall_type4 { 26471 LAYOUTRECALL4_FILE = LAYOUT4_RET_REC_FILE, 26472 LAYOUTRECALL4_FSID = LAYOUT4_RET_REC_FSID, 26473 LAYOUTRECALL4_ALL = LAYOUT4_RET_REC_ALL 26474 }; 26476 struct layoutrecall_file4 { 26477 nfs_fh4 lor_fh; 26478 offset4 lor_offset; 26479 length4 lor_length; 26480 stateid4 lor_stateid; 26481 }; 26483 union layoutrecall4 switch(layoutrecall_type4 lor_recalltype) { 26484 case LAYOUTRECALL4_FILE: 26485 layoutrecall_file4 lor_layout; 26486 case LAYOUTRECALL4_FSID: 26487 fsid4 lor_fsid; 26488 case LAYOUTRECALL4_ALL: 26489 void; 26490 }; 26492 struct CB_LAYOUTRECALL4args { 26493 layouttype4 clora_type; 26494 layoutiomode4 clora_iomode; 26495 bool clora_changed; 26496 layoutrecall4 clora_recall; 26497 }; 26499 20.3.2. RESULT 26501 struct CB_LAYOUTRECALL4res { 26502 nfsstat4 clorr_status; 26503 }; 26505 20.3.3. DESCRIPTION 26507 The CB_LAYOUTRECALL operation is used by the server to recall layouts 26508 from the client; as a result, the client will begin the process of 26509 returning layouts via LAYOUTRETURN. The CB_LAYOUTRECALL operation 26510 specifies one of three forms of recall processing with the value of 26511 layoutrecall_type4. The recall is either for a specific layout (by 26512 file), for an entire file system (FSID), or for all file systems 26513 (ALL). 26515 The behavior of the operation varies based on the value of the 26516 layoutrecall_type4. The value and behaviors are: 26518 LAYOUTRECALL4_FILE 26520 For a layout to match the recall request, the values of the 26521 following fields must match those of the layout: clora_type, 26522 clora_iomode, lor_fh, and the byte range specified by lor_offset 26523 and lor_length. The clora_iomode field may have a special value 26524 of LAYOUTIOMODE4_ANY. The special value LAYOUTIOMODE4_ANY will 26525 match any iomode originally returned in a layout; therefore it 26526 acts as a wild card. The other special value used is for 26527 lor_length. If lor_length has a value of NFS4_UINT64_MAX, the 26528 lor_length field means the maximum possible file size. If a 26529 matching layout is found, it MUST be returned using the 26530 LAYOUTRETURN operation (see Section 18.44). An example of the 26531 field's special value use is if clora_iomode is LAYOUTIOMODE4_ANY, 26532 lor_offset is zero, and lor_length is NFS4_UINT64_MAX, then the 26533 entire layout is to be returned. 26535 The NFS4ERR_NOMATCHING_LAYOUT error is only returned when the 26536 client does not hold layouts for the file or if the client does 26537 not have any overlapping layouts for the specification in the 26538 layout recall. 26540 LAYOUTRECALL4_FSID and LAYOUTRECALL4_ALL 26542 If LAYOUTRECALL4_FSID is specified, the fsid specifies the file 26543 system for which any outstanding layouts MUST be returned. If 26544 LAYOUTRECALL4_ALL is specified, all outstanding layouts MUST be 26545 returned. In addition, LAYOUTRECALL4_FSID and LAYOUTRECALL4_ALL 26546 specify that all the storage device ID to storage device address 26547 mappings in the affected file system(s) are also recalled. The 26548 respective LAYOUTRETURN with either LAYOUTRETURN4_FSID or 26549 LAYOUTRETURN4_ALL acknowledges to the server that the client 26550 invalidated the said device mappings. See Section 12.5.5.2.1.5 26551 for considerations with "bulk" recall of layouts. 26553 The NFS4ERR_NOMATCHING_LAYOUT error is only returned when the 26554 client does not hold layouts and does not have valid deviceid 26555 mappings. 26557 In processing the layout recall request, the client also varies its 26558 behavior based on the value of the clora_changed field. This field 26559 is used by the server to provide additional context for the reason 26560 why the layout is being recalled. A FALSE value for clora_changed 26561 indicates that no change in the layout is expected and the client may 26562 write modified data to the storage devices involved; this must be 26563 done prior to returning the layout via LAYOUTRETURN. A TRUE value 26564 for clora_changed indicates that the server is changing the layout. 26565 Examples of layout changes and reasons for a TRUE indication are: the 26566 metadata server is restriping the file or a permanent error has 26567 occurred on a storage device and the metadata server would like to 26568 provide a new layout for the file. Therefore, a clora_changed value 26569 of TRUE indicates some level of change for the layout and the client 26570 SHOULD NOT write and commit modified data to the storage devices. In 26571 this case, the client writes and commits data through the metadata 26572 server. 26574 See Section 12.5.3 for a description of how the lor_stateid field in 26575 the arguments is to be constructed. Note that the "seqid" field of 26576 lor_stateid MUST NOT be zero. See Section 8.2, Section 12.5.3, and 26577 Section 12.5.5.2 for a further discussion and requirements. 26579 20.3.4. IMPLEMENTATION 26581 The client's processing for CB_LAYOUTRECALL is similar to CB_RECALL 26582 (recall of file delegations) in that the client responds to the 26583 request before actually returning layouts via the LAYOUTRETURN 26584 operation. While the client responds to the CB_LAYOUTRECALL 26585 immediately, the operation is not considered complete (i.e. 26586 considered pending) until all affected layouts are returned to the 26587 server via the LAYOUTRETURN operation. 26589 Before returning the layout to the server via LAYOUTRETURN, the 26590 client should wait for the response from in-process or in-flight 26591 READ, WRITE, or COMMIT operations that use the recalled layout. 26593 If the client is holding modified data which is affected by a 26594 recalled layout, the client has various options for writing the data 26595 to the server. As always, the client may write the data through the 26596 metadata server. In fact, the client may not have a choice other 26597 than writing to the metadata server when the clora_changed argument 26598 is TRUE and a new layout is unavailable from the server. However, 26599 the client may be able to write the modified data to the storage 26600 device if the clora_changed argument is FALSE; this needs to be done 26601 before returning the layout via LAYOUTRETURN. If the client were to 26602 obtain a new layout covering the modified data's range, then writing 26603 to the storage devices is an available alternative. Note that before 26604 obtaining a new layout, the client must first return the original 26605 layout. 26607 In the case of modified data being written while the layout is held, 26608 the client must use LAYOUTCOMMIT operations at the appropriate time; 26609 as required LAYOUTCOMMIT must be done before the LAYOUTRETURN. If a 26610 large amount of modified data is outstanding, the client may send 26611 LAYOUTRETURNs for portions of the recalled layout; this allows the 26612 server to monitor the client's progress and adherence to the original 26613 recall request. However, the last LAYOUTRETURN in a sequence of 26614 returns, MUST specify the full range being recalled (see 26615 Section 12.5.5.1 for details). 26617 If a server needs to delete a device ID, and there are layouts 26618 referring to the device ID, CB_LAYOUTRECALL MUST be invoked to cause 26619 the client to return all layouts referring to device ID before the 26620 server can delete the device ID. If the client does not return the 26621 affected layouts, the server MAY revoke the layouts. 26623 20.4. Operation 6: CB_NOTIFY - Notify Client of Directory Changes 26625 20.4.1. ARGUMENT 26627 /* 26628 * Directory notification types. 26629 */ 26630 enum notify_type4 { 26631 NOTIFY4_CHANGE_CHILD_ATTRS = 0, 26632 NOTIFY4_CHANGE_DIR_ATTRS = 1, 26633 NOTIFY4_REMOVE_ENTRY = 2, 26634 NOTIFY4_ADD_ENTRY = 3, 26635 NOTIFY4_RENAME_ENTRY = 4, 26636 NOTIFY4_CHANGE_COOKIE_VERIFIER = 5 26637 }; 26639 /* Changed entry information. */ 26640 struct notify_entry4 { 26641 component4 ne_file; 26642 fattr4 ne_attrs; 26643 }; 26645 /* Previous entry information */ 26646 struct prev_entry4 { 26647 notify_entry4 pe_prev_entry; 26648 /* what READDIR returned for this entry */ 26649 nfs_cookie4 pe_prev_entry_cookie; 26650 }; 26652 struct notify_remove4 { 26653 notify_entry4 nrm_old_entry; 26654 nfs_cookie4 nrm_old_entry_cookie; 26656 }; 26658 struct notify_add4 { 26659 /* 26660 * Information on object 26661 * possibly renamed over. 26662 */ 26663 notify_remove4 nad_old_entry<1>; 26664 notify_entry4 nad_new_entry; 26665 /* what READDIR would have returned for this entry */ 26666 nfs_cookie4 nad_new_entry_cookie<1>; 26667 prev_entry4 nad_prev_entry<1>; 26668 bool nad_last_entry; 26669 }; 26671 struct notify_attr4 { 26672 notify_entry4 na_changed_entry; 26673 }; 26675 struct notify_rename4 { 26676 notify_remove4 nrn_old_entry; 26677 notify_add4 nrn_new_entry; 26678 }; 26680 struct notify_verifier4 { 26681 verifier4 nv_old_cookieverf; 26682 verifier4 nv_new_cookieverf; 26683 }; 26685 /* 26686 * Objects of type notify_<>4 and 26687 * notify_device_<>4 are encoded in this. 26688 */ 26689 typedef opaque notifylist4<>; 26691 struct notify4 { 26692 /* composed from notify_type4 or notify_deviceid_type4 */ 26693 bitmap4 notify_mask; 26694 notifylist4 notify_vals; 26695 }; 26697 struct CB_NOTIFY4args { 26698 stateid4 cna_stateid; 26699 nfs_fh4 cna_fh; 26700 notify4 cna_changes<>; 26701 }; 26703 20.4.2. RESULT 26705 struct CB_NOTIFY4res { 26706 nfsstat4 cnr_status; 26707 }; 26709 20.4.3. DESCRIPTION 26711 The CB_NOTIFY operation is used by the server to send notifications 26712 to clients about changes to delegated directories. The registration 26713 of notifications for the directories occurs when the delegation is 26714 established using GET_DIR_DELEGATION. These notifications are sent 26715 over the backchannel. The notification is sent once the original 26716 request has been processed on the server. The server will send an 26717 array of notifications for changes that might have occurred in the 26718 directory. The notifications are sent as list of pairs of bitmaps 26719 and values. See Section 3.3.7 for a description of how NFSv4.1 26720 bitmaps work. 26722 If the server has more notifications than can fit in the CB_COMPOUND 26723 request, it SHOULD send a sequence of serial CB_COMPOUND requests so 26724 that the client's view of the directory does not become confused. 26725 E.g. If the server indicates a file named "foo" is added, and that 26726 the file "foo" is removed, the order in which the client receives 26727 these notifications needs to be the same as the order in which 26728 corresponding operations occurred on the server. 26730 If the client holding the delegation makes any changes in the 26731 directory that cause files or sub directories to be added or removed, 26732 the server will notify that client of the resulting change(s). If 26733 the client holding the delegation is making attribute or cookie 26734 verifier changes only, the server does not need to send notifications 26735 to that client. The server will send the following information for 26736 each operation: 26738 NOTIFY4_ADD_ENTRY 26739 The server will send information about the new directory entry 26740 being created along with the cookie for that entry. The entry 26741 information (data type notify_add4) includes the component name of 26742 the entry and attributes. The server will send this type of entry 26743 when a file is actually being created, when an entry is being 26744 added to a directory as a result of a rename across directories 26745 (see below), and when a hard link is being created to an existing 26746 file. If this entry is added to the end of the directory, the 26747 server will set the nad_last_entry flag to TRUE. If the file is 26748 added such that there is at least one entry before it, the server 26749 will also return the previous entry information (nad_prev_entry, a 26750 variable length array of up to one element. If the array is of 26751 zero length, there is no previous entry), along with its cookie. 26752 This is to help clients find the right location in their file name 26753 caches and directory caches where this entry should be cached. If 26754 the new entry's cookie is available, it will be in the 26755 nad_new_entry_cookie (another variable length array of up to one 26756 element) field. If the addition of the entry causes another entry 26757 to be deleted (which can only happen in the rename case) 26758 atomically with the addition, then information on this entry is 26759 reported in nad_old_entry. 26761 NOTIFY4_REMOVE_ENTRY 26762 The server will send information about the directory entry being 26763 deleted. The server will also send the cookie value for the 26764 deleted entry so that clients can get to the cached information 26765 for this entry. 26767 NOTIFY4_RENAME_ENTRY 26768 The server will send information about both the old entry and the 26769 new entry. This includes name and attributes for each entry. In 26770 addition, if the rename causes the deletion of an entry (i.e. the 26771 case of a file renamed over) then this is reported in 26772 nrn_new_new_entry.nad_old_entry. This notification is only sent 26773 if both entries are in the same directory. If the rename is 26774 across directories, the server will send a remove notification to 26775 one directory and an add notification to the other directory, 26776 assuming both have a directory delegation. 26778 NOTIFY4_CHANGE_CHILD_ATTRS/NOTIFY4_CHANGE_DIR_ATTRS 26779 The client will use the attribute mask to inform the server of 26780 attributes for which it wants to receive notifications. This 26781 change notification can be requested for both changes to the 26782 attributes of the directory as well as changes to any file's 26783 attributes in the directory by using two separate attribute masks. 26784 The client cannot ask for change attribute notification for a 26785 specific file. One attribute mask covers all the files in the 26786 directory. Upon any attribute change, the server will send back 26787 the values of changed attributes. Notifications might not make 26788 sense for some file system wide attributes and it is up to the 26789 server to decide which subset it wants to support. The client can 26790 negotiate the frequency of attribute notifications by letting the 26791 server know how often it wants to be notified of an attribute 26792 change. The server will return supported notification frequencies 26793 or an indication that no notification is permitted for directory 26794 or child attributes by setting the dir_notif_delay and 26795 dir_entry_notif_delay attributes respectively. 26797 NOTIFY4_CHANGE_COOKIE_VERIFIER 26798 If the cookie verifier changes while a client is holding a 26799 delegation, the server will notify the client so that it can 26800 invalidate its cookies and re-send a READDIR to get the new set of 26801 cookies. 26803 20.5. Operation 7: CB_PUSH_DELEG - Offer Previously Requested 26804 Delegation to Client 26806 20.5.1. ARGUMENT 26808 struct CB_PUSH_DELEG4args { 26809 nfs_fh4 cpda_fh; 26810 open_delegation4 cpda_delegation; 26812 }; 26814 20.5.2. RESULT 26816 struct CB_PUSH_DELEG4res { 26817 nfsstat4 cpdr_status; 26818 }; 26820 20.5.3. DESCRIPTION 26822 CB_PUSH_DELEG is used by the server to both signal to the client that 26823 the delegation it wants (previously indicated via a want established 26824 from an OPEN or WANT_DELEGATION operation) is available and to 26825 simultaneously offer the delegation to the client. The client has 26826 the choice of accepting the delegation by returning NFS4_OK to the 26827 server, delaying the decision to accept the offered delegation by 26828 returning NFS4ERR_DELAY or permanently rejecting the offer of the 26829 delegation by returning NFS4ERR_REJECT_DELEG. When a delegation is 26830 rejected in this fashion, the want previously established is 26831 permanently deleted and the delegation is subject to acquisition by 26832 another client. 26834 20.5.4. IMPLEMENTATION 26836 If the client does return NFS4ERR_DELAY and there is a conflicting 26837 delegation request, the server MAY process it at the expense of the 26838 client that returned NFS4ERR_DELAY. The client's want will not be 26839 cancelled, but MAY processed behind other delegation requests or 26840 registered wants. 26842 When a client returns a status other than NFS4_OK, NFS4ERR_DELAY, or 26843 NFS4ERR_REJECT_DELAY, the want remains pending, although servers may 26844 decide to cancel the want by sending a CB_WANTS_CANCELLED. 26846 20.6. Operation 8: CB_RECALL_ANY - Keep Any N Recallable Objects 26848 20.6.1. ARGUMENT 26850 const RCA4_TYPE_MASK_RDATA_DLG = 0; 26851 const RCA4_TYPE_MASK_WDATA_DLG = 1; 26852 const RCA4_TYPE_MASK_DIR_DLG = 2; 26853 const RCA4_TYPE_MASK_FILE_LAYOUT = 3; 26854 const RCA4_TYPE_MASK_BLK_LAYOUT = 4; 26855 const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8; 26856 const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 9; 26857 const RCA4_TYPE_MASK_OTHER_LAYOUT_MIN = 12; 26858 const RCA4_TYPE_MASK_OTHER_LAYOUT_MAX = 15; 26860 struct CB_RECALL_ANY4args { 26861 uint32_t craa_objects_to_keep; 26862 bitmap4 craa_type_mask; 26863 }; 26865 20.6.2. RESULT 26867 struct CB_RECALL_ANY4res { 26868 nfsstat4 crar_status; 26869 }; 26871 20.6.3. DESCRIPTION 26873 The server may decide that it cannot hold all of the state for 26874 recallable objects, such as delegations and layouts, without running 26875 out of resources. In such a case, it is free to recall individual 26876 objects to reduce the load but this would be far from optimal. 26878 Because the general purpose of such recallable objects as delegations 26879 is to eliminate client interaction with the server, the server cannot 26880 interpret lack of recent use as indicating that the object is no 26881 longer useful. The absence of visible use may be the result of a 26882 large number of potential operations eliminated. In the case of 26883 layouts, the layout will be used explicitly but the metadata server 26884 does not have direct knowledge of such use. 26886 In order to implement an effective reclaim scheme for such objects, 26887 the server's knowledge of available resources must be used to 26888 determine when objects must be recalled with the clients selecting 26889 the actual objects to be returned. 26891 Server implementations may differ in their resource allocation 26892 requirements. For example, one server may share resources among all 26893 classes of recallable objects whereas another may use separate 26894 resource pools for layouts and for delegations, or further separate 26895 resources by types of delegations. 26897 When a given resource pool is over-utilized, the server can send a 26898 CB_RECALL_ANY to clients holding recallable objects of the types 26899 involved, allowing it to keep a certain number of such objects and 26900 return any excess. A mask specifies which types of objects are to be 26901 limited. The client chooses, based on its own knowledge of current 26902 usefulness, which of the objects in that class should be returned. 26904 A number of bits are defined. For some of these, ranges are defined 26905 and it is up to the definition of the storage protocol to specify how 26906 these are to be used. There are ranges reserved for object-based 26907 storage protocols and for other experimental storage protocols. An 26908 RFC defining such a storage protocol needs to specify how particular 26909 bits within its range are to be used. For example, it may specify a 26910 mapping between attributes of the layout (read vs. write, size of 26911 area) and the bit to be used or it may define a field in the layout 26912 where the associated bit position is made available by the server to 26913 the client. 26915 RCA4_TYPE_MASK_RDATA_DLG 26917 The client is to return read delegations on non-directory file 26918 objects. 26920 RCA4_TYPE_MASK_WDATA_DLG 26922 The client is to return write delegations on regular file objects. 26924 RCA4_TYPE_MASK_DIR_DLG 26926 The client is to return directory delegations. 26928 RCA4_TYPE_MASK_FILE_LAYOUT 26930 The client is to return layouts of type LAYOUT4_NFSV4_1_FILES. 26932 RCA4_TYPE_MASK_BLK_LAYOUT 26933 See [40] for a description. 26935 RCA4_TYPE_MASK_OBJ_LAYOUT_MIN to RCA4_TYPE_MASK_OBJ_LAYOUT_MAX 26937 See [39] for a description. 26939 RCA4_TYPE_MASK_OTHER_LAYOUT_MIN to RCA4_TYPE_MASK_OTHER_LAYOUT_MAX 26941 This range is reserved for telling the client to recall layouts of 26942 experimental or site specific layout types (see Section 3.3.13). 26944 When a bit is set in the type mask that corresponds to an undefined 26945 type of recallable object, NFS4ERR_INVAL MUST be returned. When a 26946 bit is set that corresponds to a defined type of object, but the 26947 client does not support an object of the type, NFS4ERR_INVAL MUST NOT 26948 be returned. Future minor versions of NFSv4 may expand the set of 26949 valid type mask bits. 26951 CB_RECALL_ANY specifies a count of objects that the client may keep 26952 as opposed to a count that the client must return. This is to avoid 26953 potential race between a CB_RECALL_ANY that had a count of objects to 26954 free with a set of client-originated operations to return layouts or 26955 delegations. As a result of the race, the client and server would 26956 have differing ideas as to how many objects to return. Hence the 26957 client could mistakenly free too many. 26959 If resource demands prompt it, the server may send another 26960 CB_RECALL_ANY with a lower count, even it has not yet received an 26961 acknowledgement from the client for a previous CB_RECALL_ANY with the 26962 same type mask. Although the possibility exists that these will be 26963 received by the client in a order different from the order in which 26964 they were sent, any such permutation of the callback stream is 26965 harmless. It is the job of the client to bring down the size of the 26966 recallable object set in line with each CB_RECALL_ANY received and 26967 until that obligation is met it cannot be canceled or modified by any 26968 subsequent CB_RECALL_ANY for the same type mask. Thus if the server 26969 sends two CB_RECALL_ANY's, the effect will be the same as if the 26970 lower count was sent, whatever the order of recall receipt. Note 26971 that this means that a server may not cancel the effect of a 26972 CB_RECALL_ANY by sending another recall with a higher count. When a 26973 CB_RECALL_ANY is received and the count is already within the limit 26974 set or is above a limit that the client is working to get down to, 26975 that callback has no effect. 26977 Servers are generally free not to give out recallable objects when 26978 insufficient resources are available. Note that the effect of such a 26979 policy is implicitly to give precedence to existing objects relative 26980 to requested ones, with the result that resources might not be 26981 optimally used. To prevent this, servers are well advised to make 26982 the point at which they start issuing CB_RECALL_ANY callbacks 26983 somewhat below that at which they cease to give out new delegations 26984 and layouts. This allows the client to purge its less-used objects 26985 whenever appropriate and so continue to have its subsequent requests 26986 given new resources freed up by object returns. 26988 20.6.4. IMPLEMENTATION 26990 The client can choose to return any type of object specified by the 26991 mask. If a server wishes to limit use of objects of a specific type, 26992 it should only specify that type in the mask sent. The client may 26993 not return requested objects and it is up to the server to handle 26994 this situation, typically by doing specific recalls to properly limit 26995 resource usage. The server should give the client enough time to 26996 return objects before proceeding to specific recalls. This time 26997 should not be less than the lease period. 26999 20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL - Signal Resources for 27000 Recallable Objects 27002 20.7.1. ARGUMENT 27004 typedef CB_RECALL_ANY4args CB_RECALLABLE_OBJ_AVAIL4args; 27006 20.7.2. RESULT 27008 struct CB_RECALLABLE_OBJ_AVAIL4res { 27009 nfsstat4 croa_status; 27010 }; 27012 20.7.3. DESCRIPTION 27014 CB_RECALLABLE_OBJ_AVAIL is used by the server to signal the client 27015 that the server has resources to grant recallable objects that might 27016 previously have been denied by OPEN, WANT_DELEGATION, GET_DIR_DELEG, 27017 or LAYOUTGET. 27019 The argument craa_objects_to_keep means the total number of 27020 recallable objects of the types indicated in the argument type_mask 27021 that the server believes it can allow the client to have, including 27022 the number of such objects the client already has. A client that 27023 tries to acquire more recallable objects than the server informs it 27024 can have runs the risk of having objects recalled. 27026 The server is not obligated to reserve the difference between the 27027 number of the objects the client currently has and the value of 27028 craa_objects_to_keep, nor does delaying the reply to 27029 CB_RECALLABLE_OBJ_AVAIL prevent the server from using the resources 27030 of the recallable objects for another purpose. Indeed, if a client 27031 responds slowly to CB_RECALLABLE_OBJ_AVAIL, the server might 27032 interpret the client as having reduced capability to manage 27033 recallable objects, and so cancel or reduce any reservation it is 27034 maintaining on behalf of the client. Thus if the client desires to 27035 acquire more recallable objects, it needs to reply quickly to 27036 CB_RECALLABLE_OBJ_AVAIL, and then send the appropriate operations to 27037 acquire recallable objects. 27039 20.8. Operation 10: CB_RECALL_SLOT - Change Flow Control Limits 27041 20.8.1. ARGUMENT 27043 struct CB_RECALL_SLOT4args { 27044 slotid4 rsa_target_highest_slotid; 27045 }; 27047 20.8.2. RESULT 27049 struct CB_RECALL_SLOT4res { 27050 nfsstat4 rsr_status; 27051 }; 27053 20.8.3. DESCRIPTION 27055 The CB_RECALL_SLOT operation requests the client to return session 27056 slots, and if applicable, transport credits (e.g. RDMA credits for 27057 connections associated with the operations channel) of the session's 27058 fore channel. CB_RECALL_SLOT specifies rsa_target_highest_slotid, 27059 the value of the target highest slot ID the server wants for the 27060 session. The client MUST then progress toward reducing the session's 27061 highest slot ID to the target value. 27063 If the session has only non-RDMA connections associated with its 27064 operations channel, then the client need only wait for all 27065 outstanding requests with a slot ID > rsa_target_highest_slotid to 27066 complete, then send a single COMPOUND consisting of a single SEQUENCE 27067 operation, with the sa_highestslot field set to 27068 rsa_target_highest_slotid. If there are RDMA-based connections 27069 associated with operation channel, then the client needs to also send 27070 enough zero-length RDMA Sends to take the total RDMA credit count to 27071 rsa_target_highest_slotid + 1 or below. 27073 20.8.4. IMPLEMENTATION 27075 If the client fails to reduce highest slot it has on the fore channel 27076 to what the server requests, the server can force the issue by 27077 asserting flow control on the receive side of all connections bound 27078 to the fore channel, and then finish servicing all outstanding 27079 requests that are in slots greater than rsa_target_highest_slotid. 27080 Once that is done, the server can then open the flow control, and any 27081 time the client sends a new request on a slot greater than 27082 rsa_target_highest_slotid, the server can return NFS4ERR_BADSLOT. 27084 20.9. Operation 11: CB_SEQUENCE - Supply Backchannel Sequencing and 27085 Control 27087 20.9.1. ARGUMENT 27089 struct referring_call4 { 27090 sequenceid4 rc_sequenceid; 27091 slotid4 rc_slotid; 27092 }; 27094 struct referring_call_list4 { 27095 sessionid4 rcl_sessionid; 27096 referring_call4 rcl_referring_calls<>; 27097 }; 27099 struct CB_SEQUENCE4args { 27100 sessionid4 csa_sessionid; 27101 sequenceid4 csa_sequenceid; 27102 slotid4 csa_slotid; 27103 slotid4 csa_highest_slotid; 27104 bool csa_cachethis; 27105 referring_call_list4 csa_referring_call_lists<>; 27106 }; 27108 20.9.2. RESULT 27110 struct CB_SEQUENCE4resok { 27111 sessionid4 csr_sessionid; 27112 sequenceid4 csr_sequenceid; 27113 slotid4 csr_slotid; 27114 slotid4 csr_highest_slotid; 27115 slotid4 csr_target_highest_slotid; 27116 }; 27118 union CB_SEQUENCE4res switch (nfsstat4 csr_status) { 27119 case NFS4_OK: 27120 CB_SEQUENCE4resok csr_resok4; 27121 default: 27122 void; 27123 }; 27125 20.9.3. DESCRIPTION 27127 The CB_SEQUENCE operation is used to manage operational accounting 27128 for the backchannel of the session on which a request is sent. The 27129 contents include the session ID to which this request belongs, the 27130 slot ID and sequence ID used by the server to implement session 27131 request control and exactly once semantics, and exchanged slot ID 27132 maxima which are used to adjust the size of the reply cache. This 27133 operation will appear once as the first operation in each CB_COMPOUND 27134 request or a protocol error MUST result. See Section 18.46.3 for a 27135 description of how slots are processed. 27137 If csa_cachethis is TRUE, then the server is requesting that the 27138 client cache the reply in the callback reply cache. The client MUST 27139 cache the reply (see Section 2.10.6.1.3). 27141 The csa_referring_call_lists array is the list of COMPOUND requests, 27142 identified by session ID, slot ID and sequence ID. These are 27143 requests that the client previously sent to the server. These 27144 previous requests created state that some operation(s) in the same 27145 CB_COMPOUND as the csa_referring_call_lists are identifying. A 27146 session ID is included because leased state is tied to a client ID, 27147 and a client ID can have multiple sessions. See Section 2.10.6.3. 27149 The value of the csa_sequenceid argument relative to the cached 27150 sequence ID on the slot falls into one of three cases. 27152 o If the difference between csa_sequenceid and the client's cached 27153 sequence ID at the slot ID is two (2) or more, or if 27154 csa_sequenceid is less than the cached sequence ID (accounting for 27155 wraparound of the unsigned sequence ID value), then the client 27156 MUST return NFS4ERR_SEQ_MISORDERED. 27158 o If csa_sequenceid and the cached sequence ID are the same, this is 27159 a retry, and the client returns the CB_COMPOUND request's cached 27160 reply. 27162 o If csa_sequenceid is one greater (accounting for wraparound) than 27163 the cached sequence ID, then this is a new request, and the slot's 27164 sequence ID is incremented. The operations subsequent to 27165 CB_SEQUENCE, if any, are processed. If there are no other 27166 operations, the only other effects are to cache the CB_SEQUENCE 27167 reply in the slot, maintain the session's activity, and when the 27168 server receives the CB_SEQUENCE reply, renew the lease of state 27169 related to the client ID. 27171 If the server reuses a slot ID and sequence ID for a completely 27172 different request, the client MAY treat the request as if it is retry 27173 of what it has already executed. The client MAY however detect the 27174 server's illegal reuse and return NFS4ERR_SEQ_FALSE_RETRY. 27176 If CB_SEQUENCE returns an error, then the state of the slot (sequence 27177 ID, cached reply) MUST NOT change. 27179 The client returns two "highest_slotid" values: csr_highest_slotid, 27180 and csr_target_highest_slotid. The former is the highest slot ID the 27181 client will accept in a future CB_SEQUENCE operation, and SHOULD NOT 27182 be less than the value of csa_highest_slotid (but see 27183 Section 2.10.6.1 for an exception). The latter is the highest slot 27184 ID the client would prefer the server use on a future CB_SEQUENCE 27185 operation. 27187 20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending Delegation 27188 Wants 27190 20.10.1. ARGUMENT 27192 struct CB_WANTS_CANCELLED4args { 27193 bool cwca_contended_wants_cancelled; 27194 bool cwca_resourced_wants_cancelled; 27195 }; 27197 20.10.2. RESULT 27199 struct CB_WANTS_CANCELLED4res { 27200 nfsstat4 cwcr_status; 27201 }; 27203 20.10.3. DESCRIPTION 27205 The CB_WANTS_CANCELLED operation is used to notify the client that 27206 the some or all wants it registered for recallable delegations and 27207 layouts have been canceled. 27209 If cwca_contended_wants_cancelled is TRUE, this indicates the server 27210 will not be pushing to the client any delegations that become 27211 available after contention passes. 27213 If cwca_resourced_wants_cancelled is TRUE, this indicates the server 27214 will not notify the client when there are resources on the server to 27215 grant delegations or layouts. 27217 After receiving a CB_WANTS_CANCELLED operation, the client is free to 27218 attempt to acquire the delegations or layouts it was waiting for, and 27219 possibly re-register wants. 27221 20.10.4. IMPLEMENTATION 27223 When a client has an OPEN, WANT_DELEGATION, or GET_DIR_DELEGATION 27224 request outstanding, when a CB_WANTS_CANCELLED is sent, the server 27225 may need to make clear to the client whether a promise to signal 27226 delegation availability happened before the CB_WANTS_CANCELLED and is 27227 thus covered by it, or after the CB_WANTS_CANCELLED in which case it 27228 was not covered by it. The server can make this distinction by 27229 putting the appropriate requests into the list of referring calls in 27230 the associated CB_SEQUENCE. 27232 20.11. Operation 13: CB_NOTIFY_LOCK - Notify Client of Possible Lock 27233 Availability 27235 20.11.1. ARGUMENT 27237 struct CB_NOTIFY_LOCK4args { 27238 nfs_fh4 cnla_fh; 27239 lock_owner4 cnla_lock_owner; 27240 }; 27242 20.11.2. RESULT 27244 struct CB_NOTIFY_LOCK4res { 27245 nfsstat4 cnlr_status; 27246 }; 27248 20.11.3. DESCRIPTION 27250 The server can use this operation to indicate that a byte-range lock 27251 for the given file and lock-owner, previously requested by the client 27252 via an unsuccessful LOCK request, might be available. 27254 This callback is meant to be used by servers to help reduce the 27255 latency of blocking locks in the case where they recognize that a 27256 client which has been polling for a blocking lock may now be able to 27257 acquire the lock. If the server supports this callback for a given 27258 file, it MUST set the OPEN4_RESULT_MAY_NOTIFY_LOCK flag when 27259 responding to successful opens for that file. This does not commit 27260 the server to the use of CB_NOTIFY_LOCK, but the client may use this 27261 as a hint to decide how frequently to poll for locks derived from 27262 that open. 27264 If an OPEN operation results in an upgrade, in which the stateid 27265 returned has an "other" value matching that of a stateid already 27266 allocated, with a new "seqid" indicating a change in the lock being 27267 represented, then the value of the OPEN4_RESULT_MAY_NOTIFY_LOCK flag 27268 when responding to that new OPEN controls handling from that point 27269 going forward. When parallel OPENs are done on the same file and 27270 open-owner, the ordering of the "seqid" field of the returned stateid 27271 (subject to wraparound) are to be used to select the controlling 27272 value of the OPEN4_RESULT_MAY_NOTIFY_LOCK flag. 27274 20.11.4. IMPLEMENTATION 27276 The server MUST NOT grant the lock to the client unless and until it 27277 receives an actual LOCK request from the client. Similarly, the 27278 client receiving this callback cannot assume that it now has the 27279 lock, or that a subsequent LOCK request for the lock will be 27280 successful. 27282 The server is not required to implement this callback, and even if it 27283 does, it is not required to use it in any particular case. Therefore 27284 the client must still rely on polling for blocking locks, as 27285 described in Section 9.6. 27287 Similarly, the client is not required to implement this callback, and 27288 even it does, is still free to ignore it. Therefore the server MUST 27289 NOT assume that the client will act based on the callback. 27291 20.12. Operation 14: CB_NOTIFY_DEVICEID - Notify Client of Device ID 27292 Changes 27294 20.12.1. ARGUMENT 27296 /* 27297 * Device notification types. 27298 */ 27299 enum notify_deviceid_type4 { 27300 NOTIFY_DEVICEID4_CHANGE = 1, 27301 NOTIFY_DEVICEID4_DELETE = 2 27302 }; 27304 /* For NOTIFY4_DEVICEID4_DELETE */ 27305 struct notify_deviceid_delete4 { 27306 layouttype4 ndd_layouttype; 27307 deviceid4 ndd_deviceid; 27308 }; 27310 /* For NOTIFY4_DEVICEID4_CHANGE */ 27311 struct notify_deviceid_change4 { 27312 layouttype4 ndc_layouttype; 27313 deviceid4 ndc_deviceid; 27314 bool ndc_immediate; 27315 }; 27317 struct CB_NOTIFY_DEVICEID4args { 27318 notify4 cnda_changes<>; 27319 }; 27321 20.12.2. RESULT 27323 struct CB_NOTIFY_DEVICEID4res { 27324 nfsstat4 cndr_status; 27325 }; 27327 20.12.3. DESCRIPTION 27329 The CB_NOTIFY_DEVICEID operation is used by the server to send 27330 notifications to clients about changes to pNFS device IDs. The 27331 registration of device ID notifications is optional and is done via 27332 GETDEVICEINFO. These notifications are sent over the backchannel 27333 once the original request has been processed on the server. The 27334 server will send an array of notifications, cnda_changes, as a list 27335 of pairs of bitmaps and values. See Section 3.3.7 for a description 27336 of how NFSv4.1 bitmaps work. 27338 As with CB_NOTIFY (Section 20.4.3), it is possible the server has 27339 more notifications than can fit in a CB_COMPOUND, thus requiring 27340 multiple CB_COMPOUNDs. Unlike CB_NOTIFY, serialization is not an 27341 issue because unlike directory entries, device IDs cannot be re-used 27342 after being deleted (Section 12.2.10). 27344 All device ID notifications contain a device ID and a layout type. 27345 The layout type is necessary because two different layout types can 27346 share the same device ID, and the common device ID can have 27347 completely different mappings for each layout type. 27349 The server will send the following notifications: 27351 NOTIFY_DEVICEID4_CHANGE 27352 A previously provided device ID to device address mapping has 27353 changed and the client uses GETDEVICEINFO to obtain the updated 27354 mapping. The notification is encoded in a value of data type 27355 notify_deviceid_change4. This data type also contains a boolean 27356 field, ndc_immediate, which if TRUE indicates that the change will 27357 be enforced immediately, and so the client might not be able to 27358 complete any pending I/O to the device ID. If ndc_immediate is 27359 FALSE, then for an indefinite time, the client can complete 27360 pending I/O. After pending I/O is complete, the client SHOULD get 27361 the new device ID to device address mappings before issuing new 27362 I/O to the device ID. 27364 NOTIFY4_DEVICEID_DELETE 27365 Deletes a device ID from the mappings. This notification MUST NOT 27366 be sent if the client has a layout that refers to the device ID. 27367 In other words if the server is sending a delete device ID 27368 notification, one of the following is true for layouts associated 27369 with the layout type: 27371 * The client never had a layout referring to that device ID. 27373 * The client has returned all layouts referring to that device 27374 ID. 27376 * The server has revoked all layouts referring to that device ID. 27378 The notification is encoded in a value of data type 27379 notify_deviceid_delete4. After a server deletes a device ID, it 27380 MUST NOT reuse that device ID for the same layout type until the 27381 client ID is deleted. 27383 20.13. Operation 10044: CB_ILLEGAL - Illegal Callback Operation 27385 20.13.1. ARGUMENT 27387 void; 27389 20.13.2. RESULT 27391 /* 27392 * CB_ILLEGAL: Response for illegal operation numbers 27393 */ 27394 struct CB_ILLEGAL4res { 27395 nfsstat4 status; 27396 }; 27398 20.13.3. DESCRIPTION 27400 This operation is a placeholder for encoding a result to handle the 27401 case of the server sending an operation code within CB_COMPOUND that 27402 is not defined in the NFSv4.1 specification. See Section 19.2.3 for 27403 more details. 27405 The status field of CB_ILLEGAL4res MUST be set to NFS4ERR_OP_ILLEGAL. 27407 20.13.4. IMPLEMENTATION 27409 A server will probably not send an operation with code OP_CB_ILLEGAL 27410 but if it does, the response will be CB_ILLEGAL4res just as it would 27411 be with any other invalid operation code. Note that if the client 27412 gets an illegal operation code that is not OP_ILLEGAL, and if the 27413 client checks for legal operation codes during the XDR decode phase, 27414 then an instance of data type CB_ILLEGAL4res will not be returned. 27416 21. Security Considerations 27418 Historically the authentication of model of NFS had the entire 27419 machine being the NFS client, and the NFS server trusting the NFS 27420 client to authenticate the end-user. The NFS server in turn shared 27421 its files only to specific clients, as identified by the client's 27422 source network address. Given this model, the AUTH_SYS RPC security 27423 flavor simply identified the end-user using the client to the NFS 27424 server. When processing NFS responses, the client ensured that the 27425 responses came from the same network address and port number that the 27426 request was sent to. While such a model is easy to implement and 27427 simple to deploy and use, it is unsafe. Thus, NFSv4.1 27428 implementations are REQUIRED to support a security model that uses 27429 end to end authentication, where an end-user on a client mutually 27430 authenticates (via cryptographic schemes that do not expose passwords 27431 or keys in the clear on the network) to a principal on an NFS server. 27432 Consideration is also be given to the integrity and privacy of NFS 27433 requests and responses. The issues of end to end mutual 27434 authentication, integrity, and privacy are discussed 27435 Section 2.2.1.1.1. There are specific considerations when using 27436 Kerberos V5 as described in Section 2.2.1.1.1.2.1.1. 27438 Note that being REQUIRED to implement does not mean REQUIRED to use; 27439 AUTH_SYS can be used by NFSv4.1 clients and servers. However, 27440 AUTH_SYS is merely an OPTIONAL security flavor in NFSv4.1, and so 27441 interoperability via AUTH_SYS is not assured. 27443 For reasons of reduced administration overhead, better performance 27444 and/or reduction of CPU utilization, users of NFSv4.1 implementations 27445 may opt to not use security mechanisms that enable integrity 27446 protection on each remote procedure call and response. The use of 27447 mechanisms without integrity leaves the user vulnerable to an 27448 attacker in the middle of the NFS client and server that modifies the 27449 RPC request and/or the response. While implementations are free to 27450 provide the option to use weaker security mechanisms, there are three 27451 operations in particular that warrant the implementation overriding 27452 user choices. 27454 o The first two such operations are SECINFO and SECINFO_NO_NAME. It 27455 is RECOMMENDED that the client send both operations such that they 27456 is protected with a security flavor that has integrity protection, 27457 such as RPCSEC_GSS with either the rpc_gss_svc_integrity or 27458 rpc_gss_svc_privacy service. Without integrity protection 27459 encapsulating SECINFO and SECINFO_NO_NAME and their results, an 27460 attacker in the middle could modify results such that the client 27461 might select a weaker algorithm in the set allowed by server, 27462 making the client and/or server vulnerable to further attacks. 27464 o The third operation that should definitely use integrity 27465 protection is any GETATTR for the fs_locations and 27466 fs_locations_info attributes. The attack has two steps. First 27467 the attacker modifies the unprotected results of some operation to 27468 return NFS4ERR_MOVED. Second, when the client follows up with a 27469 GETATTR for the fs_locations or fs_locations_info attributes, the 27470 attacker modifies the results to cause the client migrate its 27471 traffic to a server controlled by the attacker. 27473 Relative to previous NFS versions, NFSv4.1 has additional security 27474 considerations for pNFS (see Section 12.9 and Section 13.12), locking 27475 and session state (see Section 2.10.8.3), and state recovery during 27476 grace period (see Section 8.4.2.1.1). 27478 22. IANA Considerations 27480 This section uses terms that are defined in [54]. 27482 22.1. Named Attribute Definitions 27484 IANA will create a registry called the "NFSv4 Named Attribute 27485 Definitions Registry". 27487 The NFSv4.1 protocol supports the association of a file with zero or 27488 more named attributes. The name space identifiers for these 27489 attributes are defined as string names. The protocol does not define 27490 the specific assignment of the name space for these file attributes. 27491 An IANA registry will promote interoperability where common interests 27492 exist. While application developers are allowed to define and use 27493 attributes as needed, they are encouraged to register the attributes 27494 with IANA. 27496 Such registered named attributes are presumed to apply to all minor 27497 versions of NFSv4, including those defined subsequently to the 27498 registration. Where the named attribute is intended to be limited 27499 with regard to the minor versions for which they are not be used, the 27500 assignment in registry will clearly state the applicable limits. 27502 All assignments to the registry are made on a First Come First Served 27503 basis, per section 4.1 of [54]. The policy for each assignment is 27504 Specification Required, per section 4.1 of [54]. 27506 Under the NFSv4.1 specification, the name of a named attribute can in 27507 theory be up to 2^32 - 1 bytes in length, but in practice NFSv4.1 27508 clients and servers will be unable to a handle string that long. 27509 IANA should reject any assignment request with a named attribute that 27510 exceeds 128 UTF-8 characters. To give IESG the flexibility to set up 27511 bases of assignment of Experimental Use and Standards Action, the 27512 prefixes of "EXPE" and "STDS" are Reserved. The zero length named 27513 attribute name is Reserved. 27515 The prefix "PRIV" is allocated for Private Use. A site that wants to 27516 make use of unregistered named attributes without risk of conflicting 27517 with an assignment in IANA's registry should use the prefix "PRIV" in 27518 all of its named attributes. 27520 Because some NFSv4.1 clients and servers have case insensitive 27521 semantics, the fifteen additional lower case and mixed case 27522 permutations of each of "EXPE", "PRIV", and "STDS", are Reserved 27523 (e.g. "expe", "expE", "exPe", etc. are Reserved). Similarly, IANA 27524 must not allow two assignments that would conflict if both named 27525 attributes were converted to a common case. 27527 The registry of named attributes is a list of assignments, each 27528 containing three fields for each assignment. 27530 1. A US-ASCII string name that is the actual name of the attribute. 27531 This name must be unique. This string name can be 1 to 128 UTF-8 27532 characters long. 27534 2. A reference to the specification of the named attribute. The 27535 reference can consume up to 256 bytes (or more if IANA permits). 27537 3. The point of contact of the registrant. The point of contact can 27538 consume up to 256 bytes (or more if IANA permits). 27540 22.1.1. Initial Registry 27542 There is no initial registry. 27544 22.1.2. Updating Registrations 27546 The registrant is always permitted to update the point of contact 27547 field. To make any other change will require Expert Review or IESG 27548 Approval. 27550 22.2. Device ID Notifications 27552 IANA will create a registry called the "NFSv4.1 Device ID 27553 Notifications Registry". 27555 The potential exists for new notification types to be added to the 27556 CB_NOTIFY_DEVICEID operation Section 20.12. This can be done via 27557 changes to the operations that register notifications, or by adding 27558 new operations to NFSv4. This requires a new minor version of NFSv4, 27559 and requires a standards track document from IETF. Another way to 27560 add a notification is to specify a new layout type (see 27561 Section 22.4). 27563 Hence all assignments to the registry are made on a Standards Action 27564 basis per section 4.1 of [54], with Expert Review required. 27566 The registry is a list of assignments, each containing five fields 27567 per assignment. 27569 1. The name of the notification type. This name must have the 27570 prefix: "NOTIFY_DEVICEID4_". This name must be unique. 27572 2. The value of the notification. IANA will assign this number, and 27573 the request from the registrant will use TBD1 instead of an 27574 actual value. IANA MUST use a whole number which can be no 27575 higher than 2^32-1, and should be the next available value. The 27576 value assigned must be unique. A Designated Expert must be used 27577 to ensure that when the name of the notification type and its 27578 value are added to the NFSv4.1 notify_deviceid_type4 enumerated 27579 data type in the NFSv4.1 XDR description ([12]), the result 27580 continues to be a valid XDR description. 27582 3. The Standards Track RFC(s) that describe the notification. If 27583 the RFC(s) have not yet been published, the registrant will use 27584 RFCTBD2, RFCTBD3, etc. instead of an actual RFC number. 27586 4. How the RFC introduces the notification. This is indicated by a 27587 single US-ASCII value. If the value is N, it means a minor 27588 revision to the NFSv4 protocol. If the value is L, it means a 27589 new pNFS layout type. Other values can be used with IESG 27590 Approval. 27592 5. The minor versions of NFSv4 that are allowed to the use the 27593 notification. While these are numeric values, IANA will not 27594 allocate and assign them; the author of the relevant RFCs with 27595 IESG Approval assigns these numbers. Each time there is new 27596 minor version of NFSv4 approved, a Designated Expert should 27597 review the registry to make recommended updates as needed. 27599 22.2.1. Initial Registry 27601 The initial registry is in Table 16. Note that next available value 27602 is zero. 27604 +-------------------------+-------+----------+-----+----------------+ 27605 | Notification Name | Value | RFC | How | Minor Versions | 27606 +-------------------------+-------+----------+-----+----------------+ 27607 | NOTIFY_DEVICEID4_CHANGE | 1 | RFCTBD10 | N | 1 | 27608 | NOTIFY_DEVICEID4_DELETE | 2 | RFCTBD10 | N | 1 | 27609 +-------------------------+-------+----------+-----+----------------+ 27611 Table 16: Initial Device ID Notification Assignments 27613 22.2.2. Updating Registrations 27615 The update of a registration will require IESG Approval on the advice 27616 of a Designated Expert. 27618 22.3. Object Recall Types 27620 IANA will create a registry called the "NFSv4.1 Recallable Object 27621 Types Registry". 27623 The potential exists for new object types to be added to the 27624 CB_RECALL_ANY operation (see Section 20.6). This can be done via 27625 changes to the operations that add recallable types, or by adding new 27626 operations to NFSv4. This requires a new minor version of NFSv4, and 27627 requires a standards track document from IETF. Another way to add a 27628 new recallable object is to specify a new layout type (see 27629 Section 22.4). 27631 All assignments to the registry are made on a Standards Action basis 27632 per section 4.1 of [54], with Expert Review required. 27634 Recallable object types are 32 bit unsigned numbers. There are no 27635 Reserved values. Values in the range 12 through 15, inclusive, are 27636 for Private Use. 27638 The registry is a list of assignments, each containing five fields 27639 per assignment. 27641 1. The name of the recallable object type. This name must have the 27642 prefix: "RCA4_TYPE_MASK_". The name must be unique. 27644 2. The value of the recallable object type. IANA will assign this 27645 number, and the request from the registrant will use TBD1 instead 27646 of an actual value. IANA MUST use a whole number which can be no 27647 higher than 2^32-1, and should be the next available value. The 27648 value must be unique. A Designated Expert must be used to ensure 27649 that when the name of the recallable type and its value are added 27650 to the NFSv4 XDR description [12], the result continues to be a 27651 valid XDR description. 27653 3. The Standards Track RFC(s) that describe the recallable object 27654 type. If the RFC(s) have not yet been published, the registrant 27655 will use RFCTBD2, RFCTBD3, etc. instead of an actual RFC number. 27657 4. How the RFC introduces the recallable object type. This is 27658 indicated by a single US-ASCII value. If the value is N, it 27659 means a minor revision to the NFSv4 protocol. If the value is L, 27660 it means a new pNFS layout type. Other values can be used with 27661 IESG Approval. 27663 5. The minor versions of NFSv4 that are allowed to the use the 27664 recallable object type. While these are numeric values, IANA 27665 will not allocate and assign them; the author of the relevant 27666 RFCs with IESG Approval assigns these numbers. Each time there 27667 is new minor version of NFSv4 approved, a Designated Expert 27668 should review the registry to make recommended updates as needed. 27670 22.3.1. Initial Registry 27672 The initial registry is in Table 17. Note that next available value 27673 is five. 27675 +-------------------------------+-------+----------+-----+----------+ 27676 | Recallable Object Type Name | Value | RFC | How | Minor | 27677 | | | | | Versions | 27678 +-------------------------------+-------+----------+-----+----------+ 27679 | RCA4_TYPE_MASK_RDATA_DLG | 0 | RFCTBD10 | N | 1 | 27680 | RCA4_TYPE_MASK_WDATA_DLG | 1 | RFCTBD10 | N | 1 | 27681 | RCA4_TYPE_MASK_DIR_DLG | 2 | RFCTBD10 | N | 1 | 27682 | RCA4_TYPE_MASK_FILE_LAYOUT | 3 | RFCTBD10 | N | 1 | 27683 | RCA4_TYPE_MASK_BLK_LAYOUT | 4 | RFCTBD20 | L | 1 | 27684 | RCA4_TYPE_MASK_OBJ_LAYOUT_MIN | 8 | RFCTBD30 | L | 1 | 27685 | RCA4_TYPE_MASK_OBJ_LAYOUT_MAX | 9 | RFCTBD30 | L | 1 | 27686 +-------------------------------+-------+----------+-----+----------+ 27688 Table 17: Initial Recallable Object Type Assignments 27690 22.3.2. Updating Registrations 27692 The update of a registration will require IESG Approval on the advice 27693 of a Designated Expert. 27695 22.4. Layout Types 27697 IANA will create a registry called the "pNFS Layout Types Registry". 27699 All assignments to the registry are made on a Standards Action basis, 27700 with Expert Review required. 27702 Layout types are 32 bit numbers. The value zero is Reserved. Values 27703 in the range 0x80000000 to 0xFFFFFFFF inclusive are for Private Use. 27704 IANA will assign numbers from the range 0x00000001 to 0x7FFFFFFF 27705 inclusive. 27707 The registry is a list of assignments, each containing five fields. 27709 1. The name of the layout type. This name must have the prefix: 27710 "LAYOUT4_". The name must be unique. 27712 2. The value of the layout type. IANA will assign this number, and 27713 the request from the registrant will use TBD1 instead of an 27714 actual value. The value assigned must be unique. A Designated 27715 Expert must be used to ensure that when the name of the layout 27716 type and its value are added to the NFSv4.1 layouttype4 27717 enumerated data type in the NFSv4.1 XDR description ([12]), the 27718 result continues to be a valid XDR description. 27720 3. The Standards Track RFC(s) that describe the notification. If 27721 the RFC(s) have not yet been published, the registrant will use 27722 RFCTBD2, RFCTBD3, etc. instead of an actual RFC number. 27723 Collectively, the RFC(s) must adhere to the guidelines listed in 27724 Section 22.4.3. 27726 4. How the RFC introduces the layout type. This is indicated by a 27727 single US-ASCII value. If the value is N, it means a minor 27728 revision to the NFSv4 protocol. If the value is L, it means a 27729 new pNFS layout type. Other values can be used with IESG 27730 Approval. 27732 5. The minor versions of NFSv4 that are allowed to the use the 27733 notification. While these are numeric values, IANA will not 27734 allocate and assign them; the author of the relevant RFCs with 27735 IESG Approval assigns these numbers. Each time there is new 27736 minor version of NFSv4 approved, a Designated Expert should 27737 review the registry to make recommended updates as needed. 27739 22.4.1. Initial Registry 27741 The initial registry is in Table 18. 27743 +-----------------------+-------+----------+-----+----------------+ 27744 | Layout Type Name | Value | RFC | How | Minor Versions | 27745 +-----------------------+-------+----------+-----+----------------+ 27746 | LAYOUT4_NFSV4_1_FILES | 0x1 | RFCTBD10 | N | 1 | 27747 | LAYOUT4_OSD2_OBJECTS | 0x2 | RFCTBD30 | L | 1 | 27748 | LAYOUT4_BLOCK_VOLUME | 0x3 | RFCTBD20 | L | 1 | 27749 +-----------------------+-------+----------+-----+----------------+ 27751 Table 18: Initial Layout Type Assignments 27753 22.4.2. Updating Registrations 27755 The update of a registration will require IESG Approval on the advice 27756 of a Designated Expert. 27758 22.4.3. Guidelines for Writing Layout Type Specifications 27760 The author of a new pNFS layout specification must follow these steps 27761 to obtain acceptance of the layout type as a Standards Track RFC: 27763 1. The author devises the new layout specification. 27765 2. The new layout type specification MUST, at a minimum: 27767 * Define the contents of the layout-type-specific fields of the 27768 following data types: 27770 + the da_addr_body field of the device_addr4 data type; 27772 + the loh_body field of the layouthint4 data type; 27774 + the loc_body field of layout_content4 data type (which in 27775 turn is the lo_content field of the layout4 data type); 27777 + the lou_body field of the layoutupdate4 data type; 27779 * Describe or define the storage access protocol used to access 27780 the storage devices. 27782 * Describe whether revocation of layouts is supported. 27784 * At a minimum, describe the methods of recovery from: 27786 1. Failure and restart for client, server, storage device. 27788 2. Lease expiration from perspective of the active client, 27789 server, storage device. 27791 3. Loss of layout state resulting in fencing of client access 27792 to storage devices (for an example, see Section 12.7.3). 27794 * Include an IANA considerations section, will in turn include: 27796 + A request to IANA for a new layout type per Section 22.4. 27798 + A list of requests to IANA for any new recallable object 27799 types for CB_RECALL_ANY; each entry is to presented in the 27800 form described in Section 22.3. 27802 + A list of requests to IANA for any new notification values 27803 for CB_NOTIFY_DEVICEID; each entry is to presented in the 27804 form described in Section 22.2. 27806 * Include a security considerations section. This section MUST 27807 explain how the NFSv4.1 authentication, authorization, and 27808 access control models are preserved. I.e. if a metadata 27809 server would restrict a READ or WRITE operation, how would 27810 pNFS via the layout similarly restrict a corresponding input 27811 or output operation? 27813 3. The author documents the new layout specification as an Internet 27814 Draft. 27816 4. The author submits the Internet Draft for review through the IETF 27817 standards process as defined in "Internet Official Protocol 27818 Standards" (STD 1). The new layout specification will be 27819 submitted for eventual publication as a standards track RFC. 27821 5. The layout specification progresses through the IETF standards 27822 process; the new option will be reviewed by the NFSv4 Working 27823 Group (if that group still exists), or as an Internet Draft not 27824 submitted by an IETF working group. 27826 22.5. Path Variable Definitions 27828 This section deals with the IANA considerations associated with the 27829 variable substitution feature for location names as described in 27830 Section 11.10.3. As described there, variables subject to 27831 substitution consist of a domain name and a specific name within that 27832 domain, with two separated by a colon. There are two sets of IANA 27833 considerations here: 27835 1. The list of variable names. 27837 2. For each variable name, the list of possible values. 27839 Thus, there will be one registry for the list of variable names, and 27840 possibly one registry for listing the values of each variable name. 27842 22.5.1. Path Variables Registry 27844 IANA will create a registry called the "NFSv4 Path Variables 27845 Registry". 27847 22.5.1.1. Path Variable Values 27849 Variable names are of the form "${", followed by a domain name, 27850 followed by a colon (":"), followed by a domain-specific portion of 27851 the variable name, followed by "}". When the domain name is 27852 "ietf.org" all variables names must be registered with IANA on a 27853 Standards Action basis, with Expert Review required. Path variables 27854 with registered domain names neither part of nor equal to ietf.org 27855 are assigned on a Hierarchical Allocation basis (delegating to the 27856 domain owner) and thus of no concern to IANA, unless the domain owner 27857 chooses to register a variable name from his domain. If the domain 27858 owner chooses to do so, IANA will do so on a First Come First Serve 27859 basis. To accommodate registrants who do not have their own domain, 27860 IANA will accept requests to register variables with the prefix 27861 "${FCFS.ietf.org:" on a First Come First Served basis. Assignments 27862 on a First Come First Basis do not require Expert Review, unless the 27863 registrant also wants IANA to establish a registry for the values of 27864 the registered variable. 27866 The registry is a list of assignments, each containing three fields. 27868 1. The name of the variable. The name of this variable must start 27869 with a "${" followed by a registered domain name, followed by 27870 ":", or it must start with "${FCFS.ietf.org". The name must be 27871 no more than 64 UTF-8 characters long. The name must be unique. 27873 2. For assignments made on Standards Action basis, the Standards 27874 Track RFC(s) that describe the variable. If the RFC(s) have not 27875 yet been published, the registrant will use RFCTBD1, RFCTBD2, 27876 etc. instead of an actual RFC number. Note that the RFCs do not 27877 have to be a part of a NFS minor version. For assignments made 27878 on a First Come First Serve basis, an explanation (consuming no 27879 more than 1024 bytes, or more if IANA permits) of the purpose of 27880 the variable. A reference to the explanation can be substituted. 27882 3. The point of contact, including an email address. The point of 27883 contact can consume up to 256 bytes (or more if IANA permits). 27884 For assignments made on a Standards Action basis, the point of 27885 contact is always IESG. 27887 22.5.1.1.1. Initial Registry 27889 The initial registry is in Table 19. 27891 +------------------------+----------+------------------+ 27892 | Variable Name | RFC | Point of Contact | 27893 +------------------------+----------+------------------+ 27894 | ${ietf.org:CPU_ARCH} | RFCTBD10 | IESG | 27895 | ${ietf.org:OS_TYPE} | RFCTBD10 | IESG | 27896 | ${ietf.org:OS_VERSION} | RFCTBD10 | IESG | 27897 +------------------------+----------+------------------+ 27899 Table 19: Initial List of Path Variables 27901 IANA will need to create registries for the values of the variable 27902 names ${ietf.org:CPU_ARCH} and ${ietf.org:OS_TYPE}. See 27903 Section 22.5.2 and Section 22.5.3. 27905 For the values of the variable ${ietf.org:OS_VERSION}, no registry is 27906 needed as the specifics of the values of the variable will vary with 27907 the value of ${ietf.org:OS_TYPE}. Thus values for ${ietf.org: 27908 OS_VERSION} are on a Hierarchical Allocation basis and are of no 27909 concern to IANA. 27911 22.5.1.1.2. Updating Registrations 27913 The update of an assignment made on a Standards Action basis will 27914 require IESG Approval on the advice of a Designated Expert. 27916 The registrant can always updated the point of contact of an 27917 assignment made on a First Come First Serve basis. Any other update 27918 will require Expert Review. 27920 22.5.2. Values for the ${ietf.org:CPU_ARCH} Variable 27922 IANA will create a registry called the "NFSv4 ${ietf.org:CPU_ARCH} 27923 Value Registry". 27925 Assignments to the registry are made on a First Come First Serve 27926 basis. The zero length value of ${ietf.org:CPU_ARCH} is Reserved. 27927 Values with a prefix of "PRIV" are Reserved for Private Use. 27929 The registry is a list of assignments, each containing three fields. 27931 1. A value of the ${ietf.org:CPU_ARCH} variable. The value must be 27932 1 to 32 UTF-8 characters long. The value must be unique. 27934 2. An explanation (consuming no more than 1024 bytes, or more if 27935 IANA permits) of what CPU architecture the value denotes. A 27936 reference to the explanation can be substituted. 27938 3. The point of contact, including an email address. The point of 27939 contact can consume up to 256 bytes (or more if IANA permits). 27941 22.5.2.1. Initial Registry 27943 There is no initial registry. 27945 22.5.2.2. Updating Registrations 27947 The registrant is free to update the assignment, i.e. change the 27948 explanation and/or point of contact fields. 27950 22.5.3. Values for the ${ietf.org:OS_TYPE} Variable 27952 IANA will create a registry called the "NFSv4 ${ietf.org:OS_TYPE} 27953 Value Registry". 27955 Assignments to the registry are made on a First Come First Serve 27956 basis. The zero length value of ${ietf.org:OS_TYPE} is Reserved. 27958 Values with a prefix of "PRIV" are Reserved for Private Use. 27960 The registry is a list of assignments, each containing three fields. 27962 1. A value of the ${ietf.org:OS_TYPE} variable. The value must be 1 27963 to 32 UTF-8 characters long. The value must be unique. 27965 2. An explanation (consuming no more than 1024 bytes, or more if 27966 IANA permits) of what CPU architecture the value denotes. A 27967 reference to the explanation can be substituted. 27969 3. The point of contact, including an email address. The point of 27970 contact can consume up to 256 bytes (or more if IANA permits). 27972 22.5.3.1. Initial Registry 27974 There is no initial registry. 27976 22.5.3.2. Updating Registrations 27978 The registrant is free to update the assignment, i.e. change the 27979 explanation and/or point of contact fields. 27981 23. References 27983 23.1. Normative References 27985 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 27986 Levels", RFC 2119, March 1997. 27988 [2] Eisler, M., "XDR: External Data Representation Standard", 27989 STD 67, RFC 4506, May 2006. 27991 [3] Srinivasan, R., "RPC: Remote Procedure Call Protocol 27992 Specification Version 2", RFC 1831, August 1995. 27994 [4] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol 27995 Specification", RFC 2203, September 1997. 27997 [5] Zhu, L., Jaganathan, K., and S. Hartman, "The Kerberos Version 27998 5 Generic Security Service Application Program Interface (GSS- 27999 API) Mechanism Version 2", RFC 4121, July 2005. 28001 [6] The Open Group, "Section 3.191 of Chapter 3 of Base Definitions 28002 of The Open Group Base Specifications Issue 6 IEEE Std 1003.1, 28003 2004 Edition, HTML Version (www.opengroup.org), ISBN 28004 1931624232", 2004. 28006 [7] Linn, J., "Generic Security Service Application Program 28007 Interface Version 2, Update 1", RFC 2743, January 2000. 28009 [8] Talpey, T. and B. Callaghan, "Remote Direct Memory Access 28010 Transport for Remote Procedure Call", 28011 draft-ietf-nfsv4-rpcrdma-09 (work in progress), December 2008. 28013 [9] Talpey, T., Callaghan, B., and I. Property, "NFS Direct Data 28014 Placement", draft-ietf-nfsv4-nfsdirect-08 (work in progress), 28015 April 2008. 28017 [10] Recio, P., Metzler, B., Culley, P., Hilland, J., and D. Garcia, 28018 "A Remote Direct Memory Access Protocol Specification", 28019 RFC 5040, October 2007. 28021 [11] Krawczyk, H., Bellare, M., and R. Canetti, "HMAC: Keyed-Hashing 28022 for Message Authentication", RFC 2104, February 1997. 28024 [12] Shepler, S., Eisler, M., and D. Noveck, "NFSv4 Minor Version 1 28025 XDR Description", draft-ietf-nfsv4-minorversion1-dot-x-12 (work 28026 in progress), Dec 2008. 28028 [13] The Open Group, "Section 3.372 of Chapter 3 of Base Definitions 28029 of The Open Group Base Specifications Issue 6 IEEE Std 1003.1, 28030 2004 Edition, HTML Version (www.opengroup.org), ISBN 28031 1931624232", 2004. 28033 [14] Eisler, M., "IANA Considerations for RPC Net Identifiers and 28034 Universal Address Formats", draft-ietf-nfsv4-rpc-netid-04 (work 28035 in progress), December 2008. 28037 [15] The Open Group, "Section 'read()' of System Interfaces of The 28038 Open Group Base Specifications Issue 6 IEEE Std 1003.1, 2004 28039 Edition, HTML Version (www.opengroup.org), ISBN 1931624232", 28040 2004. 28042 [16] The Open Group, "Section 'readdir()' of System Interfaces of 28043 The Open Group Base Specifications Issue 6 IEEE Std 1003.1, 28044 2004 Edition, HTML Version (www.opengroup.org), ISBN 28045 1931624232", 2004. 28047 [17] The Open Group, "Section 'write()' of System Interfaces of The 28048 Open Group Base Specifications Issue 6 IEEE Std 1003.1, 2004 28049 Edition, HTML Version (www.opengroup.org), ISBN 1931624232", 28050 2004. 28052 [18] Hoffman, P. and M. Blanchet, "Preparation of Internationalized 28053 Strings ("stringprep")", RFC 3454, December 2002. 28055 [19] The Open Group, "Section 'chmod()' of System Interfaces of The 28056 Open Group Base Specifications Issue 6 IEEE Std 1003.1, 2004 28057 Edition, HTML Version (www.opengroup.org), ISBN 1931624232", 28058 2004. 28060 [20] International Organization for Standardization, "Information 28061 Technology - Universal Multiple-octet coded Character Set (UCS) 28062 - Part 1: Architecture and Basic Multilingual Plane", 28063 ISO Standard 10646-1, May 1993. 28065 [21] Alvestrand, H., "IETF Policy on Character Sets and Languages", 28066 BCP 18, RFC 2277, January 1998. 28068 [22] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile 28069 for Internationalized Domain Names (IDN)", RFC 3491, 28070 March 2003. 28072 [23] The Open Group, "Section 'fcntl()' of System Interfaces of The 28073 Open Group Base Specifications Issue 6 IEEE Std 1003.1, 2004 28074 Edition, HTML Version (www.opengroup.org), ISBN 1931624232", 28075 2004. 28077 [24] The Open Group, "Section 'fsync()' of System Interfaces of The 28078 Open Group Base Specifications Issue 6 IEEE Std 1003.1, 2004 28079 Edition, HTML Version (www.opengroup.org), ISBN 1931624232", 28080 2004. 28082 [25] The Open Group, "Section 'getpwnam()' of System Interfaces of 28083 The Open Group Base Specifications Issue 6 IEEE Std 1003.1, 28084 2004 Edition, HTML Version (www.opengroup.org), ISBN 28085 1931624232", 2004. 28087 [26] The Open Group, "Section 'unlink()' of System Interfaces of The 28088 Open Group Base Specifications Issue 6 IEEE Std 1003.1, 2004 28089 Edition, HTML Version (www.opengroup.org), ISBN 1931624232", 28090 2004. 28092 [27] Schaad, J., Kaliski, B., and R. Housley, "Additional Algorithms 28093 and Identifiers for RSA Cryptography for use in the Internet 28094 X.509 Public Key Infrastructure Certificate and Certificate 28095 Revocation List (CRL) Profile", RFC 4055, June 2005. 28097 [28] National Institute of Standards and Technology, "Cryptographic 28098 Algorithm Object Registration", URL http://csrc.nist.gov/ 28099 groups/ST/crypto_apps_infra/csor/algorithms.html, 28100 November 2007. 28102 23.2. Informative References 28104 [29] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, 28105 C., Eisler, M., and D. Noveck, "Network File System (NFS) 28106 version 4 Protocol", RFC 3530, April 2003. 28108 [30] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3 28109 Protocol Specification", RFC 1813, June 1995. 28111 [31] Eisler, M., "LIPKEY - A Low Infrastructure Public Key Mechanism 28112 Using SPKM", RFC 2847, June 2000. 28114 [32] Eisler, M., "NFS Version 2 and Version 3 Security Issues and 28115 the NFS Protocol's Use of RPCSEC_GSS and Kerberos V5", 28116 RFC 2623, June 1999. 28118 [33] Juszczak, C., "Improving the Performance and Correctness of an 28119 NFS Server", USENIX Conference Proceedings , June 1990. 28121 [34] Reynolds, J., "Assigned Numbers: RFC 1700 is Replaced by an On- 28122 line Database", RFC 3232, January 2002. 28124 [35] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", 28125 RFC 1833, August 1995. 28127 [36] Werme, R., "RPC XID Issues", USENIX Conference Proceedings , 28128 February 1996. 28130 [37] Nowicki, B., "NFS: Network File System Protocol specification", 28131 RFC 1094, March 1989. 28133 [38] Bhide, A., Elnozahy, E., and S. Morgan, "A Highly Available 28134 Network Server", USENIX Conference Proceedings , January 1991. 28136 [39] Halevy, B., Welch, B., and J. Zelenka, "Object-based pNFS 28137 Operations", draft-ietf-nfsv4-pnfs-obj-11 (work in progress), 28138 December 2008. 28140 [40] Black, D., Fridella, S., and J. Glasgow, "pNFS Block/Volume 28141 Layout", draft-ietf-nfsv4-pnfs-block-11 (work in progress), 28142 December 2008. 28144 [41] Callaghan, B., "WebNFS Client Specification", RFC 2054, 28145 October 1996. 28147 [42] Callaghan, B., "WebNFS Server Specification", RFC 2055, 28148 October 1996. 28150 [43] IESG, "IESG Processing of RFC Errata for the IETF Stream", 28151 July 2008. 28153 [44] Shepler, S., "NFS Version 4 Design Considerations", RFC 2624, 28154 June 1999. 28156 [45] The Open Group, "Protocols for Interworking: XNFS, Version 3W, 28157 ISBN 1-85912-184-5", February 1998. 28159 [46] Floyd, S. and V. Jacobson, "The Synchronization of Periodic 28160 Routing Messages", IEEE/ACM Transactions on Networking 2(2), 28161 pp. 122-136, April 1994. 28163 [47] Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M., and E. 28164 Zeidner, "Internet Small Computer Systems Interface (iSCSI)", 28165 RFC 3720, April 2004. 28167 [48] Snively, R., "Fibre Channel Protocol for SCSI, 2nd Version 28168 (FCP-2)", ANSI/INCITS 350-2003, Oct 2003. 28170 [49] Weber, R., "Object-Based Storage Device Commands (OSD)", ANSI/ 28171 INCITS 400-2004, July 2004, 28172 . 28174 [50] Carns, P., Ligon III, W., Ross, R., and R. Thakur, "PVFS: A 28175 Parallel File System for Linux Clusters.", Proceedings of the 28176 4th Annual Linux Showcase and Conference , 2000. 28178 [51] The Open Group, "The Open Group Base Specifications Issue 6, 28179 IEEE Std 1003.1, 2004 Edition", 2004. 28181 [52] Callaghan, B., "NFS URL Scheme", RFC 2224, October 1997. 28183 [53] Chiu, A., Eisler, M., and B. Callaghan, "Security Negotiation 28184 for WebNFS", RFC 2755, January 2000. 28186 [54] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA 28187 Considerations Section in RFCs", BCP 26, RFC 5226, May 2008. 28189 Appendix A. Acknowledgments 28191 The initial drafts for the SECINFO extensions were edited by Mike 28192 Eisler with contributions from Peng Dai, Sergey Klyushin, and Carl 28193 Burnett. 28195 The initial drafts for the SESSIONS extensions were edited by Tom 28196 Talpey, Spencer Shepler, Jon Bauman with contributions from Charles 28197 Antonelli, Brent Callaghan, Mike Eisler, John Howard, Chet Juszczak, 28198 Trond Myklebust, Dave Noveck, John Scott, Mike Stolarchuk and Mark 28199 Wittle. 28201 Initial drafts relating to multi-server namespace features, including 28202 the concept of referrals, were contributed by Dave Noveck, Carl 28203 Burnett, and Charles Fan with contributions from Ted Anderson, Neil 28204 Brown, and Jon Haswell. 28206 The initial drafts for the Directory Delegations support were 28207 contributed by Saadia Khan with input from Dave Noveck, Mike Eisler, 28208 Carl Burnett, Ted Anderson and Tom Talpey. 28210 The initial drafts for the ACL explanations were contributed by Sam 28211 Falkner and Lisa Week. 28213 The pNFS work was inspired by the NASD and OSD work done by Garth 28214 Gibson. Gary Grider has also been a champion of high-performance 28215 parallel I/O. Garth Gibson and Peter Corbett started the pNFS effort 28216 with a problem statement document for IETF that formed the basis for 28217 the pNFS work in NFSv4.1. 28219 The initial drafts for the parallel NFS support were edited by Brent 28220 Welch and Garth Goodson. Additional authors for those documents were 28221 Benny Halevy, David Black, and Andy Adamson. Additional input came 28222 from the informal group which contributed to the construction of the 28223 initial pNFS drafts; specific acknowledgement goes to Gary Grider, 28224 Peter Corbett, Dave Noveck, Peter Honeyman, and Stephen Fridella. 28226 Fredric Isaman found several errors in draft versions of the ONC RPC 28227 XDR description of the NFSv4.1 protocol. 28229 Audrey Van Belleghem provided, in numerous ways, essential co- 28230 ordination and management of the process of editing the specification 28231 drafts. 28233 Richard Jernigan gave feedback on the file layout's striping pattern 28234 design. 28236 Several formal inspection teams were formed to review various areas 28237 of the protocol. All the inspections found significant errors and 28238 room for improvement. NFSv4.1's inspection teams were: 28240 o ACLs, with the following inspectors: Sam Falkner, Bruce Fields, 28241 Rahul Iyer, Saadia Khan, Dave Noveck, Lisa Week, Mario Wurzl, and 28242 Alan Yoder. 28244 o Sessions, with the following inspectors: William Brown, Tom 28245 Doeppner, Robert Gordon, Benny Halevy, Fredric Isaman, Rick 28246 Macklem, Trond Myklebust, Dave Noveck, Karen Rochford, John Scott, 28247 and Peter Shah. 28249 o Initial pNFS inspection, with the following inspectors: Andy 28250 Adamson, David Black, Mike Eisler, Marc Eshel, Sam Falkner, Garth 28251 Goodson, Benny Halevy, Rahul Iyer, Trond Myklebust, Spencer 28252 Shepler, and Lisa Week. 28254 o Global namespace, with the following inspectors: Mike Eisler, Dan 28255 Ellard, Craig Everhart, Fred Isaman, Trond Myklebust, Dave Noveck, 28256 Theresa Raj, Spencer Shepler, Renu Tewari, and Robert Thurlow. 28258 o NFSv4.1 file layout type, with the following inspectors: Andy 28259 Adamson, Marc Eshel, Sam Falkner, Garth Goodson, Rahul Iyer, Trond 28260 Myklebust, and Lisa Week. 28262 o NFSv4.1 locking and directory delegations, with the following 28263 inspectors: Mike Eisler, Pranoop Erasani, Robert Gordon, Saadia 28264 Khan, Eric Kustarz, Dave Noveck, Spencer Shepler, and Amy Weaver. 28266 o EXCHANGE_ID and DESTROY_CLIENTID, with the following inspectors: 28267 Mike Eisler, Pranoop Erasani, Robert Gordon, Benny Halevy, Fred 28268 Isaman, Saadia Khan, Ricardo Labiaga, Rick Macklem, Trond 28269 Myklebust, Spencer Shepler, and Brent Welch. 28271 o Final pNFS inspection, with the following inspectors: Andy 28272 Adamson, Mike Eisler, Mark Eshel, Sam Falkner, Jason Glasgow, 28273 Garth Goodson, Robert Gordon, Benny Halevy, Dean Hildebrand, Rahul 28274 Iyer, Suchit Kaura, Trond Myklebust, Anatoly Pinchuk, Spencer 28275 Shepler, Renu Tewari, Lisa Week, and Brent Welch. 28277 A review team worked together to generate the tables of assignments 28278 of error sets to operations and make sure that each such assignment 28279 had two or more people validating it. Participating in the process 28280 were: Andy Adamson, Mike Eisler, Sam Falkner, Garth Goodson, Robert 28281 Gordon, Trond Myklebust, Dave Noveck, Spencer Shepler, Tom Talpey, 28282 Amy Weaver, and Lisa Week. 28284 Jari Arkko, David Black, Scott Bradner, Lisa Dusseault, Lars Eggert, 28285 Chris Newman, and Tim Polk provided valuable review and guidance. 28287 Olga Kornievskaia found several errors in the SSV specification. 28289 Ricardo Labiaga found several places where the use of RPCSEC_GSS was 28290 underspecified. 28292 Others who provided comments include: Sunil Bhargo, Jason 28293 Goldschmidt, Vijay K. Gurbani, James Lentini, Anshul Madan, Archana 28294 Ramani, Jim Rees, and Mahesh Siddheshwar. 28296 Appendix B. RFC Editor Notes 28298 [RFC Editor: please remove this section prior to publishing this 28299 document as an RFC] 28301 [RFC Editor: prior to publishing this document as an RFC, please 28302 replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the 28303 RFC number of this document] 28305 [RFC Editor: prior to publishing this document as an RFC, please 28306 replace all occurrences of RFCTBD20 with RFCyyyy where yyyy is the 28307 RFC number of the document referenced in [40]] 28309 [RFC Editor: prior to publishing this document as an RFC, please 28310 replace all occurrences of RFCTBD30 with RFCzzzz where zzzz is the 28311 RFC number of the document referenced in [39]] 28313 [RFC Editor: prior to publishing this document as an RFC, please 28314 ensure all section references to [14], including the reference from 28315 Section 3.3.9 are accurate if document referenced by [14] has been 28316 finalized for RFC publication. If not finalized for publication, 28317 please remove section number references to [14]. 28319 Authors' Addresses 28321 Spencer Shepler 28322 Storspeed, Inc. 28323 7808 Moonflower Drive 28324 Austin, TX 78750 28325 USA 28327 Phone: +1-512-402-5811 ext 8530 28328 Email: shepler@storspeed.com 28329 Mike Eisler 28330 NetApp 28331 5765 Chase Point Circle 28332 Colorado Springs, CO 80919 28333 USA 28335 Phone: +1-719-599-9026 28336 Email: mike@eisler.com 28337 URI: http://www.eisler.com 28339 David Noveck 28340 NetApp 28341 1601 Trapelo Road, Suite 16 28342 Waltham, MA 02454 28343 USA 28345 Phone: +1-781-768-5347 28346 Email: dnoveck@netapp.com