idnits 2.17.1 draft-dnoveck-nfsv4-rfc5661bis-base-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 3 instances of lines with non-RFC2606-compliant FQDNs in the document. == There are 53 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. -- The abstract seems to indicate that this document obsoletes RFC5661, but the header doesn't have an 'Obsoletes:' line to match this. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1013 has weird spacing: '...privacy no ...' == Line 3211 has weird spacing: '...est|Pad bytes...' == Line 4501 has weird spacing: '... opaque devic...' == Line 4614 has weird spacing: '...str_cis nii_...' == Line 4615 has weird spacing: '...8str_cs nii...' == (28 more instances...) == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (4 July 2021) is 1027 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '0' is mentioned on line 16691, but not defined -- Looks like a reference, but probably isn't: 'X' on line 16425 -- Looks like a reference, but probably isn't: 'Y' on line 16434 -- Possible downref: Non-RFC (?) normative reference: ref. '6' -- Possible downref: Non-RFC (?) normative reference: ref. '11' -- Possible downref: Non-RFC (?) normative reference: ref. '13' -- Possible downref: Non-RFC (?) normative reference: ref. '14' -- Possible downref: Non-RFC (?) normative reference: ref. '15' ** Obsolete normative reference: RFC 3454 (ref. '16') (Obsoleted by RFC 7564) -- Possible downref: Non-RFC (?) normative reference: ref. '17' -- Possible downref: Non-RFC (?) normative reference: ref. '18' ** Obsolete normative reference: RFC 3491 (ref. '20') (Obsoleted by RFC 5891) -- Possible downref: Non-RFC (?) normative reference: ref. '21' -- Possible downref: Non-RFC (?) normative reference: ref. '22' -- Possible downref: Non-RFC (?) normative reference: ref. '23' -- Possible downref: Non-RFC (?) normative reference: ref. '24' -- Possible downref: Non-RFC (?) normative reference: ref. '26' -- Obsolete informational reference (is this intentional?): RFC 3530 (ref. '37') (Obsoleted by RFC 7530) -- Obsolete informational reference (is this intentional?): RFC 5661 (ref. '64') (Obsoleted by RFC 8881) -- Duplicate reference: RFC5661, mentioned in '66', was also mentioned in '64'. -- Obsolete informational reference (is this intentional?): RFC 5661 (ref. '66') (Obsoleted by RFC 8881) Summary: 2 errors (**), 0 flaws (~~), 11 warnings (==), 21 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 D. Noveck, Ed. 3 Internet-Draft NetApp 4 Intended status: Standards Track C. Lever 5 Expires: 5 January 2022 ORACLE 6 4 July 2021 8 Network File System (NFS) Version 4 Minor Version 1 Protocol 9 draft-dnoveck-nfsv4-rfc5661bis-base-02 11 Abstract 13 This document describes the Network File System (NFS) version 4 minor 14 version 1, including features retained from the base protocol (NFS 15 version 4 minor version 0, which is specified in RFC 7530) and 16 protocol extensions made subsequently. The later minor version has 17 no dependencies on NFS version 4 minor version 0, and is considered a 18 separate protocol. 20 This document obsoletes RFC 5661. It substantially revises the 21 treatment of features relating to multi-server namespace, superseding 22 the description of those features appearing in RFC 5661. 24 The content of this document reflects that of RFC8881, presented in 25 the form of an I-D. This is intended to provide a helpful point of 26 comparision for drafts leading to an eventual rfc5661bis document 27 suite. This document can serve as a baseline to enable use of 28 rfcdiff when reviewing such drafts. 30 Status of This Memo 32 This Internet-Draft is submitted in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF). Note that other groups may also distribute 37 working documents as Internet-Drafts. The list of current Internet- 38 Drafts is at https://datatracker.ietf.org/drafts/current/. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress." 45 This Internet-Draft will expire on 5 January 2022. 47 Copyright Notice 49 Copyright (c) 2021 IETF Trust and the persons identified as the 50 document authors. All rights reserved. 52 This document is subject to BCP 78 and the IETF Trust's Legal 53 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 54 license-info) in effect on the date of publication of this document. 55 Please review these documents carefully, as they describe your rights 56 and restrictions with respect to this document. Code Components 57 extracted from this document must include Simplified BSD License text 58 as described in Section 4.e of the Trust Legal Provisions and are 59 provided without warranty as described in the Simplified BSD License. 61 This document may contain material from IETF Documents or IETF 62 Contributions published or made publicly available before November 63 10, 2008. The person(s) controlling the copyright in some of this 64 material may not have granted the IETF Trust the right to allow 65 modifications of such material outside the IETF Standards Process. 66 Without obtaining an adequate license from the person(s) controlling 67 the copyright in such materials, this document may not be modified 68 outside the IETF Standards Process, and derivative works of it may 69 not be created outside the IETF Standards Process, except to format 70 it for publication as an RFC or to translate it into languages other 71 than English. 73 Table of Contents 75 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 8 76 1.1. Introduction to This Update . . . . . . . . . . . . . . . 8 77 1.2. The NFS Version 4 Minor Version 1 Protocol . . . . . . . 10 78 1.3. Requirements Language . . . . . . . . . . . . . . . . . . 10 79 1.4. Scope of This Document . . . . . . . . . . . . . . . . . 10 80 1.5. NFSv4 Goals . . . . . . . . . . . . . . . . . . . . . . . 11 81 1.6. NFSv4.1 Goals . . . . . . . . . . . . . . . . . . . . . . 11 82 1.7. General Definitions . . . . . . . . . . . . . . . . . . . 12 83 1.8. Overview of NFSv4.1 Features . . . . . . . . . . . . . . 14 84 1.9. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 18 85 2. Core Infrastructure . . . . . . . . . . . . . . . . . . . . . 20 86 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 20 87 2.2. RPC and XDR . . . . . . . . . . . . . . . . . . . . . . . 20 88 2.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 23 89 2.4. Client Identifiers and Client Owners . . . . . . . . . . 24 90 2.5. Server Owners . . . . . . . . . . . . . . . . . . . . . . 30 91 2.6. Security Service Negotiation . . . . . . . . . . . . . . 31 92 2.7. Minor Versioning . . . . . . . . . . . . . . . . . . . . 36 93 2.8. Non-RPC-Based Security Services . . . . . . . . . . . . . 38 94 2.9. Transport Layers . . . . . . . . . . . . . . . . . . . . 39 95 2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . . 42 96 3. Protocol Constants and Data Types . . . . . . . . . . . . . . 90 97 3.1. Basic Constants . . . . . . . . . . . . . . . . . . . . . 90 98 3.2. Basic Data Types . . . . . . . . . . . . . . . . . . . . 91 99 3.3. Structured Data Types . . . . . . . . . . . . . . . . . . 94 100 4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 103 101 4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 103 102 4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 104 103 4.3. One Method of Constructing a Volatile Filehandle . . . . 107 104 4.4. Client Recovery from Filehandle Expiration . . . . . . . 107 105 5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 108 106 5.1. REQUIRED Attributes . . . . . . . . . . . . . . . . . . . 110 107 5.2. RECOMMENDED Attributes . . . . . . . . . . . . . . . . . 110 108 5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 110 109 5.4. Classification of Attributes . . . . . . . . . . . . . . 112 110 5.5. Set-Only and Get-Only Attributes . . . . . . . . . . . . 113 111 5.6. REQUIRED Attributes - List and Definition References . . 113 112 5.7. RECOMMENDED Attributes - List and Definition 113 References . . . . . . . . . . . . . . . . . . . . . . . 114 114 5.8. Attribute Definitions . . . . . . . . . . . . . . . . . . 119 115 5.9. Interpreting owner and owner_group . . . . . . . . . . . 128 116 5.10. Character Case Attributes . . . . . . . . . . . . . . . . 130 117 5.11. Directory Notification Attributes . . . . . . . . . . . . 130 118 5.12. pNFS Attribute Definitions . . . . . . . . . . . . . . . 131 119 5.13. Retention Attributes . . . . . . . . . . . . . . . . . . 132 120 6. Access Control Attributes . . . . . . . . . . . . . . . . . . 135 121 6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 135 122 6.2. File Attributes Discussion . . . . . . . . . . . . . . . 137 123 6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 153 124 6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 155 125 7. Single-Server Namespace . . . . . . . . . . . . . . . . . . . 162 126 7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 162 127 7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 162 128 7.3. Server Pseudo File System . . . . . . . . . . . . . . . . 163 129 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 164 130 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . . 164 131 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . . 164 132 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 164 133 7.8. Security Policy and Namespace Presentation . . . . . . . 165 134 8. State Management . . . . . . . . . . . . . . . . . . . . . . 166 135 8.1. Client and Session ID . . . . . . . . . . . . . . . . . . 167 136 8.2. Stateid Definition . . . . . . . . . . . . . . . . . . . 167 137 8.3. Lease Renewal . . . . . . . . . . . . . . . . . . . . . . 176 138 8.4. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 179 139 8.5. Server Revocation of Locks . . . . . . . . . . . . . . . 190 140 8.6. Short and Long Leases . . . . . . . . . . . . . . . . . . 191 141 8.7. Clocks, Propagation Delay, and Calculating Lease 142 Expiration . . . . . . . . . . . . . . . . . . . . . . . 191 144 8.8. Obsolete Locking Infrastructure from NFSv4.0 . . . . . . 192 145 9. File Locking and Share Reservations . . . . . . . . . . . . . 193 146 9.1. Opens and Byte-Range Locks . . . . . . . . . . . . . . . 193 147 9.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . . 197 148 9.3. Upgrading and Downgrading Locks . . . . . . . . . . . . . 197 149 9.4. Stateid Seqid Values and Byte-Range Locks . . . . . . . . 198 150 9.5. Issues with Multiple Open-Owners . . . . . . . . . . . . 198 151 9.6. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 199 152 9.7. Share Reservations . . . . . . . . . . . . . . . . . . . 200 153 9.8. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . . 201 154 9.9. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 202 155 9.10. Parallel OPENs . . . . . . . . . . . . . . . . . . . . . 203 156 9.11. Reclaim of Open and Byte-Range Locks . . . . . . . . . . 203 157 10. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 204 158 10.1. Performance Challenges for Client-Side Caching . . . . . 204 159 10.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 205 160 10.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 210 161 10.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 214 162 10.5. Data Caching and Revocation . . . . . . . . . . . . . . 226 163 10.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 227 164 10.7. Data and Metadata Caching and Memory Mapped Files . . . 229 165 10.8. Name and Directory Caching without Directory 166 Delegations . . . . . . . . . . . . . . . . . . . . . . 231 167 10.9. Directory Delegations . . . . . . . . . . . . . . . . . 234 168 11. Multi-Server Namespace . . . . . . . . . . . . . . . . . . . 237 169 11.1. Terminology . . . . . . . . . . . . . . . . . . . . . . 237 170 11.2. File System Location Attributes . . . . . . . . . . . . 241 171 11.3. File System Presence or Absence . . . . . . . . . . . . 242 172 11.4. Getting Attributes for an Absent File System . . . . . . 244 173 11.5. Uses of File System Location Information . . . . . . . . 246 174 11.6. Trunking without File System Location Information . . . 256 175 11.7. Users and Groups in a Multi-Server Namespace . . . . . . 256 176 11.8. Additional Client-Side Considerations . . . . . . . . . 257 177 11.9. Overview of File Access Transitions . . . . . . . . . . 258 178 11.10. Effecting Network Endpoint Transitions . . . . . . . . . 258 179 11.11. Effecting File System Transitions . . . . . . . . . . . 259 180 11.12. Transferring State upon Migration . . . . . . . . . . . 270 181 11.13. Client Responsibilities When Access Is Transitioned . . 272 182 11.14. Server Responsibilities Upon Migration . . . . . . . . . 282 183 11.15. Effecting File System Referrals . . . . . . . . . . . . 288 184 11.16. The Attribute fs_locations . . . . . . . . . . . . . . . 295 185 11.17. The Attribute fs_locations_info . . . . . . . . . . . . 298 186 11.18. The Attribute fs_status . . . . . . . . . . . . . . . . 312 187 12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 316 188 12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 316 189 12.2. pNFS Definitions . . . . . . . . . . . . . . . . . . . . 318 190 12.3. pNFS Operations . . . . . . . . . . . . . . . . . . . . 323 191 12.4. pNFS Attributes . . . . . . . . . . . . . . . . . . . . 324 192 12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 324 193 12.6. pNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 341 194 12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 342 195 12.8. Metadata and Storage Device Roles . . . . . . . . . . . 348 196 12.9. Security Considerations for pNFS . . . . . . . . . . . . 348 197 13. NFSv4.1 as a Storage Protocol in pNFS: the File Layout 198 Type . . . . . . . . . . . . . . . . . . . . . . . . . . 349 199 13.1. Client ID and Session Considerations . . . . . . . . . . 349 200 13.2. File Layout Definitions . . . . . . . . . . . . . . . . 353 201 13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 353 202 13.4. Interpreting the File Layout . . . . . . . . . . . . . . 358 203 13.5. Data Server Multipathing . . . . . . . . . . . . . . . . 366 204 13.6. Operations Sent to NFSv4.1 Data Servers . . . . . . . . 367 205 13.7. COMMIT through Metadata Server . . . . . . . . . . . . . 370 206 13.8. The Layout Iomode . . . . . . . . . . . . . . . . . . . 371 207 13.9. Metadata and Data Server State Coordination . . . . . . 371 208 13.10. Data Server Component File Size . . . . . . . . . . . . 374 209 13.11. Layout Revocation and Fencing . . . . . . . . . . . . . 375 210 13.12. Security Considerations for the File Layout Type . . . . 376 211 14. Internationalization . . . . . . . . . . . . . . . . . . . . 377 212 14.1. Stringprep Profile for the utf8str_cs Type . . . . . . . 378 213 14.2. Stringprep Profile for the utf8str_cis Type . . . . . . 379 214 14.3. Stringprep Profile for the utf8str_mixed Type . . . . . 381 215 14.4. UTF-8 Capabilities . . . . . . . . . . . . . . . . . . . 382 216 14.5. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 383 217 15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 383 218 15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 383 219 15.2. Operations and Their Valid Errors . . . . . . . . . . . 407 220 15.3. Callback Operations and Their Valid Errors . . . . . . . 424 221 15.4. Errors and the Operations That Use Them . . . . . . . . 427 222 16. NFSv4.1 Procedures . . . . . . . . . . . . . . . . . . . . . 442 223 16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 442 224 16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 442 225 17. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . 454 226 18. NFSv4.1 Operations . . . . . . . . . . . . . . . . . . . . . 460 227 18.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 460 228 18.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 465 229 18.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 466 230 18.4. Operation 6: CREATE - Create a Non-Regular File 231 Object . . . . . . . . . . . . . . . . . . . . . . . . 469 232 18.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting 233 Recovery . . . . . . . . . . . . . . . . . . . . . . . 472 234 18.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 473 235 18.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 473 236 18.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 475 237 18.9. Operation 11: LINK - Create Link to a File . . . . . . . 476 238 18.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 479 239 18.11. Operation 13: LOCKT - Test for Lock . . . . . . . . . . 483 240 18.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 484 241 18.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 486 242 18.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 487 243 18.15. Operation 17: NVERIFY - Verify Difference in 244 Attributes . . . . . . . . . . . . . . . . . . . . . . 489 245 18.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 490 246 18.17. Operation 19: OPENATTR - Open Named Attribute 247 Directory . . . . . . . . . . . . . . . . . . . . . . . 511 248 18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File 249 Access . . . . . . . . . . . . . . . . . . . . . . . . 512 250 18.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 514 251 18.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 515 252 18.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 516 253 18.22. Operation 25: READ - Read from File . . . . . . . . . . 517 254 18.23. Operation 26: READDIR - Read Directory . . . . . . . . . 519 255 18.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 523 256 18.25. Operation 28: REMOVE - Remove File System Object . . . . 524 257 18.26. Operation 29: RENAME - Rename Directory Entry . . . . . 527 258 18.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 530 259 18.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 531 260 18.29. Operation 33: SECINFO - Obtain Available Security . . . 532 261 18.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 535 262 18.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 538 263 18.32. Operation 38: WRITE - Write to File . . . . . . . . . . 540 264 18.33. Operation 40: BACKCHANNEL_CTL - Backchannel Control . . 545 265 18.34. Operation 41: BIND_CONN_TO_SESSION - Associate Connection 266 with Session . . . . . . . . . . . . . . . . . . . . . 546 267 18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 549 268 18.36. Operation 43: CREATE_SESSION - Create New Session and 269 Confirm Client ID . . . . . . . . . . . . . . . . . . . 567 270 18.37. Operation 44: DESTROY_SESSION - Destroy a Session . . . 578 271 18.38. Operation 45: FREE_STATEID - Free Stateid with No 272 Locks . . . . . . . . . . . . . . . . . . . . . . . . . 579 273 18.39. Operation 46: GET_DIR_DELEGATION - Get a Directory 274 Delegation . . . . . . . . . . . . . . . . . . . . . . 580 275 18.40. Operation 47: GETDEVICEINFO - Get Device Information . . 584 276 18.41. Operation 48: GETDEVICELIST - Get All Device Mappings for 277 a File System . . . . . . . . . . . . . . . . . . . . . 587 278 18.42. Operation 49: LAYOUTCOMMIT - Commit Writes Made Using a 279 Layout . . . . . . . . . . . . . . . . . . . . . . . . 588 280 18.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 593 281 18.44. Operation 51: LAYOUTRETURN - Release Layout 282 Information . . . . . . . . . . . . . . . . . . . . . . 603 283 18.45. Operation 52: SECINFO_NO_NAME - Get Security on Unnamed 284 Object . . . . . . . . . . . . . . . . . . . . . . . . 608 285 18.46. Operation 53: SEQUENCE - Supply Per-Procedure Sequencing 286 and Control . . . . . . . . . . . . . . . . . . . . . . 609 287 18.47. Operation 54: SET_SSV - Update SSV for a Client ID . . . 615 288 18.48. Operation 55: TEST_STATEID - Test Stateids for 289 Validity . . . . . . . . . . . . . . . . . . . . . . . 617 290 18.49. Operation 56: WANT_DELEGATION - Request Delegation . . . 619 291 18.50. Operation 57: DESTROY_CLIENTID - Destroy a Client ID . . 622 292 18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims 293 Finished . . . . . . . . . . . . . . . . . . . . . . . 623 294 18.52. Operation 10044: ILLEGAL - Illegal Operation . . . . . . 627 295 19. NFSv4.1 Callback Procedures . . . . . . . . . . . . . . . . . 628 296 19.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 628 297 19.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 628 298 20. NFSv4.1 Callback Operations . . . . . . . . . . . . . . . . . 632 299 20.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 632 300 20.2. Operation 4: CB_RECALL - Recall a Delegation . . . . . . 633 301 20.3. Operation 5: CB_LAYOUTRECALL - Recall Layout from 302 Client . . . . . . . . . . . . . . . . . . . . . . . . 634 303 20.4. Operation 6: CB_NOTIFY - Notify Client of Directory 304 Changes . . . . . . . . . . . . . . . . . . . . . . . . 638 305 20.5. Operation 7: CB_PUSH_DELEG - Offer Previously Requested 306 Delegation to Client . . . . . . . . . . . . . . . . . 642 307 20.6. Operation 8: CB_RECALL_ANY - Keep Any N Recallable 308 Objects . . . . . . . . . . . . . . . . . . . . . . . . 643 309 20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL - Signal Resources 310 for Recallable Objects . . . . . . . . . . . . . . . . 646 311 20.8. Operation 10: CB_RECALL_SLOT - Change Flow Control 312 Limits . . . . . . . . . . . . . . . . . . . . . . . . 647 313 20.9. Operation 11: CB_SEQUENCE - Supply Backchannel Sequencing 314 and Control . . . . . . . . . . . . . . . . . . . . . . 648 315 20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending 316 Delegation Wants . . . . . . . . . . . . . . . . . . . 651 317 20.11. Operation 13: CB_NOTIFY_LOCK - Notify Client of Possible 318 Lock Availability . . . . . . . . . . . . . . . . . . . 652 319 20.12. Operation 14: CB_NOTIFY_DEVICEID - Notify Client of Device 320 ID Changes . . . . . . . . . . . . . . . . . . . . . . 653 321 20.13. Operation 10044: CB_ILLEGAL - Illegal Callback 322 Operation . . . . . . . . . . . . . . . . . . . . . . . 655 323 21. Security Considerations . . . . . . . . . . . . . . . . . . . 656 324 22. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 660 325 22.1. IANA Actions . . . . . . . . . . . . . . . . . . . . . . 661 326 22.2. Named Attribute Definitions . . . . . . . . . . . . . . 661 327 22.3. Device ID Notifications . . . . . . . . . . . . . . . . 662 328 22.4. Object Recall Types . . . . . . . . . . . . . . . . . . 664 329 22.5. Layout Types . . . . . . . . . . . . . . . . . . . . . . 666 330 22.6. Path Variable Definitions . . . . . . . . . . . . . . . 668 331 23. References . . . . . . . . . . . . . . . . . . . . . . . . . 672 332 23.1. Normative References . . . . . . . . . . . . . . . . . . 672 333 23.2. Informative References . . . . . . . . . . . . . . . . . 676 334 Appendix A. The Need for This Update . . . . . . . . . . . . . . 679 335 Appendix B. Changes in This Update . . . . . . . . . . . . . . . 682 336 B.1. Revisions Made to Section 11 of RFC 5661 . . . . . . . . 682 337 B.2. Revisions Made to Operations in RFC 5661 . . . . . . . . 685 338 B.3. Revisions Made to Error Definitions in RFC 5661 . . . . . 687 339 B.4. Other Revisions Made to RFC 5661 . . . . . . . . . . . . 688 340 Appendix C. Security Issues That Need to Be Addressed . . . . . 689 341 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 691 342 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 694 344 1. Introduction 346 1.1. Introduction to This Update 348 Two important features previously defined in minor version 0 but 349 never fully addressed in minor version 1 are trunking, which is the 350 simultaneous use of multiple connections between a client and server, 351 potentially to different network addresses, and Transparent State 352 Migration, which allows a file system to be transferred between 353 servers in a way that provides to the client the ability to maintain 354 its existing locking state across the transfer. 356 The revised description of the NFS version 4 minor version 1 357 (NFSv4.1) protocol presented in this update is necessary to enable 358 full use of these features together with other multi-server namespace 359 features. This document is in the form of an updated description of 360 the NFSv4.1 protocol previously defined in RFC 5661 [66]. RFC 5661 361 is obsoleted by this document. However, the update has a limited 362 scope and is focused on enabling full use of trunking and Transparent 363 State Migration. The need for these changes is discussed in 364 Appendix A. Appendix B describes the specific changes made to arrive 365 at the current text. 367 This limited-scope update replaces the current NFSv4.1 RFC with the 368 intention of providing an authoritative and complete specification, 369 the motivation for which is discussed in [36], addressing the issues 370 within the scope of the update. However, it will not address issues 371 that are known but outside of this limited scope as could be expected 372 by a full update of the protocol. Below are some areas that are 373 known to need addressing in a future update of the protocol: 375 * Work needs to be done with regard to RFC 8178 [67], which 376 establishes NFSv4-wide versioning rules. As RFC 5661 is currently 377 inconsistent with that document, changes are needed in order to 378 arrive at a situation in which there would be no need for RFC 8178 379 to update the NFSv4.1 specification. 381 * Work needs to be done with regard to RFC 8434 [70], which 382 establishes the requirements for parallel NFS (pNFS) layout types, 383 which are not clearly defined in RFC 5661. When that work is done 384 and the resulting documents approved, the new NFSv4.1 385 specification document will provide a clear set of requirements 386 for layout types and a description of the file layout type that 387 conforms to those requirements. Other layout types will have 388 their own specification documents that conform to those 389 requirements as well. 391 * Work needs to be done to address many errata reports relevant to 392 RFC 5661, other than errata report 2006 [64], which is addressed 393 in this document. Addressing that report was not deferrable 394 because of the interaction of the changes suggested there and the 395 newly described handling of state and session migration. 397 The errata reports that have been deferred and that will need to 398 be addressed in a later document include reports currently 399 assigned a range of statuses in the errata reporting system, 400 including reports marked Accepted and those marked Hold For 401 Document Update because the change was too minor to address 402 immediately. 404 In addition, there is a set of other reports, including at least 405 one in state Rejected, that will need to be addressed in a later 406 document. This will involve making changes to consensus decisions 407 reflected in RFC 5661, in situations in which the working group 408 has decided that the treatment in RFC 5661 is incorrect and needs 409 to be revised to reflect the working group's new consensus and to 410 ensure compatibility with existing implementations that do not 411 follow the handling described in RFC 5661. 413 Note that it is expected that all such errata reports will remain 414 relevant to implementors and the authors of an eventual 415 rfc5661bis, despite the fact that this document obsoletes RFC 5661 416 [66]. 418 * There is a need for a new approach to the description of 419 internationalization since the current internationalization 420 section (Section 14) has never been implemented and does not meet 421 the needs of the NFSv4 protocol. Possible solutions are to create 422 a new internationalization section modeled on that in [68] or to 423 create a new document describing internationalization for all 424 NFSv4 minor versions and reference that document in the RFCs 425 defining both NFSv4.0 and NFSv4.1. 427 * There is a need for a revised treatment of security in NFSv4.1. 428 The issues with the existing treatment are discussed in 429 Appendix C. 431 Until the above work is done, there will not be a consistent set of 432 documents that provides a description of the NFSv4.1 protocol, and 433 any full description would involve documents updating other documents 434 within the specification. The updates applied by RFC 8434 [70] and 435 RFC 8178 [67] to RFC 5661 also apply to this specification, and will 436 apply to any subsequent v4.1 specification until that work is done. 438 1.2. The NFS Version 4 Minor Version 1 Protocol 440 The NFS version 4 minor version 1 (NFSv4.1) protocol is the second 441 minor version of the NFS version 4 (NFSv4) protocol. The first minor 442 version, NFSv4.0, is now described in RFC 7530 [68]. It generally 443 follows the guidelines for minor versioning that are listed in 444 Section 10 of RFC 3530 [37]. However, it diverges from guidelines 11 445 ("a client and server that support minor version X must support minor 446 versions 0 through X-1") and 12 ("no new features may be introduced 447 as mandatory in a minor version"). These divergences are due to the 448 introduction of the sessions model for managing non-idempotent 449 operations and the RECLAIM_COMPLETE operation. These two new 450 features are infrastructural in nature and simplify implementation of 451 existing and other new features. Making them anything but REQUIRED 452 would add undue complexity to protocol definition and implementation. 453 NFSv4.1 accordingly updates the minor versioning guidelines 454 (Section 2.7). 456 As a minor version, NFSv4.1 is consistent with the overall goals for 457 NFSv4, but extends the protocol so as to better meet those goals, 458 based on experiences with NFSv4.0. In addition, NFSv4.1 has adopted 459 some additional goals, which motivate some of the major extensions in 460 NFSv4.1. 462 1.3. Requirements Language 464 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 465 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 466 document are to be interpreted as described in RFC 2119 [1]. 468 1.4. Scope of This Document 470 This document describes the NFSv4.1 protocol. With respect to 471 NFSv4.0, this document does not: 473 * describe the NFSv4.0 protocol, except where needed to contrast 474 with NFSv4.1. 476 * modify the specification of the NFSv4.0 protocol. 478 * clarify the NFSv4.0 protocol. 480 1.5. NFSv4 Goals 482 The NFSv4 protocol is a further revision of the NFS protocol defined 483 already by NFSv3 [38]. It retains the essential characteristics of 484 previous versions: easy recovery; independence of transport 485 protocols, operating systems, and file systems; simplicity; and good 486 performance. NFSv4 has the following goals: 488 * Improved access and good performance on the Internet 490 The protocol is designed to transit firewalls easily, perform well 491 where latency is high and bandwidth is low, and scale to very 492 large numbers of clients per server. 494 * Strong security with negotiation built into the protocol 496 The protocol builds on the work of the ONCRPC working group in 497 supporting the RPCSEC_GSS protocol. Additionally, the NFSv4.1 498 protocol provides a mechanism to allow clients and servers the 499 ability to negotiate security and require clients and servers to 500 support a minimal set of security schemes. 502 * Good cross-platform interoperability 504 The protocol features a file system model that provides a useful, 505 common set of features that does not unduly favor one file system 506 or operating system over another. 508 * Designed for protocol extensions 510 The protocol is designed to accept standard extensions within a 511 framework that enables and encourages backward compatibility. 513 1.6. NFSv4.1 Goals 515 NFSv4.1 has the following goals, within the framework established by 516 the overall NFSv4 goals. 518 * To correct significant structural weaknesses and oversights 519 discovered in the base protocol. 521 * To add clarity and specificity to areas left unaddressed or not 522 addressed in sufficient detail in the base protocol. However, as 523 stated in Section 1.4, it is not a goal to clarify the NFSv4.0 524 protocol in the NFSv4.1 specification. 526 * To add specific features based on experience with the existing 527 protocol and recent industry developments. 529 * To provide protocol support to take advantage of clustered server 530 deployments including the ability to provide scalable parallel 531 access to files distributed among multiple servers. 533 1.7. General Definitions 535 The following definitions provide an appropriate context for the 536 reader. 538 Byte: In this document, a byte is an octet, i.e., a datum exactly 8 539 bits in length. 541 Client: The client is the entity that accesses the NFS server's 542 resources. The client may be an application that contains the 543 logic to access the NFS server directly. The client may also be 544 the traditional operating system client that provides remote file 545 system services for a set of applications. 547 A client is uniquely identified by a client owner. 549 With reference to byte-range locking, the client is also the 550 entity that maintains a set of locks on behalf of one or more 551 applications. This client is responsible for crash or failure 552 recovery for those locks it manages. 554 Note that multiple clients may share the same transport and 555 connection and multiple clients may exist on the same network 556 node. 558 Client ID: The client ID is a 64-bit quantity used as a unique, 559 short-hand reference to a client-supplied verifier and client 560 owner. The server is responsible for supplying the client ID. 562 Client Owner: The client owner is a unique string, opaque to the 563 server, that identifies a client. Multiple network connections 564 and source network addresses originating from those connections 565 may share a client owner. The server is expected to treat 566 requests from connections with the same client owner as coming 567 from the same client. 569 File System: The file system is the collection of objects on a 570 server (as identified by the major identifier of a server owner, 571 which is defined later in this section) that share the same fsid 572 attribute (see Section 5.8.1.9). 574 Lease: A lease is an interval of time defined by the server for 575 which the client is irrevocably granted locks. At the end of a 576 lease period, locks may be revoked if the lease has not been 577 extended. A lock must be revoked if a conflicting lock has been 578 granted after the lease interval. 580 A server grants a client a single lease for all state. 582 Lock: The term "lock" is used to refer to byte-range (in UNIX 583 environments, also known as record) locks, share reservations, 584 delegations, or layouts unless specifically stated otherwise. 586 Secret State Verifier (SSV): The SSV is a unique secret key shared 587 between a client and server. The SSV serves as the secret key for 588 an internal (that is, internal to NFSv4.1) Generic Security 589 Services (GSS) mechanism (the SSV GSS mechanism; see 590 Section 2.10.9). The SSV GSS mechanism uses the SSV to compute 591 message integrity code (MIC) and Wrap tokens. See 592 Section 2.10.8.3 for more details on how NFSv4.1 uses the SSV and 593 the SSV GSS mechanism. 595 Server: The Server is the entity responsible for coordinating client 596 access to a set of file systems and is identified by a server 597 owner. A server can span multiple network addresses. 599 Server Owner: The server owner identifies the server to the client. 600 The server owner consists of a major identifier and a minor 601 identifier. When the client has two connections each to a peer 602 with the same major identifier, the client assumes that both peers 603 are the same server (the server namespace is the same via each 604 connection) and that lock state is shareable across both 605 connections. When each peer has both the same major and minor 606 identifiers, the client assumes that each connection might be 607 associable with the same session. 609 Stable Storage: Stable storage is storage from which data stored by 610 an NFSv4.1 server can be recovered without data loss from multiple 611 power failures (including cascading power failures, that is, 612 several power failures in quick succession), operating system 613 failures, and/or hardware failure of components other than the 614 storage medium itself (such as disk, nonvolatile RAM, flash 615 memory, etc.). 617 Some examples of stable storage that are allowable for an NFS 618 server include: 620 1. Media commit of data; that is, the modified data has been 621 successfully written to the disk media, for example, the disk 622 platter. 624 2. An immediate reply disk drive with battery-backed, on-drive 625 intermediate storage or uninterruptible power system (UPS). 627 3. Server commit of data with battery-backed intermediate storage 628 and recovery software. 630 4. Cache commit with uninterruptible power system (UPS) and 631 recovery software. 633 Stateid: A stateid is a 128-bit quantity returned by a server that 634 uniquely defines the open and locking states provided by the 635 server for a specific open-owner or lock-owner/open-owner pair for 636 a specific file and type of lock. 638 Verifier: A verifier is a 64-bit quantity generated by the client 639 that the server can use to determine if the client has restarted 640 and lost all previous lock state. 642 1.8. Overview of NFSv4.1 Features 644 The major features of the NFSv4.1 protocol will be reviewed in brief. 645 This will be done to provide an appropriate context for both the 646 reader who is familiar with the previous versions of the NFS protocol 647 and the reader who is new to the NFS protocols. For the reader new 648 to the NFS protocols, there is still a set of fundamental knowledge 649 that is expected. The reader should be familiar with the External 650 Data Representation (XDR) and Remote Procedure Call (RPC) protocols 651 as described in [2] and [3]. A basic knowledge of file systems and 652 distributed file systems is expected as well. 654 In general, this specification of NFSv4.1 will not distinguish those 655 features added in minor version 1 from those present in the base 656 protocol but will treat NFSv4.1 as a unified whole. See Section 1.9 657 for a summary of the differences between NFSv4.0 and NFSv4.1. 659 1.8.1. RPC and Security 661 As with previous versions of NFS, the External Data Representation 662 (XDR) and Remote Procedure Call (RPC) mechanisms used for the NFSv4.1 663 protocol are those defined in [2] and [3]. To meet end-to-end 664 security requirements, the RPCSEC_GSS framework [4] is used to extend 665 the basic RPC security. With the use of RPCSEC_GSS, various 666 mechanisms can be provided to offer authentication, integrity, and 667 privacy to the NFSv4 protocol. Kerberos V5 is used as described in 668 [5] to provide one security framework. With the use of RPCSEC_GSS, 669 other mechanisms may also be specified and used for NFSv4.1 security. 671 To enable in-band security negotiation, the NFSv4.1 protocol has 672 operations that provide the client a method of querying the server 673 about its policies regarding which security mechanisms must be used 674 for access to the server's file system resources. With this, the 675 client can securely match the security mechanism that meets the 676 policies specified at both the client and server. 678 NFSv4.1 introduces parallel access (see Section 1.8.2.2), which is 679 called pNFS. The security framework described in this section is 680 significantly modified by the introduction of pNFS (see 681 Section 12.9), because data access is sometimes not over RPC. The 682 level of significance varies with the storage protocol (see 683 Section 12.2.5) and can be as low as zero impact (see Section 13.12). 685 1.8.2. Protocol Structure 687 1.8.2.1. Core Protocol 689 Unlike NFSv3, which used a series of ancillary protocols (e.g., NLM, 690 NSM (Network Status Monitor), MOUNT), within all minor versions of 691 NFSv4 a single RPC protocol is used to make requests to the server. 692 Facilities that had been separate protocols, such as locking, are now 693 integrated within a single unified protocol. 695 1.8.2.2. Parallel Access 697 Minor version 1 supports high-performance data access to a clustered 698 server implementation by enabling a separation of metadata access and 699 data access, with the latter done to multiple servers in parallel. 701 Such parallel data access is controlled by recallable objects known 702 as "layouts", which are integrated into the protocol locking model. 703 Clients direct requests for data access to a set of data servers 704 specified by the layout via a data storage protocol which may be 705 NFSv4.1 or may be another protocol. 707 Because the protocols used for parallel data access are not 708 necessarily RPC-based, the RPC-based security model (Section 1.8.1) 709 is obviously impacted (see Section 12.9). The degree of impact 710 varies with the storage protocol (see Section 12.2.5) used for data 711 access, and can be as low as zero (see Section 13.12). 713 1.8.3. File System Model 715 The general file system model used for the NFSv4.1 protocol is the 716 same as previous versions. The server file system is hierarchical 717 with the regular files contained within being treated as opaque byte 718 streams. In a slight departure, file and directory names are encoded 719 with UTF-8 to deal with the basics of internationalization. 721 The NFSv4.1 protocol does not require a separate protocol to provide 722 for the initial mapping between path name and filehandle. All file 723 systems exported by a server are presented as a tree so that all file 724 systems are reachable from a special per-server global root 725 filehandle. This allows LOOKUP operations to be used to perform 726 functions previously provided by the MOUNT protocol. The server 727 provides any necessary pseudo file systems to bridge any gaps that 728 arise due to unexported gaps between exported file systems. 730 1.8.3.1. Filehandles 732 As in previous versions of the NFS protocol, opaque filehandles are 733 used to identify individual files and directories. Lookup-type and 734 create operations translate file and directory names to filehandles, 735 which are then used to identify objects in subsequent operations. 737 The NFSv4.1 protocol provides support for persistent filehandles, 738 guaranteed to be valid for the lifetime of the file system object 739 designated. In addition, it provides support to servers to provide 740 filehandles with more limited validity guarantees, called volatile 741 filehandles. 743 1.8.3.2. File Attributes 745 The NFSv4.1 protocol has a rich and extensible file object attribute 746 structure, which is divided into REQUIRED, RECOMMENDED, and named 747 attributes (see Section 5). 749 Several (but not all) of the REQUIRED attributes are derived from the 750 attributes of NFSv3 (see the definition of the fattr3 data type in 751 [38]). An example of a REQUIRED attribute is the file object's type 752 (Section 5.8.1.2) so that regular files can be distinguished from 753 directories (also known as folders in some operating environments) 754 and other types of objects. REQUIRED attributes are discussed in 755 Section 5.1. 757 An example of three RECOMMENDED attributes are acl, sacl, and dacl. 758 These attributes define an Access Control List (ACL) on a file object 759 (Section 6). An ACL provides directory and file access control 760 beyond the model used in NFSv3. The ACL definition allows for 761 specification of specific sets of permissions for individual users 762 and groups. In addition, ACL inheritance allows propagation of 763 access permissions and restrictions down a directory tree as file 764 system objects are created. RECOMMENDED attributes are discussed in 765 Section 5.2. 767 A named attribute is an opaque byte stream that is associated with a 768 directory or file and referred to by a string name. Named attributes 769 are meant to be used by client applications as a method to associate 770 application-specific data with a regular file or directory. NFSv4.1 771 modifies named attributes relative to NFSv4.0 by tightening the 772 allowed operations in order to prevent the development of non- 773 interoperable implementations. Named attributes are discussed in 774 Section 5.3. 776 1.8.3.3. Multi-Server Namespace 778 NFSv4.1 contains a number of features to allow implementation of 779 namespaces that cross server boundaries and that allow and facilitate 780 a nondisruptive transfer of support for individual file systems 781 between servers. They are all based upon attributes that allow one 782 file system to specify alternate, additional, and new location 783 information that specifies how the client may access that file 784 system. 786 These attributes can be used to provide for individual active file 787 systems: 789 * Alternate network addresses to access the current file system 790 instance. 792 * The locations of alternate file system instances or replicas to be 793 used in the event that the current file system instance becomes 794 unavailable. 796 These file system location attributes may be used together with the 797 concept of absent file systems, in which a position in the server 798 namespace is associated with locations on other servers without there 799 being any corresponding file system instance on the current server. 800 For example, 802 * These attributes may be used with absent file systems to implement 803 referrals whereby one server may direct the client to a file 804 system provided by another server. This allows extensive multi- 805 server namespaces to be constructed. 807 * These attributes may be provided when a previously present file 808 system becomes absent. This allows nondisruptive migration of 809 file systems to alternate servers. 811 1.8.4. Locking Facilities 813 As mentioned previously, NFSv4.1 is a single protocol that includes 814 locking facilities. These locking facilities include support for 815 many types of locks including a number of sorts of recallable locks. 816 Recallable locks such as delegations allow the client to be assured 817 that certain events will not occur so long as that lock is held. 818 When circumstances change, the lock is recalled via a callback 819 request. The assurances provided by delegations allow more extensive 820 caching to be done safely when circumstances allow it. 822 The types of locks are: 824 * Share reservations as established by OPEN operations. 826 * Byte-range locks. 828 * File delegations, which are recallable locks that assure the 829 holder that inconsistent opens and file changes cannot occur so 830 long as the delegation is held. 832 * Directory delegations, which are recallable locks that assure the 833 holder that inconsistent directory modifications cannot occur so 834 long as the delegation is held. 836 * Layouts, which are recallable objects that assure the holder that 837 direct access to the file data may be performed directly by the 838 client and that no change to the data's location that is 839 inconsistent with that access may be made so long as the layout is 840 held. 842 All locks for a given client are tied together under a single client- 843 wide lease. All requests made on sessions associated with the client 844 renew that lease. When the client's lease is not promptly renewed, 845 the client's locks are subject to revocation. In the event of server 846 restart, clients have the opportunity to safely reclaim their locks 847 within a special grace period. 849 1.9. Differences from NFSv4.0 851 The following summarizes the major differences between minor version 852 1 and the base protocol: 854 * Implementation of the sessions model (Section 2.10). 856 * Parallel access to data (Section 12). 858 * Addition of the RECLAIM_COMPLETE operation to better structure the 859 lock reclamation process (Section 18.51). 861 * Enhanced delegation support as follows. 863 - Delegations on directories and other file types in addition to 864 regular files (Section 18.39, Section 18.49). 866 - Operations to optimize acquisition of recalled or denied 867 delegations (Section 18.49, Section 20.5, Section 20.7). 869 - Notifications of changes to files and directories 870 (Section 18.39, Section 20.4). 872 - A method to allow a server to indicate that it is recalling one 873 or more delegations for resource management reasons, and thus a 874 method to allow the client to pick which delegations to return 875 (Section 20.6). 877 * Attributes can be set atomically during exclusive file create via 878 the OPEN operation (see the new EXCLUSIVE4_1 creation method in 879 Section 18.16). 881 * Open files can be preserved if removed and the hard link count 882 ("hard link" is defined in an Open Group [6] standard) goes to 883 zero, thus obviating the need for clients to rename deleted files 884 to partially hidden names -- colloquially called "silly rename" 885 (see the new OPEN4_RESULT_PRESERVE_UNLINKED reply flag in 886 Section 18.16). 888 * Improved compatibility with Microsoft Windows for Access Control 889 Lists (Section 6.2.3, Section 6.2.2, Section 6.4.3.2). 891 * Data retention (Section 5.13). 893 * Identification of the implementation of the NFS client and server 894 (Section 18.35). 896 * Support for notification of the availability of byte-range locks 897 (see the new OPEN4_RESULT_MAY_NOTIFY_LOCK reply flag in 898 Section 18.16 and see Section 20.11). 900 * In NFSv4.1, LIPKEY and SPKM-3 are not required security mechanisms 901 [39]. 903 2. Core Infrastructure 905 2.1. Introduction 907 NFSv4.1 relies on core infrastructure common to nearly every 908 operation. This core infrastructure is described in the remainder of 909 this section. 911 2.2. RPC and XDR 913 The NFSv4.1 protocol is a Remote Procedure Call (RPC) application 914 that uses RPC version 2 and the corresponding eXternal Data 915 Representation (XDR) as defined in [3] and [2]. 917 2.2.1. RPC-Based Security 919 Previous NFS versions have been thought of as having a host-based 920 authentication model, where the NFS server authenticates the NFS 921 client, and trusts the client to authenticate all users. Actually, 922 NFS has always depended on RPC for authentication. One of the first 923 forms of RPC authentication, AUTH_SYS, had no strong authentication 924 and required a host-based authentication approach. NFSv4.1 also 925 depends on RPC for basic security services and mandates RPC support 926 for a user-based authentication model. The user-based authentication 927 model has user principals authenticated by a server, and in turn the 928 server authenticated by user principals. RPC provides some basic 929 security services that are used by NFSv4.1. 931 2.2.1.1. RPC Security Flavors 933 As described in "Authentication", Section 7 of [3], RPC security is 934 encapsulated in the RPC header, via a security or authentication 935 flavor, and information specific to the specified security flavor. 936 Every RPC header conveys information used to identify and 937 authenticate a client and server. As discussed in Section 2.2.1.1.1, 938 some security flavors provide additional security services. 940 NFSv4.1 clients and servers MUST implement RPCSEC_GSS. (This 941 requirement to implement is not a requirement to use.) Other 942 flavors, such as AUTH_NONE and AUTH_SYS, MAY be implemented as well. 944 2.2.1.1.1. RPCSEC_GSS and Security Services 946 RPCSEC_GSS [4] uses the functionality of GSS-API [7]. This allows 947 for the use of various security mechanisms by the RPC layer without 948 the additional implementation overhead of adding RPC security 949 flavors. 951 2.2.1.1.1.1. Identification, Authentication, Integrity, Privacy 953 Via the GSS-API, RPCSEC_GSS can be used to identify and authenticate 954 users on clients to servers, and servers to users. It can also 955 perform integrity checking on the entire RPC message, including the 956 RPC header, and on the arguments or results. Finally, privacy, 957 usually via encryption, is a service available with RPCSEC_GSS. 958 Privacy is performed on the arguments and results. Note that if 959 privacy is selected, integrity, authentication, and identification 960 are enabled. If privacy is not selected, but integrity is selected, 961 authentication and identification are enabled. If integrity and 962 privacy are not selected, but authentication is enabled, 963 identification is enabled. RPCSEC_GSS does not provide 964 identification as a separate service. 966 Although GSS-API has an authentication service distinct from its 967 privacy and integrity services, GSS-API's authentication service is 968 not used for RPCSEC_GSS's authentication service. Instead, each RPC 969 request and response header is integrity protected with the GSS-API 970 integrity service, and this allows RPCSEC_GSS to offer per-RPC 971 authentication and identity. See [4] for more information. 973 NFSv4.1 client and servers MUST support RPCSEC_GSS's integrity and 974 authentication service. NFSv4.1 servers MUST support RPCSEC_GSS's 975 privacy service. NFSv4.1 clients SHOULD support RPCSEC_GSS's privacy 976 service. 978 2.2.1.1.1.2. Security Mechanisms for NFSv4.1 980 RPCSEC_GSS, via GSS-API, normalizes access to mechanisms that provide 981 security services. Therefore, NFSv4.1 clients and servers MUST 982 support the Kerberos V5 security mechanism. 984 The use of RPCSEC_GSS requires selection of mechanism, quality of 985 protection (QOP), and service (authentication, integrity, privacy). 986 For the mandated security mechanisms, NFSv4.1 specifies that a QOP of 987 zero is used, leaving it up to the mechanism or the mechanism's 988 configuration to map QOP zero to an appropriate level of protection. 989 Each mandated mechanism specifies a minimum set of cryptographic 990 algorithms for implementing integrity and privacy. NFSv4.1 clients 991 and servers MUST be implemented on operating environments that comply 992 with the REQUIRED cryptographic algorithms of each REQUIRED 993 mechanism. 995 2.2.1.1.1.2.1. Kerberos V5 997 The Kerberos V5 GSS-API mechanism as described in [5] MUST be 998 implemented with the RPCSEC_GSS services as specified in the 999 following table: 1001 column descriptions: 1002 1 == number of pseudo flavor 1003 2 == name of pseudo flavor 1004 3 == mechanism's OID 1005 4 == RPCSEC_GSS service 1006 5 == NFSv4.1 clients MUST support 1007 6 == NFSv4.1 servers MUST support 1009 1 2 3 4 5 6 1010 ------------------------------------------------------------------ 1011 390003 krb5 1.2.840.113554.1.2.2 rpc_gss_svc_none yes yes 1012 390004 krb5i 1.2.840.113554.1.2.2 rpc_gss_svc_integrity yes yes 1013 390005 krb5p 1.2.840.113554.1.2.2 rpc_gss_svc_privacy no yes 1015 Note that the number and name of the pseudo flavor are presented here 1016 as a mapping aid to the implementor. Because the NFSv4.1 protocol 1017 includes a method to negotiate security and it understands the GSS- 1018 API mechanism, the pseudo flavor is not needed. The pseudo flavor is 1019 needed for the NFSv3 since the security negotiation is done via the 1020 MOUNT protocol as described in [40]. 1022 At the time NFSv4.1 was specified, the Advanced Encryption Standard 1023 (AES) with HMAC-SHA1 was a REQUIRED algorithm set for Kerberos V5. 1024 In contrast, when NFSv4.0 was specified, weaker algorithm sets were 1025 REQUIRED for Kerberos V5, and were REQUIRED in the NFSv4.0 1026 specification, because the Kerberos V5 specification at the time did 1027 not specify stronger algorithms. The NFSv4.1 specification does not 1028 specify REQUIRED algorithms for Kerberos V5, and instead, the 1029 implementor is expected to track the evolution of the Kerberos V5 1030 standard if and when stronger algorithms are specified. 1032 2.2.1.1.1.2.1.1. Security Considerations for Cryptographic Algorithms 1033 in Kerberos V5 1035 When deploying NFSv4.1, the strength of the security achieved depends 1036 on the existing Kerberos V5 infrastructure. The algorithms of 1037 Kerberos V5 are not directly exposed to or selectable by the client 1038 or server, so there is some due diligence required by the user of 1039 NFSv4.1 to ensure that security is acceptable where needed. 1041 2.2.1.1.1.3. GSS Server Principal 1043 Regardless of what security mechanism under RPCSEC_GSS is being used, 1044 the NFS server MUST identify itself in GSS-API via a 1045 GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE 1046 names are of the form: 1048 service@hostname 1050 For NFS, the "service" element is 1052 nfs 1054 Implementations of security mechanisms will convert nfs@hostname to 1055 various different forms. For Kerberos V5, the following form is 1056 RECOMMENDED: 1058 nfs/hostname 1060 2.3. COMPOUND and CB_COMPOUND 1062 A significant departure from the versions of the NFS protocol before 1063 NFSv4 is the introduction of the COMPOUND procedure. For the NFSv4 1064 protocol, in all minor versions, there are exactly two RPC 1065 procedures, NULL and COMPOUND. The COMPOUND procedure is defined as 1066 a series of individual operations and these operations perform the 1067 sorts of functions performed by traditional NFS procedures. 1069 The operations combined within a COMPOUND request are evaluated in 1070 order by the server, without any atomicity guarantees. A limited set 1071 of facilities exist to pass results from one operation to another. 1072 Once an operation returns a failing result, the evaluation ends and 1073 the results of all evaluated operations are returned to the client. 1075 With the use of the COMPOUND procedure, the client is able to build 1076 simple or complex requests. These COMPOUND requests allow for a 1077 reduction in the number of RPCs needed for logical file system 1078 operations. For example, multi-component look up requests can be 1079 constructed by combining multiple LOOKUP operations. Those can be 1080 further combined with operations such as GETATTR, READDIR, or OPEN 1081 plus READ to do more complicated sets of operation without incurring 1082 additional latency. 1084 NFSv4.1 also contains a considerable set of callback operations in 1085 which the server makes an RPC directed at the client. Callback RPCs 1086 have a similar structure to that of the normal server requests. In 1087 all minor versions of the NFSv4 protocol, there are two callback RPC 1088 procedures: CB_NULL and CB_COMPOUND. The CB_COMPOUND procedure is 1089 defined in an analogous fashion to that of COMPOUND with its own set 1090 of callback operations. 1092 The addition of new server and callback operations within the 1093 COMPOUND and CB_COMPOUND request framework provides a means of 1094 extending the protocol in subsequent minor versions. 1096 Except for a small number of operations needed for session creation, 1097 server requests and callback requests are performed within the 1098 context of a session. Sessions provide a client context for every 1099 request and support robust replay protection for non-idempotent 1100 requests. 1102 2.4. Client Identifiers and Client Owners 1104 For each operation that obtains or depends on locking state, the 1105 specific client needs to be identifiable by the server. 1107 Each distinct client instance is represented by a client ID. A 1108 client ID is a 64-bit identifier representing a specific client at a 1109 given time. The client ID is changed whenever the client re- 1110 initializes, and may change when the server re-initializes. Client 1111 IDs are used to support lock identification and crash recovery. 1113 During steady state operation, the client ID associated with each 1114 operation is derived from the session (see Section 2.10) on which the 1115 operation is sent. A session is associated with a client ID when the 1116 session is created. 1118 Unlike NFSv4.0, the only NFSv4.1 operations possible before a client 1119 ID is established are those needed to establish the client ID. 1121 A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION 1122 operation using that client ID (eir_clientid as returned from 1123 EXCHANGE_ID) is required to establish and confirm the client ID on 1124 the server. Establishment of identification by a new incarnation of 1125 the client also has the effect of immediately releasing any locking 1126 state that a previous incarnation of that same client might have had 1127 on the server. Such released state would include all byte-range 1128 lock, share reservation, layout state, and -- where the server 1129 supports neither the CLAIM_DELEGATE_PREV nor CLAIM_DELEG_CUR_FH claim 1130 types -- all delegation state associated with the same client with 1131 the same identity. For discussion of delegation state recovery, see 1132 Section 10.2.1. For discussion of layout state recovery, see 1133 Section 12.7.1. 1135 Releasing such state requires that the server be able to determine 1136 that one client instance is the successor of another. Where this 1137 cannot be done, for any of a number of reasons, the locking state 1138 will remain for a time subject to lease expiration (see Section 8.3) 1139 and the new client will need to wait for such state to be removed, if 1140 it makes conflicting lock requests. 1142 Client identification is encapsulated in the following client owner 1143 data type: 1145 struct client_owner4 { 1146 verifier4 co_verifier; 1147 opaque co_ownerid; 1148 }; 1150 The first field, co_verifier, is a client incarnation verifier, 1151 allowing the server to distinguish successive incarnations (e.g., 1152 reboots) of the same client. The server will start the process of 1153 canceling the client's leased state if co_verifier is different than 1154 what the server has previously recorded for the identified client (as 1155 specified in the co_ownerid field). 1157 The second field, co_ownerid, is a variable length string that 1158 uniquely defines the client so that subsequent instances of the same 1159 client bear the same co_ownerid with a different verifier. 1161 There are several considerations for how the client generates the 1162 co_ownerid string: 1164 * The string should be unique so that multiple clients do not 1165 present the same string. The consequences of two clients 1166 presenting the same string range from one client getting an error 1167 to one client having its leased state abruptly and unexpectedly 1168 cancelled. 1170 * The string should be selected so that subsequent incarnations 1171 (e.g., restarts) of the same client cause the client to present 1172 the same string. The implementor is cautioned from an approach 1173 that requires the string to be recorded in a local file because 1174 this precludes the use of the implementation in an environment 1175 where there is no local disk and all file access is from an 1176 NFSv4.1 server. 1178 * The string should be the same for each server network address that 1179 the client accesses. This way, if a server has multiple 1180 interfaces, the client can trunk traffic over multiple network 1181 paths as described in Section 2.10.5. (Note: the precise opposite 1182 was advised in the NFSv4.0 specification [37].) 1184 * The algorithm for generating the string should not assume that the 1185 client's network address will not change, unless the client 1186 implementation knows it is using statically assigned network 1187 addresses. This includes changes between client incarnations and 1188 even changes while the client is still running in its current 1189 incarnation. Thus, with dynamic address assignment, if the client 1190 includes just the client's network address in the co_ownerid 1191 string, there is a real risk that after the client gives up the 1192 network address, another client, using a similar algorithm for 1193 generating the co_ownerid string, would generate a conflicting 1194 co_ownerid string. 1196 Given the above considerations, an example of a well-generated 1197 co_ownerid string is one that includes: 1199 * If applicable, the client's statically assigned network address. 1201 * Additional information that tends to be unique, such as one or 1202 more of: 1204 - The client machine's serial number (for privacy reasons, it is 1205 best to perform some one-way function on the serial number). 1207 - A Media Access Control (MAC) address (again, a one-way function 1208 should be performed). 1210 - The timestamp of when the NFSv4.1 software was first installed 1211 on the client (though this is subject to the previously 1212 mentioned caution about using information that is stored in a 1213 file, because the file might only be accessible over NFSv4.1). 1215 - A true random number. However, since this number ought to be 1216 the same between client incarnations, this shares the same 1217 problem as that of using the timestamp of the software 1218 installation. 1220 * For a user-level NFSv4.1 client, it should contain additional 1221 information to distinguish the client from other user-level 1222 clients running on the same host, such as a process identifier or 1223 other unique sequence. 1225 The client ID is assigned by the server (the eir_clientid result from 1226 EXCHANGE_ID) and should be chosen so that it will not conflict with a 1227 client ID previously assigned by the server. This applies across 1228 server restarts. 1230 In the event of a server restart, a client may find out that its 1231 current client ID is no longer valid when it receives an 1232 NFS4ERR_STALE_CLIENTID error. The precise circumstances depend on 1233 the characteristics of the sessions involved, specifically whether 1234 the session is persistent (see Section 2.10.6.5), but in each case 1235 the client will receive this error when it attempts to establish a 1236 new session with the existing client ID and receives the error 1237 NFS4ERR_STALE_CLIENTID, indicating that a new client ID needs to be 1238 obtained via EXCHANGE_ID and the new session established with that 1239 client ID. 1241 When a session is not persistent, the client will find out that it 1242 needs to create a new session as a result of getting an 1243 NFS4ERR_BADSESSION, since the session in question was lost as part of 1244 a server restart. When the existing client ID is presented to a 1245 server as part of creating a session and that client ID is not 1246 recognized, as would happen after a server restart, the server will 1247 reject the request with the error NFS4ERR_STALE_CLIENTID. 1249 In the case of the session being persistent, the client will re- 1250 establish communication using the existing session after the restart. 1251 This session will be associated with the existing client ID but may 1252 only be used to retransmit operations that the client previously 1253 transmitted and did not see replies to. Replies to operations that 1254 the server previously performed will come from the reply cache; 1255 otherwise, NFS4ERR_DEADSESSION will be returned. Hence, such a 1256 session is referred to as "dead". In this situation, in order to 1257 perform new operations, the client needs to establish a new session. 1258 If an attempt is made to establish this new session with the existing 1259 client ID, the server will reject the request with 1260 NFS4ERR_STALE_CLIENTID. 1262 When NFS4ERR_STALE_CLIENTID is received in either of these 1263 situations, the client needs to obtain a new client ID by use of the 1264 EXCHANGE_ID operation, then use that client ID as the basis of a new 1265 session, and then proceed to any other necessary recovery for the 1266 server restart case (see Section 8.4.2). 1268 See the descriptions of EXCHANGE_ID (Section 18.35) and 1269 CREATE_SESSION (Section 18.36) for a complete specification of these 1270 operations. 1272 2.4.1. Upgrade from NFSv4.0 to NFSv4.1 1274 To facilitate upgrade from NFSv4.0 to NFSv4.1, a server may compare a 1275 value of data type client_owner4 in an EXCHANGE_ID with a value of 1276 data type nfs_client_id4 that was established using the SETCLIENTID 1277 operation of NFSv4.0. A server that does so will allow an upgraded 1278 client to avoid waiting until the lease (i.e., the lease established 1279 by the NFSv4.0 instance client) expires. This requires that the 1280 value of data type client_owner4 be constructed the same way as the 1281 value of data type nfs_client_id4. If the latter's contents included 1282 the server's network address (per the recommendations of the NFSv4.0 1283 specification [37]), and the NFSv4.1 client does not wish to use a 1284 client ID that prevents trunking, it should send two EXCHANGE_ID 1285 operations. The first EXCHANGE_ID will have a client_owner4 equal to 1286 the nfs_client_id4. This will clear the state created by the NFSv4.0 1287 client. The second EXCHANGE_ID will not have the server's network 1288 address. The state created for the second EXCHANGE_ID will not have 1289 to wait for lease expiration, because there will be no state to 1290 expire. 1292 2.4.2. Server Release of Client ID 1294 NFSv4.1 introduces a new operation called DESTROY_CLIENTID 1295 (Section 18.50), which the client SHOULD use to destroy a client ID 1296 it no longer needs. This permits graceful, bilateral release of a 1297 client ID. The operation cannot be used if there are sessions 1298 associated with the client ID, or state with an unexpired lease. 1300 If the server determines that the client holds no associated state 1301 for its client ID (associated state includes unrevoked sessions, 1302 opens, locks, delegations, layouts, and wants), the server MAY choose 1303 to unilaterally release the client ID in order to conserve resources. 1304 If the client contacts the server after this release, the server MUST 1305 ensure that the client receives the appropriate error so that it will 1306 use the EXCHANGE_ID/CREATE_SESSION sequence to establish a new client 1307 ID. The server ought to be very hesitant to release a client ID 1308 since the resulting work on the client to recover from such an event 1309 will be the same burden as if the server had failed and restarted. 1311 Typically, a server would not release a client ID unless there had 1312 been no activity from that client for many minutes. As long as there 1313 are sessions, opens, locks, delegations, layouts, or wants, the 1314 server MUST NOT release the client ID. See Section 2.10.13.1.4 for 1315 discussion on releasing inactive sessions. 1317 2.4.3. Resolving Client Owner Conflicts 1319 When the server gets an EXCHANGE_ID for a client owner that currently 1320 has no state, or that has state but the lease has expired, the server 1321 MUST allow the EXCHANGE_ID and confirm the new client ID if followed 1322 by the appropriate CREATE_SESSION. 1324 When the server gets an EXCHANGE_ID for a new incarnation of a client 1325 owner that currently has an old incarnation with state and an 1326 unexpired lease, the server is allowed to dispose of the state of the 1327 previous incarnation of the client owner if one of the following is 1328 true: 1330 * The principal that created the client ID for the client owner is 1331 the same as the principal that is sending the EXCHANGE_ID 1332 operation. Note that if the client ID was created with 1333 SP4_MACH_CRED state protection (Section 18.35), the principal MUST 1334 be based on RPCSEC_GSS authentication, the RPCSEC_GSS service used 1335 MUST be integrity or privacy, and the same GSS mechanism and 1336 principal MUST be used as that used when the client ID was 1337 created. 1339 * The client ID was established with SP4_SSV protection 1340 (Section 18.35, Section 2.10.8.3) and the client sends the 1341 EXCHANGE_ID with the security flavor set to RPCSEC_GSS using the 1342 GSS SSV mechanism (Section 2.10.9). 1344 * The client ID was established with SP4_SSV protection, and under 1345 the conditions described herein, the EXCHANGE_ID was sent with 1346 SP4_MACH_CRED state protection. Because the SSV might not persist 1347 across client and server restart, and because the first time a 1348 client sends EXCHANGE_ID to a server it does not have an SSV, the 1349 client MAY send the subsequent EXCHANGE_ID without an SSV 1350 RPCSEC_GSS handle. Instead, as with SP4_MACH_CRED protection, the 1351 principal MUST be based on RPCSEC_GSS authentication, the 1352 RPCSEC_GSS service used MUST be integrity or privacy, and the same 1353 GSS mechanism and principal MUST be used as that used when the 1354 client ID was created. 1356 If none of the above situations apply, the server MUST return 1357 NFS4ERR_CLID_INUSE. 1359 If the server accepts the principal and co_ownerid as matching that 1360 which created the client ID, and the co_verifier in the EXCHANGE_ID 1361 differs from the co_verifier used when the client ID was created, 1362 then after the server receives a CREATE_SESSION that confirms the 1363 client ID, the server deletes state. If the co_verifier values are 1364 the same (e.g., the client either is updating properties of the 1365 client ID (Section 18.35) or is attempting trunking (Section 2.10.5), 1366 the server MUST NOT delete state. 1368 2.5. Server Owners 1370 The server owner is similar to a client owner (Section 2.4), but 1371 unlike the client owner, there is no shorthand server ID. The server 1372 owner is defined in the following data type: 1374 struct server_owner4 { 1375 uint64_t so_minor_id; 1376 opaque so_major_id; 1377 }; 1379 The server owner is returned from EXCHANGE_ID. When the so_major_id 1380 fields are the same in two EXCHANGE_ID results, the connections that 1381 each EXCHANGE_ID were sent over can be assumed to address the same 1382 server (as defined in Section 1.7). If the so_minor_id fields are 1383 also the same, then not only do both connections connect to the same 1384 server, but the session can be shared across both connections. The 1385 reader is cautioned that multiple servers may deliberately or 1386 accidentally claim to have the same so_major_id or so_major_id/ 1387 so_minor_id; the reader should examine Sections 2.10.5 and 18.35 in 1388 order to avoid acting on falsely matching server owner values. 1390 The considerations for generating an so_major_id are similar to that 1391 for generating a co_ownerid string (see Section 2.4). The 1392 consequences of two servers generating conflicting so_major_id values 1393 are less dire than they are for co_ownerid conflicts because the 1394 client can use RPCSEC_GSS to compare the authenticity of each server 1395 (see Section 2.10.5). 1397 2.6. Security Service Negotiation 1399 With the NFSv4.1 server potentially offering multiple security 1400 mechanisms, the client needs a method to determine or negotiate which 1401 mechanism is to be used for its communication with the server. The 1402 NFS server may have multiple points within its file system namespace 1403 that are available for use by NFS clients. These points can be 1404 considered security policy boundaries, and, in some NFS 1405 implementations, are tied to NFS export points. In turn, the NFS 1406 server may be configured such that each of these security policy 1407 boundaries may have different or multiple security mechanisms in use. 1409 The security negotiation between client and server SHOULD be done 1410 with a secure channel to eliminate the possibility of a third party 1411 intercepting the negotiation sequence and forcing the client and 1412 server to choose a lower level of security than required or desired. 1413 See Section 21 for further discussion. 1415 2.6.1. NFSv4.1 Security Tuples 1417 An NFS server can assign one or more "security tuples" to each 1418 security policy boundary in its namespace. Each security tuple 1419 consists of a security flavor (see Section 2.2.1.1) and, if the 1420 flavor is RPCSEC_GSS, a GSS-API mechanism Object Identifier (OID), a 1421 GSS-API quality of protection, and an RPCSEC_GSS service. 1423 2.6.2. SECINFO and SECINFO_NO_NAME 1425 The SECINFO and SECINFO_NO_NAME operations allow the client to 1426 determine, on a per-filehandle basis, what security tuple is to be 1427 used for server access. In general, the client will not have to use 1428 either operation except during initial communication with the server 1429 or when the client crosses security policy boundaries at the server. 1430 However, the server's policies may also change at any time and force 1431 the client to negotiate a new security tuple. 1433 Where the use of different security tuples would affect the type of 1434 access that would be allowed if a request was sent over the same 1435 connection used for the SECINFO or SECINFO_NO_NAME operation (e.g., 1436 read-only vs. read-write) access, security tuples that allow greater 1437 access should be presented first. Where the general level of access 1438 is the same and different security flavors limit the range of 1439 principals whose privileges are recognized (e.g., allowing or 1440 disallowing root access), flavors supporting the greatest range of 1441 principals should be listed first. 1443 2.6.3. Security Error 1445 Based on the assumption that each NFSv4.1 client and server MUST 1446 support a minimum set of security (i.e., Kerberos V5 under 1447 RPCSEC_GSS), the NFS client will initiate file access to the server 1448 with one of the minimal security tuples. During communication with 1449 the server, the client may receive an NFS error of NFS4ERR_WRONGSEC. 1450 This error allows the server to notify the client that the security 1451 tuple currently being used contravenes the server's security policy. 1452 The client is then responsible for determining (see Section 2.6.3.1) 1453 what security tuples are available at the server and choosing one 1454 that is appropriate for the client. 1456 2.6.3.1. Using NFS4ERR_WRONGSEC, SECINFO, and SECINFO_NO_NAME 1458 This section explains the mechanics of NFSv4.1 security negotiation. 1460 2.6.3.1.1. Put Filehandle Operations 1462 The term "put filehandle operation" refers to PUTROOTFH, PUTPUBFH, 1463 PUTFH, and RESTOREFH. Each of the subsections herein describes how 1464 the server handles a subseries of operations that starts with a put 1465 filehandle operation. 1467 2.6.3.1.1.1. Put Filehandle Operation + SAVEFH 1469 The client is saving a filehandle for a future RESTOREFH, LINK, or 1470 RENAME. SAVEFH MUST NOT return NFS4ERR_WRONGSEC. To determine 1471 whether or not the put filehandle operation returns NFS4ERR_WRONGSEC, 1472 the server implementation pretends SAVEFH is not in the series of 1473 operations and examines which of the situations described in the 1474 other subsections of Section 2.6.3.1.1 apply. 1476 2.6.3.1.1.2. Two or More Put Filehandle Operations 1478 For a series of N put filehandle operations, the server MUST NOT 1479 return NFS4ERR_WRONGSEC to the first N-1 put filehandle operations. 1480 The Nth put filehandle operation is handled as if it is the first in 1481 a subseries of operations. For example, if the server received a 1482 COMPOUND request with this series of operations -- PUTFH, PUTROOTFH, 1483 LOOKUP -- then the PUTFH operation is ignored for NFS4ERR_WRONGSEC 1484 purposes, and the PUTROOTFH, LOOKUP subseries is processed as 1485 according to Section 2.6.3.1.1.3. 1487 2.6.3.1.1.3. Put Filehandle Operation + LOOKUP (or OPEN of an Existing 1488 Name) 1490 This situation also applies to a put filehandle operation followed by 1491 a LOOKUP or an OPEN operation that specifies an existing component 1492 name. 1494 In this situation, the client is potentially crossing a security 1495 policy boundary, and the set of security tuples the parent directory 1496 supports may differ from those of the child. The server 1497 implementation may decide whether to impose any restrictions on 1498 security policy administration. There are at least three approaches 1499 (sec_policy_child is the tuple set of the child export, 1500 sec_policy_parent is that of the parent). 1502 (a) sec_policy_child <= sec_policy_parent (<= for subset). This 1503 means that the set of security tuples specified on the security 1504 policy of a child directory is always a subset of its parent 1505 directory. 1507 (b) sec_policy_child ^ sec_policy_parent != {} (^ for intersection, 1508 {} for the empty set). This means that the set of security 1509 tuples specified on the security policy of a child directory 1510 always has a non-empty intersection with that of the parent. 1512 (c) sec_policy_child ^ sec_policy_parent == {}. This means that the 1513 set of security tuples specified on the security policy of a 1514 child directory may not intersect with that of the parent. In 1515 other words, there are no restrictions on how the system 1516 administrator may set up these tuples. 1518 In order for a server to support approaches (b) (for the case when a 1519 client chooses a flavor that is not a member of sec_policy_parent) 1520 and (c), the put filehandle operation cannot return NFS4ERR_WRONGSEC 1521 when there is a security tuple mismatch. Instead, it should be 1522 returned from the LOOKUP (or OPEN by existing component name) that 1523 follows. 1525 Since the above guideline does not contradict approach (a), it should 1526 be followed in general. Even if approach (a) is implemented, it is 1527 possible for the security tuple used to be acceptable for the target 1528 of LOOKUP but not for the filehandles used in the put filehandle 1529 operation. The put filehandle operation could be a PUTROOTFH or 1530 PUTPUBFH, where the client cannot know the security tuples for the 1531 root or public filehandle. Or the security policy for the filehandle 1532 used by the put filehandle operation could have changed since the 1533 time the filehandle was obtained. 1535 Therefore, an NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC in 1536 response to the put filehandle operation if the operation is 1537 immediately followed by a LOOKUP or an OPEN by component name. 1539 2.6.3.1.1.4. Put Filehandle Operation + LOOKUPP 1541 Since SECINFO only works its way down, there is no way LOOKUPP can 1542 return NFS4ERR_WRONGSEC without SECINFO_NO_NAME. SECINFO_NO_NAME 1543 solves this issue via style SECINFO_STYLE4_PARENT, which works in the 1544 opposite direction as SECINFO. As with Section 2.6.3.1.1.3, a put 1545 filehandle operation that is followed by a LOOKUPP MUST NOT return 1546 NFS4ERR_WRONGSEC. If the server does not support SECINFO_NO_NAME, 1547 the client's only recourse is to send the put filehandle operation, 1548 LOOKUPP, GETFH sequence of operations with every security tuple it 1549 supports. 1551 Regardless of whether SECINFO_NO_NAME is supported, an NFSv4.1 server 1552 MUST NOT return NFS4ERR_WRONGSEC in response to a put filehandle 1553 operation if the operation is immediately followed by a LOOKUPP. 1555 2.6.3.1.1.5. Put Filehandle Operation + SECINFO/SECINFO_NO_NAME 1557 A security-sensitive client is allowed to choose a strong security 1558 tuple when querying a server to determine a file object's permitted 1559 security tuples. The security tuple chosen by the client does not 1560 have to be included in the tuple list of the security policy of 1561 either the parent directory indicated in the put filehandle operation 1562 or the child file object indicated in SECINFO (or any parent 1563 directory indicated in SECINFO_NO_NAME). Of course, the server has 1564 to be configured for whatever security tuple the client selects; 1565 otherwise, the request will fail at the RPC layer with an appropriate 1566 authentication error. 1568 In theory, there is no connection between the security flavor used by 1569 SECINFO or SECINFO_NO_NAME and those supported by the security 1570 policy. But in practice, the client may start looking for strong 1571 flavors from those supported by the security policy, followed by 1572 those in the REQUIRED set. 1574 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to a put 1575 filehandle operation that is immediately followed by SECINFO or 1576 SECINFO_NO_NAME. The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC 1577 from SECINFO or SECINFO_NO_NAME. 1579 2.6.3.1.1.6. Put Filehandle Operation + Nothing 1581 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC. 1583 2.6.3.1.1.7. Put Filehandle Operation + Anything Else 1585 "Anything Else" includes OPEN by filehandle. 1587 The security policy enforcement applies to the filehandle specified 1588 in the put filehandle operation. Therefore, the put filehandle 1589 operation MUST return NFS4ERR_WRONGSEC when there is a security tuple 1590 mismatch. This avoids the complexity of adding NFS4ERR_WRONGSEC as 1591 an allowable error to every other operation. 1593 A COMPOUND containing the series put filehandle operation + 1594 SECINFO_NO_NAME (style SECINFO_STYLE4_CURRENT_FH) is an efficient way 1595 for the client to recover from NFS4ERR_WRONGSEC. 1597 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to any operation 1598 other than a put filehandle operation, LOOKUP, LOOKUPP, and OPEN (by 1599 component name). 1601 2.6.3.1.1.8. Operations after SECINFO and SECINFO_NO_NAME 1603 Suppose a client sends a COMPOUND procedure containing the series 1604 SEQUENCE, PUTFH, SECINFO_NONAME, READ, and suppose the security tuple 1605 used does not match that required for the target file. By rule (see 1606 Section 2.6.3.1.1.5), neither PUTFH nor SECINFO_NO_NAME can return 1607 NFS4ERR_WRONGSEC. By rule (see Section 2.6.3.1.1.7), READ cannot 1608 return NFS4ERR_WRONGSEC. The issue is resolved by the fact that 1609 SECINFO and SECINFO_NO_NAME consume the current filehandle (note that 1610 this is a change from NFSv4.0). This leaves no current filehandle 1611 for READ to use, and READ returns NFS4ERR_NOFILEHANDLE. 1613 2.6.3.1.2. LINK and RENAME 1615 The LINK and RENAME operations use both the current and saved 1616 filehandles. Technically, the server MAY return NFS4ERR_WRONGSEC 1617 from LINK or RENAME if the security policy of the saved filehandle 1618 rejects the security flavor used in the COMPOUND request's 1619 credentials. If the server does so, then if there is no intersection 1620 between the security policies of saved and current filehandles, this 1621 means that it will be impossible for the client to perform the 1622 intended LINK or RENAME operation. 1624 For example, suppose the client sends this COMPOUND request: 1625 SEQUENCE, PUTFH bFH, SAVEFH, PUTFH aFH, RENAME "c" "d", where 1626 filehandles bFH and aFH refer to different directories. Suppose no 1627 common security tuple exists between the security policies of aFH and 1628 bFH. If the client sends the request using credentials acceptable to 1629 bFH's security policy but not aFH's policy, then the PUTFH aFH 1630 operation will fail with NFS4ERR_WRONGSEC. After a SECINFO_NO_NAME 1631 request, the client sends SEQUENCE, PUTFH bFH, SAVEFH, PUTFH aFH, 1632 RENAME "c" "d", using credentials acceptable to aFH's security policy 1633 but not bFH's policy. The server returns NFS4ERR_WRONGSEC on the 1634 RENAME operation. 1636 To prevent a client from an endless sequence of a request containing 1637 LINK or RENAME, followed by a request containing SECINFO_NO_NAME or 1638 SECINFO, the server MUST detect when the security policies of the 1639 current and saved filehandles have no mutually acceptable security 1640 tuple, and MUST NOT return NFS4ERR_WRONGSEC from LINK or RENAME in 1641 that situation. Instead the server MUST do one of two things: 1643 * The server can return NFS4ERR_XDEV. 1645 * The server can allow the security policy of the current filehandle 1646 to override that of the saved filehandle, and so return NFS4_OK. 1648 2.7. Minor Versioning 1650 To address the requirement of an NFS protocol that can evolve as the 1651 need arises, the NFSv4.1 protocol contains the rules and framework to 1652 allow for future minor changes or versioning. 1654 The base assumption with respect to minor versioning is that any 1655 future accepted minor version will be documented in one or more 1656 Standards Track RFCs. Minor version 0 of the NFSv4 protocol is 1657 represented by [37], and minor version 1 is represented by this RFC. 1658 The COMPOUND and CB_COMPOUND procedures support the encoding of the 1659 minor version being requested by the client. 1661 The following items represent the basic rules for the development of 1662 minor versions. Note that a future minor version may modify or add 1663 to the following rules as part of the minor version definition. 1665 1. Procedures are not added or deleted. 1667 To maintain the general RPC model, NFSv4 minor versions will not 1668 add to or delete procedures from the NFS program. 1670 2. Minor versions may add operations to the COMPOUND and 1671 CB_COMPOUND procedures. 1673 The addition of operations to the COMPOUND and CB_COMPOUND 1674 procedures does not affect the RPC model. 1676 * Minor versions may append attributes to the bitmap4 that 1677 represents sets of attributes and to the fattr4 that 1678 represents sets of attribute values. 1680 This allows for the expansion of the attribute model to allow 1681 for future growth or adaptation. 1683 * Minor version X must append any new attributes after the last 1684 documented attribute. 1686 Since attribute results are specified as an opaque array of 1687 per-attribute, XDR-encoded results, the complexity of adding 1688 new attributes in the midst of the current definitions would 1689 be too burdensome. 1691 3. Minor versions must not modify the structure of an existing 1692 operation's arguments or results. 1694 Again, the complexity of handling multiple structure definitions 1695 for a single operation is too burdensome. New operations should 1696 be added instead of modifying existing structures for a minor 1697 version. 1699 This rule does not preclude the following adaptations in a minor 1700 version: 1702 * adding bits to flag fields, such as new attributes to 1703 GETATTR's bitmap4 data type, and providing corresponding 1704 variants of opaque arrays, such as a notify4 used together 1705 with such bitmaps 1707 * adding bits to existing attributes like ACLs that have flag 1708 words 1710 * extending enumerated types (including NFS4ERR_*) with new 1711 values 1713 * adding cases to a switched union 1715 4. Minor versions must not modify the structure of existing 1716 attributes. 1718 5. Minor versions must not delete operations. 1720 This prevents the potential reuse of a particular operation 1721 "slot" in a future minor version. 1723 6. Minor versions must not delete attributes. 1725 7. Minor versions must not delete flag bits or enumeration values. 1727 8. Minor versions may declare an operation MUST NOT be implemented. 1729 Specifying that an operation MUST NOT be implemented is 1730 equivalent to obsoleting an operation. For the client, it means 1731 that the operation MUST NOT be sent to the server. For the 1732 server, an NFS error can be returned as opposed to "dropping" 1733 the request as an XDR decode error. This approach allows for 1734 the obsolescence of an operation while maintaining its structure 1735 so that a future minor version can reintroduce the operation. 1737 1. Minor versions may declare that an attribute MUST NOT be 1738 implemented. 1740 2. Minor versions may declare that a flag bit or enumeration 1741 value MUST NOT be implemented. 1743 9. Minor versions may downgrade features from REQUIRED to 1744 RECOMMENDED, or RECOMMENDED to OPTIONAL. 1746 10. Minor versions may upgrade features from OPTIONAL to 1747 RECOMMENDED, or RECOMMENDED to REQUIRED. 1749 11. A client and server that support minor version X SHOULD support 1750 minor versions zero through X-1 as well. 1752 12. Except for infrastructural changes, a minor version must not 1753 introduce REQUIRED new features. 1755 This rule allows for the introduction of new functionality and 1756 forces the use of implementation experience before designating a 1757 feature as REQUIRED. On the other hand, some classes of 1758 features are infrastructural and have broad effects. Allowing 1759 infrastructural features to be RECOMMENDED or OPTIONAL 1760 complicates implementation of the minor version. 1762 13. A client MUST NOT attempt to use a stateid, filehandle, or 1763 similar returned object from the COMPOUND procedure with minor 1764 version X for another COMPOUND procedure with minor version Y, 1765 where X != Y. 1767 2.8. Non-RPC-Based Security Services 1769 As described in Section 2.2.1.1.1.1, NFSv4.1 relies on RPC for 1770 identification, authentication, integrity, and privacy. NFSv4.1 1771 itself provides or enables additional security services as described 1772 in the next several subsections. 1774 2.8.1. Authorization 1776 Authorization to access a file object via an NFSv4.1 operation is 1777 ultimately determined by the NFSv4.1 server. A client can 1778 predetermine its access to a file object via the OPEN (Section 18.16) 1779 and the ACCESS (Section 18.1) operations. 1781 Principals with appropriate access rights can modify the 1782 authorization on a file object via the SETATTR (Section 18.30) 1783 operation. Attributes that affect access rights include mode, owner, 1784 owner_group, acl, dacl, and sacl. See Section 5. 1786 2.8.2. Auditing 1788 NFSv4.1 provides auditing on a per-file object basis, via the acl and 1789 sacl attributes as described in Section 6. It is outside the scope 1790 of this specification to specify audit log formats or management 1791 policies. 1793 2.8.3. Intrusion Detection 1795 NFSv4.1 provides alarm control on a per-file object basis, via the 1796 acl and sacl attributes as described in Section 6. Alarms may serve 1797 as the basis for intrusion detection. It is outside the scope of 1798 this specification to specify heuristics for detecting intrusion via 1799 alarms. 1801 2.9. Transport Layers 1803 2.9.1. REQUIRED and RECOMMENDED Properties of Transports 1805 NFSv4.1 works over Remote Direct Memory Access (RDMA) and non-RDMA- 1806 based transports with the following attributes: 1808 * The transport supports reliable delivery of data, which NFSv4.1 1809 requires but neither NFSv4.1 nor RPC has facilities for ensuring 1810 [41]. 1812 * The transport delivers data in the order it was sent. Ordered 1813 delivery simplifies detection of transmit errors, and simplifies 1814 the sending of arbitrary sized requests and responses via the 1815 record marking protocol [3]. 1817 Where an NFSv4.1 implementation supports operation over the IP 1818 network protocol, any transport used between NFS and IP MUST be among 1819 the IETF-approved congestion control transport protocols. At the 1820 time this document was written, the only two transports that had the 1821 above attributes were TCP and the Stream Control Transmission 1822 Protocol (SCTP). To enhance the possibilities for interoperability, 1823 an NFSv4.1 implementation MUST support operation over the TCP 1824 transport protocol. 1826 Even if NFSv4.1 is used over a non-IP network protocol, it is 1827 RECOMMENDED that the transport support congestion control. 1829 It is permissible for a connectionless transport to be used under 1830 NFSv4.1; however, reliable and in-order delivery of data combined 1831 with congestion control by the connectionless transport is REQUIRED. 1832 As a consequence, UDP by itself MUST NOT be used as an NFSv4.1 1833 transport. NFSv4.1 assumes that a client transport address and 1834 server transport address used to send data over a transport together 1835 constitute a connection, even if the underlying transport eschews the 1836 concept of a connection. 1838 2.9.2. Client and Server Transport Behavior 1840 If a connection-oriented transport (e.g., TCP) is used, the client 1841 and server SHOULD use long-lived connections for at least three 1842 reasons: 1844 1. This will prevent the weakening of the transport's congestion 1845 control mechanisms via short-lived connections. 1847 2. This will improve performance for the WAN environment by 1848 eliminating the need for connection setup handshakes. 1850 3. The NFSv4.1 callback model differs from NFSv4.0, and requires the 1851 client and server to maintain a client-created backchannel (see 1852 Section 2.10.3.1) for the server to use. 1854 In order to reduce congestion, if a connection-oriented transport is 1855 used, and the request is not the NULL procedure: 1857 * A requester MUST NOT retry a request unless the connection the 1858 request was sent over was lost before the reply was received. 1860 * A replier MUST NOT silently drop a request, even if the request is 1861 a retry. (The silent drop behavior of RPCSEC_GSS [4] does not 1862 apply because this behavior happens at the RPCSEC_GSS layer, a 1863 lower layer in the request processing.) Instead, the replier 1864 SHOULD return an appropriate error (see Section 2.10.6.1), or it 1865 MAY disconnect the connection. 1867 When sending a reply, the replier MUST send the reply to the same 1868 full network address (e.g., if using an IP-based transport, the 1869 source port of the requester is part of the full network address) 1870 from which the requester sent the request. If using a connection- 1871 oriented transport, replies MUST be sent on the same connection from 1872 which the request was received. 1874 If a connection is dropped after the replier receives the request but 1875 before the replier sends the reply, the replier might have a pending 1876 reply. If a connection is established with the same source and 1877 destination full network address as the dropped connection, then the 1878 replier MUST NOT send the reply until the requester retries the 1879 request. The reason for this prohibition is that the requester MAY 1880 retry a request over a different connection (provided that connection 1881 is associated with the original request's session). 1883 When using RDMA transports, there are other reasons for not 1884 tolerating retries over the same connection: 1886 * RDMA transports use "credits" to enforce flow control, where a 1887 credit is a right to a peer to transmit a message. If one peer 1888 were to retransmit a request (or reply), it would consume an 1889 additional credit. If the replier retransmitted a reply, it would 1890 certainly result in an RDMA connection loss, since the requester 1891 would typically only post a single receive buffer for each 1892 request. If the requester retransmitted a request, the additional 1893 credit consumed on the server might lead to RDMA connection 1894 failure unless the client accounted for it and decreased its 1895 available credit, leading to wasted resources. 1897 * RDMA credits present a new issue to the reply cache in NFSv4.1. 1898 The reply cache may be used when a connection within a session is 1899 lost, such as after the client reconnects. Credit information is 1900 a dynamic property of the RDMA connection, and stale values must 1901 not be replayed from the cache. This implies that the reply cache 1902 contents must not be blindly used when replies are sent from it, 1903 and credit information appropriate to the channel must be 1904 refreshed by the RPC layer. 1906 In addition, as described in Section 2.10.6.2, while a session is 1907 active, the NFSv4.1 requester MUST NOT stop waiting for a reply. 1909 2.9.3. Ports 1911 Historically, NFSv3 servers have listened over TCP port 2049. The 1912 registered port 2049 [42] for the NFS protocol should be the default 1913 configuration. NFSv4.1 clients SHOULD NOT use the RPC binding 1914 protocols as described in [43]. 1916 2.10. Session 1918 NFSv4.1 clients and servers MUST support and MUST use the session 1919 feature as described in this section. 1921 2.10.1. Motivation and Overview 1923 Previous versions and minor versions of NFS have suffered from the 1924 following: 1926 * Lack of support for Exactly Once Semantics (EOS). This includes 1927 lack of support for EOS through server failure and recovery. 1929 * Limited callback support, including no support for sending 1930 callbacks through firewalls, and races between replies to normal 1931 requests and callbacks. 1933 * Limited trunking over multiple network paths. 1935 * Requiring machine credentials for fully secure operation. 1937 Through the introduction of a session, NFSv4.1 addresses the above 1938 shortfalls with practical solutions: 1940 * EOS is enabled by a reply cache with a bounded size, making it 1941 feasible to keep the cache in persistent storage and enable EOS 1942 through server failure and recovery. One reason that previous 1943 revisions of NFS did not support EOS was because some EOS 1944 approaches often limited parallelism. As will be explained in 1945 Section 2.10.6, NFSv4.1 supports both EOS and unlimited 1946 parallelism. 1948 * The NFSv4.1 client (defined in Section 1.7) creates transport 1949 connections and provides them to the server to use for sending 1950 callback requests, thus solving the firewall issue 1951 (Section 18.34). Races between responses from client requests and 1952 callbacks caused by the requests are detected via the session's 1953 sequencing properties that are a consequence of EOS 1954 (Section 2.10.6.3). 1956 * The NFSv4.1 client can associate an arbitrary number of 1957 connections with the session, and thus provide trunking 1958 (Section 2.10.5). 1960 * The NFSv4.1 client and server produce a session key independent of 1961 client and server machine credentials which can be used to compute 1962 a digest for protecting critical session management operations 1963 (Section 2.10.8.3). 1965 * The NFSv4.1 client can also create secure RPCSEC_GSS contexts for 1966 use by the session's backchannel that do not require the server to 1967 authenticate to a client machine principal (Section 2.10.8.2). 1969 A session is a dynamically created, long-lived server object created 1970 by a client and used over time from one or more transport 1971 connections. Its function is to maintain the server's state relative 1972 to the connection(s) belonging to a client instance. This state is 1973 entirely independent of the connection itself, and indeed the state 1974 exists whether or not the connection exists. A client may have one 1975 or more sessions associated with it so that client-associated state 1976 may be accessed using any of the sessions associated with that 1977 client's client ID, when connections are associated with those 1978 sessions. When no connections are associated with any of a client 1979 ID's sessions for an extended time, such objects as locks, opens, 1980 delegations, layouts, etc. are subject to expiration. The session 1981 serves as an object representing a means of access by a client to the 1982 associated client state on the server, independent of the physical 1983 means of access to that state. 1985 A single client may create multiple sessions. A single session MUST 1986 NOT serve multiple clients. 1988 2.10.2. NFSv4 Integration 1990 Sessions are part of NFSv4.1 and not NFSv4.0. Normally, a major 1991 infrastructure change such as sessions would require a new major 1992 version number to an Open Network Computing (ONC) RPC program like 1993 NFS. However, because NFSv4 encapsulates its functionality in a 1994 single procedure, COMPOUND, and because COMPOUND can support an 1995 arbitrary number of operations, sessions have been added to NFSv4.1 1996 with little difficulty. COMPOUND includes a minor version number 1997 field, and for NFSv4.1 this minor version is set to 1. When the 1998 NFSv4 server processes a COMPOUND with the minor version set to 1, it 1999 expects a different set of operations than it does for NFSv4.0. 2000 NFSv4.1 defines the SEQUENCE operation, which is required for every 2001 COMPOUND that operates over an established session, with the 2002 exception of some session administration operations, such as 2003 DESTROY_SESSION (Section 18.37). 2005 2.10.2.1. SEQUENCE and CB_SEQUENCE 2007 In NFSv4.1, when the SEQUENCE operation is present, it MUST be the 2008 first operation in the COMPOUND procedure. The primary purpose of 2009 SEQUENCE is to carry the session identifier. The session identifier 2010 associates all other operations in the COMPOUND procedure with a 2011 particular session. SEQUENCE also contains required information for 2012 maintaining EOS (see Section 2.10.6). Session-enabled NFSv4.1 2013 COMPOUND requests thus have the form: 2015 +-----+--------------+-----------+------------+-----------+---- 2016 | tag | minorversion | numops |SEQUENCE op | op + args | ... 2017 | | (== 1) | (limited) | + args | | 2018 +-----+--------------+-----------+------------+-----------+---- 2020 and the replies have the form: 2022 +------------+-----+--------+-------------------------------+--// 2023 |last status | tag | numres |status + SEQUENCE op + results | // 2024 +------------+-----+--------+-------------------------------+--// 2025 //-----------------------+---- 2026 // status + op + results | ... 2027 //-----------------------+---- 2029 A CB_COMPOUND procedure request and reply has a similar form to 2030 COMPOUND, but instead of a SEQUENCE operation, there is a CB_SEQUENCE 2031 operation. CB_COMPOUND also has an additional field called 2032 "callback_ident", which is superfluous in NFSv4.1 and MUST be ignored 2033 by the client. CB_SEQUENCE has the same information as SEQUENCE, and 2034 also includes other information needed to resolve callback races 2035 (Section 2.10.6.3). 2037 2.10.2.2. Client ID and Session Association 2039 Each client ID (Section 2.4) can have zero or more active sessions. 2040 A client ID and associated session are required to perform file 2041 access in NFSv4.1. Each time a session is used (whether by a client 2042 sending a request to the server or the client replying to a callback 2043 request from the server), the state leased to its associated client 2044 ID is automatically renewed. 2046 State (which can consist of share reservations, locks, delegations, 2047 and layouts (Section 1.8.4)) is tied to the client ID. Client state 2048 is not tied to any individual session. Successive state changing 2049 operations from a given state owner MAY go over different sessions, 2050 provided the session is associated with the same client ID. A 2051 callback MAY arrive over a different session than that of the request 2052 that originally acquired the state pertaining to the callback. For 2053 example, if session A is used to acquire a delegation, a request to 2054 recall the delegation MAY arrive over session B if both sessions are 2055 associated with the same client ID. Sections 2.10.8.1 and 2.10.8.2 2056 discuss the security considerations around callbacks. 2058 2.10.3. Channels 2060 A channel is not a connection. A channel represents the direction 2061 ONC RPC requests are sent. 2063 Each session has one or two channels: the fore channel and the 2064 backchannel. Because there are at most two channels per session, and 2065 because each channel has a distinct purpose, channels are not 2066 assigned identifiers. 2068 The fore channel is used for ordinary requests from the client to the 2069 server, and carries COMPOUND requests and responses. A session 2070 always has a fore channel. 2072 The backchannel is used for callback requests from server to client, 2073 and carries CB_COMPOUND requests and responses. Whether or not there 2074 is a backchannel is decided by the client; however, many features of 2075 NFSv4.1 require a backchannel. NFSv4.1 servers MUST support 2076 backchannels. 2078 Each session has resources for each channel, including separate reply 2079 caches (see Section 2.10.6.1). Note that even the backchannel 2080 requires a reply cache (or, at least, a slot table in order to detect 2081 retries) because some callback operations are non-idempotent. 2083 2.10.3.1. Association of Connections, Channels, and Sessions 2085 Each channel is associated with zero or more transport connections 2086 (whether of the same transport protocol or different transport 2087 protocols). A connection can be associated with one channel or both 2088 channels of a session; the client and server negotiate whether a 2089 connection will carry traffic for one channel or both channels via 2090 the CREATE_SESSION (Section 18.36) and the BIND_CONN_TO_SESSION 2091 (Section 18.34) operations. When a session is created via 2092 CREATE_SESSION, the connection that transported the CREATE_SESSION 2093 request is automatically associated with the fore channel, and 2094 optionally the backchannel. If the client specifies no state 2095 protection (Section 18.35) when the session is created, then when 2096 SEQUENCE is transmitted on a different connection, the connection is 2097 automatically associated with the fore channel of the session 2098 specified in the SEQUENCE operation. 2100 A connection's association with a session is not exclusive. A 2101 connection associated with the channel(s) of one session may be 2102 simultaneously associated with the channel(s) of other sessions 2103 including sessions associated with other client IDs. 2105 It is permissible for connections of multiple transport types to be 2106 associated with the same channel. For example, both TCP and RDMA 2107 connections can be associated with the fore channel. In the event an 2108 RDMA and non-RDMA connection are associated with the same channel, 2109 the maximum number of slots SHOULD be at least one more than the 2110 total number of RDMA credits (Section 2.10.6.1). This way, if all 2111 RDMA credits are used, the non-RDMA connection can have at least one 2112 outstanding request. If a server supports multiple transport types, 2113 it MUST allow a client to associate connections from each transport 2114 to a channel. 2116 It is permissible for a connection of one type of transport to be 2117 associated with the fore channel, and a connection of a different 2118 type to be associated with the backchannel. 2120 2.10.4. Server Scope 2122 Servers each specify a server scope value in the form of an opaque 2123 string eir_server_scope returned as part of the results of an 2124 EXCHANGE_ID operation. The purpose of the server scope is to allow a 2125 group of servers to indicate to clients that a set of servers sharing 2126 the same server scope value has arranged to use distinct values of 2127 opaque identifiers so that the two servers never assign the same 2128 value to two distinct objects. Thus, the identifiers generated by 2129 two servers within that set can be assumed compatible so that, in 2130 certain important cases, identifiers generated by one server in that 2131 set may be presented to another server of the same scope. 2133 The use of such compatible values does not imply that a value 2134 generated by one server will always be accepted by another. In most 2135 cases, it will not. However, a server will not inadvertently accept 2136 a value generated by another server. When it does accept it, it will 2137 be because it is recognized as valid and carrying the same meaning as 2138 on another server of the same scope. 2140 When servers are of the same server scope, this compatibility of 2141 values applies to the following identifiers: 2143 * Filehandle values. A filehandle value accepted by two servers of 2144 the same server scope denotes the same object. A WRITE operation 2145 sent to one server is reflected immediately in a READ sent to the 2146 other. 2148 * Server owner values. When the server scope values are the same, 2149 server owner value may be validly compared. In cases where the 2150 server scope values are different, server owner values are treated 2151 as different even if they contain identical strings of bytes. 2153 The coordination among servers required to provide such compatibility 2154 can be quite minimal, and limited to a simple partition of the ID 2155 space. The recognition of common values requires additional 2156 implementation, but this can be tailored to the specific situations 2157 in which that recognition is desired. 2159 Clients will have occasion to compare the server scope values of 2160 multiple servers under a number of circumstances, each of which will 2161 be discussed under the appropriate functional section: 2163 * When server owner values received in response to EXCHANGE_ID 2164 operations sent to multiple network addresses are compared for the 2165 purpose of determining the validity of various forms of trunking, 2166 as described in Section 11.5.2. 2168 * When network or server reconfiguration causes the same network 2169 address to possibly be directed to different servers, with the 2170 necessity for the client to determine when lock reclaim should be 2171 attempted, as described in Section 8.4.2.1. 2173 When two replies from EXCHANGE_ID, each from two different server 2174 network addresses, have the same server scope, there are a number of 2175 ways a client can validate that the common server scope is due to two 2176 servers cooperating in a group. 2178 * If both EXCHANGE_ID requests were sent with RPCSEC_GSS ([4], [9], 2179 [27]) authentication and the server principal is the same for both 2180 targets, the equality of server scope is validated. It is 2181 RECOMMENDED that two servers intending to share the same server 2182 scope and server_owner major_id also share the same principal 2183 name. In some cases, this simplifies the client's task of 2184 validating server scope. 2186 * The client may accept the appearance of the second server in the 2187 fs_locations or fs_locations_info attribute for a relevant file 2188 system. For example, if there is a migration event for a 2189 particular file system or there are locks to be reclaimed on a 2190 particular file system, the attributes for that particular file 2191 system may be used. The client sends the GETATTR request to the 2192 first server for the fs_locations or fs_locations_info attribute 2193 with RPCSEC_GSS authentication. It may need to do this in advance 2194 of the need to verify the common server scope. If the client 2195 successfully authenticates the reply to GETATTR, and the GETATTR 2196 request and reply containing the fs_locations or fs_locations_info 2197 attribute refers to the second server, then the equality of server 2198 scope is supported. A client may choose to limit the use of this 2199 form of support to information relevant to the specific file 2200 system involved (e.g. a file system being migrated). 2202 2.10.5. Trunking 2204 Trunking is the use of multiple connections between a client and 2205 server in order to increase the speed of data transfer. NFSv4.1 2206 supports two types of trunking: session trunking and client ID 2207 trunking. 2209 In the context of a single server network address, it can be assumed 2210 that all connections are accessing the same server, and NFSv4.1 2211 servers MUST support both forms of trunking. When multiple 2212 connections use a set of network addresses to access the same server, 2213 the server MUST support both forms of trunking. NFSv4.1 servers in a 2214 clustered configuration MAY allow network addresses for different 2215 servers to use client ID trunking. 2217 Clients may use either form of trunking as long as they do not, when 2218 trunking between different server network addresses, violate the 2219 servers' mandates as to the kinds of trunking to be allowed (see 2220 below). With regard to callback channels, the client MUST allow the 2221 server to choose among all callback channels valid for a given client 2222 ID and MUST support trunking when the connections supporting the 2223 backchannel allow session or client ID trunking to be used for 2224 callbacks. 2226 Session trunking is essentially the association of multiple 2227 connections, each with potentially different target and/or source 2228 network addresses, to the same session. When the target network 2229 addresses (server addresses) of the two connections are the same, the 2230 server MUST support such session trunking. When the target network 2231 addresses are different, the server MAY indicate such support using 2232 the data returned by the EXCHANGE_ID operation (see below). 2234 Client ID trunking is the association of multiple sessions to the 2235 same client ID. Servers MUST support client ID trunking for two 2236 target network addresses whenever they allow session trunking for 2237 those same two network addresses. In addition, a server MAY, by 2238 presenting the same major server owner ID (Section 2.5) and server 2239 scope (Section 2.10.4), allow an additional case of client ID 2240 trunking. When two servers return the same major server owner and 2241 server scope, it means that the two servers are cooperating on 2242 locking state management, which is a prerequisite for client ID 2243 trunking. 2245 Distinguishing when the client is allowed to use session and client 2246 ID trunking requires understanding how the results of the EXCHANGE_ID 2247 (Section 18.35) operation identify a server. Suppose a client sends 2248 EXCHANGE_IDs over two different connections, each with a possibly 2249 different target network address, but each EXCHANGE_ID operation has 2250 the same value in the eia_clientowner field. If the same NFSv4.1 2251 server is listening over each connection, then each EXCHANGE_ID 2252 result MUST return the same values of eir_clientid, 2253 eir_server_owner.so_major_id, and eir_server_scope. The client can 2254 then treat each connection as referring to the same server (subject 2255 to verification; see Section 2.10.5.1 below), and it can use each 2256 connection to trunk requests and replies. The client's choice is 2257 whether session trunking or client ID trunking applies. 2259 Session Trunking. If the eia_clientowner argument is the same in two 2260 different EXCHANGE_ID requests, and the eir_clientid, 2261 eir_server_owner.so_major_id, eir_server_owner.so_minor_id, and 2262 eir_server_scope results match in both EXCHANGE_ID results, then 2263 the client is permitted to perform session trunking. If the 2264 client has no session mapping to the tuple of eir_clientid, 2265 eir_server_owner.so_major_id, eir_server_scope, and 2266 eir_server_owner.so_minor_id, then it creates the session via a 2267 CREATE_SESSION operation over one of the connections, which 2268 associates the connection to the session. If there is a session 2269 for the tuple, the client can send BIND_CONN_TO_SESSION to 2270 associate the connection to the session. 2272 Of course, if the client does not desire to use session trunking, 2273 it is not required to do so. It can invoke CREATE_SESSION on the 2274 connection. This will result in client ID trunking as described 2275 below. It can also decide to drop the connection if it does not 2276 choose to use trunking. 2278 Client ID Trunking. If the eia_clientowner argument is the same in 2279 two different EXCHANGE_ID requests, and the eir_clientid, 2280 eir_server_owner.so_major_id, and eir_server_scope results match 2281 in both EXCHANGE_ID results, then the client is permitted to 2282 perform client ID trunking (regardless of whether the 2283 eir_server_owner.so_minor_id results match). The client can 2284 associate each connection with different sessions, where each 2285 session is associated with the same server. 2287 The client completes the act of client ID trunking by invoking 2288 CREATE_SESSION on each connection, using the same client ID that 2289 was returned in eir_clientid. These invocations create two 2290 sessions and also associate each connection with its respective 2291 session. The client is free to decline to use client ID trunking 2292 by simply dropping the connection at this point. 2294 When doing client ID trunking, locking state is shared across 2295 sessions associated with that same client ID. This requires the 2296 server to coordinate state across sessions and the client to be 2297 able to associate the same locking state with multiple sessions. 2299 It is always possible that, as a result of various sorts of 2300 reconfiguration events, eir_server_scope and eir_server_owner values 2301 may be different on subsequent EXCHANGE_ID requests made to the same 2302 network address. 2304 In most cases, such reconfiguration events will be disruptive and 2305 indicate that an IP address formerly connected to one server is now 2306 connected to an entirely different one. 2308 Some guidelines on client handling of such situations follow: 2310 * When eir_server_scope changes, the client has no assurance that 2311 any IDs that it obtained previously (e.g., filehandles) can be 2312 validly used on the new server, and, even if the new server 2313 accepts them, there is no assurance that this is not due to 2314 accident. Thus, it is best to treat all such state as lost or 2315 stale, although a client may assume that the probability of 2316 inadvertent acceptance is low and treat this situation as within 2317 the next case. 2319 * When eir_server_scope remains the same and 2320 eir_server_owner.so_major_id changes, the client can use the 2321 filehandles it has, consider its locking state lost, and attempt 2322 to reclaim or otherwise re-obtain its locks. It might find that 2323 its filehandle is now stale. However, if NFS4ERR_STALE is not 2324 returned, it can proceed to reclaim or otherwise re-obtain its 2325 open locking state. 2327 * When eir_server_scope and eir_server_owner.so_major_id remain the 2328 same, the client has to use the now-current values of 2329 eir_server_owner.so_minor_id in deciding on appropriate forms of 2330 trunking. This may result in connections being dropped or new 2331 sessions being created. 2333 2.10.5.1. Verifying Claims of Matching Server Identity 2335 When the server responds using two different connections that claim 2336 matching or partially matching eir_server_owner, eir_server_scope, 2337 and eir_clientid values, the client does not have to trust the 2338 servers' claims. The client may verify these claims before trunking 2339 traffic in the following ways: 2341 * For session trunking, clients SHOULD reliably verify if 2342 connections between different network paths are in fact associated 2343 with the same NFSv4.1 server and usable on the same session, and 2344 servers MUST allow clients to perform reliable verification. When 2345 a client ID is created, the client SHOULD specify that 2346 BIND_CONN_TO_SESSION is to be verified according to the SP4_SSV or 2347 SP4_MACH_CRED (Section 18.35) state protection options. For 2348 SP4_SSV, reliable verification depends on a shared secret (the 2349 SSV) that is established via the SET_SSV (see Section 18.47) 2350 operation. 2352 When a new connection is associated with the session (via the 2353 BIND_CONN_TO_SESSION operation, see Section 18.34), if the client 2354 specified SP4_SSV state protection for the BIND_CONN_TO_SESSION 2355 operation, the client MUST send the BIND_CONN_TO_SESSION with 2356 RPCSEC_GSS protection, using integrity or privacy, and an 2357 RPCSEC_GSS handle created with the GSS SSV mechanism (see 2358 Section 2.10.9). 2360 If the client mistakenly tries to associate a connection to a 2361 session of a wrong server, the server will either reject the 2362 attempt because it is not aware of the session identifier of the 2363 BIND_CONN_TO_SESSION arguments, or it will reject the attempt 2364 because the RPCSEC_GSS authentication fails. Even if the server 2365 mistakenly or maliciously accepts the connection association 2366 attempt, the RPCSEC_GSS verifier it computes in the response will 2367 not be verified by the client, so the client will know it cannot 2368 use the connection for trunking the specified session. 2370 If the client specified SP4_MACH_CRED state protection, the 2371 BIND_CONN_TO_SESSION operation will use RPCSEC_GSS integrity or 2372 privacy, using the same credential that was used when the client 2373 ID was created. Mutual authentication via RPCSEC_GSS assures the 2374 client that the connection is associated with the correct session 2375 of the correct server. 2377 * For client ID trunking, the client has at least two options for 2378 verifying that the same client ID obtained from two different 2379 EXCHANGE_ID operations came from the same server. The first 2380 option is to use RPCSEC_GSS authentication when sending each 2381 EXCHANGE_ID operation. Each time an EXCHANGE_ID is sent with 2382 RPCSEC_GSS authentication, the client notes the principal name of 2383 the GSS target. If the EXCHANGE_ID results indicate that client 2384 ID trunking is possible, and the GSS targets' principal names are 2385 the same, the servers are the same and client ID trunking is 2386 allowed. 2388 The second option for verification is to use SP4_SSV protection. 2389 When the client sends EXCHANGE_ID, it specifies SP4_SSV 2390 protection. The first EXCHANGE_ID the client sends always has to 2391 be confirmed by a CREATE_SESSION call. The client then sends 2392 SET_SSV. Later, the client sends EXCHANGE_ID to a second 2393 destination network address different from the one the first 2394 EXCHANGE_ID was sent to. The client checks that each EXCHANGE_ID 2395 reply has the same eir_clientid, eir_server_owner.so_major_id, and 2396 eir_server_scope. If so, the client verifies the claim by sending 2397 a CREATE_SESSION operation to the second destination address, 2398 protected with RPCSEC_GSS integrity using an RPCSEC_GSS handle 2399 returned by the second EXCHANGE_ID. If the server accepts the 2400 CREATE_SESSION request, and if the client verifies the RPCSEC_GSS 2401 verifier and integrity codes, then the client has proof the second 2402 server knows the SSV, and thus the two servers are cooperating for 2403 the purposes of specifying server scope and client ID trunking. 2405 2.10.6. Exactly Once Semantics 2407 Via the session, NFSv4.1 offers exactly once semantics (EOS) for 2408 requests sent over a channel. EOS is supported on both the fore 2409 channel and backchannel. 2411 Each COMPOUND or CB_COMPOUND request that is sent with a leading 2412 SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver 2413 exactly once. This requirement holds regardless of whether the 2414 request is sent with reply caching specified (see 2415 Section 2.10.6.1.3). The requirement holds even if the requester is 2416 sending the request over a session created between a pNFS data client 2417 and pNFS data server. To understand the rationale for this 2418 requirement, divide the requests into three classifications: 2420 * Non-idempotent requests. 2422 * Idempotent modifying requests. 2424 * Idempotent non-modifying requests. 2426 An example of a non-idempotent request is RENAME. Obviously, if a 2427 replier executes the same RENAME request twice, and the first 2428 execution succeeds, the re-execution will fail. If the replier 2429 returns the result from the re-execution, this result is incorrect. 2430 Therefore, EOS is required for non-idempotent requests. 2432 An example of an idempotent modifying request is a COMPOUND request 2433 containing a WRITE operation. Repeated execution of the same WRITE 2434 has the same effect as execution of that WRITE a single time. 2435 Nevertheless, enforcing EOS for WRITEs and other idempotent modifying 2436 requests is necessary to avoid data corruption. 2438 Suppose a client sends WRITE A to a noncompliant server that does not 2439 enforce EOS, and receives no response, perhaps due to a network 2440 partition. The client reconnects to the server and re-sends WRITE A. 2441 Now, the server has outstanding two instances of A. The server can 2442 be in a situation in which it executes and replies to the retry of A, 2443 while the first A is still waiting in the server's internal I/O 2444 system for some resource. Upon receiving the reply to the second 2445 attempt of WRITE A, the client believes its WRITE is done so it is 2446 free to send WRITE B, which overlaps the byte-range of A. When the 2447 original A is dispatched from the server's I/O system and executed 2448 (thus the second time A will have been written), then what has been 2449 written by B can be overwritten and thus corrupted. 2451 An example of an idempotent non-modifying request is a COMPOUND 2452 containing SEQUENCE, PUTFH, READLINK, and nothing else. The re- 2453 execution of such a request will not cause data corruption or produce 2454 an incorrect result. Nonetheless, to keep the implementation simple, 2455 the replier MUST enforce EOS for all requests, whether or not 2456 idempotent and non-modifying. 2458 Note that true and complete EOS is not possible unless the server 2459 persists the reply cache in stable storage, and unless the server is 2460 somehow implemented to never require a restart (indeed, if such a 2461 server exists, the distinction between a reply cache kept in stable 2462 storage versus one that is not is one without meaning). See 2463 Section 2.10.6.5 for a discussion of persistence in the reply cache. 2464 Regardless, even if the server does not persist the reply cache, EOS 2465 improves robustness and correctness over previous versions of NFS 2466 because the legacy duplicate request/reply caches were based on the 2467 ONC RPC transaction identifier (XID). Section 2.10.6.1 explains the 2468 shortcomings of the XID as a basis for a reply cache and describes 2469 how NFSv4.1 sessions improve upon the XID. 2471 2.10.6.1. Slot Identifiers and Reply Cache 2473 The RPC layer provides a transaction ID (XID), which, while required 2474 to be unique, is not convenient for tracking requests for two 2475 reasons. First, the XID is only meaningful to the requester; it 2476 cannot be interpreted by the replier except to test for equality with 2477 previously sent requests. When consulting an RPC-based duplicate 2478 request cache, the opaqueness of the XID requires a computationally 2479 expensive look up (often via a hash that includes XID and source 2480 address). NFSv4.1 requests use a non-opaque slot ID, which is an 2481 index into a slot table, which is far more efficient. Second, 2482 because RPC requests can be executed by the replier in any order, 2483 there is no bound on the number of requests that may be outstanding 2484 at any time. To achieve perfect EOS, using ONC RPC would require 2485 storing all replies in the reply cache. XIDs are 32 bits; storing 2486 over four billion (2^32) replies in the reply cache is not practical. 2487 In practice, previous versions of NFS have chosen to store a fixed 2488 number of replies in the cache, and to use a least recently used 2489 (LRU) approach to replacing cache entries with new entries when the 2490 cache is full. In NFSv4.1, the number of outstanding requests is 2491 bounded by the size of the slot table, and a sequence ID per slot is 2492 used to tell the replier when it is safe to delete a cached reply. 2494 In the NFSv4.1 reply cache, when the requester sends a new request, 2495 it selects a slot ID in the range 0..N, where N is the replier's 2496 current maximum slot ID granted to the requester on the session over 2497 which the request is to be sent. The value of N starts out as equal 2498 to ca_maxrequests - 1 (Section 18.36), but can be adjusted by the 2499 response to SEQUENCE or CB_SEQUENCE as described later in this 2500 section. The slot ID must be unused by any of the requests that the 2501 requester has already active on the session. "Unused" here means the 2502 requester has no outstanding request for that slot ID. 2504 A slot contains a sequence ID and the cached reply corresponding to 2505 the request sent with that sequence ID. The sequence ID is a 32-bit 2506 unsigned value, and is therefore in the range 0..0xFFFFFFFF (2^32 - 2507 1). The first time a slot is used, the requester MUST specify a 2508 sequence ID of one (Section 18.36). Each time a slot is reused, the 2509 request MUST specify a sequence ID that is one greater than that of 2510 the previous request on the slot. If the previous sequence ID was 2511 0xFFFFFFFF, then the next request for the slot MUST have the sequence 2512 ID set to zero (i.e., (2^32 - 1) + 1 mod 2^32). 2514 The sequence ID accompanies the slot ID in each request. It is for 2515 the critical check at the replier: it used to efficiently determine 2516 whether a request using a certain slot ID is a retransmit or a new, 2517 never-before-seen request. It is not feasible for the requester to 2518 assert that it is retransmitting to implement this, because for any 2519 given request the requester cannot know whether the replier has seen 2520 it unless the replier actually replies. Of course, if the requester 2521 has seen the reply, the requester would not retransmit. 2523 The replier compares each received request's sequence ID with the 2524 last one previously received for that slot ID, to see if the new 2525 request is: 2527 * A new request, in which the sequence ID is one greater than that 2528 previously seen in the slot (accounting for sequence wraparound). 2529 The replier proceeds to execute the new request, and the replier 2530 MUST increase the slot's sequence ID by one. 2532 * A retransmitted request, in which the sequence ID is equal to that 2533 currently recorded in the slot. If the original request has 2534 executed to completion, the replier returns the cached reply. See 2535 Section 2.10.6.2 for direction on how the replier deals with 2536 retries of requests that are still in progress. 2538 * A misordered retry, in which the sequence ID is less than 2539 (accounting for sequence wraparound) that previously seen in the 2540 slot. The replier MUST return NFS4ERR_SEQ_MISORDERED (as the 2541 result from SEQUENCE or CB_SEQUENCE). 2543 * A misordered new request, in which the sequence ID is two or more 2544 than (accounting for sequence wraparound) that previously seen in 2545 the slot. Note that because the sequence ID MUST wrap around to 2546 zero once it reaches 0xFFFFFFFF, a misordered new request and a 2547 misordered retry cannot be distinguished. Thus, the replier MUST 2548 return NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or 2549 CB_SEQUENCE). 2551 Unlike the XID, the slot ID is always within a specific range; this 2552 has two implications. The first implication is that for a given 2553 session, the replier need only cache the results of a limited number 2554 of COMPOUND requests. The second implication derives from the first, 2555 which is that unlike XID-indexed reply caches (also known as 2556 duplicate request caches - DRCs), the slot ID-based reply cache 2557 cannot be overflowed. Through use of the sequence ID to identify 2558 retransmitted requests, the replier does not need to actually cache 2559 the request itself, reducing the storage requirements of the reply 2560 cache further. These facilities make it practical to maintain all 2561 the required entries for an effective reply cache. 2563 The slot ID, sequence ID, and session ID therefore take over the 2564 traditional role of the XID and source network address in the 2565 replier's reply cache implementation. This approach is considerably 2566 more portable and completely robust -- it is not subject to the 2567 reassignment of ports as clients reconnect over IP networks. In 2568 addition, the RPC XID is not used in the reply cache, enhancing 2569 robustness of the cache in the face of any rapid reuse of XIDs by the 2570 requester. While the replier does not care about the XID for the 2571 purposes of reply cache management (but the replier MUST return the 2572 same XID that was in the request), nonetheless there are 2573 considerations for the XID in NFSv4.1 that are the same as all other 2574 previous versions of NFS. The RPC XID remains in each message and 2575 needs to be formulated in NFSv4.1 requests as in any other ONC RPC 2576 request. The reasons include: 2578 * The RPC layer retains its existing semantics and implementation. 2580 * The requester and replier must be able to interoperate at the RPC 2581 layer, prior to the NFSv4.1 decoding of the SEQUENCE or 2582 CB_SEQUENCE operation. 2584 * If an operation is being used that does not start with SEQUENCE or 2585 CB_SEQUENCE (e.g., BIND_CONN_TO_SESSION), then the RPC XID is 2586 needed for correct operation to match the reply to the request. 2588 * The SEQUENCE or CB_SEQUENCE operation may generate an error. If 2589 so, the embedded slot ID, sequence ID, and session ID (if present) 2590 in the request will not be in the reply, and the requester has 2591 only the XID to match the reply to the request. 2593 Given that well-formulated XIDs continue to be required, this raises 2594 the question: why do SEQUENCE and CB_SEQUENCE replies have a session 2595 ID, slot ID, and sequence ID? Having the session ID in the reply 2596 means that the requester does not have to use the XID to look up the 2597 session ID, which would be necessary if the connection were 2598 associated with multiple sessions. Having the slot ID and sequence 2599 ID in the reply means that the requester does not have to use the XID 2600 to look up the slot ID and sequence ID. Furthermore, since the XID 2601 is only 32 bits, it is too small to guarantee the re-association of a 2602 reply with its request [44]; having session ID, slot ID, and sequence 2603 ID in the reply allows the client to validate that the reply in fact 2604 belongs to the matched request. 2606 The SEQUENCE (and CB_SEQUENCE) operation also carries a 2607 "highest_slotid" value, which carries additional requester slot usage 2608 information. The requester MUST always indicate the slot ID 2609 representing the outstanding request with the highest-numbered slot 2610 value. The requester should in all cases provide the most 2611 conservative value possible, although it can be increased somewhat 2612 above the actual instantaneous usage to maintain some minimum or 2613 optimal level. This provides a way for the requester to yield unused 2614 request slots back to the replier, which in turn can use the 2615 information to reallocate resources. 2617 The replier responds with both a new target highest_slotid and an 2618 enforced highest_slotid, described as follows: 2620 * The target highest_slotid is an indication to the requester of the 2621 highest_slotid the replier wishes the requester to be using. This 2622 permits the replier to withdraw (or add) resources from a 2623 requester that has been found to not be using them, in order to 2624 more fairly share resources among a varying level of demand from 2625 other requesters. The requester must always comply with the 2626 replier's value updates, since they indicate newly established 2627 hard limits on the requester's access to session resources. 2628 However, because of request pipelining, the requester may have 2629 active requests in flight reflecting prior values; therefore, the 2630 replier must not immediately require the requester to comply. 2632 * The enforced highest_slotid indicates the highest slot ID the 2633 requester is permitted to use on a subsequent SEQUENCE or 2634 CB_SEQUENCE operation. The replier's enforced highest_slotid 2635 SHOULD be no less than the highest_slotid the requester indicated 2636 in the SEQUENCE or CB_SEQUENCE arguments. 2638 A requester can be intransigent with respect to lowering its 2639 highest_slotid argument to a Sequence operation, i.e. the 2640 requester continues to ignore the target highest_slotid in the 2641 response to a Sequence operation, and continues to set its 2642 highest_slotid argument to be higher than the target 2643 highest_slotid. This can be considered particularly egregious 2644 behavior when the replier knows there are no outstanding requests 2645 with slot IDs higher than its target highest_slotid. When faced 2646 with such intransigence, the replier is free to take more forceful 2647 action, and MAY reply with a new enforced highest_slotid that is 2648 less than its previous enforced highest_slotid. Thereafter, if 2649 the requester continues to send requests with a highest_slotid 2650 that is greater than the replier's new enforced highest_slotid, 2651 the server MAY return NFS4ERR_BAD_HIGH_SLOT, unless the slot ID in 2652 the request is greater than the new enforced highest_slotid and 2653 the request is a retry. 2655 The replier SHOULD retain the slots it wants to retire until the 2656 requester sends a request with a highest_slotid less than or equal 2657 to the replier's new enforced highest_slotid. 2659 The requester can also be intransigent with respect to sending 2660 non-retry requests that have a slot ID that exceeds the replier's 2661 highest_slotid. Once the replier has forcibly lowered the 2662 enforced highest_slotid, the requester is only allowed to send 2663 retries on slots that exceed the replier's highest_slotid. If a 2664 request is received with a slot ID that is higher than the new 2665 enforced highest_slotid, and the sequence ID is one higher than 2666 what is in the slot's reply cache, then the server can both retire 2667 the slot and return NFS4ERR_BADSLOT (however, the server MUST NOT 2668 do one and not the other). The reason it is safe to retire the 2669 slot is because by using the next sequence ID, the requester is 2670 indicating it has received the previous reply for the slot. 2672 * The requester SHOULD use the lowest available slot when sending a 2673 new request. This way, the replier may be able to retire slot 2674 entries faster. However, where the replier is actively adjusting 2675 its granted highest_slotid, it will not be able to use only the 2676 receipt of the slot ID and highest_slotid in the request. Neither 2677 the slot ID nor the highest_slotid used in a request may reflect 2678 the replier's current idea of the requester's session limit, 2679 because the request may have been sent from the requester before 2680 the update was received. Therefore, in the downward adjustment 2681 case, the replier may have to retain a number of reply cache 2682 entries at least as large as the old value of maximum requests 2683 outstanding, until it can infer that the requester has seen a 2684 reply containing the new granted highest_slotid. The replier can 2685 infer that the requester has seen such a reply when it receives a 2686 new request with the same slot ID as the request replied to and 2687 the next higher sequence ID. 2689 2.10.6.1.1. Caching of SEQUENCE and CB_SEQUENCE Replies 2691 When a SEQUENCE or CB_SEQUENCE operation is successfully executed, 2692 its reply MUST always be cached. Specifically, session ID, sequence 2693 ID, and slot ID MUST be cached in the reply cache. The reply from 2694 SEQUENCE also includes the highest slot ID, target highest slot ID, 2695 and status flags. Instead of caching these values, the server MAY 2696 re-compute the values from the current state of the fore channel, 2697 session, and/or client ID as appropriate. Similarly, the reply from 2698 CB_SEQUENCE includes a highest slot ID and target highest slot ID. 2699 The client MAY re-compute the values from the current state of the 2700 session as appropriate. 2702 Regardless of whether or not a replier is re-computing highest slot 2703 ID, target slot ID, and status on replies to retries, the requester 2704 MUST NOT assume that the values are being re-computed whenever it 2705 receives a reply after a retry is sent, since it has no way of 2706 knowing whether the reply it has received was sent by the replier in 2707 response to the retry or is a delayed response to the original 2708 request. Therefore, it may be the case that highest slot ID, target 2709 slot ID, or status bits may reflect the state of affairs when the 2710 request was first executed. Although acting based on such delayed 2711 information is valid, it may cause the receiver of the reply to do 2712 unneeded work. Requesters MAY choose to send additional requests to 2713 get the current state of affairs or use the state of affairs reported 2714 by subsequent requests, in preference to acting immediately on data 2715 that might be out of date. 2717 2.10.6.1.2. Errors from SEQUENCE and CB_SEQUENCE 2719 Any time SEQUENCE or CB_SEQUENCE returns an error, the sequence ID of 2720 the slot MUST NOT change. The replier MUST NOT modify the reply 2721 cache entry for the slot whenever an error is returned from SEQUENCE 2722 or CB_SEQUENCE. 2724 2.10.6.1.3. Optional Reply Caching 2726 On a per-request basis, the requester can choose to direct the 2727 replier to cache the reply to all operations after the first 2728 operation (SEQUENCE or CB_SEQUENCE) via the sa_cachethis or 2729 csa_cachethis fields of the arguments to SEQUENCE or CB_SEQUENCE. 2730 The reason it would not direct the replier to cache the entire reply 2731 is that the request is composed of all idempotent operations [41]. 2732 Caching the reply may offer little benefit. If the reply is too 2733 large (see Section 2.10.6.4), it may not be cacheable anyway. Even 2734 if the reply to idempotent request is small enough to cache, 2735 unnecessarily caching the reply slows down the server and increases 2736 RPC latency. 2738 Whether or not the requester requests the reply to be cached has no 2739 effect on the slot processing. If the result of SEQUENCE or 2740 CB_SEQUENCE is NFS4_OK, then the slot's sequence ID MUST be 2741 incremented by one. If a requester does not direct the replier to 2742 cache the reply, the replier MUST do one of following: 2744 * The replier can cache the entire original reply. Even though 2745 sa_cachethis or csa_cachethis is FALSE, the replier is always free 2746 to cache. It may choose this approach in order to simplify 2747 implementation. 2749 * The replier enters into its reply cache a reply consisting of the 2750 original results to the SEQUENCE or CB_SEQUENCE operation, and 2751 with the next operation in COMPOUND or CB_COMPOUND having the 2752 error NFS4ERR_RETRY_UNCACHED_REP. Thus, if the requester later 2753 retries the request, it will get NFS4ERR_RETRY_UNCACHED_REP. If a 2754 replier receives a retried Sequence operation where the reply to 2755 the COMPOUND or CB_COMPOUND was not cached, then the replier, 2757 - MAY return NFS4ERR_RETRY_UNCACHED_REP in reply to a Sequence 2758 operation if the Sequence operation is not the first operation 2759 (granted, a requester that does so is in violation of the 2760 NFSv4.1 protocol). 2762 - MUST NOT return NFS4ERR_RETRY_UNCACHED_REP in reply to a 2763 Sequence operation if the Sequence operation is the first 2764 operation. 2766 * If the second operation is an illegal operation, or an operation 2767 that was legal in a previous minor version of NFSv4 and MUST NOT 2768 be supported in the current minor version (e.g., SETCLIENTID), the 2769 replier MUST NOT ever return NFS4ERR_RETRY_UNCACHED_REP. Instead 2770 the replier MUST return NFS4ERR_OP_ILLEGAL or NFS4ERR_BADXDR or 2771 NFS4ERR_NOTSUPP as appropriate. 2773 * If the second operation can result in another error status, the 2774 replier MAY return a status other than NFS4ERR_RETRY_UNCACHED_REP, 2775 provided the operation is not executed in such a way that the 2776 state of the replier is changed. Examples of such an error status 2777 include: NFS4ERR_NOTSUPP returned for an operation that is legal 2778 but not REQUIRED in the current minor versions, and thus not 2779 supported by the replier; NFS4ERR_SEQUENCE_POS; and 2780 NFS4ERR_REQ_TOO_BIG. 2782 The discussion above assumes that the retried request matches the 2783 original one. Section 2.10.6.1.3.1 discusses what the replier might 2784 do, and MUST do when original and retried requests do not match. 2785 Since the replier may only cache a small amount of the information 2786 that would be required to determine whether this is a case of a false 2787 retry, the replier may send to the client any of the following 2788 responses: 2790 * The cached reply to the original request (if the replier has 2791 cached it in its entirety and the users of the original request 2792 and retry match). 2794 * A reply that consists only of the Sequence operation with the 2795 error NFS4ERR_SEQ_FALSE_RETRY. 2797 * A reply consisting of the response to Sequence with the status 2798 NFS4_OK, together with the second operation as it appeared in the 2799 retried request with an error of NFS4ERR_RETRY_UNCACHED_REP or 2800 other error as described above. 2802 * A reply that consists of the response to Sequence with the status 2803 NFS4_OK, together with the second operation as it appeared in the 2804 original request with an error of NFS4ERR_RETRY_UNCACHED_REP or 2805 other error as described above. 2807 2.10.6.1.3.1. False Retry 2809 If a requester sent a Sequence operation with a slot ID and sequence 2810 ID that are in the reply cache but the replier detected that the 2811 retried request is not the same as the original request, including a 2812 retry that has different operations or different arguments in the 2813 operations from the original and a retry that uses a different 2814 principal in the RPC request's credential field that translates to a 2815 different user, then this is a false retry. When the replier detects 2816 a false retry, it is permitted (but not always obligated) to return 2817 NFS4ERR_SEQ_FALSE_RETRY in response to the Sequence operation when it 2818 detects a false retry. 2820 Translations of particularly privileged user values to other users 2821 due to the lack of appropriately secure credentials, as configured on 2822 the replier, should be applied before determining whether the users 2823 are the same or different. If the replier determines the users are 2824 different between the original request and a retry, then the replier 2825 MUST return NFS4ERR_SEQ_FALSE_RETRY. 2827 If an operation of the retry is an illegal operation, or an operation 2828 that was legal in a previous minor version of NFSv4 and MUST NOT be 2829 supported in the current minor version (e.g., SETCLIENTID), the 2830 replier MAY return NFS4ERR_SEQ_FALSE_RETRY (and MUST do so if the 2831 users of the original request and retry differ). Otherwise, the 2832 replier MAY return NFS4ERR_OP_ILLEGAL or NFS4ERR_BADXDR or 2833 NFS4ERR_NOTSUPP as appropriate. Note that the handling is in 2834 contrast for how the replier deals with retries requests with no 2835 cached reply. The difference is due to NFS4ERR_SEQ_FALSE_RETRY being 2836 a valid error for only Sequence operations, whereas 2837 NFS4ERR_RETRY_UNCACHED_REP is a valid error for all operations except 2838 illegal operations and operations that MUST NOT be supported in the 2839 current minor version of NFSv4. 2841 2.10.6.2. Retry and Replay of Reply 2843 A requester MUST NOT retry a request, unless the connection it used 2844 to send the request disconnects. The requester can then reconnect 2845 and re-send the request, or it can re-send the request over a 2846 different connection that is associated with the same session. 2848 If the requester is a server wanting to re-send a callback operation 2849 over the backchannel of a session, the requester of course cannot 2850 reconnect because only the client can associate connections with the 2851 backchannel. The server can re-send the request over another 2852 connection that is bound to the same session's backchannel. If there 2853 is no such connection, the server MUST indicate that the session has 2854 no backchannel by setting the SEQ4_STATUS_CB_PATH_DOWN_SESSION flag 2855 bit in the response to the next SEQUENCE operation from the client. 2856 The client MUST then associate a connection with the session (or 2857 destroy the session). 2859 Note that it is not fatal for a requester to retry without a 2860 disconnect between the request and retry. However, the retry does 2861 consume resources, especially with RDMA, where each request, retry or 2862 not, consumes a credit. Retries for no reason, especially retries 2863 sent shortly after the previous attempt, are a poor use of network 2864 bandwidth and defeat the purpose of a transport's inherent congestion 2865 control system. 2867 A requester MUST wait for a reply to a request before using the slot 2868 for another request. If it does not wait for a reply, then the 2869 requester does not know what sequence ID to use for the slot on its 2870 next request. For example, suppose a requester sends a request with 2871 sequence ID 1, and does not wait for the response. The next time it 2872 uses the slot, it sends the new request with sequence ID 2. If the 2873 replier has not seen the request with sequence ID 1, then the replier 2874 is not expecting sequence ID 2, and rejects the requester's new 2875 request with NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or 2876 CB_SEQUENCE). 2878 RDMA fabrics do not guarantee that the memory handles (Steering Tags) 2879 within each RPC/RDMA "chunk" [32] are valid on a scope outside that 2880 of a single connection. Therefore, handles used by the direct 2881 operations become invalid after connection loss. The server must 2882 ensure that any RDMA operations that must be replayed from the reply 2883 cache use the newly provided handle(s) from the most recent request. 2885 A retry might be sent while the original request is still in progress 2886 on the replier. The replier SHOULD deal with the issue by returning 2887 NFS4ERR_DELAY as the reply to SEQUENCE or CB_SEQUENCE operation, but 2888 implementations MAY return NFS4ERR_MISORDERED. Since errors from 2889 SEQUENCE and CB_SEQUENCE are never recorded in the reply cache, this 2890 approach allows the results of the execution of the original request 2891 to be properly recorded in the reply cache (assuming that the 2892 requester specified the reply to be cached). 2894 2.10.6.3. Resolving Server Callback Races 2896 It is possible for server callbacks to arrive at the client before 2897 the reply from related fore channel operations. For example, a 2898 client may have been granted a delegation to a file it has opened, 2899 but the reply to the OPEN (informing the client of the granting of 2900 the delegation) may be delayed in the network. If a conflicting 2901 operation arrives at the server, it will recall the delegation using 2902 the backchannel, which may be on a different transport connection, 2903 perhaps even a different network, or even a different session 2904 associated with the same client ID. 2906 The presence of a session between the client and server alleviates 2907 this issue. When a session is in place, each client request is 2908 uniquely identified by its { session ID, slot ID, sequence ID } 2909 triple. By the rules under which slot entries (reply cache entries) 2910 are retired, the server has knowledge whether the client has "seen" 2911 each of the server's replies. The server can therefore provide 2912 sufficient information to the client to allow it to disambiguate 2913 between an erroneous or conflicting callback race condition. 2915 For each client operation that might result in some sort of server 2916 callback, the server SHOULD "remember" the { session ID, slot ID, 2917 sequence ID } triple of the client request until the slot ID 2918 retirement rules allow the server to determine that the client has, 2919 in fact, seen the server's reply. Until the time the { session ID, 2920 slot ID, sequence ID } request triple can be retired, any recalls of 2921 the associated object MUST carry an array of these referring 2922 identifiers (in the CB_SEQUENCE operation's arguments), for the 2923 benefit of the client. After this time, it is not necessary for the 2924 server to provide this information in related callbacks, since it is 2925 certain that a race condition can no longer occur. 2927 The CB_SEQUENCE operation that begins each server callback carries a 2928 list of "referring" { session ID, slot ID, sequence ID } triples. If 2929 the client finds the request corresponding to the referring session 2930 ID, slot ID, and sequence ID to be currently outstanding (i.e., the 2931 server's reply has not been seen by the client), it can determine 2932 that the callback has raced the reply, and act accordingly. If the 2933 client does not find the request corresponding to the referring 2934 triple to be outstanding (including the case of a session ID 2935 referring to a destroyed session), then there is no race with respect 2936 to this triple. The server SHOULD limit the referring triples to 2937 requests that refer to just those that apply to the objects referred 2938 to in the CB_COMPOUND procedure. 2940 The client must not simply wait forever for the expected server reply 2941 to arrive before responding to the CB_COMPOUND that won the race, 2942 because it is possible that it will be delayed indefinitely. The 2943 client should assume the likely case that the reply will arrive 2944 within the average round-trip time for COMPOUND requests to the 2945 server, and wait that period of time. If that period of time 2946 expires, it can respond to the CB_COMPOUND with NFS4ERR_DELAY. There 2947 are other scenarios under which callbacks may race replies. Among 2948 them are pNFS layout recalls as described in Section 12.5.5.2. 2950 2.10.6.4. COMPOUND and CB_COMPOUND Construction Issues 2952 Very large requests and replies may pose both buffer management 2953 issues (especially with RDMA) and reply cache issues. When the 2954 session is created (Section 18.36), for each channel (fore and back), 2955 the client and server negotiate the maximum-sized request they will 2956 send or process (ca_maxrequestsize), the maximum-sized reply they 2957 will return or process (ca_maxresponsesize), and the maximum-sized 2958 reply they will store in the reply cache (ca_maxresponsesize_cached). 2960 If a request exceeds ca_maxrequestsize, the reply will have the 2961 status NFS4ERR_REQ_TOO_BIG. A replier MAY return NFS4ERR_REQ_TOO_BIG 2962 as the status for the first operation (SEQUENCE or CB_SEQUENCE) in 2963 the request (which means that no operations in the request executed 2964 and that the state of the slot in the reply cache is unchanged), or 2965 it MAY opt to return it on a subsequent operation in the same 2966 COMPOUND or CB_COMPOUND request (which means that at least one 2967 operation did execute and that the state of the slot in the reply 2968 cache does change). The replier SHOULD set NFS4ERR_REQ_TOO_BIG on 2969 the operation that exceeds ca_maxrequestsize. 2971 If a reply exceeds ca_maxresponsesize, the reply will have the status 2972 NFS4ERR_REP_TOO_BIG. A replier MAY return NFS4ERR_REP_TOO_BIG as the 2973 status for the first operation (SEQUENCE or CB_SEQUENCE) in the 2974 request, or it MAY opt to return it on a subsequent operation (in the 2975 same COMPOUND or CB_COMPOUND reply). A replier MAY return 2976 NFS4ERR_REP_TOO_BIG in the reply to SEQUENCE or CB_SEQUENCE, even if 2977 the response would still exceed ca_maxresponsesize. 2979 If sa_cachethis or csa_cachethis is TRUE, then the replier MUST cache 2980 a reply except if an error is returned by the SEQUENCE or CB_SEQUENCE 2981 operation (see Section 2.10.6.1.2). If the reply exceeds 2982 ca_maxresponsesize_cached (and sa_cachethis or csa_cachethis is 2983 TRUE), then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. 2984 Even if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that 2985 matter) is returned on an operation other than the first operation 2986 (SEQUENCE or CB_SEQUENCE), then the reply MUST be cached if 2987 sa_cachethis or csa_cachethis is TRUE. For example, if a COMPOUND 2988 has eleven operations, including SEQUENCE, the fifth operation is a 2989 RENAME, and the tenth operation is a READ for one million bytes, the 2990 server may return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth 2991 operation. Since the server executed several operations, especially 2992 the non-idempotent RENAME, the client's request to cache the reply 2993 needs to be honored in order for the correct operation of exactly 2994 once semantics. If the client retries the request, the server will 2995 have cached a reply that contains results for ten of the eleven 2996 requested operations, with the tenth operation having a status of 2997 NFS4ERR_REP_TOO_BIG_TO_CACHE. 2999 A client needs to take care that, when sending operations that change 3000 the current filehandle (except for PUTFH, PUTPUBFH, PUTROOTFH, and 3001 RESTOREFH), it does not exceed the maximum reply buffer before the 3002 GETFH operation. Otherwise, the client will have to retry the 3003 operation that changed the current filehandle, in order to obtain the 3004 desired filehandle. For the OPEN operation (see Section 18.16), 3005 retry is not always available as an option. The following guidelines 3006 for the handling of filehandle-changing operations are advised: 3008 * Within the same COMPOUND procedure, a client SHOULD send GETFH 3009 immediately after a current filehandle-changing operation. A 3010 client MUST send GETFH after a current filehandle-changing 3011 operation that is also non-idempotent (e.g., the OPEN operation), 3012 unless the operation is RESTOREFH. RESTOREFH is an exception, 3013 because even though it is non-idempotent, the filehandle RESTOREFH 3014 produced originated from an operation that is either idempotent 3015 (e.g., PUTFH, LOOKUP), or non-idempotent (e.g., OPEN, CREATE). If 3016 the origin is non-idempotent, then because the client MUST send 3017 GETFH after the origin operation, the client can recover if 3018 RESTOREFH returns an error. 3020 * A server MAY return NFS4ERR_REP_TOO_BIG or 3021 NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a 3022 filehandle-changing operation if the reply would be too large on 3023 the next operation. 3025 * A server SHOULD return NFS4ERR_REP_TOO_BIG or 3026 NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a 3027 filehandle-changing, non-idempotent operation if the reply would 3028 be too large on the next operation, especially if the operation is 3029 OPEN. 3031 * A server MAY return NFS4ERR_UNSAFE_COMPOUND to a non-idempotent 3032 current filehandle-changing operation, if it looks at the next 3033 operation (in the same COMPOUND procedure) and finds it is not 3034 GETFH. The server SHOULD do this if it is unable to determine in 3035 advance whether the total response size would exceed 3036 ca_maxresponsesize_cached or ca_maxresponsesize. 3038 2.10.6.5. Persistence 3040 Since the reply cache is bounded, it is practical for the reply cache 3041 to persist across server restarts. The replier MUST persist the 3042 following information if it agreed to persist the session (when the 3043 session was created; see Section 18.36): 3045 * The session ID. 3047 * The slot table including the sequence ID and cached reply for each 3048 slot. 3050 The above are sufficient for a replier to provide EOS semantics for 3051 any requests that were sent and executed before the server restarted. 3052 If the replier is a client, then there is no need for it to persist 3053 any more information, unless the client will be persisting all other 3054 state across client restart, in which case, the server will never see 3055 any NFSv4.1-level protocol manifestation of a client restart. If the 3056 replier is a server, with just the slot table and session ID 3057 persisting, any requests the client retries after the server restart 3058 will return the results that are cached in the reply cache, and any 3059 new requests (i.e., the sequence ID is one greater than the slot's 3060 sequence ID) MUST be rejected with NFS4ERR_DEADSESSION (returned by 3061 SEQUENCE). Such a session is considered dead. A server MAY re- 3062 animate a session after a server restart so that the session will 3063 accept new requests as well as retries. To re-animate a session, the 3064 server needs to persist additional information through server 3065 restart: 3067 * The client ID. This is a prerequisite to let the client create 3068 more sessions associated with the same client ID as the re- 3069 animated session. 3071 * The client ID's sequence ID that is used for creating sessions 3072 (see Sections 18.35 and 18.36). This is a prerequisite to let the 3073 client create more sessions. 3075 * The principal that created the client ID. This allows the server 3076 to authenticate the client when it sends EXCHANGE_ID. 3078 * The SSV, if SP4_SSV state protection was specified when the client 3079 ID was created (see Section 18.35). This lets the client create 3080 new sessions, and associate connections with the new and existing 3081 sessions. 3083 * The properties of the client ID as defined in Section 18.35. 3085 A persistent reply cache places certain demands on the server. The 3086 execution of the sequence of operations (starting with SEQUENCE) and 3087 placement of its results in the persistent cache MUST be atomic. If 3088 a client retries a sequence of operations that was previously 3089 executed on the server, the only acceptable outcomes are either the 3090 original cached reply or an indication that the client ID or session 3091 has been lost (indicating a catastrophic loss of the reply cache or a 3092 session that has been deleted because the client failed to use the 3093 session for an extended period of time). 3095 A server could fail and restart in the middle of a COMPOUND procedure 3096 that contains one or more non-idempotent or idempotent-but-modifying 3097 operations. This creates an even higher challenge for atomic 3098 execution and placement of results in the reply cache. One way to 3099 view the problem is as a single transaction consisting of each 3100 operation in the COMPOUND followed by storing the result in 3101 persistent storage, then finally a transaction commit. If there is a 3102 failure before the transaction is committed, then the server rolls 3103 back the transaction. If the server itself fails, then when it 3104 restarts, its recovery logic could roll back the transaction before 3105 starting the NFSv4.1 server. 3107 While the description of the implementation for atomic execution of 3108 the request and caching of the reply is beyond the scope of this 3109 document, an example implementation for NFSv2 [45] is described in 3110 [46]. 3112 2.10.7. RDMA Considerations 3114 A complete discussion of the operation of RPC-based protocols over 3115 RDMA transports is in [32]. A discussion of the operation of NFSv4, 3116 including NFSv4.1, over RDMA is in [33]. Where RDMA is considered, 3117 this specification assumes the use of such a layering; it addresses 3118 only the upper-layer issues relevant to making best use of RPC/RDMA. 3120 2.10.7.1. RDMA Connection Resources 3122 RDMA requires its consumers to register memory and post buffers of a 3123 specific size and number for receive operations. 3125 Registration of memory can be a relatively high-overhead operation, 3126 since it requires pinning of buffers, assignment of attributes (e.g., 3127 readable/writable), and initialization of hardware translation. 3128 Preregistration is desirable to reduce overhead. These registrations 3129 are specific to hardware interfaces and even to RDMA connection 3130 endpoints; therefore, negotiation of their limits is desirable to 3131 manage resources effectively. 3133 Following basic registration, these buffers must be posted by the RPC 3134 layer to handle receives. These buffers remain in use by the RPC/ 3135 NFSv4.1 implementation; the size and number of them must be known to 3136 the remote peer in order to avoid RDMA errors that would cause a 3137 fatal error on the RDMA connection. 3139 NFSv4.1 manages slots as resources on a per-session basis (see 3140 Section 2.10), while RDMA connections manage credits on a per- 3141 connection basis. This means that in order for a peer to send data 3142 over RDMA to a remote buffer, it has to have both an NFSv4.1 slot and 3143 an RDMA credit. If multiple RDMA connections are associated with a 3144 session, then if the total number of credits across all RDMA 3145 connections associated with the session is X, and the number of slots 3146 in the session is Y, then the maximum number of outstanding requests 3147 is the lesser of X and Y. 3149 2.10.7.2. Flow Control 3151 Previous versions of NFS do not provide flow control; instead, they 3152 rely on the windowing provided by transports like TCP to throttle 3153 requests. This does not work with RDMA, which provides no operation 3154 flow control and will terminate a connection in error when limits are 3155 exceeded. Limits such as maximum number of requests outstanding are 3156 therefore negotiated when a session is created (see the 3157 ca_maxrequests field in Section 18.36). These limits then provide 3158 the maxima within which each connection associated with the session's 3159 channel(s) must remain. RDMA connections are managed within these 3160 limits as described in Section 3.3 of [32]; if there are multiple 3161 RDMA connections, then the maximum number of requests for a channel 3162 will be divided among the RDMA connections. Put a different way, the 3163 onus is on the replier to ensure that the total number of RDMA 3164 credits across all connections associated with the replier's channel 3165 does exceed the channel's maximum number of outstanding requests. 3167 The limits may also be modified dynamically at the replier's choosing 3168 by manipulating certain parameters present in each NFSv4.1 reply. In 3169 addition, the CB_RECALL_SLOT callback operation (see Section 20.8) 3170 can be sent by a server to a client to return RDMA credits to the 3171 server, thereby lowering the maximum number of requests a client can 3172 have outstanding to the server. 3174 2.10.7.3. Padding 3176 Header padding is requested by each peer at session initiation (see 3177 the ca_headerpadsize argument to CREATE_SESSION in Section 18.36), 3178 and subsequently used by the RPC RDMA layer, as described in [32]. 3179 Zero padding is permitted. 3181 Padding leverages the useful property that RDMA preserve alignment of 3182 data, even when they are placed into anonymous (untagged) buffers. 3183 If requested, client inline writes will insert appropriate pad bytes 3184 within the request header to align the data payload on the specified 3185 boundary. The client is encouraged to add sufficient padding (up to 3186 the negotiated size) so that the "data" field of the WRITE operation 3187 is aligned. Most servers can make good use of such padding, which 3188 allows them to chain receive buffers in such a way that any data 3189 carried by client requests will be placed into appropriate buffers at 3190 the server, ready for file system processing. The receiver's RPC 3191 layer encounters no overhead from skipping over pad bytes, and the 3192 RDMA layer's high performance makes the insertion and transmission of 3193 padding on the sender a significant optimization. In this way, the 3194 need for servers to perform RDMA Read to satisfy all but the largest 3195 client writes is obviated. An added benefit is the reduction of 3196 message round trips on the network -- a potentially good trade, where 3197 latency is present. 3199 The value to choose for padding is subject to a number of criteria. 3200 A primary source of variable-length data in the RPC header is the 3201 authentication information, the form of which is client-determined, 3202 possibly in response to server specification. The contents of 3203 COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all 3204 go into the determination of a maximal NFSv4.1 request size and 3205 therefore minimal buffer size. The client must select its offered 3206 value carefully, so as to avoid overburdening the server, and vice 3207 versa. The benefit of an appropriate padding value is higher 3208 performance. 3210 Sender gather: 3211 |RPC Request|Pad bytes|Length| -> |User data...| 3212 \------+----------------------/ \ 3213 \ \ 3214 \ Receiver scatter: \-----------+- ... 3215 /-----+----------------\ \ \ 3216 |RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->... 3218 In the above case, the server may recycle unused buffers to the next 3219 posted receive if unused by the actual received request, or may pass 3220 the now-complete buffers by reference for normal write processing. 3221 For a server that can make use of it, this removes any need for data 3222 copies of incoming data, without resorting to complicated end-to-end 3223 buffer advertisement and management. This includes most kernel-based 3224 and integrated server designs, among many others. The client may 3225 perform similar optimizations, if desired. 3227 2.10.7.4. Dual RDMA and Non-RDMA Transports 3229 Some RDMA transports (e.g., RFC 5040 [8]) permit a "streaming" (non- 3230 RDMA) phase, where ordinary traffic might flow before "stepping up" 3231 to RDMA mode, commencing RDMA traffic. Some RDMA transports start 3232 connections always in RDMA mode. NFSv4.1 allows, but does not 3233 assume, a streaming phase before RDMA mode. When a connection is 3234 associated with a session, the client and server negotiate whether 3235 the connection is used in RDMA or non-RDMA mode (see Sections 18.36 3236 and 18.34). 3238 2.10.8. Session Security 3239 2.10.8.1. Session Callback Security 3241 Via session/connection association, NFSv4.1 improves security over 3242 that provided by NFSv4.0 for the backchannel. The connection is 3243 client-initiated (see Section 18.34) and subject to the same firewall 3244 and routing checks as the fore channel. At the client's option (see 3245 Section 18.35), connection association is fully authenticated before 3246 being activated (see Section 18.34). Traffic from the server over 3247 the backchannel is authenticated exactly as the client specifies (see 3248 Section 2.10.8.2). 3250 2.10.8.2. Backchannel RPC Security 3252 When the NFSv4.1 client establishes the backchannel, it informs the 3253 server of the security flavors and principals to use when sending 3254 requests. If the security flavor is RPCSEC_GSS, the client expresses 3255 the principal in the form of an established RPCSEC_GSS context. The 3256 server is free to use any of the flavor/principal combinations the 3257 client offers, but it MUST NOT use unoffered combinations. This way, 3258 the client need not provide a target GSS principal for the 3259 backchannel as it did with NFSv4.0, nor does the server have to 3260 implement an RPCSEC_GSS initiator as it did with NFSv4.0 [37]. 3262 The CREATE_SESSION (Section 18.36) and BACKCHANNEL_CTL 3263 (Section 18.33) operations allow the client to specify flavor/ 3264 principal combinations. 3266 Also note that the SP4_SSV state protection mode (see Sections 18.35 3267 and 2.10.8.3) has the side benefit of providing SSV-derived 3268 RPCSEC_GSS contexts (Section 2.10.9). 3270 2.10.8.3. Protection from Unauthorized State Changes 3272 As described to this point in the specification, the state model of 3273 NFSv4.1 is vulnerable to an attacker that sends a SEQUENCE operation 3274 with a forged session ID and with a slot ID that it expects the 3275 legitimate client to use next. When the legitimate client uses the 3276 slot ID with the same sequence number, the server returns the 3277 attacker's result from the reply cache, which disrupts the legitimate 3278 client and thus denies service to it. Similarly, an attacker could 3279 send a CREATE_SESSION with a forged client ID to create a new session 3280 associated with the client ID. The attacker could send requests 3281 using the new session that change locking state, such as LOCKU 3282 operations to release locks the legitimate client has acquired. 3283 Setting a security policy on the file that requires RPCSEC_GSS 3284 credentials when manipulating the file's state is one potential work 3285 around, but has the disadvantage of preventing a legitimate client 3286 from releasing state when RPCSEC_GSS is required to do so, but a GSS 3287 context cannot be obtained (possibly because the user has logged off 3288 the client). 3290 NFSv4.1 provides three options to a client for state protection, 3291 which are specified when a client creates a client ID via EXCHANGE_ID 3292 (Section 18.35). 3294 The first (SP4_NONE) is to simply waive state protection. 3296 The other two options (SP4_MACH_CRED and SP4_SSV) share several 3297 traits: 3299 * An RPCSEC_GSS-based credential is used to authenticate client ID 3300 and session maintenance operations, including creating and 3301 destroying a session, associating a connection with the session, 3302 and destroying the client ID. 3304 * Because RPCSEC_GSS is used to authenticate client ID and session 3305 maintenance, the attacker cannot associate a rogue connection with 3306 a legitimate session, or associate a rogue session with a 3307 legitimate client ID in order to maliciously alter the client ID's 3308 lock state via CLOSE, LOCKU, DELEGRETURN, LAYOUTRETURN, etc. 3310 * In cases where the server's security policies on a portion of its 3311 namespace require RPCSEC_GSS authentication, a client may have to 3312 use an RPCSEC_GSS credential to remove per-file state (e.g., 3313 LOCKU, CLOSE, etc.). The server may require that the principal 3314 that removes the state match certain criteria (e.g., the principal 3315 might have to be the same as the one that acquired the state). 3316 However, the client might not have an RPCSEC_GSS context for such 3317 a principal, and might not be able to create such a context 3318 (perhaps because the user has logged off). When the client 3319 establishes SP4_MACH_CRED or SP4_SSV protection, it can specify a 3320 list of operations that the server MUST allow using the machine 3321 credential (if SP4_MACH_CRED is used) or the SSV credential (if 3322 SP4_SSV is used). 3324 The SP4_MACH_CRED state protection option uses a machine credential 3325 where the principal that creates the client ID MUST also be the 3326 principal that performs client ID and session maintenance operations. 3327 The security of the machine credential state protection approach 3328 depends entirely on safeguarding the per-machine credential. 3329 Assuming a proper safeguard using the per-machine credential for 3330 operations like CREATE_SESSION, BIND_CONN_TO_SESSION, 3331 DESTROY_SESSION, and DESTROY_CLIENTID will prevent an attacker from 3332 associating a rogue connection with a session, or associating a rogue 3333 session with a client ID. 3335 There are at least three scenarios for the SP4_MACH_CRED option: 3337 1. The system administrator configures a unique, permanent per- 3338 machine credential for one of the mandated GSS mechanisms (e.g., 3339 if Kerberos V5 is used, a "keytab" containing a principal derived 3340 from a client host name could be used). 3342 2. The client is used by a single user, and so the client ID and its 3343 sessions are used by just that user. If the user's credential 3344 expires, then session and client ID maintenance cannot occur, but 3345 since the client has a single user, only that user is 3346 inconvenienced. 3348 3. The physical client has multiple users, but the client 3349 implementation has a unique client ID for each user. This is 3350 effectively the same as the second scenario, but a disadvantage 3351 is that each user needs to be allocated at least one session 3352 each, so the approach suffers from lack of economy. 3354 The SP4_SSV protection option uses the SSV (Section 1.7), via 3355 RPCSEC_GSS and the SSV GSS mechanism (Section 2.10.9), to protect 3356 state from attack. The SP4_SSV protection option is intended for the 3357 situation comprised of a client that has multiple active users and a 3358 system administrator who wants to avoid the burden of installing a 3359 permanent machine credential on each client. The SSV is established 3360 and updated on the server via SET_SSV (see Section 18.47). To 3361 prevent eavesdropping, a client SHOULD send SET_SSV via RPCSEC_GSS 3362 with the privacy service. Several aspects of the SSV make it 3363 intractable for an attacker to guess the SSV, and thus associate 3364 rogue connections with a session, and rogue sessions with a client 3365 ID: 3367 * The arguments to and results of SET_SSV include digests of the old 3368 and new SSV, respectively. 3370 * Because the initial value of the SSV is zero, therefore known, the 3371 client that opts for SP4_SSV protection and opts to apply SP4_SSV 3372 protection to BIND_CONN_TO_SESSION and CREATE_SESSION MUST send at 3373 least one SET_SSV operation before the first BIND_CONN_TO_SESSION 3374 operation or before the second CREATE_SESSION operation on a 3375 client ID. If it does not, the SSV mechanism will not generate 3376 tokens (Section 2.10.9). A client SHOULD send SET_SSV as soon as 3377 a session is created. 3379 * A SET_SSV request does not replace the SSV with the argument to 3380 SET_SSV. Instead, the current SSV on the server is logically 3381 exclusive ORed (XORed) with the argument to SET_SSV. Each time a 3382 new principal uses a client ID for the first time, the client 3383 SHOULD send a SET_SSV with that principal's RPCSEC_GSS 3384 credentials, with RPCSEC_GSS service set to RPC_GSS_SVC_PRIVACY. 3386 Here are the types of attacks that can be attempted by an attacker 3387 named Eve on a victim named Bob, and how SP4_SSV protection foils 3388 each attack: 3390 * Suppose Eve is the first user to log into a legitimate client. 3391 Eve's use of an NFSv4.1 file system will cause the legitimate 3392 client to create a client ID with SP4_SSV protection, specifying 3393 that the BIND_CONN_TO_SESSION operation MUST use the SSV 3394 credential. Eve's use of the file system also causes an SSV to be 3395 created. The SET_SSV operation that creates the SSV will be 3396 protected by the RPCSEC_GSS context created by the legitimate 3397 client, which uses Eve's GSS principal and credentials. Eve can 3398 eavesdrop on the network while her RPCSEC_GSS context is created 3399 and the SET_SSV using her context is sent. Even if the legitimate 3400 client sends the SET_SSV with RPC_GSS_SVC_PRIVACY, because Eve 3401 knows her own credentials, she can decrypt the SSV. Eve can 3402 compute an RPCSEC_GSS credential that BIND_CONN_TO_SESSION will 3403 accept, and so associate a new connection with the legitimate 3404 session. Eve can change the slot ID and sequence state of a 3405 legitimate session, and/or the SSV state, in such a way that when 3406 Bob accesses the server via the same legitimate client, the 3407 legitimate client will be unable to use the session. 3409 The client's only recourse is to create a new client ID for Bob to 3410 use, and establish a new SSV for the client ID. The client will 3411 be unable to delete the old client ID, and will let the lease on 3412 the old client ID expire. 3414 Once the legitimate client establishes an SSV over the new session 3415 using Bob's RPCSEC_GSS context, Eve can use the new session via 3416 the legitimate client, but she cannot disrupt Bob. Moreover, 3417 because the client SHOULD have modified the SSV due to Eve using 3418 the new session, Bob cannot get revenge on Eve by associating a 3419 rogue connection with the session. 3421 The question is how did the legitimate client detect that Eve has 3422 hijacked the old session? When the client detects that a new 3423 principal, Bob, wants to use the session, it SHOULD have sent a 3424 SET_SSV, which leads to the following sub-scenarios: 3426 - Let us suppose that from the rogue connection, Eve sent a 3427 SET_SSV with the same slot ID and sequence ID that the 3428 legitimate client later uses. The server will assume the 3429 SET_SSV sent with Bob's credentials is a retry, and return to 3430 the legitimate client the reply it sent Eve. However, unless 3431 Eve can correctly guess the SSV the legitimate client will use, 3432 the digest verification checks in the SET_SSV response will 3433 fail. That is an indication to the client that the session has 3434 apparently been hijacked. 3436 - Alternatively, Eve sent a SET_SSV with a different slot ID than 3437 the legitimate client uses for its SET_SSV. Then the digest 3438 verification of the SET_SSV sent with Bob's credentials fails 3439 on the server, and the error returned to the client makes it 3440 apparent that the session has been hijacked. 3442 - Alternatively, Eve sent an operation other than SET_SSV, but 3443 with the same slot ID and sequence that the legitimate client 3444 uses for its SET_SSV. The server returns to the legitimate 3445 client the response it sent Eve. The client sees that the 3446 response is not at all what it expects. The client assumes 3447 either session hijacking or a server bug, and either way 3448 destroys the old session. 3450 * Eve associates a rogue connection with the session as above, and 3451 then destroys the session. Again, Bob goes to use the server from 3452 the legitimate client, which sends a SET_SSV using Bob's 3453 credentials. The client receives an error that indicates that the 3454 session does not exist. When the client tries to create a new 3455 session, this will fail because the SSV it has does not match that 3456 which the server has, and now the client knows the session was 3457 hijacked. The legitimate client establishes a new client ID. 3459 * If Eve creates a connection before the legitimate client 3460 establishes an SSV, because the initial value of the SSV is zero 3461 and therefore known, Eve can send a SET_SSV that will pass the 3462 digest verification check. However, because the new connection 3463 has not been associated with the session, the SET_SSV is rejected 3464 for that reason. 3466 In summary, an attacker's disruption of state when SP4_SSV protection 3467 is in use is limited to the formative period of a client ID, its 3468 first session, and the establishment of the SSV. Once a non- 3469 malicious user uses the client ID, the client quickly detects any 3470 hijack and rectifies the situation. Once a non-malicious user 3471 successfully modifies the SSV, the attacker cannot use NFSv4.1 3472 operations to disrupt the non-malicious user. 3474 Note that neither the SP4_MACH_CRED nor SP4_SSV protection approaches 3475 prevent hijacking of a transport connection that has previously been 3476 associated with a session. If the goal of a counter-threat strategy 3477 is to prevent connection hijacking, the use of IPsec is RECOMMENDED. 3479 If a connection hijack occurs, the hijacker could in theory change 3480 locking state and negatively impact the service to legitimate 3481 clients. However, if the server is configured to require the use of 3482 RPCSEC_GSS with integrity or privacy on the affected file objects, 3483 and if EXCHGID4_FLAG_BIND_PRINC_STATEID capability (Section 18.35) is 3484 in force, this will thwart unauthorized attempts to change locking 3485 state. 3487 2.10.9. The Secret State Verifier (SSV) GSS Mechanism 3489 The SSV provides the secret key for a GSS mechanism internal to 3490 NFSv4.1 that NFSv4.1 uses for state protection. Contexts for this 3491 mechanism are not established via the RPCSEC_GSS protocol. Instead, 3492 the contexts are automatically created when EXCHANGE_ID specifies 3493 SP4_SSV protection. The only tokens defined are the PerMsgToken 3494 (emitted by GSS_GetMIC) and the SealedMessage token (emitted by 3495 GSS_Wrap). 3497 The mechanism OID for the SSV mechanism is 3498 iso.org.dod.internet.private.enterprise.Michael Eisler.nfs.ssv_mech 3499 (1.3.6.1.4.1.28882.1.1). While the SSV mechanism does not define any 3500 initial context tokens, the OID can be used to let servers indicate 3501 that the SSV mechanism is acceptable whenever the client sends a 3502 SECINFO or SECINFO_NO_NAME operation (see Section 2.6). 3504 The SSV mechanism defines four subkeys derived from the SSV value. 3505 Each time SET_SSV is invoked, the subkeys are recalculated by the 3506 client and server. The calculation of each of the four subkeys 3507 depends on each of the four respective ssv_subkey4 enumerated values. 3508 The calculation uses the HMAC [52] algorithm, using the current SSV 3509 as the key, the one-way hash algorithm as negotiated by EXCHANGE_ID, 3510 and the input text as represented by the XDR encoded enumeration 3511 value for that subkey of data type ssv_subkey4. If the length of the 3512 output of the HMAC algorithm exceeds the length of key of the 3513 encryption algorithm (which is also negotiated by EXCHANGE_ID), then 3514 the subkey MUST be truncated from the HMAC output, i.e., if the 3515 subkey is of N bytes long, then the first N bytes of the HMAC output 3516 MUST be used for the subkey. The specification of EXCHANGE_ID states 3517 that the length of the output of the HMAC algorithm MUST NOT be less 3518 than the length of subkey needed for the encryption algorithm (see 3519 Section 18.35). 3521 /* Input for computing subkeys */ 3522 enum ssv_subkey4 { 3523 SSV4_SUBKEY_MIC_I2T = 1, 3524 SSV4_SUBKEY_MIC_T2I = 2, 3525 SSV4_SUBKEY_SEAL_I2T = 3, 3526 SSV4_SUBKEY_SEAL_T2I = 4 3527 }; 3529 The subkey derived from SSV4_SUBKEY_MIC_I2T is used for calculating 3530 message integrity codes (MICs) that originate from the NFSv4.1 3531 client, whether as part of a request over the fore channel or a 3532 response over the backchannel. The subkey derived from 3533 SSV4_SUBKEY_MIC_T2I is used for MICs originating from the NFSv4.1 3534 server. The subkey derived from SSV4_SUBKEY_SEAL_I2T is used for 3535 encryption text originating from the NFSv4.1 client, and the subkey 3536 derived from SSV4_SUBKEY_SEAL_T2I is used for encryption text 3537 originating from the NFSv4.1 server. 3539 The PerMsgToken description is based on an XDR definition: 3541 /* Input for computing smt_hmac */ 3542 struct ssv_mic_plain_tkn4 { 3543 uint32_t smpt_ssv_seq; 3544 opaque smpt_orig_plain<>; 3545 }; 3547 /* SSV GSS PerMsgToken token */ 3548 struct ssv_mic_tkn4 { 3549 uint32_t smt_ssv_seq; 3550 opaque smt_hmac<>; 3551 }; 3553 The field smt_hmac is an HMAC calculated by using the subkey derived 3554 from SSV4_SUBKEY_MIC_I2T or SSV4_SUBKEY_MIC_T2I as the key, the one- 3555 way hash algorithm as negotiated by EXCHANGE_ID, and the input text 3556 as represented by data of type ssv_mic_plain_tkn4. The field 3557 smpt_ssv_seq is the same as smt_ssv_seq. The field smpt_orig_plain 3558 is the "message" input passed to GSS_GetMIC() (see Section 2.3.1 of 3559 [7]). The caller of GSS_GetMIC() provides a pointer to a buffer 3560 containing the plain text. The SSV mechanism's entry point for 3561 GSS_GetMIC() encodes this into an opaque array, and the encoding will 3562 include an initial four-byte length, plus any necessary padding. 3563 Prepended to this will be the XDR encoded value of smpt_ssv_seq, thus 3564 making up an XDR encoding of a value of data type ssv_mic_plain_tkn4, 3565 which in turn is the input into the HMAC. 3567 The token emitted by GSS_GetMIC() is XDR encoded and of XDR data type 3568 ssv_mic_tkn4. The field smt_ssv_seq comes from the SSV sequence 3569 number, which is equal to one after SET_SSV (Section 18.47) is called 3570 the first time on a client ID. Thereafter, the SSV sequence number 3571 is incremented on each SET_SSV. Thus, smt_ssv_seq represents the 3572 version of the SSV at the time GSS_GetMIC() was called. As noted in 3573 Section 18.35, the client and server can maintain multiple concurrent 3574 versions of the SSV. This allows the SSV to be changed without 3575 serializing all RPC calls that use the SSV mechanism with SET_SSV 3576 operations. Once the HMAC is calculated, it is XDR encoded into 3577 smt_hmac, which will include an initial four-byte length, and any 3578 necessary padding. Prepended to this will be the XDR encoded value 3579 of smt_ssv_seq. 3581 The SealedMessage description is based on an XDR definition: 3583 /* Input for computing ssct_encr_data and ssct_hmac */ 3584 struct ssv_seal_plain_tkn4 { 3585 opaque sspt_confounder<>; 3586 uint32_t sspt_ssv_seq; 3587 opaque sspt_orig_plain<>; 3588 opaque sspt_pad<>; 3589 }; 3591 /* SSV GSS SealedMessage token */ 3592 struct ssv_seal_cipher_tkn4 { 3593 uint32_t ssct_ssv_seq; 3594 opaque ssct_iv<>; 3595 opaque ssct_encr_data<>; 3596 opaque ssct_hmac<>; 3597 }; 3599 The token emitted by GSS_Wrap() is XDR encoded and of XDR data type 3600 ssv_seal_cipher_tkn4. 3602 The ssct_ssv_seq field has the same meaning as smt_ssv_seq. 3604 The ssct_encr_data field is the result of encrypting a value of the 3605 XDR encoded data type ssv_seal_plain_tkn4. The encryption key is the 3606 subkey derived from SSV4_SUBKEY_SEAL_I2T or SSV4_SUBKEY_SEAL_T2I, and 3607 the encryption algorithm is that negotiated by EXCHANGE_ID. 3609 The ssct_iv field is the initialization vector (IV) for the 3610 encryption algorithm (if applicable) and is sent in clear text. The 3611 content and size of the IV MUST comply with the specification of the 3612 encryption algorithm. For example, the id-aes256-CBC algorithm MUST 3613 use a 16-byte initialization vector (IV), which MUST be unpredictable 3614 for each instance of a value of data type ssv_seal_plain_tkn4 that is 3615 encrypted with a particular SSV key. 3617 The ssct_hmac field is the result of computing an HMAC using the 3618 value of the XDR encoded data type ssv_seal_plain_tkn4 as the input 3619 text. The key is the subkey derived from SSV4_SUBKEY_MIC_I2T or 3620 SSV4_SUBKEY_MIC_T2I, and the one-way hash algorithm is that 3621 negotiated by EXCHANGE_ID. 3623 The sspt_confounder field is a random value. 3625 The sspt_ssv_seq field is the same as ssvt_ssv_seq. 3627 The field sspt_orig_plain field is the original plaintext and is the 3628 "input_message" input passed to GSS_Wrap() (see Section 2.3.3 of 3629 [7]). As with the handling of the plaintext by the SSV mechanism's 3630 GSS_GetMIC() entry point, the entry point for GSS_Wrap() expects a 3631 pointer to the plaintext, and will XDR encode an opaque array into 3632 sspt_orig_plain representing the plain text, along with the other 3633 fields of an instance of data type ssv_seal_plain_tkn4. 3635 The sspt_pad field is present to support encryption algorithms that 3636 require inputs to be in fixed-sized blocks. The content of sspt_pad 3637 is zero filled except for the length. Beware that the XDR encoding 3638 of ssv_seal_plain_tkn4 contains three variable-length arrays, and so 3639 each array consumes four bytes for an array length, and each array 3640 that follows the length is always padded to a multiple of four bytes 3641 per the XDR standard. 3643 For example, suppose the encryption algorithm uses 16-byte blocks, 3644 and the sspt_confounder is three bytes long, and the sspt_orig_plain 3645 field is 15 bytes long. The XDR encoding of sspt_confounder uses 3646 eight bytes (4 + 3 + 1-byte pad), the XDR encoding of sspt_ssv_seq 3647 uses four bytes, the XDR encoding of sspt_orig_plain uses 20 bytes (4 3648 + 15 + 1-byte pad), and the smallest XDR encoding of the sspt_pad 3649 field is four bytes. This totals 36 bytes. The next multiple of 16 3650 is 48; thus, the length field of sspt_pad needs to be set to 12 3651 bytes, or a total encoding of 16 bytes. The total number of XDR 3652 encoded bytes is thus 8 + 4 + 20 + 16 = 48. 3654 GSS_Wrap() emits a token that is an XDR encoding of a value of data 3655 type ssv_seal_cipher_tkn4. Note that regardless of whether or not 3656 the caller of GSS_Wrap() requests confidentiality, the token always 3657 has confidentiality. This is because the SSV mechanism is for 3658 RPCSEC_GSS, and RPCSEC_GSS never produces GSS_wrap() tokens without 3659 confidentiality. 3661 There is one SSV per client ID. There is a single GSS context for a 3662 client ID / SSV pair. All SSV mechanism RPCSEC_GSS handles of a 3663 client ID / SSV pair share the same GSS context. SSV GSS contexts do 3664 not expire except when the SSV is destroyed (causes would include the 3665 client ID being destroyed or a server restart). Since one purpose of 3666 context expiration is to replace keys that have been in use for "too 3667 long", hence vulnerable to compromise by brute force or accident, the 3668 client can replace the SSV key by sending periodic SET_SSV 3669 operations, which is done by cycling through different users' 3670 RPCSEC_GSS credentials. This way, the SSV is replaced without 3671 destroying the SSV's GSS contexts. 3673 SSV RPCSEC_GSS handles can be expired or deleted by the server at any 3674 time, and the EXCHANGE_ID operation can be used to create more SSV 3675 RPCSEC_GSS handles. Expiration of SSV RPCSEC_GSS handles does not 3676 imply that the SSV or its GSS context has expired. 3678 The client MUST establish an SSV via SET_SSV before the SSV GSS 3679 context can be used to emit tokens from GSS_Wrap() and GSS_GetMIC(). 3680 If SET_SSV has not been successfully called, attempts to emit tokens 3681 MUST fail. 3683 The SSV mechanism does not support replay detection and sequencing in 3684 its tokens because RPCSEC_GSS does not use those features (see 3685 "Context Creation Requests", Section 5.2.2 of [4]). However, 3686 Section 2.10.10 discusses special considerations for the SSV 3687 mechanism when used with RPCSEC_GSS. 3689 2.10.10. Security Considerations for RPCSEC_GSS When Using the SSV 3690 Mechanism 3692 When a client ID is created with SP4_SSV state protection (see 3693 Section 18.35), the client is permitted to associate multiple 3694 RPCSEC_GSS handles with the single SSV GSS context (see 3695 Section 2.10.9). Because of the way RPCSEC_GSS (both version 1 and 3696 version 2, see [4] and [9]) calculate the verifier of the reply, 3697 special care must be taken by the implementation of the NFSv4.1 3698 client to prevent attacks by a man-in-the-middle. The verifier of an 3699 RPCSEC_GSS reply is the output of GSS_GetMIC() applied to the input 3700 value of the seq_num field of the RPCSEC_GSS credential (data type 3701 rpc_gss_cred_ver_1_t) (see Section 5.3.3.2 of [4]). If multiple 3702 RPCSEC_GSS handles share the same GSS context, then if one handle is 3703 used to send a request with the same seq_num value as another handle, 3704 an attacker could block the reply, and replace it with the verifier 3705 used for the other handle. 3707 There are multiple ways to prevent the attack on the SSV RPCSEC_GSS 3708 verifier in the reply. The simplest is believed to be as follows. 3710 * Each time one or more new SSV RPCSEC_GSS handles are created via 3711 EXCHANGE_ID, the client SHOULD send a SET_SSV operation to modify 3712 the SSV. By changing the SSV, the new handles will not result in 3713 the re-use of an SSV RPCSEC_GSS verifier in a reply. 3715 * When a requester decides to use N SSV RPCSEC_GSS handles, it 3716 SHOULD assign a unique and non-overlapping range of seq_nums to 3717 each SSV RPCSEC_GSS handle. The size of each range SHOULD be 3718 equal to MAXSEQ / N (see Section 5 of [4] for the definition of 3719 MAXSEQ). When an SSV RPCSEC_GSS handle reaches its maximum, it 3720 SHOULD force the replier to destroy the handle by sending a NULL 3721 RPC request with seq_num set to MAXSEQ + 1 (see Section 5.3.3.3 of 3722 [4]). 3724 * When the requester wants to increase or decrease N, it SHOULD 3725 force the replier to destroy all N handles by sending a NULL RPC 3726 request on each handle with seq_num set to MAXSEQ + 1. If the 3727 requester is the client, it SHOULD send a SET_SSV operation before 3728 using new handles. If the requester is the server, then the 3729 client SHOULD send a SET_SSV operation when it detects that the 3730 server has forced it to destroy a backchannel's SSV RPCSEC_GSS 3731 handle. By sending a SET_SSV operation, the SSV will change, and 3732 so the attacker will be unavailable to successfully replay a 3733 previous verifier in a reply to the requester. 3735 Note that if the replier carefully creates the SSV RPCSEC_GSS 3736 handles, the related risk of a man-in-the-middle splicing a forged 3737 SSV RPCSEC_GSS credential with a verifier for another handle does not 3738 exist. This is because the verifier in an RPCSEC_GSS request is 3739 computed from input that includes both the RPCSEC_GSS handle and 3740 seq_num (see Section 5.3.1 of [4]). Provided the replier takes care 3741 to avoid re-using the value of an RPCSEC_GSS handle that it creates, 3742 such as by including a generation number in the handle, the man-in- 3743 the-middle will not be able to successfully replay a previous 3744 verifier in the request to a replier. 3746 2.10.11. Session Mechanics - Steady State 3748 2.10.11.1. Obligations of the Server 3750 The server has the primary obligation to monitor the state of 3751 backchannel resources that the client has created for the server 3752 (RPCSEC_GSS contexts and backchannel connections). If these 3753 resources vanish, the server takes action as specified in 3754 Section 2.10.13.2. 3756 2.10.11.2. Obligations of the Client 3758 The client SHOULD honor the following obligations in order to utilize 3759 the session: 3761 * Keep a necessary session from going idle on the server. A client 3762 that requires a session but nonetheless is not sending operations 3763 risks having the session be destroyed by the server. This is 3764 because sessions consume resources, and resource limitations may 3765 force the server to cull an inactive session. A server MAY 3766 consider a session to be inactive if the client has not used the 3767 session before the session inactivity timer (Section 2.10.12) has 3768 expired. 3770 * Destroy the session when not needed. If a client has multiple 3771 sessions, one of which has no requests waiting for replies, and 3772 has been idle for some period of time, it SHOULD destroy the 3773 session. 3775 * Maintain GSS contexts and RPCSEC_GSS handles for the backchannel. 3776 If the client requires the server to use the RPCSEC_GSS security 3777 flavor for callbacks, then it needs to be sure the RPCSEC_GSS 3778 handles and/or their GSS contexts that are handed to the server 3779 via BACKCHANNEL_CTL or CREATE_SESSION are unexpired. 3781 * Preserve a connection for a backchannel. The server requires a 3782 backchannel in order to gracefully recall recallable state or 3783 notify the client of certain events. Note that if the connection 3784 is not being used for the fore channel, there is no way for the 3785 client to tell if the connection is still alive (e.g., the server 3786 restarted without sending a disconnect). The onus is on the 3787 server, not the client, to determine if the backchannel's 3788 connection is alive, and to indicate in the response to a SEQUENCE 3789 operation when the last connection associated with a session's 3790 backchannel has disconnected. 3792 2.10.11.3. Steps the Client Takes to Establish a Session 3794 If the client does not have a client ID, the client sends EXCHANGE_ID 3795 to establish a client ID. If it opts for SP4_MACH_CRED or SP4_SSV 3796 protection, in the spo_must_enforce list of operations, it SHOULD at 3797 minimum specify CREATE_SESSION, DESTROY_SESSION, 3798 BIND_CONN_TO_SESSION, BACKCHANNEL_CTL, and DESTROY_CLIENTID. If it 3799 opts for SP4_SSV protection, the client needs to ask for SSV-based 3800 RPCSEC_GSS handles. 3802 The client uses the client ID to send a CREATE_SESSION on a 3803 connection to the server. The results of CREATE_SESSION indicate 3804 whether or not the server will persist the session reply cache 3805 through a server that has restarted, and the client notes this for 3806 future reference. 3808 If the client specified SP4_SSV state protection when the client ID 3809 was created, then it SHOULD send SET_SSV in the first COMPOUND after 3810 the session is created. Each time a new principal goes to use the 3811 client ID, it SHOULD send a SET_SSV again. 3813 If the client wants to use delegations, layouts, directory 3814 notifications, or any other state that requires a backchannel, then 3815 it needs to add a connection to the backchannel if CREATE_SESSION did 3816 not already do so. The client creates a connection, and calls 3817 BIND_CONN_TO_SESSION to associate the connection with the session and 3818 the session's backchannel. If CREATE_SESSION did not already do so, 3819 the client MUST tell the server what security is required in order 3820 for the client to accept callbacks. The client does this via 3821 BACKCHANNEL_CTL. If the client selected SP4_MACH_CRED or SP4_SSV 3822 protection when it called EXCHANGE_ID, then the client SHOULD specify 3823 that the backchannel use RPCSEC_GSS contexts for security. 3825 If the client wants to use additional connections for the 3826 backchannel, then it needs to call BIND_CONN_TO_SESSION on each 3827 connection it wants to use with the session. If the client wants to 3828 use additional connections for the fore channel, then it needs to 3829 call BIND_CONN_TO_SESSION if it specified SP4_SSV or SP4_MACH_CRED 3830 state protection when the client ID was created. 3832 At this point, the session has reached steady state. 3834 2.10.12. Session Inactivity Timer 3836 The server MAY maintain a session inactivity timer for each session. 3837 If the session inactivity timer expires, then the server MAY destroy 3838 the session. To avoid losing a session due to inactivity, the client 3839 MUST renew the session inactivity timer. The length of session 3840 inactivity timer MUST NOT be less than the lease_time attribute 3841 (Section 5.8.1.11). As with lease renewal (Section 8.3), when the 3842 server receives a SEQUENCE operation, it resets the session 3843 inactivity timer, and MUST NOT allow the timer to expire while the 3844 rest of the operations in the COMPOUND procedure's request are still 3845 executing. Once the last operation has finished, the server MUST set 3846 the session inactivity timer to expire no sooner than the sum of the 3847 current time and the value of the lease_time attribute. 3849 2.10.13. Session Mechanics - Recovery 3851 2.10.13.1. Events Requiring Client Action 3853 The following events require client action to recover. 3855 2.10.13.1.1. RPCSEC_GSS Context Loss by Callback Path 3857 If all RPCSEC_GSS handles granted by the client to the server for 3858 callback use have expired, the client MUST establish a new handle via 3859 BACKCHANNEL_CTL. The sr_status_flags field of the SEQUENCE results 3860 indicates when callback handles are nearly expired, or fully expired 3861 (see Section 18.46.3). 3863 2.10.13.1.2. Connection Loss 3865 If the client loses the last connection of the session and wants to 3866 retain the session, then it needs to create a new connection, and if, 3867 when the client ID was created, BIND_CONN_TO_SESSION was specified in 3868 the spo_must_enforce list, the client MUST use BIND_CONN_TO_SESSION 3869 to associate the connection with the session. 3871 If there was a request outstanding at the time of connection loss, 3872 then if the client wants to continue to use the session, it MUST 3873 retry the request, as described in Section 2.10.6.2. Note that it is 3874 not necessary to retry requests over a connection with the same 3875 source network address or the same destination network address as the 3876 lost connection. As long as the session ID, slot ID, and sequence ID 3877 in the retry match that of the original request, the server will 3878 recognize the request as a retry if it executed the request prior to 3879 disconnect. 3881 If the connection that was lost was the last one associated with the 3882 backchannel, and the client wants to retain the backchannel and/or 3883 prevent revocation of recallable state, the client needs to 3884 reconnect, and if it does, it MUST associate the connection to the 3885 session and backchannel via BIND_CONN_TO_SESSION. The server SHOULD 3886 indicate when it has no callback connection via the sr_status_flags 3887 result from SEQUENCE. 3889 2.10.13.1.3. Backchannel GSS Context Loss 3891 Via the sr_status_flags result of the SEQUENCE operation or other 3892 means, the client will learn if some or all of the RPCSEC_GSS 3893 contexts it assigned to the backchannel have been lost. If the 3894 client wants to retain the backchannel and/or not put recallable 3895 state subject to revocation, the client needs to use BACKCHANNEL_CTL 3896 to assign new contexts. 3898 2.10.13.1.4. Loss of Session 3900 The replier might lose a record of the session. Causes include: 3902 * Replier failure and restart. 3904 * A catastrophe that causes the reply cache to be corrupted or lost 3905 on the media on which it was stored. This applies even if the 3906 replier indicated in the CREATE_SESSION results that it would 3907 persist the cache. 3909 * The server purges the session of a client that has been inactive 3910 for a very extended period of time. 3912 * As a result of configuration changes among a set of clustered 3913 servers, a network address previously connected to one server 3914 becomes connected to a different server that has no knowledge of 3915 the session in question. Such a configuration change will 3916 generally only happen when the original server ceases to function 3917 for a time. 3919 Loss of reply cache is equivalent to loss of session. The replier 3920 indicates loss of session to the requester by returning 3921 NFS4ERR_BADSESSION on the next operation that uses the session ID 3922 that refers to the lost session. 3924 After an event like a server restart, the client may have lost its 3925 connections. The client assumes for the moment that the session has 3926 not been lost. It reconnects, and if it specified connection 3927 association enforcement when the session was created, it invokes 3928 BIND_CONN_TO_SESSION using the session ID. Otherwise, it invokes 3929 SEQUENCE. If BIND_CONN_TO_SESSION or SEQUENCE returns 3930 NFS4ERR_BADSESSION, the client knows the session is not available to 3931 it when communicating with that network address. If the connection 3932 survives session loss, then the next SEQUENCE operation the client 3933 sends over the connection will get back NFS4ERR_BADSESSION. The 3934 client again knows the session was lost. 3936 Here is one suggested algorithm for the client when it gets 3937 NFS4ERR_BADSESSION. It is not obligatory in that, if a client does 3938 not want to take advantage of such features as trunking, it may omit 3939 parts of it. However, it is a useful example that draws attention to 3940 various possible recovery issues: 3942 1. If the client has other connections to other server network 3943 addresses associated with the same session, attempt a COMPOUND 3944 with a single operation, SEQUENCE, on each of the other 3945 connections. 3947 2. If the attempts succeed, the session is still alive, and this is 3948 a strong indicator that the server's network address has moved. 3949 The client might send an EXCHANGE_ID on the connection that 3950 returned NFS4ERR_BADSESSION to see if there are opportunities for 3951 client ID trunking (i.e., the same client ID and so_major_id 3952 value are returned). The client might use DNS to see if the 3953 moved network address was replaced with another, so that the 3954 performance and availability benefits of session trunking can 3955 continue. 3957 3. If the SEQUENCE requests fail with NFS4ERR_BADSESSION, then the 3958 session no longer exists on any of the server network addresses 3959 for which the client has connections associated with that session 3960 ID. It is possible the session is still alive and available on 3961 other network addresses. The client sends an EXCHANGE_ID on all 3962 the connections to see if the server owner is still listening on 3963 those network addresses. If the same server owner is returned 3964 but a new client ID is returned, this is a strong indicator of a 3965 server restart. If both the same server owner and same client ID 3966 are returned, then this is a strong indication that the server 3967 did delete the session, and the client will need to send a 3968 CREATE_SESSION if it has no other sessions for that client ID. 3969 If a different server owner is returned, the client can use DNS 3970 to find other network addresses. If it does not, or if DNS does 3971 not find any other addresses for the server, then the client will 3972 be unable to provide NFSv4.1 service, and fatal errors should be 3973 returned to processes that were using the server. If the client 3974 is using a "mount" paradigm, unmounting the server is advised. 3976 4. If the client knows of no other connections associated with the 3977 session ID and server network addresses that are, or have been, 3978 associated with the session ID, then the client can use DNS to 3979 find other network addresses. If it does not, or if DNS does not 3980 find any other addresses for the server, then the client will be 3981 unable to provide NFSv4.1 service, and fatal errors should be 3982 returned to processes that were using the server. If the client 3983 is using a "mount" paradigm, unmounting the server is advised. 3985 If there is a reconfiguration event that results in the same network 3986 address being assigned to servers where the eir_server_scope value is 3987 different, it cannot be guaranteed that a session ID generated by the 3988 first will be recognized as invalid by the first. Therefore, in 3989 managing server reconfigurations among servers with different server 3990 scope values, it is necessary to make sure that all clients have 3991 disconnected from the first server before effecting the 3992 reconfiguration. Nonetheless, clients should not assume that servers 3993 will always adhere to this requirement; clients MUST be prepared to 3994 deal with unexpected effects of server reconfigurations. Even where 3995 a session ID is inappropriately recognized as valid, it is likely 3996 either that the connection will not be recognized as valid or that a 3997 sequence value for a slot will not be correct. Therefore, when a 3998 client receives results indicating such unexpected errors, the use of 3999 EXCHANGE_ID to determine the current server configuration is 4000 RECOMMENDED. 4002 A variation on the above is that after a server's network address 4003 moves, there is no NFSv4.1 server listening, e.g., no listener on 4004 port 2049. In this example, one of the following occur: the NFSv4 4005 server returns NFS4ERR_MINOR_VERS_MISMATCH, the NFS server returns a 4006 PROG_MISMATCH error, the RPC listener on 2049 returns PROG_UNVAIL, or 4007 attempts to reconnect to the network address timeout. These SHOULD 4008 be treated as equivalent to SEQUENCE returning NFS4ERR_BADSESSION for 4009 these purposes. 4011 When the client detects session loss, it needs to call CREATE_SESSION 4012 to recover. Any non-idempotent operations that were in progress 4013 might have been performed on the server at the time of session loss. 4014 The client has no general way to recover from this. 4016 Note that loss of session does not imply loss of byte-range lock, 4017 open, delegation, or layout state because locks, opens, delegations, 4018 and layouts are tied to the client ID and depend on the client ID, 4019 not the session. Nor does loss of byte-range lock, open, delegation, 4020 or layout state imply loss of session state, because the session 4021 depends on the client ID; loss of client ID however does imply loss 4022 of session, byte-range lock, open, delegation, and layout state. See 4023 Section 8.4.2. A session can survive a server restart, but lock 4024 recovery may still be needed. 4026 It is possible that CREATE_SESSION will fail with 4027 NFS4ERR_STALE_CLIENTID (e.g., the server restarts and does not 4028 preserve client ID state). If so, the client needs to call 4029 EXCHANGE_ID, followed by CREATE_SESSION. 4031 2.10.13.2. Events Requiring Server Action 4033 The following events require server action to recover. 4035 2.10.13.2.1. Client Crash and Restart 4037 As described in Section 18.35, a restarted client sends EXCHANGE_ID 4038 in such a way that it causes the server to delete any sessions it 4039 had. 4041 2.10.13.2.2. Client Crash with No Restart 4043 If a client crashes and never comes back, it will never send 4044 EXCHANGE_ID with its old client owner. Thus, the server has session 4045 state that will never be used again. After an extended period of 4046 time, and if the server has resource constraints, it MAY destroy the 4047 old session as well as locking state. 4049 2.10.13.2.3. Extended Network Partition 4051 To the server, the extended network partition may be no different 4052 from a client crash with no restart (see Section 2.10.13.2.2). 4053 Unless the server can discern that there is a network partition, it 4054 is free to treat the situation as if the client has crashed 4055 permanently. 4057 2.10.13.2.4. Backchannel Connection Loss 4059 If there were callback requests outstanding at the time of a 4060 connection loss, then the server MUST retry the requests, as 4061 described in Section 2.10.6.2. Note that it is not necessary to 4062 retry requests over a connection with the same source network address 4063 or the same destination network address as the lost connection. As 4064 long as the session ID, slot ID, and sequence ID in the retry match 4065 that of the original request, the callback target will recognize the 4066 request as a retry even if it did see the request prior to 4067 disconnect. 4069 If the connection lost is the last one associated with the 4070 backchannel, then the server MUST indicate that in the 4071 sr_status_flags field of every SEQUENCE reply until the backchannel 4072 is re-established. There are two situations, each of which uses 4073 different status flags: no connectivity for the session's backchannel 4074 and no connectivity for any session backchannel of the client. See 4075 Section 18.46 for a description of the appropriate flags in 4076 sr_status_flags. 4078 2.10.13.2.5. GSS Context Loss 4080 The server SHOULD monitor when the number of RPCSEC_GSS handles 4081 assigned to the backchannel reaches one, and when that one handle is 4082 near expiry (i.e., between one and two periods of lease time), and 4083 indicate so in the sr_status_flags field of all SEQUENCE replies. 4084 The server MUST indicate when all of the backchannel's assigned 4085 RPCSEC_GSS handles have expired via the sr_status_flags field of all 4086 SEQUENCE replies. 4088 2.10.14. Parallel NFS and Sessions 4090 A client and server can potentially be a non-pNFS implementation, a 4091 metadata server implementation, a data server implementation, or two 4092 or three types of implementations. The EXCHGID4_FLAG_USE_NON_PNFS, 4093 EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not 4094 mutually exclusive) are passed in the EXCHANGE_ID arguments and 4095 results to allow the client to indicate how it wants to use sessions 4096 created under the client ID, and to allow the server to indicate how 4097 it will allow the sessions to be used. See Section 13.1 for pNFS 4098 sessions considerations. 4100 3. Protocol Constants and Data Types 4102 The syntax and semantics to describe the data types of the NFSv4.1 4103 protocol are defined in the XDR (RFC 4506 [2]) and RPC (RFC 5531 [3]) 4104 documents. The next sections build upon the XDR data types to define 4105 constants, types, and structures specific to this protocol. The full 4106 list of XDR data types is in [10]. 4108 3.1. Basic Constants 4110 const NFS4_FHSIZE = 128; 4111 const NFS4_VERIFIER_SIZE = 8; 4112 const NFS4_OPAQUE_LIMIT = 1024; 4113 const NFS4_SESSIONID_SIZE = 16; 4115 const NFS4_INT64_MAX = 0x7fffffffffffffff; 4116 const NFS4_UINT64_MAX = 0xffffffffffffffff; 4117 const NFS4_INT32_MAX = 0x7fffffff; 4118 const NFS4_UINT32_MAX = 0xffffffff; 4120 const NFS4_MAXFILELEN = 0xffffffffffffffff; 4121 const NFS4_MAXFILEOFF = 0xfffffffffffffffe; 4123 Except where noted, all these constants are defined in bytes. 4125 * NFS4_FHSIZE is the maximum size of a filehandle. 4127 * NFS4_VERIFIER_SIZE is the fixed size of a verifier. 4129 * NFS4_OPAQUE_LIMIT is the maximum size of certain opaque 4130 information. 4132 * NFS4_SESSIONID_SIZE is the fixed size of a session identifier. 4134 * NFS4_INT64_MAX is the maximum value of a signed 64-bit integer. 4136 * NFS4_UINT64_MAX is the maximum value of an unsigned 64-bit 4137 integer. 4139 * NFS4_INT32_MAX is the maximum value of a signed 32-bit integer. 4141 * NFS4_UINT32_MAX is the maximum value of an unsigned 32-bit 4142 integer. 4144 * NFS4_MAXFILELEN is the maximum length of a regular file. 4146 * NFS4_MAXFILEOFF is the maximum offset into a regular file. 4148 3.2. Basic Data Types 4150 These are the base NFSv4.1 data types. 4152 +===============+==============================================+ 4153 | Data Type | Definition | 4154 +===============+==============================================+ 4155 | int32_t | typedef int int32_t; | 4156 +---------------+----------------------------------------------+ 4157 | uint32_t | typedef unsigned int uint32_t; | 4158 +---------------+----------------------------------------------+ 4159 | int64_t | typedef hyper int64_t; | 4160 +---------------+----------------------------------------------+ 4161 | uint64_t | typedef unsigned hyper uint64_t; | 4162 +---------------+----------------------------------------------+ 4163 | attrlist4 | typedef opaque attrlist4<>; | 4164 | | | 4165 | | Used for file/directory attributes. | 4166 +---------------+----------------------------------------------+ 4167 | bitmap4 | typedef uint32_t bitmap4<>; | 4168 | | | 4169 | | Used in attribute array encoding. | 4170 +---------------+----------------------------------------------+ 4171 | changeid4 | typedef uint64_t changeid4; | 4172 | | | 4173 | | Used in the definition of change_info4. | 4174 +---------------+----------------------------------------------+ 4175 | clientid4 | typedef uint64_t clientid4; | 4176 | | | 4177 | | Shorthand reference to client | 4178 | | identification. | 4179 +---------------+----------------------------------------------+ 4180 | count4 | typedef uint32_t count4; | 4181 | | | 4182 | | Various count parameters (READ, WRITE, | 4183 | | COMMIT). | 4184 +---------------+----------------------------------------------+ 4185 | length4 | typedef uint64_t length4; | 4186 | | | 4187 | | The length of a byte-range within a file. | 4188 +---------------+----------------------------------------------+ 4189 | mode4 | typedef uint32_t mode4; | 4190 | | | 4191 | | Mode attribute data type. | 4192 +---------------+----------------------------------------------+ 4193 | nfs_cookie4 | typedef uint64_t nfs_cookie4; | 4194 | | | 4195 | | Opaque cookie value for READDIR. | 4196 +---------------+----------------------------------------------+ 4197 | nfs_fh4 | typedef opaque nfs_fh4; | 4198 | | | 4199 | | Filehandle definition. | 4200 +---------------+----------------------------------------------+ 4201 | nfs_ftype4 | enum nfs_ftype4; | 4202 | | | 4203 | | Various defined file types. | 4204 +---------------+----------------------------------------------+ 4205 | nfsstat4 | enum nfsstat4; | 4206 | | | 4207 | | Return value for operations. | 4208 +---------------+----------------------------------------------+ 4209 | offset4 | typedef uint64_t offset4; | 4210 | | | 4211 | | Various offset designations (READ, WRITE, | 4212 | | LOCK, COMMIT). | 4213 +---------------+----------------------------------------------+ 4214 | qop4 | typedef uint32_t qop4; | 4215 | | | 4216 | | Quality of protection designation in | 4217 | | SECINFO. | 4218 +---------------+----------------------------------------------+ 4219 | sec_oid4 | typedef opaque sec_oid4<>; | 4220 | | | 4221 | | Security Object Identifier. The sec_oid4 | 4222 | | data type is not really opaque. Instead, it | 4223 | | contains an ASN.1 OBJECT IDENTIFIER as used | 4224 | | by GSS-API in the mech_type argument to | 4225 | | GSS_Init_sec_context. See [7] for details. | 4226 +---------------+----------------------------------------------+ 4227 | sequenceid4 | typedef uint32_t sequenceid4; | 4228 | | | 4229 | | Sequence number used for various session | 4230 | | operations (EXCHANGE_ID, CREATE_SESSION, | 4231 | | SEQUENCE, CB_SEQUENCE). | 4232 +---------------+----------------------------------------------+ 4233 | seqid4 | typedef uint32_t seqid4; | 4234 | | | 4235 | | Sequence identifier used for locking. | 4236 +---------------+----------------------------------------------+ 4237 | sessionid4 | typedef opaque | 4238 | | sessionid4[NFS4_SESSIONID_SIZE]; | 4239 | | | 4240 | | Session identifier. | 4241 +---------------+----------------------------------------------+ 4242 | slotid4 | typedef uint32_t slotid4; | 4243 | | | 4244 | | Sequencing artifact for various session | 4245 | | operations (SEQUENCE, CB_SEQUENCE). | 4246 +---------------+----------------------------------------------+ 4247 | utf8string | typedef opaque utf8string<>; | 4248 | | | 4249 | | UTF-8 encoding for strings. | 4250 +---------------+----------------------------------------------+ 4251 | utf8str_cis | typedef utf8string utf8str_cis; | 4252 | | | 4253 | | Case-insensitive UTF-8 string. | 4254 +---------------+----------------------------------------------+ 4255 | utf8str_cs | typedef utf8string utf8str_cs; | 4256 | | | 4257 | | Case-sensitive UTF-8 string. | 4258 +---------------+----------------------------------------------+ 4259 | utf8str_mixed | typedef utf8string utf8str_mixed; | 4260 | | | 4261 | | UTF-8 strings with a case-sensitive prefix | 4262 | | and a case-insensitive suffix. | 4263 +---------------+----------------------------------------------+ 4264 | component4 | typedef utf8str_cs component4; | 4265 | | | 4266 | | Represents pathname components. | 4267 +---------------+----------------------------------------------+ 4268 | linktext4 | typedef utf8str_cs linktext4; | 4269 | | | 4270 | | Symbolic link contents ("symbolic link" is | 4271 | | defined in an Open Group [11] standard). | 4272 +---------------+----------------------------------------------+ 4273 | pathname4 | typedef component4 pathname4<>; | 4274 | | | 4275 | | Represents pathname for fs_locations. | 4276 +---------------+----------------------------------------------+ 4277 | verifier4 | typedef opaque | 4278 | | verifier4[NFS4_VERIFIER_SIZE]; | 4279 | | | 4280 | | Verifier used for various operations | 4281 | | (COMMIT, CREATE, EXCHANGE_ID, OPEN, READDIR, | 4282 | | WRITE) NFS4_VERIFIER_SIZE is defined as 8. | 4283 +---------------+----------------------------------------------+ 4285 Table 1 4287 End of Base Data Types 4289 3.3. Structured Data Types 4291 3.3.1. nfstime4 4293 struct nfstime4 { 4294 int64_t seconds; 4295 uint32_t nseconds; 4296 }; 4298 The nfstime4 data type gives the number of seconds and nanoseconds 4299 since midnight or zero hour January 1, 1970 Coordinated Universal 4300 Time (UTC). Values greater than zero for the seconds field denote 4301 dates after the zero hour January 1, 1970. Values less than zero for 4302 the seconds field denote dates before the zero hour January 1, 1970. 4303 In both cases, the nseconds field is to be added to the seconds field 4304 for the final time representation. For example, if the time to be 4305 represented is one-half second before zero hour January 1, 1970, the 4306 seconds field would have a value of negative one (-1) and the 4307 nseconds field would have a value of one-half second (500000000). 4308 Values greater than 999,999,999 for nseconds are invalid. 4310 This data type is used to pass time and date information. A server 4311 converts to and from its local representation of time when processing 4312 time values, preserving as much accuracy as possible. If the 4313 precision of timestamps stored for a file system object is less than 4314 defined, loss of precision can occur. An adjunct time maintenance 4315 protocol is RECOMMENDED to reduce client and server time skew. 4317 3.3.2. time_how4 4319 enum time_how4 { 4320 SET_TO_SERVER_TIME4 = 0, 4321 SET_TO_CLIENT_TIME4 = 1 4322 }; 4324 3.3.3. settime4 4326 union settime4 switch (time_how4 set_it) { 4327 case SET_TO_CLIENT_TIME4: 4328 nfstime4 time; 4329 default: 4330 void; 4331 }; 4333 The time_how4 and settime4 data types are used for setting timestamps 4334 in file object attributes. If set_it is SET_TO_SERVER_TIME4, then 4335 the server uses its local representation of time for the time value. 4337 3.3.4. specdata4 4339 struct specdata4 { 4340 uint32_t specdata1; /* major device number */ 4341 uint32_t specdata2; /* minor device number */ 4342 }; 4344 This data type represents the device numbers for the device file 4345 types NF4CHR and NF4BLK. 4347 3.3.5. fsid4 4349 struct fsid4 { 4350 uint64_t major; 4351 uint64_t minor; 4352 }; 4354 3.3.6. change_policy4 4356 struct change_policy4 { 4357 uint64_t cp_major; 4358 uint64_t cp_minor; 4359 }; 4361 The change_policy4 data type is used for the change_policy 4362 RECOMMENDED attribute. It provides change sequencing indication 4363 analogous to the change attribute. To enable the server to present a 4364 value valid across server re-initialization without requiring 4365 persistent storage, two 64-bit quantities are used, allowing one to 4366 be a server instance ID and the second to be incremented non- 4367 persistently, within a given server instance. 4369 3.3.7. fattr4 4371 struct fattr4 { 4372 bitmap4 attrmask; 4373 attrlist4 attr_vals; 4374 }; 4376 The fattr4 data type is used to represent file and directory 4377 attributes. 4379 The bitmap is a counted array of 32-bit integers used to contain bit 4380 values. The position of the integer in the array that contains bit n 4381 can be computed from the expression (n / 32), and its bit within that 4382 integer is (n mod 32). 4384 0 1 4385 +-----------+-----------+-----------+-- 4386 | count | 31 .. 0 | 63 .. 32 | 4387 +-----------+-----------+-----------+-- 4389 3.3.8. change_info4 4391 struct change_info4 { 4392 bool atomic; 4393 changeid4 before; 4394 changeid4 after; 4395 }; 4397 This data type is used with the CREATE, LINK, OPEN, REMOVE, and 4398 RENAME operations to let the client know the value of the change 4399 attribute for the directory in which the target file system object 4400 resides. 4402 3.3.9. netaddr4 4404 struct netaddr4 { 4405 /* see struct rpcb in RFC 1833 */ 4406 string na_r_netid<>; /* network id */ 4407 string na_r_addr<>; /* universal address */ 4408 }; 4410 The netaddr4 data type is used to identify network transport 4411 endpoints. The na_r_netid and na_r_addr fields respectively contain 4412 a netid and uaddr. The netid and uaddr concepts are defined in [12]. 4413 The netid and uaddr formats for TCP over IPv4 and TCP over IPv6 are 4414 defined in [12], specifically Tables 2 and 3 and in Sections 5.2.3.3 4415 and 5.2.3.4. 4417 3.3.10. state_owner4 4419 struct state_owner4 { 4420 clientid4 clientid; 4421 opaque owner; 4422 }; 4424 typedef state_owner4 open_owner4; 4425 typedef state_owner4 lock_owner4; 4427 The state_owner4 data type is the base type for the open_owner4 4428 (Section 3.3.10.1) and lock_owner4 (Section 3.3.10.2). 4430 3.3.10.1. open_owner4 4432 This data type is used to identify the owner of OPEN state. 4434 3.3.10.2. lock_owner4 4436 This structure is used to identify the owner of byte-range locking 4437 state. 4439 3.3.11. open_to_lock_owner4 4441 struct open_to_lock_owner4 { 4442 seqid4 open_seqid; 4443 stateid4 open_stateid; 4444 seqid4 lock_seqid; 4445 lock_owner4 lock_owner; 4446 }; 4448 This data type is used for the first LOCK operation done for an 4449 open_owner4. It provides both the open_stateid and lock_owner, such 4450 that the transition is made from a valid open_stateid sequence to 4451 that of the new lock_stateid sequence. Using this mechanism avoids 4452 the confirmation of the lock_owner/lock_seqid pair since it is tied 4453 to established state in the form of the open_stateid/open_seqid. 4455 3.3.12. stateid4 4457 struct stateid4 { 4458 uint32_t seqid; 4459 opaque other[12]; 4460 }; 4462 This data type is used for the various state sharing mechanisms 4463 between the client and server. The client never modifies a value of 4464 data type stateid. The starting value of the "seqid" field is 4465 undefined. The server is required to increment the "seqid" field by 4466 one at each transition of the stateid. This is important since the 4467 client will inspect the seqid in OPEN stateids to determine the order 4468 of OPEN processing done by the server. 4470 3.3.13. layouttype4 4472 enum layouttype4 { 4473 LAYOUT4_NFSV4_1_FILES = 0x1, 4474 LAYOUT4_OSD2_OBJECTS = 0x2, 4475 LAYOUT4_BLOCK_VOLUME = 0x3 4476 }; 4477 This data type indicates what type of layout is being used. The file 4478 server advertises the layout types it supports through the 4479 fs_layout_type file system attribute (Section 5.12.1). A client asks 4480 for layouts of a particular type in LAYOUTGET, and processes those 4481 layouts in its layout-type-specific logic. 4483 The layouttype4 data type is 32 bits in length. The range 4484 represented by the layout type is split into three parts. Type 0x0 4485 is reserved. Types within the range 0x00000001-0x7FFFFFFF are 4486 globally unique and are assigned according to the description in 4487 Section 22.5; they are maintained by IANA. Types within the range 4488 0x80000000-0xFFFFFFFF are site specific and for private use only. 4490 The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file 4491 layout type, as defined in Section 13, is to be used. The 4492 LAYOUT4_OSD2_OBJECTS enumeration specifies that the object layout, as 4493 defined in [47], is to be used. Similarly, the LAYOUT4_BLOCK_VOLUME 4494 enumeration specifies that the block/volume layout, as defined in 4495 [48], is to be used. 4497 3.3.14. deviceid4 4499 const NFS4_DEVICEID4_SIZE = 16; 4501 typedef opaque deviceid4[NFS4_DEVICEID4_SIZE]; 4503 Layout information includes device IDs that specify a storage device 4504 through a compact handle. Addressing and type information is 4505 obtained with the GETDEVICEINFO operation. Device IDs are not 4506 guaranteed to be valid across metadata server restarts. A device ID 4507 is unique per client ID and layout type. See Section 12.2.10 for 4508 more details. 4510 3.3.15. device_addr4 4512 struct device_addr4 { 4513 layouttype4 da_layout_type; 4514 opaque da_addr_body<>; 4515 }; 4517 The device address is used to set up a communication channel with the 4518 storage device. Different layout types will require different data 4519 types to define how they communicate with storage devices. The 4520 opaque da_addr_body field is interpreted based on the specified 4521 da_layout_type field. 4523 This document defines the device address for the NFSv4.1 file layout 4524 (see Section 13.3), which identifies a storage device by network IP 4525 address and port number. This is sufficient for the clients to 4526 communicate with the NFSv4.1 storage devices, and may be sufficient 4527 for other layout types as well. Device types for object-based 4528 storage devices and block storage devices (e.g., Small Computer 4529 System Interface (SCSI) volume labels) are defined by their 4530 respective layout specifications. 4532 3.3.16. layout_content4 4534 struct layout_content4 { 4535 layouttype4 loc_type; 4536 opaque loc_body<>; 4537 }; 4539 The loc_body field is interpreted based on the layout type 4540 (loc_type). This document defines the loc_body for the NFSv4.1 file 4541 layout type; see Section 13.3 for its definition. 4543 3.3.17. layout4 4545 struct layout4 { 4546 offset4 lo_offset; 4547 length4 lo_length; 4548 layoutiomode4 lo_iomode; 4549 layout_content4 lo_content; 4550 }; 4552 The layout4 data type defines a layout for a file. The layout type 4553 specific data is opaque within lo_content. Since layouts are sub- 4554 dividable, the offset and length together with the file's filehandle, 4555 the client ID, iomode, and layout type identify the layout. 4557 3.3.18. layoutupdate4 4559 struct layoutupdate4 { 4560 layouttype4 lou_type; 4561 opaque lou_body<>; 4562 }; 4564 The layoutupdate4 data type is used by the client to return updated 4565 layout information to the metadata server via the LAYOUTCOMMIT 4566 (Section 18.42) operation. This data type provides a channel to pass 4567 layout type specific information (in field lou_body) back to the 4568 metadata server. For example, for the block/volume layout type, this 4569 could include the list of reserved blocks that were written. The 4570 contents of the opaque lou_body argument are determined by the layout 4571 type. The NFSv4.1 file-based layout does not use this data type; if 4572 lou_type is LAYOUT4_NFSV4_1_FILES, the lou_body field MUST have a 4573 zero length. 4575 3.3.19. layouthint4 4577 struct layouthint4 { 4578 layouttype4 loh_type; 4579 opaque loh_body<>; 4580 }; 4582 The layouthint4 data type is used by the client to pass in a hint 4583 about the type of layout it would like created for a particular file. 4584 It is the data type specified by the layout_hint attribute described 4585 in Section 5.12.4. The metadata server may ignore the hint or may 4586 selectively ignore fields within the hint. This hint should be 4587 provided at create time as part of the initial attributes within 4588 OPEN. The loh_body field is specific to the type of layout 4589 (loh_type). The NFSv4.1 file-based layout uses the 4590 nfsv4_1_file_layouthint4 data type as defined in Section 13.3. 4592 3.3.20. layoutiomode4 4594 enum layoutiomode4 { 4595 LAYOUTIOMODE4_READ = 1, 4596 LAYOUTIOMODE4_RW = 2, 4597 LAYOUTIOMODE4_ANY = 3 4598 }; 4600 The iomode specifies whether the client intends to just read or both 4601 read and write the data represented by the layout. While the 4602 LAYOUTIOMODE4_ANY iomode MUST NOT be used in the arguments to the 4603 LAYOUTGET operation, it MAY be used in the arguments to the 4604 LAYOUTRETURN and CB_LAYOUTRECALL operations. The LAYOUTIOMODE4_ANY 4605 iomode specifies that layouts pertaining to both LAYOUTIOMODE4_READ 4606 and LAYOUTIOMODE4_RW iomodes are being returned or recalled, 4607 respectively. The metadata server's use of the iomode may depend on 4608 the layout type being used. The storage devices MAY validate I/O 4609 accesses against the iomode and reject invalid accesses. 4611 3.3.21. nfs_impl_id4 4613 struct nfs_impl_id4 { 4614 utf8str_cis nii_domain; 4615 utf8str_cs nii_name; 4616 nfstime4 nii_date; 4617 }; 4618 This data type is used to identify client and server implementation 4619 details. The nii_domain field is the DNS domain name with which the 4620 implementor is associated. The nii_name field is the product name of 4621 the implementation and is completely free form. It is RECOMMENDED 4622 that the nii_name be used to distinguish machine architecture, 4623 machine platforms, revisions, versions, and patch levels. The 4624 nii_date field is the timestamp of when the software instance was 4625 published or built. 4627 3.3.22. threshold_item4 4629 struct threshold_item4 { 4630 layouttype4 thi_layout_type; 4631 bitmap4 thi_hintset; 4632 opaque thi_hintlist<>; 4633 }; 4635 This data type contains a list of hints specific to a layout type for 4636 helping the client determine when it should send I/O directly through 4637 the metadata server versus the storage devices. The data type 4638 consists of the layout type (thi_layout_type), a bitmap (thi_hintset) 4639 describing the set of hints supported by the server (they may differ 4640 based on the layout type), and a list of hints (thi_hintlist) whose 4641 content is determined by the hintset bitmap. See the mdsthreshold 4642 attribute for more details. 4644 The thi_hintset field is a bitmap of the following values: 4646 +=========================+===+=========+===========================+ 4647 | name | # | Data | Description | 4648 | | | Type | | 4649 +=========================+===+=========+===========================+ 4650 | threshold4_read_size | 0 | length4 | If a file's length is | 4651 | | | | less than the value of | 4652 | | | | threshold4_read_size, | 4653 | | | | then it is RECOMMENDED | 4654 | | | | that the client read | 4655 | | | | from the file via the | 4656 | | | | MDS and not a storage | 4657 | | | | device. | 4658 +-------------------------+---+---------+---------------------------+ 4659 | threshold4_write_size | 1 | length4 | If a file's length is | 4660 | | | | less than the value of | 4661 | | | | threshold4_write_size, | 4662 | | | | then it is RECOMMENDED | 4663 | | | | that the client write | 4664 | | | | to the file via the | 4665 | | | | MDS and not a storage | 4666 | | | | device. | 4667 +-------------------------+---+---------+---------------------------+ 4668 | threshold4_read_iosize | 2 | length4 | For read I/O sizes | 4669 | | | | below this threshold, | 4670 | | | | it is RECOMMENDED to | 4671 | | | | read data through the | 4672 | | | | MDS. | 4673 +-------------------------+---+---------+---------------------------+ 4674 | threshold4_write_iosize | 3 | length4 | For write I/O sizes | 4675 | | | | below this threshold, | 4676 | | | | it is RECOMMENDED to | 4677 | | | | write data through the | 4678 | | | | MDS. | 4679 +-------------------------+---+---------+---------------------------+ 4681 Table 2 4683 3.3.23. mdsthreshold4 4685 struct mdsthreshold4 { 4686 threshold_item4 mth_hints<>; 4687 }; 4689 This data type holds an array of elements of data type 4690 threshold_item4, each of which is valid for a particular layout type. 4691 An array is necessary because a server can support multiple layout 4692 types for a single file. 4694 4. Filehandles 4696 The filehandle in the NFS protocol is a per-server unique identifier 4697 for a file system object. The contents of the filehandle are opaque 4698 to the client. Therefore, the server is responsible for translating 4699 the filehandle to an internal representation of the file system 4700 object. 4702 4.1. Obtaining the First Filehandle 4704 The operations of the NFS protocol are defined in terms of one or 4705 more filehandles. Therefore, the client needs a filehandle to 4706 initiate communication with the server. With the NFSv3 protocol (RFC 4707 1813 [38]), there exists an ancillary protocol to obtain this first 4708 filehandle. The MOUNT protocol, RPC program number 100005, provides 4709 the mechanism of translating a string-based file system pathname to a 4710 filehandle, which can then be used by the NFS protocols. 4712 The MOUNT protocol has deficiencies in the area of security and use 4713 via firewalls. This is one reason that the use of the public 4714 filehandle was introduced in RFC 2054 [49] and RFC 2055 [50]. With 4715 the use of the public filehandle in combination with the LOOKUP 4716 operation in the NFSv3 protocol, it has been demonstrated that the 4717 MOUNT protocol is unnecessary for viable interaction between NFS 4718 client and server. 4720 Therefore, the NFSv4.1 protocol will not use an ancillary protocol 4721 for translation from string-based pathnames to a filehandle. Two 4722 special filehandles will be used as starting points for the NFS 4723 client. 4725 4.1.1. Root Filehandle 4727 The first of the special filehandles is the ROOT filehandle. The 4728 ROOT filehandle is the "conceptual" root of the file system namespace 4729 at the NFS server. The client uses or starts with the ROOT 4730 filehandle by employing the PUTROOTFH operation. The PUTROOTFH 4731 operation instructs the server to set the "current" filehandle to the 4732 ROOT of the server's file tree. Once this PUTROOTFH operation is 4733 used, the client can then traverse the entirety of the server's file 4734 tree with the LOOKUP operation. A complete discussion of the server 4735 namespace is in Section 7. 4737 4.1.2. Public Filehandle 4739 The second special filehandle is the PUBLIC filehandle. Unlike the 4740 ROOT filehandle, the PUBLIC filehandle may be bound or represent an 4741 arbitrary file system object at the server. The server is 4742 responsible for this binding. It may be that the PUBLIC filehandle 4743 and the ROOT filehandle refer to the same file system object. 4744 However, it is up to the administrative software at the server and 4745 the policies of the server administrator to define the binding of the 4746 PUBLIC filehandle and server file system object. The client may not 4747 make any assumptions about this binding. The client uses the PUBLIC 4748 filehandle via the PUTPUBFH operation. 4750 4.2. Filehandle Types 4752 In the NFSv3 protocol, there was one type of filehandle with a single 4753 set of semantics. This type of filehandle is termed "persistent" in 4754 NFSv4.1. The semantics of a persistent filehandle remain the same as 4755 before. A new type of filehandle introduced in NFSv4.1 is the 4756 "volatile" filehandle, which attempts to accommodate certain server 4757 environments. 4759 The volatile filehandle type was introduced to address server 4760 functionality or implementation issues that make correct 4761 implementation of a persistent filehandle infeasible. Some server 4762 environments do not provide a file-system-level invariant that can be 4763 used to construct a persistent filehandle. The underlying server 4764 file system may not provide the invariant or the server's file system 4765 programming interfaces may not provide access to the needed 4766 invariant. Volatile filehandles may ease the implementation of 4767 server functionality such as hierarchical storage management or file 4768 system reorganization or migration. However, the volatile filehandle 4769 increases the implementation burden for the client. 4771 Since the client will need to handle persistent and volatile 4772 filehandles differently, a file attribute is defined that may be used 4773 by the client to determine the filehandle types being returned by the 4774 server. 4776 4.2.1. General Properties of a Filehandle 4778 The filehandle contains all the information the server needs to 4779 distinguish an individual file. To the client, the filehandle is 4780 opaque. The client stores filehandles for use in a later request and 4781 can compare two filehandles from the same server for equality by 4782 doing a byte-by-byte comparison. However, the client MUST NOT 4783 otherwise interpret the contents of filehandles. If two filehandles 4784 from the same server are equal, they MUST refer to the same file. 4786 Servers SHOULD try to maintain a one-to-one correspondence between 4787 filehandles and files, but this is not required. Clients MUST use 4788 filehandle comparisons only to improve performance, not for correct 4789 behavior. All clients need to be prepared for situations in which it 4790 cannot be determined whether two filehandles denote the same object 4791 and in such cases, avoid making invalid assumptions that might cause 4792 incorrect behavior. Further discussion of filehandle and attribute 4793 comparison in the context of data caching is presented in 4794 Section 10.3.4. 4796 As an example, in the case that two different pathnames when 4797 traversed at the server terminate at the same file system object, the 4798 server SHOULD return the same filehandle for each path. This can 4799 occur if a hard link (see [6]) is used to create two file names that 4800 refer to the same underlying file object and associated data. For 4801 example, if paths /a/b/c and /a/d/c refer to the same file, the 4802 server SHOULD return the same filehandle for both pathnames' 4803 traversals. 4805 4.2.2. Persistent Filehandle 4807 A persistent filehandle is defined as having a fixed value for the 4808 lifetime of the file system object to which it refers. Once the 4809 server creates the filehandle for a file system object, the server 4810 MUST accept the same filehandle for the object for the lifetime of 4811 the object. If the server restarts, the NFS server MUST honor the 4812 same filehandle value as it did in the server's previous 4813 instantiation. Similarly, if the file system is migrated, the new 4814 NFS server MUST honor the same filehandle as the old NFS server. 4816 The persistent filehandle will be become stale or invalid when the 4817 file system object is removed. When the server is presented with a 4818 persistent filehandle that refers to a deleted object, it MUST return 4819 an error of NFS4ERR_STALE. A filehandle may become stale when the 4820 file system containing the object is no longer available. The file 4821 system may become unavailable if it exists on removable media and the 4822 media is no longer available at the server or the file system in 4823 whole has been destroyed or the file system has simply been removed 4824 from the server's namespace (i.e., unmounted in a UNIX environment). 4826 4.2.3. Volatile Filehandle 4828 A volatile filehandle does not share the same longevity 4829 characteristics of a persistent filehandle. The server may determine 4830 that a volatile filehandle is no longer valid at many different 4831 points in time. If the server can definitively determine that a 4832 volatile filehandle refers to an object that has been removed, the 4833 server should return NFS4ERR_STALE to the client (as is the case for 4834 persistent filehandles). In all other cases where the server 4835 determines that a volatile filehandle can no longer be used, it 4836 should return an error of NFS4ERR_FHEXPIRED. 4838 The REQUIRED attribute "fh_expire_type" is used by the client to 4839 determine what type of filehandle the server is providing for a 4840 particular file system. This attribute is a bitmask with the 4841 following values: 4843 FH4_PERSISTENT The value of FH4_PERSISTENT is used to indicate a 4844 persistent filehandle, which is valid until the object is removed 4845 from the file system. The server will not return 4846 NFS4ERR_FHEXPIRED for this filehandle. FH4_PERSISTENT is defined 4847 as a value in which none of the bits specified below are set. 4849 FH4_VOLATILE_ANY The filehandle may expire at any time, except as 4850 specifically excluded (i.e., FH4_NO_EXPIRE_WITH_OPEN). 4852 FH4_NOEXPIRE_WITH_OPEN May only be set when FH4_VOLATILE_ANY is set. 4853 If this bit is set, then the meaning of FH4_VOLATILE_ANY is 4854 qualified to exclude any expiration of the filehandle when it is 4855 open. 4857 FH4_VOL_MIGRATION The filehandle will expire as a result of a file 4858 system transition (migration or replication), in those cases in 4859 which the continuity of filehandle use is not specified by handle 4860 class information within the fs_locations_info attribute. When 4861 this bit is set, clients without access to fs_locations_info 4862 information should assume that filehandles will expire on file 4863 system transitions. 4865 FH4_VOL_RENAME The filehandle will expire during rename. This 4866 includes a rename by the requesting client or a rename by any 4867 other client. If FH4_VOL_ANY is set, FH4_VOL_RENAME is redundant. 4869 Servers that provide volatile filehandles that can expire while open 4870 require special care as regards handling of RENAMEs and REMOVEs. 4871 This situation can arise if FH4_VOL_MIGRATION or FH4_VOL_RENAME is 4872 set, if FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN is not 4873 set, or if a non-read-only file system has a transition target in a 4874 different handle class. In these cases, the server should deny a 4875 RENAME or REMOVE that would affect an OPEN file of any of the 4876 components leading to the OPEN file. In addition, the server should 4877 deny all RENAME or REMOVE requests during the grace period, in order 4878 to make sure that reclaims of files where filehandles may have 4879 expired do not do a reclaim for the wrong file. 4881 Volatile filehandles are especially suitable for implementation of 4882 the pseudo file systems used to bridge exports. See Section 7.5 for 4883 a discussion of this. 4885 4.3. One Method of Constructing a Volatile Filehandle 4887 A volatile filehandle, while opaque to the client, could contain: 4889 [volatile bit = 1 | server boot time | slot | generation number] 4891 * slot is an index in the server volatile filehandle table 4893 * generation number is the generation number for the table entry/ 4894 slot 4896 When the client presents a volatile filehandle, the server makes the 4897 following checks, which assume that the check for the volatile bit 4898 has passed. If the server boot time is less than the current server 4899 boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return 4900 NFS4ERR_BADHANDLE. If the generation number does not match, return 4901 NFS4ERR_FHEXPIRED. 4903 When the server restarts, the table is gone (it is volatile). 4905 If the volatile bit is 0, then it is a persistent filehandle with a 4906 different structure following it. 4908 4.4. Client Recovery from Filehandle Expiration 4910 If possible, the client SHOULD recover from the receipt of an 4911 NFS4ERR_FHEXPIRED error. The client must take on additional 4912 responsibility so that it may prepare itself to recover from the 4913 expiration of a volatile filehandle. If the server returns 4914 persistent filehandles, the client does not need these additional 4915 steps. 4917 For volatile filehandles, most commonly the client will need to store 4918 the component names leading up to and including the file system 4919 object in question. With these names, the client should be able to 4920 recover by finding a filehandle in the namespace that is still 4921 available or by starting at the root of the server's file system 4922 namespace. 4924 If the expired filehandle refers to an object that has been removed 4925 from the file system, obviously the client will not be able to 4926 recover from the expired filehandle. 4928 It is also possible that the expired filehandle refers to a file that 4929 has been renamed. If the file was renamed by another client, again 4930 it is possible that the original client will not be able to recover. 4931 However, in the case that the client itself is renaming the file and 4932 the file is open, it is possible that the client may be able to 4933 recover. The client can determine the new pathname based on the 4934 processing of the rename request. The client can then regenerate the 4935 new filehandle based on the new pathname. The client could also use 4936 the COMPOUND procedure to construct a series of operations like: 4938 RENAME A B 4939 LOOKUP B 4940 GETFH 4942 Note that the COMPOUND procedure does not provide atomicity. This 4943 example only reduces the overhead of recovering from an expired 4944 filehandle. 4946 5. File Attributes 4948 To meet the requirements of extensibility and increased 4949 interoperability with non-UNIX platforms, attributes need to be 4950 handled in a flexible manner. The NFSv3 fattr3 structure contains a 4951 fixed list of attributes that not all clients and servers are able to 4952 support or care about. The fattr3 structure cannot be extended as 4953 new needs arise and it provides no way to indicate non-support. With 4954 the NFSv4.1 protocol, the client is able to query what attributes the 4955 server supports and construct requests with only those supported 4956 attributes (or a subset thereof). 4958 To this end, attributes are divided into three groups: REQUIRED, 4959 RECOMMENDED, and named. Both REQUIRED and RECOMMENDED attributes are 4960 supported in the NFSv4.1 protocol by a specific and well-defined 4961 encoding and are identified by number. They are requested by setting 4962 a bit in the bit vector sent in the GETATTR request; the server 4963 response includes a bit vector to list what attributes were returned 4964 in the response. New REQUIRED or RECOMMENDED attributes may be added 4965 to the NFSv4 protocol as part of a new minor version by publishing a 4966 Standards Track RFC that allocates a new attribute number value and 4967 defines the encoding for the attribute. See Section 2.7 for further 4968 discussion. 4970 Named attributes are accessed by the new OPENATTR operation, which 4971 accesses a hidden directory of attributes associated with a file 4972 system object. OPENATTR takes a filehandle for the object and 4973 returns the filehandle for the attribute hierarchy. The filehandle 4974 for the named attributes is a directory object accessible by LOOKUP 4975 or READDIR and contains files whose names represent the named 4976 attributes and whose data bytes are the value of the attribute. For 4977 example: 4979 +----------+-----------+---------------------------------+ 4980 | LOOKUP | "foo" | ; look up file | 4981 +----------+-----------+---------------------------------+ 4982 | GETATTR | attrbits | | 4983 +----------+-----------+---------------------------------+ 4984 | OPENATTR | | ; access foo's named attributes | 4985 +----------+-----------+---------------------------------+ 4986 | LOOKUP | "x11icon" | ; look up specific attribute | 4987 +----------+-----------+---------------------------------+ 4988 | READ | 0,4096 | ; read stream of bytes | 4989 +----------+-----------+---------------------------------+ 4991 Table 3 4993 Named attributes are intended for data needed by applications rather 4994 than by an NFS client implementation. NFS implementors are strongly 4995 encouraged to define their new attributes as RECOMMENDED attributes 4996 by bringing them to the IETF Standards Track process. 4998 The set of attributes that are classified as REQUIRED is deliberately 4999 small since servers need to do whatever it takes to support them. A 5000 server should support as many of the RECOMMENDED attributes as 5001 possible but, by their definition, the server is not required to 5002 support all of them. Attributes are deemed REQUIRED if the data is 5003 both needed by a large number of clients and is not otherwise 5004 reasonably computable by the client when support is not provided on 5005 the server. 5007 Note that the hidden directory returned by OPENATTR is a convenience 5008 for protocol processing. The client should not make any assumptions 5009 about the server's implementation of named attributes and whether or 5010 not the underlying file system at the server has a named attribute 5011 directory. Therefore, operations such as SETATTR and GETATTR on the 5012 named attribute directory are undefined. 5014 5.1. REQUIRED Attributes 5016 These MUST be supported by every NFSv4.1 client and server in order 5017 to ensure a minimum level of interoperability. The server MUST store 5018 and return these attributes, and the client MUST be able to function 5019 with an attribute set limited to these attributes. With just the 5020 REQUIRED attributes some client functionality may be impaired or 5021 limited in some ways. A client may ask for any of these attributes 5022 to be returned by setting a bit in the GETATTR request, and the 5023 server MUST return their value. 5025 5.2. RECOMMENDED Attributes 5027 These attributes are understood well enough to warrant support in the 5028 NFSv4.1 protocol. However, they may not be supported on all clients 5029 and servers. A client may ask for any of these attributes to be 5030 returned by setting a bit in the GETATTR request but must handle the 5031 case where the server does not return them. A client MAY ask for the 5032 set of attributes the server supports and SHOULD NOT request 5033 attributes the server does not support. A server should be tolerant 5034 of requests for unsupported attributes and simply not return them 5035 rather than considering the request an error. It is expected that 5036 servers will support all attributes they comfortably can and only 5037 fail to support attributes that are difficult to support in their 5038 operating environments. A server should provide attributes whenever 5039 they don't have to "tell lies" to the client. For example, a file 5040 modification time should be either an accurate time or should not be 5041 supported by the server. At times this will be difficult for 5042 clients, but a client is better positioned to decide whether and how 5043 to fabricate or construct an attribute or whether to do without the 5044 attribute. 5046 5.3. Named Attributes 5048 These attributes are not supported by direct encoding in the NFSv4 5049 protocol but are accessed by string names rather than numbers and 5050 correspond to an uninterpreted stream of bytes that are stored with 5051 the file system object. The namespace for these attributes may be 5052 accessed by using the OPENATTR operation. The OPENATTR operation 5053 returns a filehandle for a virtual "named attribute directory", and 5054 further perusal and modification of the namespace may be done using 5055 operations that work on more typical directories. In particular, 5056 READDIR may be used to get a list of such named attributes, and 5057 LOOKUP and OPEN may select a particular attribute. Creation of a new 5058 named attribute may be the result of an OPEN specifying file 5059 creation. 5061 Once an OPEN is done, named attributes may be examined and changed by 5062 normal READ and WRITE operations using the filehandles and stateids 5063 returned by OPEN. 5065 Named attributes and the named attribute directory may have their own 5066 (non-named) attributes. Each of these objects MUST have all of the 5067 REQUIRED attributes and may have additional RECOMMENDED attributes. 5068 However, the set of attributes for named attributes and the named 5069 attribute directory need not be, and typically will not be, as large 5070 as that for other objects in that file system. 5072 Named attributes and the named attribute directory might be the 5073 target of delegations (in the case of the named attribute directory, 5074 these will be directory delegations). However, since granting of 5075 delegations is at the server's discretion, a server need not support 5076 delegations on named attributes or the named attribute directory. 5078 It is RECOMMENDED that servers support arbitrary named attributes. A 5079 client should not depend on the ability to store any named attributes 5080 in the server's file system. If a server does support named 5081 attributes, a client that is also able to handle them should be able 5082 to copy a file's data and metadata with complete transparency from 5083 one location to another; this would imply that names allowed for 5084 regular directory entries are valid for named attribute names as 5085 well. 5087 In NFSv4.1, the structure of named attribute directories is 5088 restricted in a number of ways, in order to prevent the development 5089 of non-interoperable implementations in which some servers support a 5090 fully general hierarchical directory structure for named attributes 5091 while others support a limited but adequate structure for named 5092 attributes. In such an environment, clients or applications might 5093 come to depend on non-portable extensions. The restrictions are: 5095 * CREATE is not allowed in a named attribute directory. Thus, such 5096 objects as symbolic links and special files are not allowed to be 5097 named attributes. Further, directories may not be created in a 5098 named attribute directory, so no hierarchical structure of named 5099 attributes for a single object is allowed. 5101 * If OPENATTR is done on a named attribute directory or on a named 5102 attribute, the server MUST return NFS4ERR_WRONG_TYPE. 5104 * Doing a RENAME of a named attribute to a different named attribute 5105 directory or to an ordinary (i.e., non-named-attribute) directory 5106 is not allowed. 5108 * Creating hard links between named attribute directories or between 5109 named attribute directories and ordinary directories is not 5110 allowed. 5112 Names of attributes will not be controlled by this document or other 5113 IETF Standards Track documents. See Section 22.2 for further 5114 discussion. 5116 5.4. Classification of Attributes 5118 Each of the REQUIRED and RECOMMENDED attributes can be classified in 5119 one of three categories: per server (i.e., the value of the attribute 5120 will be the same for all file objects that share the same server 5121 owner; see Section 2.5 for a definition of server owner), per file 5122 system (i.e., the value of the attribute will be the same for some or 5123 all file objects that share the same fsid attribute (Section 5.8.1.9) 5124 and server owner), or per file system object. Note that it is 5125 possible that some per file system attributes may vary within the 5126 file system, depending on the value of the "homogeneous" 5127 (Section 5.8.2.16) attribute. Note that the attributes 5128 time_access_set and time_modify_set are not listed in this section 5129 because they are write-only attributes corresponding to time_access 5130 and time_modify, and are used in a special instance of SETATTR. 5132 * The per-server attribute is: 5134 lease_time 5136 * The per-file system attributes are: 5138 supported_attrs, suppattr_exclcreat, fh_expire_type, 5139 link_support, symlink_support, unique_handles, aclsupport, 5140 cansettime, case_insensitive, case_preserving, 5141 chown_restricted, files_avail, files_free, files_total, 5142 fs_locations, homogeneous, maxfilesize, maxname, maxread, 5143 maxwrite, no_trunc, space_avail, space_free, space_total, 5144 time_delta, change_policy, fs_status, fs_layout_type, 5145 fs_locations_info, fs_charset_cap 5147 * The per-file system object attributes are: 5149 type, change, size, named_attr, fsid, rdattr_error, filehandle, 5150 acl, archive, fileid, hidden, maxlink, mimetype, mode, 5151 numlinks, owner, owner_group, rawdev, space_used, system, 5152 time_access, time_backup, time_create, time_metadata, 5153 time_modify, mounted_on_fileid, dir_notif_delay, 5154 dirent_notif_delay, dacl, sacl, layout_type, layout_hint, 5155 layout_blksize, layout_alignment, mdsthreshold, retention_get, 5156 retention_set, retentevt_get, retentevt_set, retention_hold, 5157 mode_set_masked 5159 For quota_avail_hard, quota_avail_soft, and quota_used, see their 5160 definitions below for the appropriate classification. 5162 5.5. Set-Only and Get-Only Attributes 5164 Some REQUIRED and RECOMMENDED attributes are set-only; i.e., they can 5165 be set via SETATTR but not retrieved via GETATTR. Similarly, some 5166 REQUIRED and RECOMMENDED attributes are get-only; i.e., they can be 5167 retrieved via GETATTR but not set via SETATTR. If a client attempts 5168 to set a get-only attribute or get a set-only attributes, the server 5169 MUST return NFS4ERR_INVAL. 5171 5.6. REQUIRED Attributes - List and Definition References 5173 The list of REQUIRED attributes appears in Table 4. The meaning of 5174 the columns of the table are: 5176 Name: The name of the attribute. 5178 Id: The number assigned to the attribute. In the event of conflicts 5179 between the assigned number and [10], the latter is likely 5180 authoritative, but should be resolved with Errata to this document 5181 and/or [10]. See [51] for the Errata process. 5183 Data Type: The XDR data type of the attribute. 5185 Acc: Access allowed to the attribute. R means read-only (GETATTR 5186 may retrieve, SETATTR may not set). W means write-only (SETATTR 5187 may set, GETATTR may not retrieve). R W means read/write (GETATTR 5188 may retrieve, SETATTR may set). 5190 Defined in: The section of this specification that describes the 5191 attribute. 5193 +====================+====+============+=====+==================+ 5194 | Name | Id | Data Type | Acc | Defined in: | 5195 +====================+====+============+=====+==================+ 5196 | supported_attrs | 0 | bitmap4 | R | Section 5.8.1.1 | 5197 +--------------------+----+------------+-----+------------------+ 5198 | type | 1 | nfs_ftype4 | R | Section 5.8.1.2 | 5199 +--------------------+----+------------+-----+------------------+ 5200 | fh_expire_type | 2 | uint32_t | R | Section 5.8.1.3 | 5201 +--------------------+----+------------+-----+------------------+ 5202 | change | 3 | uint64_t | R | Section 5.8.1.4 | 5203 +--------------------+----+------------+-----+------------------+ 5204 | size | 4 | uint64_t | R W | Section 5.8.1.5 | 5205 +--------------------+----+------------+-----+------------------+ 5206 | link_support | 5 | bool | R | Section 5.8.1.6 | 5207 +--------------------+----+------------+-----+------------------+ 5208 | symlink_support | 6 | bool | R | Section 5.8.1.7 | 5209 +--------------------+----+------------+-----+------------------+ 5210 | named_attr | 7 | bool | R | Section 5.8.1.8 | 5211 +--------------------+----+------------+-----+------------------+ 5212 | fsid | 8 | fsid4 | R | Section 5.8.1.9 | 5213 +--------------------+----+------------+-----+------------------+ 5214 | unique_handles | 9 | bool | R | Section 5.8.1.10 | 5215 +--------------------+----+------------+-----+------------------+ 5216 | lease_time | 10 | nfs_lease4 | R | Section 5.8.1.11 | 5217 +--------------------+----+------------+-----+------------------+ 5218 | rdattr_error | 11 | enum | R | Section 5.8.1.12 | 5219 +--------------------+----+------------+-----+------------------+ 5220 | filehandle | 19 | nfs_fh4 | R | Section 5.8.1.13 | 5221 +--------------------+----+------------+-----+------------------+ 5222 | suppattr_exclcreat | 75 | bitmap4 | R | Section 5.8.1.14 | 5223 +--------------------+----+------------+-----+------------------+ 5225 Table 4 5227 5.7. RECOMMENDED Attributes - List and Definition References 5229 The RECOMMENDED attributes are defined in Table 5. The meanings of 5230 the column headers are the same as Table 4; see Section 5.6 for the 5231 meanings. 5233 +====================+====+====================+=====+=============+ 5234 | Name | Id | Data Type | Acc | Defined in: | 5235 +====================+====+====================+=====+=============+ 5236 | acl | 12 | nfsace4<> | R W | Section | 5237 | | | | | 6.2.1 | 5238 +--------------------+----+--------------------+-----+-------------+ 5239 | aclsupport | 13 | uint32_t | R | Section | 5240 | | | | | 6.2.1.2 | 5241 +--------------------+----+--------------------+-----+-------------+ 5242 | archive | 14 | bool | R W | Section | 5243 | | | | | 5.8.2.1 | 5244 +--------------------+----+--------------------+-----+-------------+ 5245 | cansettime | 15 | bool | R | Section | 5246 | | | | | 5.8.2.2 | 5247 +--------------------+----+--------------------+-----+-------------+ 5248 | case_insensitive | 16 | bool | R | Section | 5249 | | | | | 5.8.2.3 | 5250 +--------------------+----+--------------------+-----+-------------+ 5251 | case_preserving | 17 | bool | R | Section | 5252 | | | | | 5.8.2.4 | 5253 +--------------------+----+--------------------+-----+-------------+ 5254 | change_policy | 60 | chg_policy4 | R | Section | 5255 | | | | | 5.8.2.5 | 5256 +--------------------+----+--------------------+-----+-------------+ 5257 | chown_restricted | 18 | bool | R | Section | 5258 | | | | | 5.8.2.6 | 5259 +--------------------+----+--------------------+-----+-------------+ 5260 | dacl | 58 | nfsacl41 | R W | Section | 5261 | | | | | 6.2.2 | 5262 +--------------------+----+--------------------+-----+-------------+ 5263 | dir_notif_delay | 56 | nfstime4 | R | Section | 5264 | | | | | 5.11.1 | 5265 +--------------------+----+--------------------+-----+-------------+ 5266 | dirent_notif_delay | 57 | nfstime4 | R | Section | 5267 | | | | | 5.11.2 | 5268 +--------------------+----+--------------------+-----+-------------+ 5269 | fileid | 20 | uint64_t | R | Section | 5270 | | | | | 5.8.2.7 | 5271 +--------------------+----+--------------------+-----+-------------+ 5272 | files_avail | 21 | uint64_t | R | Section | 5273 | | | | | 5.8.2.8 | 5274 +--------------------+----+--------------------+-----+-------------+ 5275 | files_free | 22 | uint64_t | R | Section | 5276 | | | | | 5.8.2.9 | 5277 +--------------------+----+--------------------+-----+-------------+ 5278 | files_total | 23 | uint64_t | R | Section | 5279 | | | | | 5.8.2.10 | 5280 +--------------------+----+--------------------+-----+-------------+ 5281 | fs_charset_cap | 76 | uint32_t | R | Section | 5282 | | | | | 5.8.2.11 | 5283 +--------------------+----+--------------------+-----+-------------+ 5284 | fs_layout_type | 62 | layouttype4<> | R | Section | 5285 | | | | | 5.12.1 | 5286 +--------------------+----+--------------------+-----+-------------+ 5287 | fs_locations | 24 | fs_locations | R | Section | 5288 | | | | | 5.8.2.12 | 5289 +--------------------+----+--------------------+-----+-------------+ 5290 | fs_locations_info | 67 | fs_locations_info4 | R | Section | 5291 | | | | | 5.8.2.13 | 5292 +--------------------+----+--------------------+-----+-------------+ 5293 | fs_status | 61 | fs4_status | R | Section | 5294 | | | | | 5.8.2.14 | 5295 +--------------------+----+--------------------+-----+-------------+ 5296 | hidden | 25 | bool | R W | Section | 5297 | | | | | 5.8.2.15 | 5298 +--------------------+----+--------------------+-----+-------------+ 5299 | homogeneous | 26 | bool | R | Section | 5300 | | | | | 5.8.2.16 | 5301 +--------------------+----+--------------------+-----+-------------+ 5302 | layout_alignment | 66 | uint32_t | R | Section | 5303 | | | | | 5.12.2 | 5304 +--------------------+----+--------------------+-----+-------------+ 5305 | layout_blksize | 65 | uint32_t | R | Section | 5306 | | | | | 5.12.3 | 5307 +--------------------+----+--------------------+-----+-------------+ 5308 | layout_hint | 63 | layouthint4 | W | Section | 5309 | | | | | 5.12.4 | 5310 +--------------------+----+--------------------+-----+-------------+ 5311 | layout_type | 64 | layouttype4<> | R | Section | 5312 | | | | | 5.12.5 | 5313 +--------------------+----+--------------------+-----+-------------+ 5314 | maxfilesize | 27 | uint64_t | R | Section | 5315 | | | | | 5.8.2.17 | 5316 +--------------------+----+--------------------+-----+-------------+ 5317 | maxlink | 28 | uint32_t | R | Section | 5318 | | | | | 5.8.2.18 | 5319 +--------------------+----+--------------------+-----+-------------+ 5320 | maxname | 29 | uint32_t | R | Section | 5321 | | | | | 5.8.2.19 | 5322 +--------------------+----+--------------------+-----+-------------+ 5323 | maxread | 30 | uint64_t | R | Section | 5324 | | | | | 5.8.2.20 | 5325 +--------------------+----+--------------------+-----+-------------+ 5326 | maxwrite | 31 | uint64_t | R | Section | 5327 | | | | | 5.8.2.21 | 5328 +--------------------+----+--------------------+-----+-------------+ 5329 | mdsthreshold | 68 | mdsthreshold4 | R | Section | 5330 | | | | | 5.12.6 | 5331 +--------------------+----+--------------------+-----+-------------+ 5332 | mimetype | 32 | utf8str_cs | R W | Section | 5333 | | | | | 5.8.2.22 | 5334 +--------------------+----+--------------------+-----+-------------+ 5335 | mode | 33 | mode4 | R W | Section | 5336 | | | | | 6.2.4 | 5337 +--------------------+----+--------------------+-----+-------------+ 5338 | mode_set_masked | 74 | mode_masked4 | W | Section | 5339 | | | | | 6.2.5 | 5340 +--------------------+----+--------------------+-----+-------------+ 5341 | mounted_on_fileid | 55 | uint64_t | R | Section | 5342 | | | | | 5.8.2.23 | 5343 +--------------------+----+--------------------+-----+-------------+ 5344 | no_trunc | 34 | bool | R | Section | 5345 | | | | | 5.8.2.24 | 5346 +--------------------+----+--------------------+-----+-------------+ 5347 | numlinks | 35 | uint32_t | R | Section | 5348 | | | | | 5.8.2.25 | 5349 +--------------------+----+--------------------+-----+-------------+ 5350 | owner | 36 | utf8str_mixed | R W | Section | 5351 | | | | | 5.8.2.26 | 5352 +--------------------+----+--------------------+-----+-------------+ 5353 | owner_group | 37 | utf8str_mixed | R W | Section | 5354 | | | | | 5.8.2.27 | 5355 +--------------------+----+--------------------+-----+-------------+ 5356 | quota_avail_hard | 38 | uint64_t | R | Section | 5357 | | | | | 5.8.2.28 | 5358 +--------------------+----+--------------------+-----+-------------+ 5359 | quota_avail_soft | 39 | uint64_t | R | Section | 5360 | | | | | 5.8.2.29 | 5361 +--------------------+----+--------------------+-----+-------------+ 5362 | quota_used | 40 | uint64_t | R | Section | 5363 | | | | | 5.8.2.30 | 5364 +--------------------+----+--------------------+-----+-------------+ 5365 | rawdev | 41 | specdata4 | R | Section | 5366 | | | | | 5.8.2.31 | 5367 +--------------------+----+--------------------+-----+-------------+ 5368 | retentevt_get | 71 | retention_get4 | R | Section | 5369 | | | | | 5.13.3 | 5370 +--------------------+----+--------------------+-----+-------------+ 5371 | retentevt_set | 72 | retention_set4 | W | Section | 5372 | | | | | 5.13.4 | 5373 +--------------------+----+--------------------+-----+-------------+ 5374 | retention_get | 69 | retention_get4 | R | Section | 5375 | | | | | 5.13.1 | 5376 +--------------------+----+--------------------+-----+-------------+ 5377 | retention_hold | 73 | uint64_t | R W | Section | 5378 | | | | | 5.13.5 | 5379 +--------------------+----+--------------------+-----+-------------+ 5380 | retention_set | 70 | retention_set4 | W | Section | 5381 | | | | | 5.13.2 | 5382 +--------------------+----+--------------------+-----+-------------+ 5383 | sacl | 59 | nfsacl41 | R W | Section | 5384 | | | | | 6.2.3 | 5385 +--------------------+----+--------------------+-----+-------------+ 5386 | space_avail | 42 | uint64_t | R | Section | 5387 | | | | | 5.8.2.32 | 5388 +--------------------+----+--------------------+-----+-------------+ 5389 | space_free | 43 | uint64_t | R | Section | 5390 | | | | | 5.8.2.33 | 5391 +--------------------+----+--------------------+-----+-------------+ 5392 | space_total | 44 | uint64_t | R | Section | 5393 | | | | | 5.8.2.34 | 5394 +--------------------+----+--------------------+-----+-------------+ 5395 | space_used | 45 | uint64_t | R | Section | 5396 | | | | | 5.8.2.35 | 5397 +--------------------+----+--------------------+-----+-------------+ 5398 | system | 46 | bool | R W | Section | 5399 | | | | | 5.8.2.36 | 5400 +--------------------+----+--------------------+-----+-------------+ 5401 | time_access | 47 | nfstime4 | R | Section | 5402 | | | | | 5.8.2.37 | 5403 +--------------------+----+--------------------+-----+-------------+ 5404 | time_access_set | 48 | settime4 | W | Section | 5405 | | | | | 5.8.2.38 | 5406 +--------------------+----+--------------------+-----+-------------+ 5407 | time_backup | 49 | nfstime4 | R W | Section | 5408 | | | | | 5.8.2.39 | 5409 +--------------------+----+--------------------+-----+-------------+ 5410 | time_create | 50 | nfstime4 | R W | Section | 5411 | | | | | 5.8.2.40 | 5412 +--------------------+----+--------------------+-----+-------------+ 5413 | time_delta | 51 | nfstime4 | R | Section | 5414 | | | | | 5.8.2.41 | 5415 +--------------------+----+--------------------+-----+-------------+ 5416 | time_metadata | 52 | nfstime4 | R | Section | 5417 | | | | | 5.8.2.42 | 5418 +--------------------+----+--------------------+-----+-------------+ 5419 | time_modify | 53 | nfstime4 | R | Section | 5420 | | | | | 5.8.2.43 | 5421 +--------------------+----+--------------------+-----+-------------+ 5422 | time_modify_set | 54 | settime4 | W | Section | 5423 | | | | | 5.8.2.44 | 5424 +--------------------+----+--------------------+-----+-------------+ 5426 Table 5 5428 5.8. Attribute Definitions 5430 5.8.1. Definitions of REQUIRED Attributes 5432 5.8.1.1. Attribute 0: supported_attrs 5434 The bit vector that would retrieve all REQUIRED and RECOMMENDED 5435 attributes that are supported for this object. The scope of this 5436 attribute applies to all objects with a matching fsid. 5438 5.8.1.2. Attribute 1: type 5440 Designates the type of an object in terms of one of a number of 5441 special constants: 5443 * NF4REG designates a regular file. 5445 * NF4DIR designates a directory. 5447 * NF4BLK designates a block device special file. 5449 * NF4CHR designates a character device special file. 5451 * NF4LNK designates a symbolic link. 5453 * NF4SOCK designates a named socket special file. 5455 * NF4FIFO designates a fifo special file. 5457 * NF4ATTRDIR designates a named attribute directory. 5459 * NF4NAMEDATTR designates a named attribute. 5461 Within the explanatory text and operation descriptions, the following 5462 phrases will be used with the meanings given below: 5464 * The phrase "is a directory" means that the object's type attribute 5465 is NF4DIR or NF4ATTRDIR. 5467 * The phrase "is a special file" means that the object's type 5468 attribute is NF4BLK, NF4CHR, NF4SOCK, or NF4FIFO. 5470 * The phrases "is an ordinary file" and "is a regular file" mean 5471 that the object's type attribute is NF4REG or NF4NAMEDATTR. 5473 5.8.1.3. Attribute 2: fh_expire_type 5475 Server uses this to specify filehandle expiration behavior to the 5476 client. See Section 4 for additional description. 5478 5.8.1.4. Attribute 3: change 5480 A value created by the server that the client can use to determine if 5481 file data, directory contents, or attributes of the object have been 5482 modified. The server may return the object's time_metadata attribute 5483 for this attribute's value, but only if the file system object cannot 5484 be updated more frequently than the resolution of time_metadata. 5486 5.8.1.5. Attribute 4: size 5488 The size of the object in bytes. 5490 5.8.1.6. Attribute 5: link_support 5492 TRUE, if the object's file system supports hard links. 5494 5.8.1.7. Attribute 6: symlink_support 5496 TRUE, if the object's file system supports symbolic links. 5498 5.8.1.8. Attribute 7: named_attr 5500 TRUE, if this object has named attributes. In other words, object 5501 has a non-empty named attribute directory. 5503 5.8.1.9. Attribute 8: fsid 5505 Unique file system identifier for the file system holding this 5506 object. The fsid attribute has major and minor components, each of 5507 which are of data type uint64_t. 5509 5.8.1.10. Attribute 9: unique_handles 5511 TRUE, if two distinct filehandles are guaranteed to refer to two 5512 different file system objects. 5514 5.8.1.11. Attribute 10: lease_time 5516 Duration of the lease at server in seconds. 5518 5.8.1.12. Attribute 11: rdattr_error 5520 Error returned from an attempt to retrieve attributes during a 5521 READDIR operation. 5523 5.8.1.13. Attribute 19: filehandle 5525 The filehandle of this object (primarily for READDIR requests). 5527 5.8.1.14. Attribute 75: suppattr_exclcreat 5529 The bit vector that would set all REQUIRED and RECOMMENDED attributes 5530 that are supported by the EXCLUSIVE4_1 method of file creation via 5531 the OPEN operation. The scope of this attribute applies to all 5532 objects with a matching fsid. 5534 5.8.2. Definitions of Uncategorized RECOMMENDED Attributes 5536 The definitions of most of the RECOMMENDED attributes follow. 5537 Collections that share a common category are defined in other 5538 sections. 5540 5.8.2.1. Attribute 14: archive 5542 TRUE, if this file has been archived since the time of last 5543 modification (deprecated in favor of time_backup). 5545 5.8.2.2. Attribute 15: cansettime 5547 TRUE, if the server is able to change the times for a file system 5548 object as specified in a SETATTR operation. 5550 5.8.2.3. Attribute 16: case_insensitive 5552 TRUE, if file name comparisons on this file system are case 5553 insensitive. 5555 5.8.2.4. Attribute 17: case_preserving 5557 TRUE, if file name case on this file system is preserved. 5559 5.8.2.5. Attribute 60: change_policy 5561 A value created by the server that the client can use to determine if 5562 some server policy related to the current file system has been 5563 subject to change. If the value remains the same, then the client 5564 can be sure that the values of the attributes related to fs location 5565 and the fss_type field of the fs_status attribute have not changed. 5566 On the other hand, a change in this value does necessarily imply a 5567 change in policy. It is up to the client to interrogate the server 5568 to determine if some policy relevant to it has changed. See 5569 Section 3.3.6 for details. 5571 This attribute MUST change when the value returned by the 5572 fs_locations or fs_locations_info attribute changes, when a file 5573 system goes from read-only to writable or vice versa, or when the 5574 allowable set of security flavors for the file system or any part 5575 thereof is changed. 5577 5.8.2.6. Attribute 18: chown_restricted 5579 If TRUE, the server will reject any request to change either the 5580 owner or the group associated with a file if the caller is not a 5581 privileged user (for example, "root" in UNIX operating environments 5582 or, in Windows 2000, the "Take Ownership" privilege). 5584 5.8.2.7. Attribute 20: fileid 5586 A number uniquely identifying the file within the file system. 5588 5.8.2.8. Attribute 21: files_avail 5590 File slots available to this user on the file system containing this 5591 object -- this should be the smallest relevant limit. 5593 5.8.2.9. Attribute 22: files_free 5595 Free file slots on the file system containing this object -- this 5596 should be the smallest relevant limit. 5598 5.8.2.10. Attribute 23: files_total 5600 Total file slots on the file system containing this object. 5602 5.8.2.11. Attribute 76: fs_charset_cap 5604 Character set capabilities for this file system. See Section 14.4. 5606 5.8.2.12. Attribute 24: fs_locations 5608 Locations where this file system may be found. If the server returns 5609 NFS4ERR_MOVED as an error, this attribute MUST be supported. See 5610 Section 11.16 for more details. 5612 5.8.2.13. Attribute 67: fs_locations_info 5614 Full function file system location. See Section 11.17.2 for more 5615 details. 5617 5.8.2.14. Attribute 61: fs_status 5619 Generic file system type information. See Section 11.18 for more 5620 details. 5622 5.8.2.15. Attribute 25: hidden 5624 TRUE, if the file is considered hidden with respect to the Windows 5625 API. 5627 5.8.2.16. Attribute 26: homogeneous 5629 TRUE, if this object's file system is homogeneous; i.e., all objects 5630 in the file system (all objects on the server with the same fsid) 5631 have common values for all per-file-system attributes. 5633 5.8.2.17. Attribute 27: maxfilesize 5635 Maximum supported file size for the file system of this object. 5637 5.8.2.18. Attribute 28: maxlink 5639 Maximum number of links for this object. 5641 5.8.2.19. Attribute 29: maxname 5643 Maximum file name size supported for this object. 5645 5.8.2.20. Attribute 30: maxread 5647 Maximum amount of data the READ operation will return for this 5648 object. 5650 5.8.2.21. Attribute 31: maxwrite 5652 Maximum amount of data the WRITE operation will accept for this 5653 object. This attribute SHOULD be supported if the file is writable. 5654 Lack of this attribute can lead to the client either wasting 5655 bandwidth or not receiving the best performance. 5657 5.8.2.22. Attribute 32: mimetype 5659 MIME body type/subtype of this object. 5661 5.8.2.23. Attribute 55: mounted_on_fileid 5663 Like fileid, but if the target filehandle is the root of a file 5664 system, this attribute represents the fileid of the underlying 5665 directory. 5667 UNIX-based operating environments connect a file system into the 5668 namespace by connecting (mounting) the file system onto the existing 5669 file object (the mount point, usually a directory) of an existing 5670 file system. When the mount point's parent directory is read via an 5671 API like readdir(), the return results are directory entries, each 5672 with a component name and a fileid. The fileid of the mount point's 5673 directory entry will be different from the fileid that the stat() 5674 system call returns. The stat() system call is returning the fileid 5675 of the root of the mounted file system, whereas readdir() is 5676 returning the fileid that stat() would have returned before any file 5677 systems were mounted on the mount point. 5679 Unlike NFSv3, NFSv4.1 allows a client's LOOKUP request to cross other 5680 file systems. The client detects the file system crossing whenever 5681 the filehandle argument of LOOKUP has an fsid attribute different 5682 from that of the filehandle returned by LOOKUP. A UNIX-based client 5683 will consider this a "mount point crossing". UNIX has a legacy 5684 scheme for allowing a process to determine its current working 5685 directory. This relies on readdir() of a mount point's parent and 5686 stat() of the mount point returning fileids as previously described. 5687 The mounted_on_fileid attribute corresponds to the fileid that 5688 readdir() would have returned as described previously. 5690 While the NFSv4.1 client could simply fabricate a fileid 5691 corresponding to what mounted_on_fileid provides (and if the server 5692 does not support mounted_on_fileid, the client has no choice), there 5693 is a risk that the client will generate a fileid that conflicts with 5694 one that is already assigned to another object in the file system. 5695 Instead, if the server can provide the mounted_on_fileid, the 5696 potential for client operational problems in this area is eliminated. 5698 If the server detects that there is no mounted point at the target 5699 file object, then the value for mounted_on_fileid that it returns is 5700 the same as that of the fileid attribute. 5702 The mounted_on_fileid attribute is RECOMMENDED, so the server SHOULD 5703 provide it if possible, and for a UNIX-based server, this is 5704 straightforward. Usually, mounted_on_fileid will be requested during 5705 a READDIR operation, in which case it is trivial (at least for UNIX- 5706 based servers) to return mounted_on_fileid since it is equal to the 5707 fileid of a directory entry returned by readdir(). If 5708 mounted_on_fileid is requested in a GETATTR operation, the server 5709 should obey an invariant that has it returning a value that is equal 5710 to the file object's entry in the object's parent directory, i.e., 5711 what readdir() would have returned. Some operating environments 5712 allow a series of two or more file systems to be mounted onto a 5713 single mount point. In this case, for the server to obey the 5714 aforementioned invariant, it will need to find the base mount point, 5715 and not the intermediate mount points. 5717 5.8.2.24. Attribute 34: no_trunc 5719 If this attribute is TRUE, then if the client uses a file name longer 5720 than name_max, an error will be returned instead of the name being 5721 truncated. 5723 5.8.2.25. Attribute 35: numlinks 5725 Number of hard links to this object. 5727 5.8.2.26. Attribute 36: owner 5729 The string name of the owner of this object. 5731 5.8.2.27. Attribute 37: owner_group 5733 The string name of the group ownership of this object. 5735 5.8.2.28. Attribute 38: quota_avail_hard 5737 The value in bytes that represents the amount of additional disk 5738 space beyond the current allocation that can be allocated to this 5739 file or directory before further allocations will be refused. It is 5740 understood that this space may be consumed by allocations to other 5741 files or directories. 5743 5.8.2.29. Attribute 39: quota_avail_soft 5745 The value in bytes that represents the amount of additional disk 5746 space that can be allocated to this file or directory before the user 5747 may reasonably be warned. It is understood that this space may be 5748 consumed by allocations to other files or directories though there is 5749 a rule as to which other files or directories. 5751 5.8.2.30. Attribute 40: quota_used 5753 The value in bytes that represents the amount of disk space used by 5754 this file or directory and possibly a number of other similar files 5755 or directories, where the set of "similar" meets at least the 5756 criterion that allocating space to any file or directory in the set 5757 will reduce the "quota_avail_hard" of every other file or directory 5758 in the set. 5760 Note that there may be a number of distinct but overlapping sets of 5761 files or directories for which a quota_used value is maintained, 5762 e.g., "all files with a given owner", "all files with a given group 5763 owner", etc. The server is at liberty to choose any of those sets 5764 when providing the content of the quota_used attribute, but should do 5765 so in a repeatable way. The rule may be configured per file system 5766 or may be "choose the set with the smallest quota". 5768 5.8.2.31. Attribute 41: rawdev 5770 Raw device number of file of type NF4BLK or NF4CHR. The device 5771 number is split into major and minor numbers. If the file's type 5772 attribute is not NF4BLK or NF4CHR, the value returned SHOULD NOT be 5773 considered useful. 5775 5.8.2.32. Attribute 42: space_avail 5777 Disk space in bytes available to this user on the file system 5778 containing this object -- this should be the smallest relevant limit. 5780 5.8.2.33. Attribute 43: space_free 5782 Free disk space in bytes on the file system containing this object -- 5783 this should be the smallest relevant limit. 5785 5.8.2.34. Attribute 44: space_total 5787 Total disk space in bytes on the file system containing this object. 5789 5.8.2.35. Attribute 45: space_used 5791 Number of file system bytes allocated to this object. 5793 5.8.2.36. Attribute 46: system 5795 This attribute is TRUE if this file is a "system" file with respect 5796 to the Windows operating environment. 5798 5.8.2.37. Attribute 47: time_access 5800 The time_access attribute represents the time of last access to the 5801 object by a READ operation sent to the server. The notion of what is 5802 an "access" depends on the server's operating environment and/or the 5803 server's file system semantics. For example, for servers obeying 5804 Portable Operating System Interface (POSIX) semantics, time_access 5805 would be updated only by the READ and READDIR operations and not any 5806 of the operations that modify the content of the object [13], [14], 5807 [15]. Of course, setting the corresponding time_access_set attribute 5808 is another way to modify the time_access attribute. 5810 Whenever the file object resides on a writable file system, the 5811 server should make its best efforts to record time_access into stable 5812 storage. However, to mitigate the performance effects of doing so, 5813 and most especially whenever the server is satisfying the read of the 5814 object's content from its cache, the server MAY cache access time 5815 updates and lazily write them to stable storage. It is also 5816 acceptable to give administrators of the server the option to disable 5817 time_access updates. 5819 5.8.2.38. Attribute 48: time_access_set 5821 Sets the time of last access to the object. SETATTR use only. 5823 5.8.2.39. Attribute 49: time_backup 5825 The time of last backup of the object. 5827 5.8.2.40. Attribute 50: time_create 5829 The time of creation of the object. This attribute does not have any 5830 relation to the traditional UNIX file attribute "ctime" or "change 5831 time". 5833 5.8.2.41. Attribute 51: time_delta 5835 Smallest useful server time granularity. 5837 5.8.2.42. Attribute 52: time_metadata 5839 The time of last metadata modification of the object. 5841 5.8.2.43. Attribute 53: time_modify 5843 The time of last modification to the object. 5845 5.8.2.44. Attribute 54: time_modify_set 5847 Sets the time of last modification to the object. SETATTR use only. 5849 5.9. Interpreting owner and owner_group 5851 The RECOMMENDED attributes "owner" and "owner_group" (and also users 5852 and groups within the "acl" attribute) are represented in terms of a 5853 UTF-8 string. To avoid a representation that is tied to a particular 5854 underlying implementation at the client or server, the use of the 5855 UTF-8 string has been chosen. Note that Section 6.1 of RFC 2624 [53] 5856 provides additional rationale. It is expected that the client and 5857 server will have their own local representation of owner and 5858 owner_group that is used for local storage or presentation to the end 5859 user. Therefore, it is expected that when these attributes are 5860 transferred between the client and server, the local representation 5861 is translated to a syntax of the form "user@dns_domain". This will 5862 allow for a client and server that do not use the same local 5863 representation the ability to translate to a common syntax that can 5864 be interpreted by both. 5866 Similarly, security principals may be represented in different ways 5867 by different security mechanisms. Servers normally translate these 5868 representations into a common format, generally that used by local 5869 storage, to serve as a means of identifying the users corresponding 5870 to these security principals. When these local identifiers are 5871 translated to the form of the owner attribute, associated with files 5872 created by such principals, they identify, in a common format, the 5873 users associated with each corresponding set of security principals. 5875 The translation used to interpret owner and group strings is not 5876 specified as part of the protocol. This allows various solutions to 5877 be employed. For example, a local translation table may be consulted 5878 that maps a numeric identifier to the user@dns_domain syntax. A name 5879 service may also be used to accomplish the translation. A server may 5880 provide a more general service, not limited by any particular 5881 translation (which would only translate a limited set of possible 5882 strings) by storing the owner and owner_group attributes in local 5883 storage without any translation or it may augment a translation 5884 method by storing the entire string for attributes for which no 5885 translation is available while using the local representation for 5886 those cases in which a translation is available. 5888 Servers that do not provide support for all possible values of the 5889 owner and owner_group attributes SHOULD return an error 5890 (NFS4ERR_BADOWNER) when a string is presented that has no 5891 translation, as the value to be set for a SETATTR of the owner, 5892 owner_group, or acl attributes. When a server does accept an owner 5893 or owner_group value as valid on a SETATTR (and similarly for the 5894 owner and group strings in an acl), it is promising to return that 5895 same string when a corresponding GETATTR is done. Configuration 5896 changes (including changes from the mapping of the string to the 5897 local representation) and ill-constructed name translations (those 5898 that contain aliasing) may make that promise impossible to honor. 5899 Servers should make appropriate efforts to avoid a situation in which 5900 these attributes have their values changed when no real change to 5901 ownership has occurred. 5903 The "dns_domain" portion of the owner string is meant to be a DNS 5904 domain name, for example, user@example.org. Servers should accept as 5905 valid a set of users for at least one domain. A server may treat 5906 other domains as having no valid translations. A more general 5907 service is provided when a server is capable of accepting users for 5908 multiple domains, or for all domains, subject to security 5909 constraints. 5911 In the case where there is no translation available to the client or 5912 server, the attribute value will be constructed without the "@". 5913 Therefore, the absence of the @ from the owner or owner_group 5914 attribute signifies that no translation was available at the sender 5915 and that the receiver of the attribute should not use that string as 5916 a basis for translation into its own internal format. Even though 5917 the attribute value cannot be translated, it may still be useful. In 5918 the case of a client, the attribute string may be used for local 5919 display of ownership. 5921 To provide a greater degree of compatibility with NFSv3, which 5922 identified users and groups by 32-bit unsigned user identifiers and 5923 group identifiers, owner and group strings that consist of decimal 5924 numeric values with no leading zeros can be given a special 5925 interpretation by clients and servers that choose to provide such 5926 support. The receiver may treat such a user or group string as 5927 representing the same user as would be represented by an NFSv3 uid or 5928 gid having the corresponding numeric value. A server is not 5929 obligated to accept such a string, but may return an NFS4ERR_BADOWNER 5930 instead. To avoid this mechanism being used to subvert user and 5931 group translation, so that a client might pass all of the owners and 5932 groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER 5933 error when there is a valid translation for the user or owner 5934 designated in this way. In that case, the client must use the 5935 appropriate name@domain string and not the special form for 5936 compatibility. 5938 The owner string "nobody" may be used to designate an anonymous user, 5939 which will be associated with a file created by a security principal 5940 that cannot be mapped through normal means to the owner attribute. 5941 Users and implementations of NFSv4.1 SHOULD NOT use "nobody" to 5942 designate a real user whose access is not anonymous. 5944 5.10. Character Case Attributes 5946 With respect to the case_insensitive and case_preserving attributes, 5947 each UCS-4 character (which UTF-8 encodes) can be mapped according to 5948 Appendix B.2 of RFC 3454 [16]. For general character handling and 5949 internationalization issues, see Section 14. 5951 5.11. Directory Notification Attributes 5953 As described in Section 18.39, the client can request a minimum delay 5954 for notifications of changes to attributes, but the server is free to 5955 ignore what the client requests. The client can determine in advance 5956 what notification delays the server will accept by sending a GETATTR 5957 operation for either or both of two directory notification 5958 attributes. When the client calls the GET_DIR_DELEGATION operation 5959 and asks for attribute change notifications, it should request 5960 notification delays that are no less than the values in the server- 5961 provided attributes. 5963 5.11.1. Attribute 56: dir_notif_delay 5965 The dir_notif_delay attribute is the minimum number of seconds the 5966 server will delay before notifying the client of a change to the 5967 directory's attributes. 5969 5.11.2. Attribute 57: dirent_notif_delay 5971 The dirent_notif_delay attribute is the minimum number of seconds the 5972 server will delay before notifying the client of a change to a file 5973 object that has an entry in the directory. 5975 5.12. pNFS Attribute Definitions 5977 5.12.1. Attribute 62: fs_layout_type 5979 The fs_layout_type attribute (see Section 3.3.13) applies to a file 5980 system and indicates what layout types are supported by the file 5981 system. When the client encounters a new fsid, the client SHOULD 5982 obtain the value for the fs_layout_type attribute associated with the 5983 new file system. This attribute is used by the client to determine 5984 if the layout types supported by the server match any of the client's 5985 supported layout types. 5987 5.12.2. Attribute 66: layout_alignment 5989 When a client holds layouts on files of a file system, the 5990 layout_alignment attribute indicates the preferred alignment for I/O 5991 to files on that file system. Where possible, the client should send 5992 READ and WRITE operations with offsets that are whole multiples of 5993 the layout_alignment attribute. 5995 5.12.3. Attribute 65: layout_blksize 5997 When a client holds layouts on files of a file system, the 5998 layout_blksize attribute indicates the preferred block size for I/O 5999 to files on that file system. Where possible, the client should send 6000 READ operations with a count argument that is a whole multiple of 6001 layout_blksize, and WRITE operations with a data argument of size 6002 that is a whole multiple of layout_blksize. 6004 5.12.4. Attribute 63: layout_hint 6006 The layout_hint attribute (see Section 3.3.19) may be set on newly 6007 created files to influence the metadata server's choice for the 6008 file's layout. If possible, this attribute is one of those set in 6009 the initial attributes within the OPEN operation. The metadata 6010 server may choose to ignore this attribute. The layout_hint 6011 attribute is a subset of the layout structure returned by LAYOUTGET. 6012 For example, instead of specifying particular devices, this would be 6013 used to suggest the stripe width of a file. The server 6014 implementation determines which fields within the layout will be 6015 used. 6017 5.12.5. Attribute 64: layout_type 6019 This attribute lists the layout type(s) available for a file. The 6020 value returned by the server is for informational purposes only. The 6021 client will use the LAYOUTGET operation to obtain the information 6022 needed in order to perform I/O, for example, the specific device 6023 information for the file and its layout. 6025 5.12.6. Attribute 68: mdsthreshold 6027 This attribute is a server-provided hint used to communicate to the 6028 client when it is more efficient to send READ and WRITE operations to 6029 the metadata server or the data server. The two types of thresholds 6030 described are file size thresholds and I/O size thresholds. If a 6031 file's size is smaller than the file size threshold, data accesses 6032 SHOULD be sent to the metadata server. If an I/O request has a 6033 length that is below the I/O size threshold, the I/O SHOULD be sent 6034 to the metadata server. Each threshold type is specified separately 6035 for read and write. 6037 The server MAY provide both types of thresholds for a file. If both 6038 file size and I/O size are provided, the client SHOULD reach or 6039 exceed both thresholds before sending its read or write requests to 6040 the data server. Alternatively, if only one of the specified 6041 thresholds is reached or exceeded, the I/O requests are sent to the 6042 metadata server. 6044 For each threshold type, a value of zero indicates no READ or WRITE 6045 should be sent to the metadata server, while a value of all ones 6046 indicates that all READs or WRITEs should be sent to the metadata 6047 server. 6049 The attribute is available on a per-filehandle basis. If the current 6050 filehandle refers to a non-pNFS file or directory, the metadata 6051 server should return an attribute that is representative of the 6052 filehandle's file system. It is suggested that this attribute is 6053 queried as part of the OPEN operation. Due to dynamic system 6054 changes, the client should not assume that the attribute will remain 6055 constant for any specific time period; thus, it should be 6056 periodically refreshed. 6058 5.13. Retention Attributes 6060 Retention is a concept whereby a file object can be placed in an 6061 immutable, undeletable, unrenamable state for a fixed or infinite 6062 duration of time. Once in this "retained" state, the file cannot be 6063 moved out of the state until the duration of retention has been 6064 reached. 6066 When retention is enabled, retention MUST extend to the data of the 6067 file, and the name of file. The server MAY extend retention to any 6068 other property of the file, including any subset of REQUIRED, 6069 RECOMMENDED, and named attributes, with the exceptions noted in this 6070 section. 6072 Servers MAY support or not support retention on any file object type. 6074 The five retention attributes are explained in the next subsections. 6076 5.13.1. Attribute 69: retention_get 6078 If retention is enabled for the associated file, this attribute's 6079 value represents the retention begin time of the file object. This 6080 attribute's value is only readable with the GETATTR operation and 6081 MUST NOT be modified by the SETATTR operation (Section 5.5). The 6082 value of the attribute consists of: 6084 const RET4_DURATION_INFINITE = 0xffffffffffffffff; 6085 struct retention_get4 { 6086 uint64_t rg_duration; 6087 nfstime4 rg_begin_time<1>; 6088 }; 6090 The field rg_duration is the duration in seconds indicating how long 6091 the file will be retained once retention is enabled. The field 6092 rg_begin_time is an array of up to one absolute time value. If the 6093 array is zero length, no beginning retention time has been 6094 established, and retention is not enabled. If rg_duration is equal 6095 to RET4_DURATION_INFINITE, the file, once retention is enabled, will 6096 be retained for an infinite duration. 6098 If (as soon as) rg_duration is zero, then rg_begin_time will be of 6099 zero length, and again, retention is not (no longer) enabled. 6101 5.13.2. Attribute 70: retention_set 6103 This attribute is used to set the retention duration and optionally 6104 enable retention for the associated file object. This attribute is 6105 only modifiable via the SETATTR operation and MUST NOT be retrieved 6106 by the GETATTR operation (Section 5.5). This attribute corresponds 6107 to retention_get. The value of the attribute consists of: 6109 struct retention_set4 { 6110 bool rs_enable; 6111 uint64_t rs_duration<1>; 6112 }; 6113 If the client sets rs_enable to TRUE, then it is enabling retention 6114 on the file object with the begin time of retention starting from the 6115 server's current time and date. The duration of the retention can 6116 also be provided if the rs_duration array is of length one. The 6117 duration is the time in seconds from the begin time of retention, and 6118 if set to RET4_DURATION_INFINITE, the file is to be retained forever. 6119 If retention is enabled, with no duration specified in either this 6120 SETATTR or a previous SETATTR, the duration defaults to zero seconds. 6121 The server MAY restrict the enabling of retention or the duration of 6122 retention on the basis of the ACE4_WRITE_RETENTION ACL permission. 6123 The enabling of retention MUST NOT prevent the enabling of event- 6124 based retention or the modification of the retention_hold attribute. 6126 The following rules apply to both the retention_set and retentevt_set 6127 attributes. 6129 * As long as retention is not enabled, the client is permitted to 6130 decrease the duration. 6132 * The duration can always be set to an equal or higher value, even 6133 if retention is enabled. Note that once retention is enabled, the 6134 actual duration (as returned by the retention_get or retentevt_get 6135 attributes; see Section 5.13.1 or Section 5.13.3) is constantly 6136 counting down to zero (one unit per second), unless the duration 6137 was set to RET4_DURATION_INFINITE. Thus, it will not be possible 6138 for the client to precisely extend the duration on a file that has 6139 retention enabled. 6141 * While retention is enabled, attempts to disable retention or 6142 decrease the retention's duration MUST fail with the error 6143 NFS4ERR_INVAL. 6145 * If the principal attempting to change retention_set or 6146 retentevt_set does not have ACE4_WRITE_RETENTION permissions, the 6147 attempt MUST fail with NFS4ERR_ACCESS. 6149 5.13.3. Attribute 71: retentevt_get 6151 Gets the event-based retention duration, and if enabled, the event- 6152 based retention begin time of the file object. This attribute is 6153 like retention_get, but refers to event-based retention. The event 6154 that triggers event-based retention is not defined by the NFSv4.1 6155 specification. 6157 5.13.4. Attribute 72: retentevt_set 6159 Sets the event-based retention duration, and optionally enables 6160 event-based retention on the file object. This attribute corresponds 6161 to retentevt_get and is like retention_set, but refers to event-based 6162 retention. When event-based retention is set, the file MUST be 6163 retained even if non-event-based retention has been set, and the 6164 duration of non-event-based retention has been reached. Conversely, 6165 when non-event-based retention has been set, the file MUST be 6166 retained even if event-based retention has been set, and the duration 6167 of event-based retention has been reached. The server MAY restrict 6168 the enabling of event-based retention or the duration of event-based 6169 retention on the basis of the ACE4_WRITE_RETENTION ACL permission. 6170 The enabling of event-based retention MUST NOT prevent the enabling 6171 of non-event-based retention or the modification of the 6172 retention_hold attribute. 6174 5.13.5. Attribute 73: retention_hold 6176 Gets or sets administrative retention holds, one hold per bit 6177 position. 6179 This attribute allows one to 64 administrative holds, one hold per 6180 bit on the attribute. If retention_hold is not zero, then the file 6181 MUST NOT be deleted, renamed, or modified, even if the duration on 6182 enabled event or non-event-based retention has been reached. The 6183 server MAY restrict the modification of retention_hold on the basis 6184 of the ACE4_WRITE_RETENTION_HOLD ACL permission. The enabling of 6185 administration retention holds does not prevent the enabling of 6186 event-based or non-event-based retention. 6188 If the principal attempting to change retention_hold does not have 6189 ACE4_WRITE_RETENTION_HOLD permissions, the attempt MUST fail with 6190 NFS4ERR_ACCESS. 6192 6. Access Control Attributes 6194 Access Control Lists (ACLs) are file attributes that specify fine- 6195 grained access control. This section covers the "acl", "dacl", 6196 "sacl", "aclsupport", "mode", and "mode_set_masked" file attributes 6197 and their interactions. Note that file attributes may apply to any 6198 file system object. 6200 6.1. Goals 6202 ACLs and modes represent two well-established models for specifying 6203 permissions. This section specifies requirements that attempt to 6204 meet the following goals: 6206 * If a server supports the mode attribute, it should provide 6207 reasonable semantics to clients that only set and retrieve the 6208 mode attribute. 6210 * If a server supports ACL attributes, it should provide reasonable 6211 semantics to clients that only set and retrieve those attributes. 6213 * On servers that support the mode attribute, if ACL attributes have 6214 never been set on an object, via inheritance or explicitly, the 6215 behavior should be traditional UNIX-like behavior. 6217 * On servers that support the mode attribute, if the ACL attributes 6218 have been previously set on an object, either explicitly or via 6219 inheritance: 6221 - Setting only the mode attribute should effectively control the 6222 traditional UNIX-like permissions of read, write, and execute 6223 on owner, owner_group, and other. 6225 - Setting only the mode attribute should provide reasonable 6226 security. For example, setting a mode of 000 should be enough 6227 to ensure that future OPEN operations for 6228 OPEN4_SHARE_ACCESS_READ or OPEN4_SHARE_ACCESS_WRITE by any 6229 principal fail, regardless of a previously existing or 6230 inherited ACL. 6232 * NFSv4.1 may introduce different semantics relating to the mode and 6233 ACL attributes, but it does not render invalid any previously 6234 existing implementations. Additionally, this section provides 6235 clarifications based on previous implementations and discussions 6236 around them. 6238 * On servers that support both the mode and the acl or dacl 6239 attributes, the server must keep the two consistent with each 6240 other. The value of the mode attribute (with the exception of the 6241 three high-order bits described in Section 6.2.4) must be 6242 determined entirely by the value of the ACL, so that use of the 6243 mode is never required for anything other than setting the three 6244 high-order bits. See Section 6.4.1 for exact requirements. 6246 * When a mode attribute is set on an object, the ACL attributes may 6247 need to be modified in order to not conflict with the new mode. 6248 In such cases, it is desirable that the ACL keep as much 6249 information as possible. This includes information about 6250 inheritance, AUDIT and ALARM ACEs, and permissions granted and 6251 denied that do not conflict with the new mode. 6253 6.2. File Attributes Discussion 6255 6.2.1. Attribute 12: acl 6257 The NFSv4.1 ACL attribute contains an array of Access Control Entries 6258 (ACEs) that are associated with the file system object. Although the 6259 client can set and get the acl attribute, the server is responsible 6260 for using the ACL to perform access control. The client can use the 6261 OPEN or ACCESS operations to check access without modifying or 6262 reading data or metadata. 6264 The NFS ACE structure is defined as follows: 6266 typedef uint32_t acetype4; 6268 typedef uint32_t aceflag4; 6270 typedef uint32_t acemask4; 6272 struct nfsace4 { 6273 acetype4 type; 6274 aceflag4 flag; 6275 acemask4 access_mask; 6276 utf8str_mixed who; 6277 }; 6279 To determine if a request succeeds, the server processes each nfsace4 6280 entry in order. Only ACEs that have a "who" that matches the 6281 requester are considered. Each ACE is processed until all of the 6282 bits of the requester's access have been ALLOWED. Once a bit (see 6283 below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer 6284 considered in the processing of later ACEs. If an ACCESS_DENIED_ACE 6285 is encountered where the requester's access still has unALLOWED bits 6286 in common with the "access_mask" of the ACE, the request is denied. 6287 When the ACL is fully processed, if there are bits in the requester's 6288 mask that have not been ALLOWED or DENIED, access is denied. 6290 Unlike the ALLOW and DENY ACE types, the ALARM and AUDIT ACE types do 6291 not affect a requester's access, and instead are for triggering 6292 events as a result of a requester's access attempt. Therefore, AUDIT 6293 and ALARM ACEs are processed only after processing ALLOW and DENY 6294 ACEs. 6296 The NFSv4.1 ACL model is quite rich. Some server platforms may 6297 provide access-control functionality that goes beyond the UNIX-style 6298 mode attribute, but that is not as rich as the NFS ACL model. So 6299 that users can take advantage of this more limited functionality, the 6300 server may support the acl attributes by mapping between its ACL 6301 model and the NFSv4.1 ACL model. Servers must ensure that the ACL 6302 they actually store or enforce is at least as strict as the NFSv4 ACL 6303 that was set. It is tempting to accomplish this by rejecting any ACL 6304 that falls outside the small set that can be represented accurately. 6305 However, such an approach can render ACLs unusable without special 6306 client-side knowledge of the server's mapping, which defeats the 6307 purpose of having a common NFSv4 ACL protocol. Therefore, servers 6308 should accept every ACL that they can without compromising security. 6309 To help accomplish this, servers may make a special exception, in the 6310 case of unsupported permission bits, to the rule that bits not 6311 ALLOWED or DENIED by an ACL must be denied. For example, a UNIX- 6312 style server might choose to silently allow read attribute 6313 permissions even though an ACL does not explicitly allow those 6314 permissions. (An ACL that explicitly denies permission to read 6315 attributes should still be rejected.) 6317 The situation is complicated by the fact that a server may have 6318 multiple modules that enforce ACLs. For example, the enforcement for 6319 NFSv4.1 access may be different from, but not weaker than, the 6320 enforcement for local access, and both may be different from the 6321 enforcement for access through other protocols such as SMB (Server 6322 Message Block). So it may be useful for a server to accept an ACL 6323 even if not all of its modules are able to support it. 6325 The guiding principle with regard to NFSv4 access is that the server 6326 must not accept ACLs that appear to make access to the file more 6327 restrictive than it really is. 6329 6.2.1.1. ACE Type 6331 The constants used for the type field (acetype4) are as follows: 6333 const ACE4_ACCESS_ALLOWED_ACE_TYPE = 0x00000000; 6334 const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001; 6335 const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002; 6336 const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003; 6338 Only the ALLOWED and DENIED bits may be used in the dacl attribute, 6339 and only the AUDIT and ALARM bits may be used in the sacl attribute. 6340 All four are permitted in the acl attribute. 6342 +==============================+==============+=====================+ 6343 | Value | Abbreviation | Description | 6344 +==============================+==============+=====================+ 6345 | ACE4_ACCESS_ALLOWED_ACE_TYPE | ALLOW | Explicitly grants | 6346 | | | the access | 6347 | | | defined in | 6348 | | | acemask4 to the | 6349 | | | file or | 6350 | | | directory. | 6351 +------------------------------+--------------+---------------------+ 6352 | ACE4_ACCESS_DENIED_ACE_TYPE | DENY | Explicitly denies | 6353 | | | the access | 6354 | | | defined in | 6355 | | | acemask4 to the | 6356 | | | file or | 6357 | | | directory. | 6358 +------------------------------+--------------+---------------------+ 6359 | ACE4_SYSTEM_AUDIT_ACE_TYPE | AUDIT | Log (in a system- | 6360 | | | dependent way) | 6361 | | | any access | 6362 | | | attempt to a file | 6363 | | | or directory that | 6364 | | | uses any of the | 6365 | | | access methods | 6366 | | | specified in | 6367 | | | acemask4. | 6368 +------------------------------+--------------+---------------------+ 6369 | ACE4_SYSTEM_ALARM_ACE_TYPE | ALARM | Generate an alarm | 6370 | | | (in a system- | 6371 | | | dependent way) | 6372 | | | when any access | 6373 | | | attempt is made | 6374 | | | to a file or | 6375 | | | directory for the | 6376 | | | access methods | 6377 | | | specified in | 6378 | | | acemask4. | 6379 +------------------------------+--------------+---------------------+ 6381 Table 6 6383 The "Abbreviation" column denotes how the types will be referred to 6384 throughout the rest of this section. 6386 6.2.1.2. Attribute 13: aclsupport 6388 A server need not support all of the above ACE types. This attribute 6389 indicates which ACE types are supported for the current file system. 6390 The bitmask constants used to represent the above definitions within 6391 the aclsupport attribute are as follows: 6393 const ACL4_SUPPORT_ALLOW_ACL = 0x00000001; 6394 const ACL4_SUPPORT_DENY_ACL = 0x00000002; 6395 const ACL4_SUPPORT_AUDIT_ACL = 0x00000004; 6396 const ACL4_SUPPORT_ALARM_ACL = 0x00000008; 6398 Servers that support either the ALLOW or DENY ACE type SHOULD support 6399 both ALLOW and DENY ACE types. 6401 Clients should not attempt to set an ACE unless the server claims 6402 support for that ACE type. If the server receives a request to set 6403 an ACE that it cannot store, it MUST reject the request with 6404 NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE 6405 that it can store but cannot enforce, the server SHOULD reject the 6406 request with NFS4ERR_ATTRNOTSUPP. 6408 Support for any of the ACL attributes is optional (albeit 6409 RECOMMENDED). However, a server that supports either of the new ACL 6410 attributes (dacl or sacl) MUST allow use of the new ACL attributes to 6411 access all of the ACE types that it supports. In other words, if 6412 such a server supports ALLOW or DENY ACEs, then it MUST support the 6413 dacl attribute, and if it supports AUDIT or ALARM ACEs, then it MUST 6414 support the sacl attribute. 6416 6.2.1.3. ACE Access Mask 6418 The bitmask constants used for the access mask field are as follows: 6420 const ACE4_READ_DATA = 0x00000001; 6421 const ACE4_LIST_DIRECTORY = 0x00000001; 6422 const ACE4_WRITE_DATA = 0x00000002; 6423 const ACE4_ADD_FILE = 0x00000002; 6424 const ACE4_APPEND_DATA = 0x00000004; 6425 const ACE4_ADD_SUBDIRECTORY = 0x00000004; 6426 const ACE4_READ_NAMED_ATTRS = 0x00000008; 6427 const ACE4_WRITE_NAMED_ATTRS = 0x00000010; 6428 const ACE4_EXECUTE = 0x00000020; 6429 const ACE4_DELETE_CHILD = 0x00000040; 6430 const ACE4_READ_ATTRIBUTES = 0x00000080; 6431 const ACE4_WRITE_ATTRIBUTES = 0x00000100; 6432 const ACE4_WRITE_RETENTION = 0x00000200; 6433 const ACE4_WRITE_RETENTION_HOLD = 0x00000400; 6435 const ACE4_DELETE = 0x00010000; 6436 const ACE4_READ_ACL = 0x00020000; 6437 const ACE4_WRITE_ACL = 0x00040000; 6438 const ACE4_WRITE_OWNER = 0x00080000; 6439 const ACE4_SYNCHRONIZE = 0x00100000; 6441 Note that some masks have coincident values, for example, 6442 ACE4_READ_DATA and ACE4_LIST_DIRECTORY. The mask entries 6443 ACE4_LIST_DIRECTORY, ACE4_ADD_FILE, and ACE4_ADD_SUBDIRECTORY are 6444 intended to be used with directory objects, while ACE4_READ_DATA, 6445 ACE4_WRITE_DATA, and ACE4_APPEND_DATA are intended to be used with 6446 non-directory objects. 6448 6.2.1.3.1. Discussion of Mask Attributes 6450 ACE4_READ_DATA 6452 Operation(s) affected: 6453 READ 6455 OPEN 6457 Discussion: 6458 Permission to read the data of the file. 6460 Servers SHOULD allow a user the ability to read the data of the 6461 file when only the ACE4_EXECUTE access mask bit is allowed. 6463 ACE4_LIST_DIRECTORY 6465 Operation(s) affected: 6466 READDIR 6468 Discussion: 6469 Permission to list the contents of a directory. 6471 ACE4_WRITE_DATA 6473 Operation(s) affected: 6474 WRITE 6476 OPEN 6478 SETATTR of size 6480 Discussion: 6481 Permission to modify a file's data. 6483 ACE4_ADD_FILE 6485 Operation(s) affected: 6486 CREATE 6488 LINK 6490 OPEN 6492 RENAME 6494 Discussion: 6495 Permission to add a new file in a directory. The CREATE 6496 operation is affected when nfs_ftype4 is NF4LNK, NF4BLK, 6497 NF4CHR, NF4SOCK, or NF4FIFO. (NF4DIR is not listed because it 6498 is covered by ACE4_ADD_SUBDIRECTORY.) OPEN is affected when 6499 used to create a regular file. LINK and RENAME are always 6500 affected. 6502 ACE4_APPEND_DATA 6504 Operation(s) affected: 6505 WRITE 6507 OPEN 6509 SETATTR of size 6511 Discussion: 6512 The ability to modify a file's data, but only starting at EOF. 6513 This allows for the notion of append-only files, by allowing 6514 ACE4_APPEND_DATA and denying ACE4_WRITE_DATA to the same user 6515 or group. If a file has an ACL such as the one described above 6516 and a WRITE request is made for somewhere other than EOF, the 6517 server SHOULD return NFS4ERR_ACCESS. 6519 ACE4_ADD_SUBDIRECTORY 6521 Operation(s) affected: 6522 CREATE 6524 RENAME 6526 Discussion: 6527 Permission to create a subdirectory in a directory. The CREATE 6528 operation is affected when nfs_ftype4 is NF4DIR. The RENAME 6529 operation is always affected. 6531 ACE4_READ_NAMED_ATTRS 6533 Operation(s) affected: 6534 OPENATTR 6536 Discussion: 6537 Permission to read the named attributes of a file or to look up 6538 the named attribute directory. OPENATTR is affected when it is 6539 not used to create a named attribute directory. This is when 6540 1) createdir is TRUE, but a named attribute directory already 6541 exists, or 2) createdir is FALSE. 6543 ACE4_WRITE_NAMED_ATTRS 6545 Operation(s) affected: 6546 OPENATTR 6548 Discussion: 6549 Permission to write the named attributes of a file or to create 6550 a named attribute directory. OPENATTR is affected when it is 6551 used to create a named attribute directory. This is when 6552 createdir is TRUE and no named attribute directory exists. The 6553 ability to check whether or not a named attribute directory 6554 exists depends on the ability to look it up; therefore, users 6555 also need the ACE4_READ_NAMED_ATTRS permission in order to 6556 create a named attribute directory. 6558 ACE4_EXECUTE 6559 Operation(s) affected: 6560 READ 6562 OPEN 6564 REMOVE 6566 RENAME 6568 LINK 6570 CREATE 6572 Discussion: 6573 Permission to execute a file. 6575 Servers SHOULD allow a user the ability to read the data of the 6576 file when only the ACE4_EXECUTE access mask bit is allowed. 6577 This is because there is no way to execute a file without 6578 reading the contents. Though a server may treat ACE4_EXECUTE 6579 and ACE4_READ_DATA bits identically when deciding to permit a 6580 READ operation, it SHOULD still allow the two bits to be set 6581 independently in ACLs, and MUST distinguish between them when 6582 replying to ACCESS operations. In particular, servers SHOULD 6583 NOT silently turn on one of the two bits when the other is set, 6584 as that would make it impossible for the client to correctly 6585 enforce the distinction between read and execute permissions. 6587 As an example, following a SETATTR of the following ACL: 6589 nfsuser:ACE4_EXECUTE:ALLOW 6591 A subsequent GETATTR of ACL for that file SHOULD return: 6593 nfsuser:ACE4_EXECUTE:ALLOW 6595 Rather than: 6597 nfsuser:ACE4_EXECUTE/ACE4_READ_DATA:ALLOW 6599 ACE4_EXECUTE 6601 Operation(s) affected: 6602 LOOKUP 6604 Discussion: 6605 Permission to traverse/search a directory. 6607 ACE4_DELETE_CHILD 6609 Operation(s) affected: 6610 REMOVE 6612 RENAME 6614 Discussion: 6615 Permission to delete a file or directory within a directory. 6616 See Section 6.2.1.3.2 for information on ACE4_DELETE and 6617 ACE4_DELETE_CHILD interact. 6619 ACE4_READ_ATTRIBUTES 6621 Operation(s) affected: 6622 GETATTR of file system object attributes 6624 VERIFY 6626 NVERIFY 6628 READDIR 6630 Discussion: 6631 The ability to read basic attributes (non-ACLs) of a file. On 6632 a UNIX system, basic attributes can be thought of as the stat- 6633 level attributes. Allowing this access mask bit would mean 6634 that the entity can execute "ls -l" and stat. If a READDIR 6635 operation requests attributes, this mask must be allowed for 6636 the READDIR to succeed. 6638 ACE4_WRITE_ATTRIBUTES 6640 Operation(s) affected: 6641 SETATTR of time_access_set, time_backup, 6643 time_create, time_modify_set, mimetype, hidden, system 6645 Discussion: 6646 Permission to change the times associated with a file or 6647 directory to an arbitrary value. Also permission to change the 6648 mimetype, hidden, and system attributes. A user having 6649 ACE4_WRITE_DATA or ACE4_WRITE_ATTRIBUTES will be allowed to set 6650 the times associated with a file to the current server time. 6652 ACE4_WRITE_RETENTION 6653 Operation(s) affected: 6654 SETATTR of retention_set, retentevt_set. 6656 Discussion: 6657 Permission to modify the durations of event and non-event-based 6658 retention. Also permission to enable event and non-event-based 6659 retention. A server MAY behave such that setting 6660 ACE4_WRITE_ATTRIBUTES allows ACE4_WRITE_RETENTION. 6662 ACE4_WRITE_RETENTION_HOLD 6664 Operation(s) affected: 6665 SETATTR of retention_hold. 6667 Discussion: 6668 Permission to modify the administration retention holds. A 6669 server MAY map ACE4_WRITE_ATTRIBUTES to 6670 ACE_WRITE_RETENTION_HOLD. 6672 ACE4_DELETE 6674 Operation(s) affected: 6675 REMOVE 6677 Discussion: 6678 Permission to delete the file or directory. See 6679 Section 6.2.1.3.2 for information on ACE4_DELETE and 6680 ACE4_DELETE_CHILD interact. 6682 ACE4_READ_ACL 6684 Operation(s) affected: 6685 GETATTR of acl, dacl, or sacl 6687 NVERIFY 6689 VERIFY 6691 Discussion: 6692 Permission to read the ACL. 6694 ACE4_WRITE_ACL 6696 Operation(s) affected: 6697 SETATTR of acl and mode 6699 Discussion: 6700 Permission to write the acl and mode attributes. 6702 ACE4_WRITE_OWNER 6704 Operation(s) affected: 6705 SETATTR of owner and owner_group 6707 Discussion: 6708 Permission to write the owner and owner_group attributes. On 6709 UNIX systems, this is the ability to execute chown() and 6710 chgrp(). 6712 ACE4_SYNCHRONIZE 6714 Operation(s) affected: 6715 NONE 6717 Discussion: 6718 Permission to use the file object as a synchronization 6719 primitive for interprocess communication. This permission is 6720 not enforced or interpreted by the NFSv4.1 server on behalf of 6721 the client. 6723 Typically, the ACE4_SYNCHRONIZE permission is only meaningful 6724 on local file systems, i.e., file systems not accessed via 6725 NFSv4.1. The reason that the permission bit exists is that 6726 some operating environments, such as Windows, use 6727 ACE4_SYNCHRONIZE. 6729 For example, if a client copies a file that has 6730 ACE4_SYNCHRONIZE set from a local file system to an NFSv4.1 6731 server, and then later copies the file from the NFSv4.1 server 6732 to a local file system, it is likely that if ACE4_SYNCHRONIZE 6733 was set in the original file, the client will want it set in 6734 the second copy. The first copy will not have the permission 6735 set unless the NFSv4.1 server has the means to set the 6736 ACE4_SYNCHRONIZE bit. The second copy will not have the 6737 permission set unless the NFSv4.1 server has the means to 6738 retrieve the ACE4_SYNCHRONIZE bit. 6740 Server implementations need not provide the granularity of control 6741 that is implied by this list of masks. For example, POSIX-based 6742 systems might not distinguish ACE4_APPEND_DATA (the ability to append 6743 to a file) from ACE4_WRITE_DATA (the ability to modify existing 6744 contents); both masks would be tied to a single "write" permission 6745 [17]. When such a server returns attributes to the client, it would 6746 show both ACE4_APPEND_DATA and ACE4_WRITE_DATA if and only if the 6747 write permission is enabled. 6749 If a server receives a SETATTR request that it cannot accurately 6750 implement, it should err in the direction of more restricted access, 6751 except in the previously discussed cases of execute and read. For 6752 example, suppose a server cannot distinguish overwriting data from 6753 appending new data, as described in the previous paragraph. If a 6754 client submits an ALLOW ACE where ACE4_APPEND_DATA is set but 6755 ACE4_WRITE_DATA is not (or vice versa), the server should either turn 6756 off ACE4_APPEND_DATA or reject the request with NFS4ERR_ATTRNOTSUPP. 6758 6.2.1.3.2. ACE4_DELETE vs. ACE4_DELETE_CHILD 6760 Two access mask bits govern the ability to delete a directory entry: 6761 ACE4_DELETE on the object itself (the "target") and ACE4_DELETE_CHILD 6762 on the containing directory (the "parent"). 6764 Many systems also take the "sticky bit" (MODE4_SVTX) on a directory 6765 to allow unlink only to a user that owns either the target or the 6766 parent; on some such systems the decision also depends on whether the 6767 target is writable. 6769 Servers SHOULD allow unlink if either ACE4_DELETE is permitted on the 6770 target, or ACE4_DELETE_CHILD is permitted on the parent. (Note that 6771 this is true even if the parent or target explicitly denies one of 6772 these permissions.) 6774 If the ACLs in question neither explicitly ALLOW nor DENY either of 6775 the above, and if MODE4_SVTX is not set on the parent, then the 6776 server SHOULD allow the removal if and only if ACE4_ADD_FILE is 6777 permitted. In the case where MODE4_SVTX is set, the server may also 6778 require the remover to own either the parent or the target, or may 6779 require the target to be writable. 6781 This allows servers to support something close to traditional UNIX- 6782 like semantics, with ACE4_ADD_FILE taking the place of the write bit. 6784 6.2.1.4. ACE flag 6786 The bitmask constants used for the flag field are as follows: 6788 const ACE4_FILE_INHERIT_ACE = 0x00000001; 6789 const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002; 6790 const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004; 6791 const ACE4_INHERIT_ONLY_ACE = 0x00000008; 6792 const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010; 6793 const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020; 6794 const ACE4_IDENTIFIER_GROUP = 0x00000040; 6795 const ACE4_INHERITED_ACE = 0x00000080; 6796 A server need not support any of these flags. If the server supports 6797 flags that are similar to, but not exactly the same as, these flags, 6798 the implementation may define a mapping between the protocol-defined 6799 flags and the implementation-defined flags. 6801 For example, suppose a client tries to set an ACE with 6802 ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the 6803 server does not support any form of ACL inheritance, the server 6804 should reject the request with NFS4ERR_ATTRNOTSUPP. If the server 6805 supports a single "inherit ACE" flag that applies to both files and 6806 directories, the server may reject the request (i.e., requiring the 6807 client to set both the file and directory inheritance flags). The 6808 server may also accept the request and silently turn on the 6809 ACE4_DIRECTORY_INHERIT_ACE flag. 6811 6.2.1.4.1. Discussion of Flag Bits 6813 ACE4_FILE_INHERIT_ACE 6814 Any non-directory file in any sub-directory will get this ACE 6815 inherited. 6817 ACE4_DIRECTORY_INHERIT_ACE 6818 Can be placed on a directory and indicates that this ACE should be 6819 added to each new directory created. 6821 If this flag is set in an ACE in an ACL attribute to be set on a 6822 non-directory file system object, the operation attempting to set 6823 the ACL SHOULD fail with NFS4ERR_ATTRNOTSUPP. 6825 ACE4_NO_PROPAGATE_INHERIT_ACE 6826 Can be placed on a directory. This flag tells the server that 6827 inheritance of this ACE should stop at newly created child 6828 directories. 6830 ACE4_INHERIT_ONLY_ACE 6831 Can be placed on a directory but does not apply to the directory; 6832 ALLOW and DENY ACEs with this bit set do not affect access to the 6833 directory, and AUDIT and ALARM ACEs with this bit set do not 6834 trigger log or alarm events. Such ACEs only take effect once they 6835 are applied (with this bit cleared) to newly created files and 6836 directories as specified by the ACE4_FILE_INHERIT_ACE and 6837 ACE4_DIRECTORY_INHERIT_ACE flags. 6839 If this flag is present on an ACE, but neither 6840 ACE4_DIRECTORY_INHERIT_ACE nor ACE4_FILE_INHERIT_ACE is present, 6841 then an operation attempting to set such an attribute SHOULD fail 6842 with NFS4ERR_ATTRNOTSUPP. 6844 ACE4_SUCCESSFUL_ACCESS_ACE_FLAG and ACE4_FAILED_ACCESS_ACE_FLAG 6845 The ACE4_SUCCESSFUL_ACCESS_ACE_FLAG (SUCCESS) and 6846 ACE4_FAILED_ACCESS_ACE_FLAG (FAILED) flag bits may be set only on 6847 ACE4_SYSTEM_AUDIT_ACE_TYPE (AUDIT) and ACE4_SYSTEM_ALARM_ACE_TYPE 6848 (ALARM) ACE types. If during the processing of the file's ACL, 6849 the server encounters an AUDIT or ALARM ACE that matches the 6850 principal attempting the OPEN, the server notes that fact, and the 6851 presence, if any, of the SUCCESS and FAILED flags encountered in 6852 the AUDIT or ALARM ACE. Once the server completes the ACL 6853 processing, it then notes if the operation succeeded or failed. 6854 If the operation succeeded, and if the SUCCESS flag was set for a 6855 matching AUDIT or ALARM ACE, then the appropriate AUDIT or ALARM 6856 event occurs. If the operation failed, and if the FAILED flag was 6857 set for the matching AUDIT or ALARM ACE, then the appropriate 6858 AUDIT or ALARM event occurs. Either or both of the SUCCESS or 6859 FAILED can be set, but if neither is set, the AUDIT or ALARM ACE 6860 is not useful. 6862 The previously described processing applies to ACCESS operations 6863 even when they return NFS4_OK. For the purposes of AUDIT and 6864 ALARM, we consider an ACCESS operation to be a "failure" if it 6865 fails to return a bit that was requested and supported. 6867 ACE4_IDENTIFIER_GROUP 6868 Indicates that the "who" refers to a GROUP as defined under UNIX 6869 or a GROUP ACCOUNT as defined under Windows. Clients and servers 6870 MUST ignore the ACE4_IDENTIFIER_GROUP flag on ACEs with a who 6871 value equal to one of the special identifiers outlined in 6872 Section 6.2.1.5. 6874 ACE4_INHERITED_ACE 6875 Indicates that this ACE is inherited from a parent directory. A 6876 server that supports automatic inheritance will place this flag on 6877 any ACEs inherited from the parent directory when creating a new 6878 object. Client applications will use this to perform automatic 6879 inheritance. Clients and servers MUST clear this bit in the acl 6880 attribute; it may only be used in the dacl and sacl attributes. 6882 6.2.1.5. ACE Who 6884 The "who" field of an ACE is an identifier that specifies the 6885 principal or principals to whom the ACE applies. It may refer to a 6886 user or a group, with the flag bit ACE4_IDENTIFIER_GROUP specifying 6887 which. 6889 There are several special identifiers that need to be understood 6890 universally, rather than in the context of a particular DNS domain. 6891 Some of these identifiers cannot be understood when an NFS client 6892 accesses the server, but have meaning when a local process accesses 6893 the file. The ability to display and modify these permissions is 6894 permitted over NFS, even if none of the access methods on the server 6895 understands the identifiers. 6897 +===============+==================================================+ 6898 | Who | Description | 6899 +===============+==================================================+ 6900 | OWNER | The owner of the file. | 6901 +---------------+--------------------------------------------------+ 6902 | GROUP | The group associated with the file. | 6903 +---------------+--------------------------------------------------+ 6904 | EVERYONE | The world, including the owner and owning group. | 6905 +---------------+--------------------------------------------------+ 6906 | INTERACTIVE | Accessed from an interactive terminal. | 6907 +---------------+--------------------------------------------------+ 6908 | NETWORK | Accessed via the network. | 6909 +---------------+--------------------------------------------------+ 6910 | DIALUP | Accessed as a dialup user to the server. | 6911 +---------------+--------------------------------------------------+ 6912 | BATCH | Accessed from a batch job. | 6913 +---------------+--------------------------------------------------+ 6914 | ANONYMOUS | Accessed without any authentication. | 6915 +---------------+--------------------------------------------------+ 6916 | AUTHENTICATED | Any authenticated user (opposite of ANONYMOUS). | 6917 +---------------+--------------------------------------------------+ 6918 | SERVICE | Access from a system service. | 6919 +---------------+--------------------------------------------------+ 6921 Table 7 6923 To avoid conflict, these special identifiers are distinguished by an 6924 appended "@" and should appear in the form "xxxx@" (with no domain 6925 name after the "@"), for example, ANONYMOUS@. 6927 The ACE4_IDENTIFIER_GROUP flag MUST be ignored on entries with these 6928 special identifiers. When encoding entries with these special 6929 identifiers, the ACE4_IDENTIFIER_GROUP flag SHOULD be set to zero. 6931 6.2.1.5.1. Discussion of EVERYONE@ 6933 It is important to note that "EVERYONE@" is not equivalent to the 6934 UNIX "other" entity. This is because, by definition, UNIX "other" 6935 does not include the owner or owning group of a file. "EVERYONE@" 6936 means literally everyone, including the owner or owning group. 6938 6.2.2. Attribute 58: dacl 6940 The dacl attribute is like the acl attribute, but dacl allows just 6941 ALLOW and DENY ACEs. The dacl attribute supports automatic 6942 inheritance (see Section 6.4.3.2). 6944 6.2.3. Attribute 59: sacl 6946 The sacl attribute is like the acl attribute, but sacl allows just 6947 AUDIT and ALARM ACEs. The sacl attribute supports automatic 6948 inheritance (see Section 6.4.3.2). 6950 6.2.4. Attribute 33: mode 6952 The NFSv4.1 mode attribute is based on the UNIX mode bits. The 6953 following bits are defined: 6955 const MODE4_SUID = 0x800; /* set user id on execution */ 6956 const MODE4_SGID = 0x400; /* set group id on execution */ 6957 const MODE4_SVTX = 0x200; /* save text even after use */ 6958 const MODE4_RUSR = 0x100; /* read permission: owner */ 6959 const MODE4_WUSR = 0x080; /* write permission: owner */ 6960 const MODE4_XUSR = 0x040; /* execute permission: owner */ 6961 const MODE4_RGRP = 0x020; /* read permission: group */ 6962 const MODE4_WGRP = 0x010; /* write permission: group */ 6963 const MODE4_XGRP = 0x008; /* execute permission: group */ 6964 const MODE4_ROTH = 0x004; /* read permission: other */ 6965 const MODE4_WOTH = 0x002; /* write permission: other */ 6966 const MODE4_XOTH = 0x001; /* execute permission: other */ 6968 Bits MODE4_RUSR, MODE4_WUSR, and MODE4_XUSR apply to the principal 6969 identified in the owner attribute. Bits MODE4_RGRP, MODE4_WGRP, and 6970 MODE4_XGRP apply to principals identified in the owner_group 6971 attribute but who are not identified in the owner attribute. Bits 6972 MODE4_ROTH, MODE4_WOTH, and MODE4_XOTH apply to any principal that 6973 does not match that in the owner attribute and does not have a group 6974 matching that of the owner_group attribute. 6976 Bits within a mode other than those specified above are not defined 6977 by this protocol. A server MUST NOT return bits other than those 6978 defined above in a GETATTR or READDIR operation, and it MUST return 6979 NFS4ERR_INVAL if bits other than those defined above are set in a 6980 SETATTR, CREATE, OPEN, VERIFY, or NVERIFY operation. 6982 6.2.5. Attribute 74: mode_set_masked 6984 The mode_set_masked attribute is a write-only attribute that allows 6985 individual bits in the mode attribute to be set or reset, without 6986 changing others. It allows, for example, the bits MODE4_SUID, 6987 MODE4_SGID, and MODE4_SVTX to be modified while leaving unmodified 6988 any of the nine low-order mode bits devoted to permissions. 6990 In such instances that the nine low-order bits are left unmodified, 6991 then neither the acl nor the dacl attribute should be automatically 6992 modified as discussed in Section 6.4.1. 6994 The mode_set_masked attribute consists of two words, each in the form 6995 of a mode4. The first consists of the value to be applied to the 6996 current mode value and the second is a mask. Only bits set to one in 6997 the mask word are changed (set or reset) in the file's mode. All 6998 other bits in the mode remain unchanged. Bits in the first word that 6999 correspond to bits that are zero in the mask are ignored, except that 7000 undefined bits are checked for validity and can result in 7001 NFS4ERR_INVAL as described below. 7003 The mode_set_masked attribute is only valid in a SETATTR operation. 7004 If it is used in a CREATE or OPEN operation, the server MUST return 7005 NFS4ERR_INVAL. 7007 Bits not defined as valid in the mode attribute are not valid in 7008 either word of the mode_set_masked attribute. The server MUST return 7009 NFS4ERR_INVAL if any such bits are set to one in a SETATTR. If the 7010 mode and mode_set_masked attributes are both specified in the same 7011 SETATTR, the server MUST also return NFS4ERR_INVAL. 7013 6.3. Common Methods 7015 The requirements in this section will be referred to in future 7016 sections, especially Section 6.4. 7018 6.3.1. Interpreting an ACL 7020 6.3.1.1. Server Considerations 7022 The server uses the algorithm described in Section 6.2.1 to determine 7023 whether an ACL allows access to an object. However, the ACL might 7024 not be the sole determiner of access. For example: 7026 * In the case of a file system exported as read-only, the server may 7027 deny write access even though an object's ACL grants it. 7029 * Server implementations MAY grant ACE4_WRITE_ACL and ACE4_READ_ACL 7030 permissions to prevent a situation from arising in which there is 7031 no valid way to ever modify the ACL. 7033 * All servers will allow a user the ability to read the data of the 7034 file when only the execute permission is granted (i.e., if the ACL 7035 denies the user the ACE4_READ_DATA access and allows the user 7036 ACE4_EXECUTE, the server will allow the user to read the data of 7037 the file). 7039 * Many servers have the notion of owner-override in which the owner 7040 of the object is allowed to override accesses that are denied by 7041 the ACL. This may be helpful, for example, to allow users 7042 continued access to open files on which the permissions have 7043 changed. 7045 * Many servers have the notion of a "superuser" that has privileges 7046 beyond an ordinary user. The superuser may be able to read or 7047 write data or metadata in ways that would not be permitted by the 7048 ACL. 7050 * A retention attribute might also block access otherwise allowed by 7051 ACLs (see Section 5.13). 7053 6.3.1.2. Client Considerations 7055 Clients SHOULD NOT do their own access checks based on their 7056 interpretation of the ACL, but rather use the OPEN and ACCESS 7057 operations to do access checks. This allows the client to act on the 7058 results of having the server determine whether or not access should 7059 be granted based on its interpretation of the ACL. 7061 Clients must be aware of situations in which an object's ACL will 7062 define a certain access even though the server will not enforce it. 7063 In general, but especially in these situations, the client needs to 7064 do its part in the enforcement of access as defined by the ACL. To 7065 do this, the client MAY send the appropriate ACCESS operation prior 7066 to servicing the request of the user or application in order to 7067 determine whether the user or application should be granted the 7068 access requested. For examples in which the ACL may define accesses 7069 that the server doesn't enforce, see Section 6.3.1.1. 7071 6.3.2. Computing a Mode Attribute from an ACL 7073 The following method can be used to calculate the MODE4_R*, MODE4_W*, 7074 and MODE4_X* bits of a mode attribute, based upon an ACL. 7076 First, for each of the special identifiers OWNER@, GROUP@, and 7077 EVERYONE@, evaluate the ACL in order, considering only ALLOW and DENY 7078 ACEs for the identifier EVERYONE@ and for the identifier under 7079 consideration. The result of the evaluation will be an NFSv4 ACL 7080 mask showing exactly which bits are permitted to that identifier. 7082 Then translate the calculated mask for OWNER@, GROUP@, and EVERYONE@ 7083 into mode bits for, respectively, the user, group, and other, as 7084 follows: 7086 1. Set the read bit (MODE4_RUSR, MODE4_RGRP, or MODE4_ROTH) if and 7087 only if ACE4_READ_DATA is set in the corresponding mask. 7089 2. Set the write bit (MODE4_WUSR, MODE4_WGRP, or MODE4_WOTH) if and 7090 only if ACE4_WRITE_DATA and ACE4_APPEND_DATA are both set in the 7091 corresponding mask. 7093 3. Set the execute bit (MODE4_XUSR, MODE4_XGRP, or MODE4_XOTH), if 7094 and only if ACE4_EXECUTE is set in the corresponding mask. 7096 6.3.2.1. Discussion 7098 Some server implementations also add bits permitted to named users 7099 and groups to the group bits (MODE4_RGRP, MODE4_WGRP, and 7100 MODE4_XGRP). 7102 Implementations are discouraged from doing this, because it has been 7103 found to cause confusion for users who see members of a file's group 7104 denied access that the mode bits appear to allow. (The presence of 7105 DENY ACEs may also lead to such behavior, but DENY ACEs are expected 7106 to be more rarely used.) 7108 The same user confusion seen when fetching the mode also results if 7109 setting the mode does not effectively control permissions for the 7110 owner, group, and other users; this motivates some of the 7111 requirements that follow. 7113 6.4. Requirements 7115 The server that supports both mode and ACL must take care to 7116 synchronize the MODE4_*USR, MODE4_*GRP, and MODE4_*OTH bits with the 7117 ACEs that have respective who fields of "OWNER@", "GROUP@", and 7118 "EVERYONE@". This way, the client can see if semantically equivalent 7119 access permissions exist whether the client asks for the owner, 7120 owner_group, and mode attributes or for just the ACL. 7122 In this section, much is made of the methods in Section 6.3.2. Many 7123 requirements refer to this section. But note that the methods have 7124 behaviors specified with "SHOULD". This is intentional, to avoid 7125 invalidating existing implementations that compute the mode according 7126 to the withdrawn POSIX ACL draft (1003.1e draft 17), rather than by 7127 actual permissions on owner, group, and other. 7129 6.4.1. Setting the Mode and/or ACL Attributes 7131 In the case where a server supports the sacl or dacl attribute, in 7132 addition to the acl attribute, the server MUST fail a request to set 7133 the acl attribute simultaneously with a dacl or sacl attribute. The 7134 error to be given is NFS4ERR_ATTRNOTSUPP. 7136 6.4.1.1. Setting Mode and not ACL 7138 When any of the nine low-order mode bits are subject to change, 7139 either because the mode attribute was set or because the 7140 mode_set_masked attribute was set and the mask included one or more 7141 bits from the nine low-order mode bits, and no ACL attribute is 7142 explicitly set, the acl and dacl attributes must be modified in 7143 accordance with the updated value of those bits. This must happen 7144 even if the value of the low-order bits is the same after the mode is 7145 set as before. 7147 Note that any AUDIT or ALARM ACEs (hence any ACEs in the sacl 7148 attribute) are unaffected by changes to the mode. 7150 In cases in which the permissions bits are subject to change, the acl 7151 and dacl attributes MUST be modified such that the mode computed via 7152 the method in Section 6.3.2 yields the low-order nine bits (MODE4_R*, 7153 MODE4_W*, MODE4_X*) of the mode attribute as modified by the 7154 attribute change. The ACL attributes SHOULD also be modified such 7155 that: 7157 1. If MODE4_RGRP is not set, entities explicitly listed in the ACL 7158 other than OWNER@ and EVERYONE@ SHOULD NOT be granted 7159 ACE4_READ_DATA. 7161 2. If MODE4_WGRP is not set, entities explicitly listed in the ACL 7162 other than OWNER@ and EVERYONE@ SHOULD NOT be granted 7163 ACE4_WRITE_DATA or ACE4_APPEND_DATA. 7165 3. If MODE4_XGRP is not set, entities explicitly listed in the ACL 7166 other than OWNER@ and EVERYONE@ SHOULD NOT be granted 7167 ACE4_EXECUTE. 7169 Access mask bits other than those listed above, appearing in ALLOW 7170 ACEs, MAY also be disabled. 7172 Note that ACEs with the flag ACE4_INHERIT_ONLY_ACE set do not affect 7173 the permissions of the ACL itself, nor do ACEs of the type AUDIT and 7174 ALARM. As such, it is desirable to leave these ACEs unmodified when 7175 modifying the ACL attributes. 7177 Also note that the requirement may be met by discarding the acl and 7178 dacl, in favor of an ACL that represents the mode and only the mode. 7179 This is permitted, but it is preferable for a server to preserve as 7180 much of the ACL as possible without violating the above requirements. 7181 Discarding the ACL makes it effectively impossible for a file created 7182 with a mode attribute to inherit an ACL (see Section 6.4.3). 7184 6.4.1.2. Setting ACL and Not Mode 7186 When setting the acl or dacl and not setting the mode or 7187 mode_set_masked attributes, the permission bits of the mode need to 7188 be derived from the ACL. In this case, the ACL attribute SHOULD be 7189 set as given. The nine low-order bits of the mode attribute 7190 (MODE4_R*, MODE4_W*, MODE4_X*) MUST be modified to match the result 7191 of the method in Section 6.3.2. The three high-order bits of the 7192 mode (MODE4_SUID, MODE4_SGID, MODE4_SVTX) SHOULD remain unchanged. 7194 6.4.1.3. Setting Both ACL and Mode 7196 When setting both the mode (includes use of either the mode attribute 7197 or the mode_set_masked attribute) and the acl or dacl attributes in 7198 the same operation, the attributes MUST be applied in this order: 7199 mode (or mode_set_masked), then ACL. The mode-related attribute is 7200 set as given, then the ACL attribute is set as given, possibly 7201 changing the final mode, as described above in Section 6.4.1.2. 7203 6.4.2. Retrieving the Mode and/or ACL Attributes 7205 This section applies only to servers that support both the mode and 7206 ACL attributes. 7208 Some server implementations may have a concept of "objects without 7209 ACLs", meaning that all permissions are granted and denied according 7210 to the mode attribute and that no ACL attribute is stored for that 7211 object. If an ACL attribute is requested of such a server, the 7212 server SHOULD return an ACL that does not conflict with the mode; 7213 that is to say, the ACL returned SHOULD represent the nine low-order 7214 bits of the mode attribute (MODE4_R*, MODE4_W*, MODE4_X*) as 7215 described in Section 6.3.2. 7217 For other server implementations, the ACL attribute is always present 7218 for every object. Such servers SHOULD store at least the three high- 7219 order bits of the mode attribute (MODE4_SUID, MODE4_SGID, 7220 MODE4_SVTX). The server SHOULD return a mode attribute if one is 7221 requested, and the low-order nine bits of the mode (MODE4_R*, 7222 MODE4_W*, MODE4_X*) MUST match the result of applying the method in 7223 Section 6.3.2 to the ACL attribute. 7225 6.4.3. Creating New Objects 7227 If a server supports any ACL attributes, it may use the ACL 7228 attributes on the parent directory to compute an initial ACL 7229 attribute for a newly created object. This will be referred to as 7230 the inherited ACL within this section. The act of adding one or more 7231 ACEs to the inherited ACL that are based upon ACEs in the parent 7232 directory's ACL will be referred to as inheriting an ACE within this 7233 section. 7235 Implementors should standardize what the behavior of CREATE and OPEN 7236 must be depending on the presence or absence of the mode and ACL 7237 attributes. 7239 1. If just the mode is given in the call: 7241 In this case, inheritance SHOULD take place, but the mode MUST be 7242 applied to the inherited ACL as described in Section 6.4.1.1, 7243 thereby modifying the ACL. 7245 2. If just the ACL is given in the call: 7247 In this case, inheritance SHOULD NOT take place, and the ACL as 7248 defined in the CREATE or OPEN will be set without modification, 7249 and the mode modified as in Section 6.4.1.2. 7251 3. If both mode and ACL are given in the call: 7253 In this case, inheritance SHOULD NOT take place, and both 7254 attributes will be set as described in Section 6.4.1.3. 7256 4. If neither mode nor ACL is given in the call: 7258 In the case where an object is being created without any initial 7259 attributes at all, e.g., an OPEN operation with an opentype4 of 7260 OPEN4_CREATE and a createmode4 of EXCLUSIVE4, inheritance SHOULD 7261 NOT take place (note that EXCLUSIVE4_1 is a better choice of 7262 createmode4, since it does permit initial attributes). Instead, 7263 the server SHOULD set permissions to deny all access to the newly 7264 created object. It is expected that the appropriate client will 7265 set the desired attributes in a subsequent SETATTR operation, and 7266 the server SHOULD allow that operation to succeed, regardless of 7267 what permissions the object is created with. For example, an 7268 empty ACL denies all permissions, but the server should allow the 7269 owner's SETATTR to succeed even though WRITE_ACL is implicitly 7270 denied. 7272 In other cases, inheritance SHOULD take place, and no 7273 modifications to the ACL will happen. The mode attribute, if 7274 supported, MUST be as computed in Section 6.3.2, with the 7275 MODE4_SUID, MODE4_SGID, and MODE4_SVTX bits clear. If no 7276 inheritable ACEs exist on the parent directory, the rules for 7277 creating acl, dacl, or sacl attributes are implementation 7278 defined. If either the dacl or sacl attribute is supported, then 7279 the ACL4_DEFAULTED flag SHOULD be set on the newly created 7280 attributes. 7282 6.4.3.1. The Inherited ACL 7284 If the object being created is not a directory, the inherited ACL 7285 SHOULD NOT inherit ACEs from the parent directory ACL unless the 7286 ACE4_FILE_INHERIT_FLAG is set. 7288 If the object being created is a directory, the inherited ACL should 7289 inherit all inheritable ACEs from the parent directory, that is, 7290 those that have the ACE4_FILE_INHERIT_ACE or 7291 ACE4_DIRECTORY_INHERIT_ACE flag set. If the inheritable ACE has 7292 ACE4_FILE_INHERIT_ACE set but ACE4_DIRECTORY_INHERIT_ACE is clear, 7293 the inherited ACE on the newly created directory MUST have the 7294 ACE4_INHERIT_ONLY_ACE flag set to prevent the directory from being 7295 affected by ACEs meant for non-directories. 7297 When a new directory is created, the server MAY split any inherited 7298 ACE that is both inheritable and effective (in other words, that has 7299 neither ACE4_INHERIT_ONLY_ACE nor ACE4_NO_PROPAGATE_INHERIT_ACE set), 7300 into two ACEs, one with no inheritance flags and one with 7301 ACE4_INHERIT_ONLY_ACE set. (In the case of a dacl or sacl attribute, 7302 both of those ACEs SHOULD also have the ACE4_INHERITED_ACE flag set.) 7303 This makes it simpler to modify the effective permissions on the 7304 directory without modifying the ACE that is to be inherited to the 7305 new directory's children. 7307 6.4.3.2. Automatic Inheritance 7309 The acl attribute consists only of an array of ACEs, but the sacl 7310 (Section 6.2.3) and dacl (Section 6.2.2) attributes also include an 7311 additional flag field. 7313 struct nfsacl41 { 7314 aclflag4 na41_flag; 7315 nfsace4 na41_aces<>; 7316 }; 7318 The flag field applies to the entire sacl or dacl; three flag values 7319 are defined: 7321 const ACL4_AUTO_INHERIT = 0x00000001; 7322 const ACL4_PROTECTED = 0x00000002; 7323 const ACL4_DEFAULTED = 0x00000004; 7325 and all other bits must be cleared. The ACE4_INHERITED_ACE flag may 7326 be set in the ACEs of the sacl or dacl (whereas it must always be 7327 cleared in the acl). 7329 Together these features allow a server to support automatic 7330 inheritance, which we now explain in more detail. 7332 Inheritable ACEs are normally inherited by child objects only at the 7333 time that the child objects are created; later modifications to 7334 inheritable ACEs do not result in modifications to inherited ACEs on 7335 descendants. 7337 However, the dacl and sacl provide an OPTIONAL mechanism that allows 7338 a client application to propagate changes to inheritable ACEs to an 7339 entire directory hierarchy. 7341 A server that supports this performs inheritance at object creation 7342 time in the normal way, and SHOULD set the ACE4_INHERITED_ACE flag on 7343 any inherited ACEs as they are added to the new object. 7345 A client application such as an ACL editor may then propagate changes 7346 to inheritable ACEs on a directory by recursively traversing that 7347 directory's descendants and modifying each ACL encountered to remove 7348 any ACEs with the ACE4_INHERITED_ACE flag and to replace them by the 7349 new inheritable ACEs (also with the ACE4_INHERITED_ACE flag set). It 7350 uses the existing ACE inheritance flags in the obvious way to decide 7351 which ACEs to propagate. (Note that it may encounter further 7352 inheritable ACEs when descending the directory hierarchy and that 7353 those will also need to be taken into account when propagating 7354 inheritable ACEs to further descendants.) 7355 The reach of this propagation may be limited in two ways: first, 7356 automatic inheritance is not performed from any directory ACL that 7357 has the ACL4_AUTO_INHERIT flag cleared; and second, automatic 7358 inheritance stops wherever an ACL with the ACL4_PROTECTED flag is 7359 set, preventing modification of that ACL and also (if the ACL is set 7360 on a directory) of the ACL on any of the object's descendants. 7362 This propagation is performed independently for the sacl and the dacl 7363 attributes; thus, the ACL4_AUTO_INHERIT and ACL4_PROTECTED flags may 7364 be independently set for the sacl and the dacl, and propagation of 7365 one type of acl may continue down a hierarchy even where propagation 7366 of the other acl has stopped. 7368 New objects should be created with a dacl and a sacl that both have 7369 the ACL4_PROTECTED flag cleared and the ACL4_AUTO_INHERIT flag set to 7370 the same value as that on, respectively, the sacl or dacl of the 7371 parent object. 7373 Both the dacl and sacl attributes are RECOMMENDED, and a server may 7374 support one without supporting the other. 7376 A server that supports both the old acl attribute and one or both of 7377 the new dacl or sacl attributes must do so in such a way as to keep 7378 all three attributes consistent with each other. Thus, the ACEs 7379 reported in the acl attribute should be the union of the ACEs 7380 reported in the dacl and sacl attributes, except that the 7381 ACE4_INHERITED_ACE flag must be cleared from the ACEs in the acl. 7382 And of course a client that queries only the acl will be unable to 7383 determine the values of the sacl or dacl flag fields. 7385 When a client performs a SETATTR for the acl attribute, the server 7386 SHOULD set the ACL4_PROTECTED flag to true on both the sacl and the 7387 dacl. By using the acl attribute, as opposed to the dacl or sacl 7388 attributes, the client signals that it may not understand automatic 7389 inheritance, and thus cannot be trusted to set an ACL for which 7390 automatic inheritance would make sense. 7392 When a client application queries an ACL, modifies it, and sets it 7393 again, it should leave any ACEs marked with ACE4_INHERITED_ACE 7394 unchanged, in their original order, at the end of the ACL. If the 7395 application is unable to do this, it should set the ACL4_PROTECTED 7396 flag. This behavior is not enforced by servers, but violations of 7397 this rule may lead to unexpected results when applications perform 7398 automatic inheritance. 7400 If a server also supports the mode attribute, it SHOULD set the mode 7401 in such a way that leaves inherited ACEs unchanged, in their original 7402 order, at the end of the ACL. If it is unable to do so, it SHOULD 7403 set the ACL4_PROTECTED flag on the file's dacl. 7405 Finally, in the case where the request that creates a new file or 7406 directory does not also set permissions for that file or directory, 7407 and there are also no ACEs to inherit from the parent's directory, 7408 then the server's choice of ACL for the new object is implementation- 7409 dependent. In this case, the server SHOULD set the ACL4_DEFAULTED 7410 flag on the ACL it chooses for the new object. An application 7411 performing automatic inheritance takes the ACL4_DEFAULTED flag as a 7412 sign that the ACL should be completely replaced by one generated 7413 using the automatic inheritance rules. 7415 7. Single-Server Namespace 7417 This section describes the NFSv4 single-server namespace. Single- 7418 server namespaces may be presented directly to clients, or they may 7419 be used as a basis to form larger multi-server namespaces (e.g., 7420 site-wide or organization-wide) to be presented to clients, as 7421 described in Section 11. 7423 7.1. Server Exports 7425 On a UNIX server, the namespace describes all the files reachable by 7426 pathnames under the root directory or "/". On a Windows server, the 7427 namespace constitutes all the files on disks named by mapped disk 7428 letters. NFS server administrators rarely make the entire server's 7429 file system namespace available to NFS clients. More often, portions 7430 of the namespace are made available via an "export" feature. In 7431 previous versions of the NFS protocol, the root filehandle for each 7432 export is obtained through the MOUNT protocol; the client sent a 7433 string that identified the export name within the namespace and the 7434 server returned the root filehandle for that export. The MOUNT 7435 protocol also provided an EXPORTS procedure that enumerated the 7436 server's exports. 7438 7.2. Browsing Exports 7440 The NFSv4.1 protocol provides a root filehandle that clients can use 7441 to obtain filehandles for the exports of a particular server, via a 7442 series of LOOKUP operations within a COMPOUND, to traverse a path. A 7443 common user experience is to use a graphical user interface (perhaps 7444 a file "Open" dialog window) to find a file via progressive browsing 7445 through a directory tree. The client must be able to move from one 7446 export to another export via single-component, progressive LOOKUP 7447 operations. 7449 This style of browsing is not well supported by the NFSv3 protocol. 7450 In NFSv3, the client expects all LOOKUP operations to remain within a 7451 single server file system. For example, the device attribute will 7452 not change. This prevents a client from taking namespace paths that 7453 span exports. 7455 In the case of NFSv3, an automounter on the client can obtain a 7456 snapshot of the server's namespace using the EXPORTS procedure of the 7457 MOUNT protocol. If it understands the server's pathname syntax, it 7458 can create an image of the server's namespace on the client. The 7459 parts of the namespace that are not exported by the server are filled 7460 in with directories that might be constructed similarly to an NFSv4.1 7461 "pseudo file system" (see Section 7.3) that allows the user to browse 7462 from one mounted file system to another. There is a drawback to this 7463 representation of the server's namespace on the client: it is static. 7464 If the server administrator adds a new export, the client will be 7465 unaware of it. 7467 7.3. Server Pseudo File System 7469 NFSv4.1 servers avoid this namespace inconsistency by presenting all 7470 the exports for a given server within the framework of a single 7471 namespace for that server. An NFSv4.1 client uses LOOKUP and READDIR 7472 operations to browse seamlessly from one export to another. 7474 Where there are portions of the server namespace that are not 7475 exported, clients require some way of traversing those portions to 7476 reach actual exported file systems. A technique that servers may use 7477 to provide for this is to bridge the unexported portion of the 7478 namespace via a "pseudo file system" that provides a view of exported 7479 directories only. A pseudo file system has a unique fsid and behaves 7480 like a normal, read-only file system. 7482 Based on the construction of the server's namespace, it is possible 7483 that multiple pseudo file systems may exist. For example, 7485 /a pseudo file system 7486 /a/b real file system 7487 /a/b/c pseudo file system 7488 /a/b/c/d real file system 7490 Each of the pseudo file systems is considered a separate entity and 7491 therefore MUST have its own fsid, unique among all the fsids for that 7492 server. 7494 7.4. Multiple Roots 7496 Certain operating environments are sometimes described as having 7497 "multiple roots". In such environments, individual file systems are 7498 commonly represented by disk or volume names. NFSv4 servers for 7499 these platforms can construct a pseudo file system above these root 7500 names so that disk letters or volume names are simply directory names 7501 in the pseudo root. 7503 7.5. Filehandle Volatility 7505 The nature of the server's pseudo file system is that it is a logical 7506 representation of file system(s) available from the server. 7507 Therefore, the pseudo file system is most likely constructed 7508 dynamically when the server is first instantiated. It is expected 7509 that the pseudo file system may not have an on-disk counterpart from 7510 which persistent filehandles could be constructed. Even though it is 7511 preferable that the server provide persistent filehandles for the 7512 pseudo file system, the NFS client should expect that pseudo file 7513 system filehandles are volatile. This can be confirmed by checking 7514 the associated "fh_expire_type" attribute for those filehandles in 7515 question. If the filehandles are volatile, the NFS client must be 7516 prepared to recover a filehandle value (e.g., with a series of LOOKUP 7517 operations) when receiving an error of NFS4ERR_FHEXPIRED. 7519 Because it is quite likely that servers will implement pseudo file 7520 systems using volatile filehandles, clients need to be prepared for 7521 them, rather than assuming that all filehandles will be persistent. 7523 7.6. Exported Root 7525 If the server's root file system is exported, one might conclude that 7526 a pseudo file system is unneeded. This is not necessarily so. 7527 Assume the following file systems on a server: 7529 / fs1 (exported) 7530 /a fs2 (not exported) 7531 /a/b fs3 (exported) 7533 Because fs2 is not exported, fs3 cannot be reached with simple 7534 LOOKUPs. The server must bridge the gap with a pseudo file system. 7536 7.7. Mount Point Crossing 7538 The server file system environment may be constructed in such a way 7539 that one file system contains a directory that is 'covered' or 7540 mounted upon by a second file system. For example: 7542 /a/b (file system 1) 7543 /a/b/c/d (file system 2) 7545 The pseudo file system for this server may be constructed to look 7546 like: 7548 / (place holder/not exported) 7549 /a/b (file system 1) 7550 /a/b/c/d (file system 2) 7552 It is the server's responsibility to present the pseudo file system 7553 that is complete to the client. If the client sends a LOOKUP request 7554 for the path /a/b/c/d, the server's response is the filehandle of the 7555 root of the file system /a/b/c/d. In previous versions of the NFS 7556 protocol, the server would respond with the filehandle of directory 7557 /a/b/c/d within the file system /a/b. 7559 The NFS client will be able to determine if it crosses a server mount 7560 point by a change in the value of the "fsid" attribute. 7562 7.8. Security Policy and Namespace Presentation 7564 Because NFSv4 clients possess the ability to change the security 7565 mechanisms used, after determining what is allowed, by using SECINFO 7566 and SECINFO_NONAME, the server SHOULD NOT present a different view of 7567 the namespace based on the security mechanism being used by a client. 7568 Instead, it should present a consistent view and return 7569 NFS4ERR_WRONGSEC if an attempt is made to access data with an 7570 inappropriate security mechanism. 7572 If security considerations make it necessary to hide the existence of 7573 a particular file system, as opposed to all of the data within it, 7574 the server can apply the security policy of a shared resource in the 7575 server's namespace to components of the resource's ancestors. For 7576 example: 7578 / (place holder/not exported) 7579 /a/b (file system 1) 7580 /a/b/MySecretProject (file system 2) 7582 The /a/b/MySecretProject directory is a real file system and is the 7583 shared resource. Suppose the security policy for /a/b/ 7584 MySecretProject is Kerberos with integrity and it is desired to limit 7585 knowledge of the existence of this file system. In this case, the 7586 server should apply the same security policy to /a/b. This allows 7587 for knowledge of the existence of a file system to be secured when 7588 desirable. 7590 For the case of the use of multiple, disjoint security mechanisms in 7591 the server's resources, applying that sort of policy would result in 7592 the higher-level file system not being accessible using any security 7593 flavor. Therefore, that sort of configuration is not compatible with 7594 hiding the existence (as opposed to the contents) from clients using 7595 multiple disjoint sets of security flavors. 7597 In other circumstances, a desirable policy is for the security of a 7598 particular object in the server's namespace to include the union of 7599 all security mechanisms of all direct descendants. A common and 7600 convenient practice, unless strong security requirements dictate 7601 otherwise, is to make the entire the pseudo file system accessible by 7602 all of the valid security mechanisms. 7604 Where there is concern about the security of data on the network, 7605 clients should use strong security mechanisms to access the pseudo 7606 file system in order to prevent man-in-the-middle attacks. 7608 8. State Management 7610 Integrating locking into the NFS protocol necessarily causes it to be 7611 stateful. With the inclusion of such features as share reservations, 7612 file and directory delegations, recallable layouts, and support for 7613 mandatory byte-range locking, the protocol becomes substantially more 7614 dependent on proper management of state than the traditional 7615 combination of NFS and NLM (Network Lock Manager) [54]. These 7616 features include expanded locking facilities, which provide some 7617 measure of inter-client exclusion, but the state also offers features 7618 not readily providable using a stateless model. There are three 7619 components to making this state manageable: 7621 * clear division between client and server 7623 * ability to reliably detect inconsistency in state between client 7624 and server 7626 * simple and robust recovery mechanisms 7628 In this model, the server owns the state information. The client 7629 requests changes in locks and the server responds with the changes 7630 made. Non-client-initiated changes in locking state are infrequent. 7631 The client receives prompt notification of such changes and can 7632 adjust its view of the locking state to reflect the server's changes. 7634 Individual pieces of state created by the server and passed to the 7635 client at its request are represented by 128-bit stateids. These 7636 stateids may represent a particular open file, a set of byte-range 7637 locks held by a particular owner, or a recallable delegation of 7638 privileges to access a file in particular ways or at a particular 7639 location. 7641 In all cases, there is a transition from the most general information 7642 that represents a client as a whole to the eventual lightweight 7643 stateid used for most client and server locking interactions. The 7644 details of this transition will vary with the type of object but it 7645 always starts with a client ID. 7647 8.1. Client and Session ID 7649 A client must establish a client ID (see Section 2.4) and then one or 7650 more sessionids (see Section 2.10) before performing any operations 7651 to open, byte-range lock, delegate, or obtain a layout for a file 7652 object. Each session ID is associated with a specific client ID, and 7653 thus serves as a shorthand reference to an NFSv4.1 client. 7655 For some types of locking interactions, the client will represent 7656 some number of internal locking entities called "owners", which 7657 normally correspond to processes internal to the client. For other 7658 types of locking-related objects, such as delegations and layouts, no 7659 such intermediate entities are provided for, and the locking-related 7660 objects are considered to be transferred directly between the server 7661 and a unitary client. 7663 8.2. Stateid Definition 7665 When the server grants a lock of any type (including opens, byte- 7666 range locks, delegations, and layouts), it responds with a unique 7667 stateid that represents a set of locks (often a single lock) for the 7668 same file, of the same type, and sharing the same ownership 7669 characteristics. Thus, opens of the same file by different open- 7670 owners each have an identifying stateid. Similarly, each set of 7671 byte-range locks on a file owned by a specific lock-owner has its own 7672 identifying stateid. Delegations and layouts also have associated 7673 stateids by which they may be referenced. The stateid is used as a 7674 shorthand reference to a lock or set of locks, and given a stateid, 7675 the server can determine the associated state-owner or state-owners 7676 (in the case of an open-owner/lock-owner pair) and the associated 7677 filehandle. When stateids are used, the current filehandle must be 7678 the one associated with that stateid. 7680 All stateids associated with a given client ID are associated with a 7681 common lease that represents the claim of those stateids and the 7682 objects they represent to be maintained by the server. See 7683 Section 8.3 for a discussion of the lease. 7685 The server may assign stateids independently for different clients. 7686 A stateid with the same bit pattern for one client may designate an 7687 entirely different set of locks for a different client. The stateid 7688 is always interpreted with respect to the client ID associated with 7689 the current session. Stateids apply to all sessions associated with 7690 the given client ID, and the client may use a stateid obtained from 7691 one session on another session associated with the same client ID. 7693 8.2.1. Stateid Types 7695 With the exception of special stateids (see Section 8.2.3), each 7696 stateid represents locking objects of one of a set of types defined 7697 by the NFSv4.1 protocol. Note that in all these cases, where we 7698 speak of guarantee, it is understood there are situations such as a 7699 client restart, or lock revocation, that allow the guarantee to be 7700 voided. 7702 * Stateids may represent opens of files. 7704 Each stateid in this case represents the OPEN state for a given 7705 client ID/open-owner/filehandle triple. Such stateids are subject 7706 to change (with consequent incrementing of the stateid's seqid) in 7707 response to OPENs that result in upgrade and OPEN_DOWNGRADE 7708 operations. 7710 * Stateids may represent sets of byte-range locks. 7712 All locks held on a particular file by a particular owner and 7713 gotten under the aegis of a particular open file are associated 7714 with a single stateid with the seqid being incremented whenever 7715 LOCK and LOCKU operations affect that set of locks. 7717 * Stateids may represent file delegations, which are recallable 7718 guarantees by the server to the client that other clients will not 7719 reference or modify a particular file, until the delegation is 7720 returned. In NFSv4.1, file delegations may be obtained on both 7721 regular and non-regular files. 7723 A stateid represents a single delegation held by a client for a 7724 particular filehandle. 7726 * Stateids may represent directory delegations, which are recallable 7727 guarantees by the server to the client that other clients will not 7728 modify the directory, until the delegation is returned. 7730 A stateid represents a single delegation held by a client for a 7731 particular directory filehandle. 7733 * Stateids may represent layouts, which are recallable guarantees by 7734 the server to the client that particular files may be accessed via 7735 an alternate data access protocol at specific locations. Such 7736 access is limited to particular sets of byte-ranges and may 7737 proceed until those byte-ranges are reduced or the layout is 7738 returned. 7740 A stateid represents the set of all layouts held by a particular 7741 client for a particular filehandle with a given layout type. The 7742 seqid is updated as the layouts of that set of byte-ranges change, 7743 via layout stateid changing operations such as LAYOUTGET and 7744 LAYOUTRETURN. 7746 8.2.2. Stateid Structure 7748 Stateids are divided into two fields, a 96-bit "other" field 7749 identifying the specific set of locks and a 32-bit "seqid" sequence 7750 value. Except in the case of special stateids (see Section 8.2.3), a 7751 particular value of the "other" field denotes a set of locks of the 7752 same type (for example, byte-range locks, opens, delegations, or 7753 layouts), for a specific file or directory, and sharing the same 7754 ownership characteristics. The seqid designates a specific instance 7755 of such a set of locks, and is incremented to indicate changes in 7756 such a set of locks, either by the addition or deletion of locks from 7757 the set, a change in the byte-range they apply to, or an upgrade or 7758 downgrade in the type of one or more locks. 7760 When such a set of locks is first created, the server returns a 7761 stateid with seqid value of one. On subsequent operations that 7762 modify the set of locks, the server is required to increment the 7763 "seqid" field by one whenever it returns a stateid for the same 7764 state-owner/file/type combination and there is some change in the set 7765 of locks actually designated. In this case, the server will return a 7766 stateid with an "other" field the same as previously used for that 7767 state-owner/file/type combination, with an incremented "seqid" field. 7768 This pattern continues until the seqid is incremented past 7769 NFS4_UINT32_MAX, and one (not zero) is the next seqid value. 7771 The purpose of the incrementing of the seqid is to allow the server 7772 to communicate to the client the order in which operations that 7773 modified locking state associated with a stateid have been processed 7774 and to make it possible for the client to send requests that are 7775 conditional on the set of locks not having changed since the stateid 7776 in question was returned. 7778 Except for layout stateids (Section 12.5.3), when a client sends a 7779 stateid to the server, it has two choices with regard to the seqid 7780 sent. It may set the seqid to zero to indicate to the server that it 7781 wishes the most up-to-date seqid for that stateid's "other" field to 7782 be used. This would be the common choice in the case of a stateid 7783 sent with a READ or WRITE operation. It also may set a non-zero 7784 value, in which case the server checks if that seqid is the correct 7785 one. In that case, the server is required to return 7786 NFS4ERR_OLD_STATEID if the seqid is lower than the most current value 7787 and NFS4ERR_BAD_STATEID if the seqid is greater than the most current 7788 value. This would be the common choice in the case of stateids sent 7789 with a CLOSE or OPEN_DOWNGRADE. Because OPENs may be sent in 7790 parallel for the same owner, a client might close a file without 7791 knowing that an OPEN upgrade had been done by the server, changing 7792 the lock in question. If CLOSE were sent with a zero seqid, the OPEN 7793 upgrade would be cancelled before the client even received an 7794 indication that an upgrade had happened. 7796 When a stateid is sent by the server to the client as part of a 7797 callback operation, it is not subject to checking for a current seqid 7798 and returning NFS4ERR_OLD_STATEID. This is because the client is not 7799 in a position to know the most up-to-date seqid and thus cannot 7800 verify it. Unless specially noted, the seqid value for a stateid 7801 sent by the server to the client as part of a callback is required to 7802 be zero with NFS4ERR_BAD_STATEID returned if it is not. 7804 In making comparisons between seqids, both by the client in 7805 determining the order of operations and by the server in determining 7806 whether the NFS4ERR_OLD_STATEID is to be returned, the possibility of 7807 the seqid being swapped around past the NFS4_UINT32_MAX value needs 7808 to be taken into account. When two seqid values are being compared, 7809 the total count of slots for all sessions associated with the current 7810 client is used to do this. When one seqid value is less than this 7811 total slot count and another seqid value is greater than 7812 NFS4_UINT32_MAX minus the total slot count, the former is to be 7813 treated as lower than the latter, despite the fact that it is 7814 numerically greater. 7816 8.2.3. Special Stateids 7818 Stateid values whose "other" field is either all zeros or all ones 7819 are reserved. They may not be assigned by the server but have 7820 special meanings defined by the protocol. The particular meaning 7821 depends on whether the "other" field is all zeros or all ones and the 7822 specific value of the "seqid" field. 7824 The following combinations of "other" and "seqid" are defined in 7825 NFSv4.1: 7827 * When "other" and "seqid" are both zero, the stateid is treated as 7828 a special anonymous stateid, which can be used in READ, WRITE, and 7829 SETATTR requests to indicate the absence of any OPEN state 7830 associated with the request. When an anonymous stateid value is 7831 used and an existing open denies the form of access requested, 7832 then access will be denied to the request. This stateid MUST NOT 7833 be used on operations to data servers (Section 13.6). 7835 * When "other" and "seqid" are both all ones, the stateid is a 7836 special READ bypass stateid. When this value is used in WRITE or 7837 SETATTR, it is treated like the anonymous value. When used in 7838 READ, the server MAY grant access, even if access would normally 7839 be denied to READ operations. This stateid MUST NOT be used on 7840 operations to data servers. 7842 * When "other" is zero and "seqid" is one, the stateid represents 7843 the current stateid, which is whatever value is the last stateid 7844 returned by an operation within the COMPOUND. In the case of an 7845 OPEN, the stateid returned for the open file and not the 7846 delegation is used. The stateid passed to the operation in place 7847 of the special value has its "seqid" value set to zero, except 7848 when the current stateid is used by the operation CLOSE or 7849 OPEN_DOWNGRADE. If there is no operation in the COMPOUND that has 7850 returned a stateid value, the server MUST return the error 7851 NFS4ERR_BAD_STATEID. As illustrated in Figure 6, if the value of 7852 a current stateid is a special stateid and the stateid of an 7853 operation's arguments has "other" set to zero and "seqid" set to 7854 one, then the server MUST return the error NFS4ERR_BAD_STATEID. 7856 * When "other" is zero and "seqid" is NFS4_UINT32_MAX, the stateid 7857 represents a reserved stateid value defined to be invalid. When 7858 this stateid is used, the server MUST return the error 7859 NFS4ERR_BAD_STATEID. 7861 If a stateid value is used that has all zeros or all ones in the 7862 "other" field but does not match one of the cases above, the server 7863 MUST return the error NFS4ERR_BAD_STATEID. 7865 Special stateids, unlike other stateids, are not associated with 7866 individual client IDs or filehandles and can be used with all valid 7867 client IDs and filehandles. In the case of a special stateid 7868 designating the current stateid, the current stateid value 7869 substituted for the special stateid is associated with a particular 7870 client ID and filehandle, and so, if it is used where the current 7871 filehandle does not match that associated with the current stateid, 7872 the operation to which the stateid is passed will return 7873 NFS4ERR_BAD_STATEID. 7875 8.2.4. Stateid Lifetime and Validation 7877 Stateids must remain valid until either a client restart or a server 7878 restart or until the client returns all of the locks associated with 7879 the stateid by means of an operation such as CLOSE or DELEGRETURN. 7880 If the locks are lost due to revocation, as long as the client ID is 7881 valid, the stateid remains a valid designation of that revoked state 7882 until the client frees it by using FREE_STATEID. Stateids associated 7883 with byte-range locks are an exception. They remain valid even if a 7884 LOCKU frees all remaining locks, so long as the open file with which 7885 they are associated remains open, unless the client frees the 7886 stateids via the FREE_STATEID operation. 7888 It should be noted that there are situations in which the client's 7889 locks become invalid, without the client requesting they be returned. 7890 These include lease expiration and a number of forms of lock 7891 revocation within the lease period. It is important to note that in 7892 these situations, the stateid remains valid and the client can use it 7893 to determine the disposition of the associated lost locks. 7895 An "other" value must never be reused for a different purpose (i.e., 7896 different filehandle, owner, or type of locks) within the context of 7897 a single client ID. A server may retain the "other" value for the 7898 same purpose beyond the point where it may otherwise be freed, but if 7899 it does so, it must maintain "seqid" continuity with previous values. 7901 One mechanism that may be used to satisfy the requirement that the 7902 server recognize invalid and out-of-date stateids is for the server 7903 to divide the "other" field of the stateid into two fields. 7905 * an index into a table of locking-state structures. 7907 * a generation number that is incremented on each allocation of a 7908 table entry for a particular use. 7910 And then store in each table entry, 7912 * the client ID with which the stateid is associated. 7914 * the current generation number for the (at most one) valid stateid 7915 sharing this index value. 7917 * the filehandle of the file on which the locks are taken. 7919 * an indication of the type of stateid (open, byte-range lock, file 7920 delegation, directory delegation, layout). 7922 * the last "seqid" value returned corresponding to the current 7923 "other" value. 7925 * an indication of the current status of the locks associated with 7926 this stateid, in particular, whether these have been revoked and 7927 if so, for what reason. 7929 With this information, an incoming stateid can be validated and the 7930 appropriate error returned when necessary. Special and non-special 7931 stateids are handled separately. (See Section 8.2.3 for a discussion 7932 of special stateids.) 7934 Note that stateids are implicitly qualified by the current client ID, 7935 as derived from the client ID associated with the current session. 7936 Note, however, that the semantics of the session will prevent 7937 stateids associated with a previous client or server instance from 7938 being analyzed by this procedure. 7940 If server restart has resulted in an invalid client ID or a session 7941 ID that is invalid, SEQUENCE will return an error and the operation 7942 that takes a stateid as an argument will never be processed. 7944 If there has been a server restart where there is a persistent 7945 session and all leased state has been lost, then the session in 7946 question will, although valid, be marked as dead, and any operation 7947 not satisfied by means of the reply cache will receive the error 7948 NFS4ERR_DEADSESSION, and thus not be processed as indicated below. 7950 When a stateid is being tested and the "other" field is all zeros or 7951 all ones, a check that the "other" and "seqid" fields match a defined 7952 combination for a special stateid is done and the results determined 7953 as follows: 7955 * If the "other" and "seqid" fields do not match a defined 7956 combination associated with a special stateid, the error 7957 NFS4ERR_BAD_STATEID is returned. 7959 * If the special stateid is one designating the current stateid and 7960 there is a current stateid, then the current stateid is 7961 substituted for the special stateid and the checks appropriate to 7962 non-special stateids are performed. 7964 * If the combination is valid in general but is not appropriate to 7965 the context in which the stateid is used (e.g., an all-zero 7966 stateid is used when an OPEN stateid is required in a LOCK 7967 operation), the error NFS4ERR_BAD_STATEID is also returned. 7969 * Otherwise, the check is completed and the special stateid is 7970 accepted as valid. 7972 When a stateid is being tested, and the "other" field is neither all 7973 zeros nor all ones, the following procedure could be used to validate 7974 an incoming stateid and return an appropriate error, when necessary, 7975 assuming that the "other" field would be divided into a table index 7976 and an entry generation. 7978 * If the table index field is outside the range of the associated 7979 table, return NFS4ERR_BAD_STATEID. 7981 * If the selected table entry is of a different generation than that 7982 specified in the incoming stateid, return NFS4ERR_BAD_STATEID. 7984 * If the selected table entry does not match the current filehandle, 7985 return NFS4ERR_BAD_STATEID. 7987 * If the client ID in the table entry does not match the client ID 7988 associated with the current session, return NFS4ERR_BAD_STATEID. 7990 * If the stateid represents revoked state, then return 7991 NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, or NFS4ERR_DELEG_REVOKED, 7992 as appropriate. 7994 * If the stateid type is not valid for the context in which the 7995 stateid appears, return NFS4ERR_BAD_STATEID. Note that a stateid 7996 may be valid in general, as would be reported by the TEST_STATEID 7997 operation, but be invalid for a particular operation, as, for 7998 example, when a stateid that doesn't represent byte-range locks is 7999 passed to the non-from_open case of LOCK or to LOCKU, or when a 8000 stateid that does not represent an open is passed to CLOSE or 8001 OPEN_DOWNGRADE. In such cases, the server MUST return 8002 NFS4ERR_BAD_STATEID. 8004 * If the "seqid" field is not zero and it is greater than the 8005 current sequence value corresponding to the current "other" field, 8006 return NFS4ERR_BAD_STATEID. 8008 * If the "seqid" field is not zero and it is less than the current 8009 sequence value corresponding to the current "other" field, return 8010 NFS4ERR_OLD_STATEID. 8012 * Otherwise, the stateid is valid and the table entry should contain 8013 any additional information about the type of stateid and 8014 information associated with that particular type of stateid, such 8015 as the associated set of locks, e.g., open-owner and lock-owner 8016 information, as well as information on the specific locks, e.g., 8017 open modes and byte-ranges. 8019 8.2.5. Stateid Use for I/O Operations 8021 Clients performing I/O operations need to select an appropriate 8022 stateid based on the locks (including opens and delegations) held by 8023 the client and the various types of state-owners sending the I/O 8024 requests. SETATTR operations that change the file size are treated 8025 like I/O operations in this regard. 8027 The following rules, applied in order of decreasing priority, govern 8028 the selection of the appropriate stateid. In following these rules, 8029 the client will only consider locks of which it has actually received 8030 notification by an appropriate operation response or callback. Note 8031 that the rules are slightly different in the case of I/O to data 8032 servers when file layouts are being used (see Section 13.9.1). 8034 * If the client holds a delegation for the file in question, the 8035 delegation stateid SHOULD be used. 8037 * Otherwise, if the entity corresponding to the lock-owner (e.g., a 8038 process) sending the I/O has a byte-range lock stateid for the 8039 associated open file, then the byte-range lock stateid for that 8040 lock-owner and open file SHOULD be used. 8042 * If there is no byte-range lock stateid, then the OPEN stateid for 8043 the open file in question SHOULD be used. 8045 * Finally, if none of the above apply, then a special stateid SHOULD 8046 be used. 8048 Ignoring these rules may result in situations in which the server 8049 does not have information necessary to properly process the request. 8050 For example, when mandatory byte-range locks are in effect, if the 8051 stateid does not indicate the proper lock-owner, via a lock stateid, 8052 a request might be avoidably rejected. 8054 The server however should not try to enforce these ordering rules and 8055 should use whatever information is available to properly process I/O 8056 requests. In particular, when a client has a delegation for a given 8057 file, it SHOULD take note of this fact in processing a request, even 8058 if it is sent with a special stateid. 8060 8.2.6. Stateid Use for SETATTR Operations 8062 Because each operation is associated with a session ID and from that 8063 the clientid can be determined, operations do not need to include a 8064 stateid for the server to be able to determine whether they should 8065 cause a delegation to be recalled or are to be treated as done within 8066 the scope of the delegation. 8068 In the case of SETATTR operations, a stateid is present. In cases 8069 other than those that set the file size, the client may send either a 8070 special stateid or, when a delegation is held for the file in 8071 question, a delegation stateid. While the server SHOULD validate the 8072 stateid and may use the stateid to optimize the determination as to 8073 whether a delegation is held, it SHOULD note the presence of a 8074 delegation even when a special stateid is sent, and MUST accept a 8075 valid delegation stateid when sent. 8077 8.3. Lease Renewal 8079 Each client/server pair, as represented by a client ID, has a single 8080 lease. The purpose of the lease is to allow the client to indicate 8081 to the server, in a low-overhead way, that it is active, and thus 8082 that the server is to retain the client's locks. This arrangement 8083 allows the server to remove stale locking-related objects that are 8084 held by a client that has crashed or is otherwise unreachable, once 8085 the relevant lease expires. This in turn allows other clients to 8086 obtain conflicting locks without being delayed indefinitely by 8087 inactive or unreachable clients. It is not a mechanism for cache 8088 consistency and lease renewals may not be denied if the lease 8089 interval has not expired. 8091 Since each session is associated with a specific client (identified 8092 by the client's client ID), any operation sent on that session is an 8093 indication that the associated client is reachable. When a request 8094 is sent for a given session, successful execution of a SEQUENCE 8095 operation (or successful retrieval of the result of SEQUENCE from the 8096 reply cache) on an unexpired lease will result in the lease being 8097 implicitly renewed, for the standard renewal period (equal to the 8098 lease_time attribute). 8100 If the client ID's lease has not expired when the server receives a 8101 SEQUENCE operation, then the server MUST renew the lease. If the 8102 client ID's lease has expired when the server receives a SEQUENCE 8103 operation, the server MAY renew the lease; this depends on whether 8104 any state was revoked as a result of the client's failure to renew 8105 the lease before expiration. 8107 Absent other activity that would renew the lease, a COMPOUND 8108 consisting of a single SEQUENCE operation will suffice. The client 8109 should also take communication-related delays into account and take 8110 steps to ensure that the renewal messages actually reach the server 8111 in good time. For example: 8113 * When trunking is in effect, the client should consider sending 8114 multiple requests on different connections, in order to ensure 8115 that renewal occurs, even in the event of blockage in the path 8116 used for one of those connections. 8118 * Transport retransmission delays might become so large as to 8119 approach or exceed the length of the lease period. This may be 8120 particularly likely when the server is unresponsive due to a 8121 restart; see Section 8.4.2.1. If the client implementation is not 8122 careful, transport retransmission delays can result in the client 8123 failing to detect a server restart before the grace period ends. 8124 The scenario is that the client is using a transport with 8125 exponential backoff, such that the maximum retransmission timeout 8126 exceeds both the grace period and the lease_time attribute. A 8127 network partition causes the client's connection's retransmission 8128 interval to back off, and even after the partition heals, the next 8129 transport-level retransmission is sent after the server has 8130 restarted and its grace period ends. 8132 The client MUST either recover from the ensuing NFS4ERR_NO_GRACE 8133 errors or it MUST ensure that, despite transport-level 8134 retransmission intervals that exceed the lease_time, a SEQUENCE 8135 operation is sent that renews the lease before expiration. The 8136 client can achieve this by associating a new connection with the 8137 session, and sending a SEQUENCE operation on it. However, if the 8138 attempt to establish a new connection is delayed for some reason 8139 (e.g., exponential backoff of the connection establishment 8140 packets), the client will have to abort the connection 8141 establishment attempt before the lease expires, and attempt to 8142 reconnect. 8144 If the server renews the lease upon receiving a SEQUENCE operation, 8145 the server MUST NOT allow the lease to expire while the rest of the 8146 operations in the COMPOUND procedure's request are still executing. 8147 Once the last operation has finished, and the response to COMPOUND 8148 has been sent, the server MUST set the lease to expire no sooner than 8149 the sum of current time and the value of the lease_time attribute. 8151 A client ID's lease can expire when it has been at least the lease 8152 interval (lease_time) since the last lease-renewing SEQUENCE 8153 operation was sent on any of the client ID's sessions and there are 8154 no active COMPOUND operations on any such sessions. 8156 Because the SEQUENCE operation is the basic mechanism to renew a 8157 lease, and because it must be done at least once for each lease 8158 period, it is the natural mechanism whereby the server will inform 8159 the client of changes in the lease status that the client needs to be 8160 informed of. The client should inspect the status flags 8161 (sr_status_flags) returned by sequence and take the appropriate 8162 action (see Section 18.46.3 for details). 8164 * The status bits SEQ4_STATUS_CB_PATH_DOWN and 8165 SEQ4_STATUS_CB_PATH_DOWN_SESSION indicate problems with the 8166 backchannel that the client may need to address in order to 8167 receive callback requests. 8169 * The status bits SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING and 8170 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED indicate problems with GSS 8171 contexts or RPCSEC_GSS handles for the backchannel that the client 8172 might have to address in order to allow callback requests to be 8173 sent. 8175 * The status bits SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, 8176 SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, 8177 SEQ4_STATUS_ADMIN_STATE_REVOKED, and 8178 SEQ4_STATUS_RECALLABLE_STATE_REVOKED notify the client of lock 8179 revocation events. When these bits are set, the client should use 8180 TEST_STATEID to find what stateids have been revoked and use 8181 FREE_STATEID to acknowledge loss of the associated state. 8183 * The status bit SEQ4_STATUS_LEASE_MOVE indicates that 8184 responsibility for lease renewal has been transferred to one or 8185 more new servers. 8187 * The status bit SEQ4_STATUS_RESTART_RECLAIM_NEEDED indicates that 8188 due to server restart the client must reclaim locking state. 8190 * The status bit SEQ4_STATUS_BACKCHANNEL_FAULT indicates that the 8191 server has encountered an unrecoverable fault with the backchannel 8192 (e.g., it has lost track of a sequence ID for a slot in the 8193 backchannel). 8195 8.4. Crash Recovery 8197 A critical requirement in crash recovery is that both the client and 8198 the server know when the other has failed. Additionally, it is 8199 required that a client sees a consistent view of data across server 8200 restarts. All READ and WRITE operations that may have been queued 8201 within the client or network buffers must wait until the client has 8202 successfully recovered the locks protecting the READ and WRITE 8203 operations. Any that reach the server before the server can safely 8204 determine that the client has recovered enough locking state to be 8205 sure that such operations can be safely processed must be rejected. 8206 This will happen because either: 8208 * The state presented is no longer valid since it is associated with 8209 a now invalid client ID. In this case, the client will receive 8210 either an NFS4ERR_BADSESSION or NFS4ERR_DEADSESSION error, and any 8211 attempt to attach a new session to that invalid client ID will 8212 result in an NFS4ERR_STALE_CLIENTID error. 8214 * Subsequent recovery of locks may make execution of the operation 8215 inappropriate (NFS4ERR_GRACE). 8217 8.4.1. Client Failure and Recovery 8219 In the event that a client fails, the server may release the client's 8220 locks when the associated lease has expired. Conflicting locks from 8221 another client may only be granted after this lease expiration. As 8222 discussed in Section 8.3, when a client has not failed and re- 8223 establishes its lease before expiration occurs, requests for 8224 conflicting locks will not be granted. 8226 To minimize client delay upon restart, lock requests are associated 8227 with an instance of the client by a client-supplied verifier. This 8228 verifier is part of the client_owner4 sent in the initial EXCHANGE_ID 8229 call made by the client. The server returns a client ID as a result 8230 of the EXCHANGE_ID operation. The client then confirms the use of 8231 the client ID by establishing a session associated with that client 8232 ID (see Section 18.36.3 for a description of how this is done). All 8233 locks, including opens, byte-range locks, delegations, and layouts 8234 obtained by sessions using that client ID, are associated with that 8235 client ID. 8237 Since the verifier will be changed by the client upon each 8238 initialization, the server can compare a new verifier to the verifier 8239 associated with currently held locks and determine that they do not 8240 match. This signifies the client's new instantiation and subsequent 8241 loss (upon confirmation of the new client ID) of locking state. As a 8242 result, the server is free to release all locks held that are 8243 associated with the old client ID that was derived from the old 8244 verifier. At this point, conflicting locks from other clients, kept 8245 waiting while the lease had not yet expired, can be granted. In 8246 addition, all stateids associated with the old client ID can also be 8247 freed, as they are no longer reference-able. 8249 Note that the verifier must have the same uniqueness properties as 8250 the verifier for the COMMIT operation. 8252 8.4.2. Server Failure and Recovery 8254 If the server loses locking state (usually as a result of a restart), 8255 it must allow clients time to discover this fact and re-establish the 8256 lost locking state. The client must be able to re-establish the 8257 locking state without having the server deny valid requests because 8258 the server has granted conflicting access to another client. 8259 Likewise, if there is a possibility that clients have not yet re- 8260 established their locking state for a file and that such locking 8261 state might make it invalid to perform READ or WRITE operations. For 8262 example, if mandatory locks are a possibility, the server must 8263 disallow READ and WRITE operations for that file. 8265 A client can determine that loss of locking state has occurred via 8266 several methods. 8268 1. When a SEQUENCE (most common) or other operation returns 8269 NFS4ERR_BADSESSION, this may mean that the session has been 8270 destroyed but the client ID is still valid. The client sends a 8271 CREATE_SESSION request with the client ID to re-establish the 8272 session. If CREATE_SESSION fails with NFS4ERR_STALE_CLIENTID, 8273 the client must establish a new client ID (see Section 8.1) and 8274 re-establish its lock state with the new client ID, after the 8275 CREATE_SESSION operation succeeds (see Section 8.4.2.1). 8277 2. When a SEQUENCE (most common) or other operation on a persistent 8278 session returns NFS4ERR_DEADSESSION, this indicates that a 8279 session is no longer usable for new, i.e., not satisfied from the 8280 reply cache, operations. Once all pending operations are 8281 determined to be either performed before the retry or not 8282 performed, the client sends a CREATE_SESSION request with the 8283 client ID to re-establish the session. If CREATE_SESSION fails 8284 with NFS4ERR_STALE_CLIENTID, the client must establish a new 8285 client ID (see Section 8.1) and re-establish its lock state after 8286 the CREATE_SESSION, with the new client ID, succeeds 8287 (Section 8.4.2.1). 8289 3. When an operation, neither SEQUENCE nor preceded by SEQUENCE (for 8290 example, CREATE_SESSION, DESTROY_SESSION), returns 8291 NFS4ERR_STALE_CLIENTID, the client MUST establish a new client ID 8292 (Section 8.1) and re-establish its lock state (Section 8.4.2.1). 8294 8.4.2.1. State Reclaim 8296 When state information and the associated locks are lost as a result 8297 of a server restart, the protocol must provide a way to cause that 8298 state to be re-established. The approach used is to define, for most 8299 types of locking state (layouts are an exception), a request whose 8300 function is to allow the client to re-establish on the server a lock 8301 first obtained from a previous instance. Generally, these requests 8302 are variants of the requests normally used to create locks of that 8303 type and are referred to as "reclaim-type" requests, and the process 8304 of re-establishing such locks is referred to as "reclaiming" them. 8306 Because each client must have an opportunity to reclaim all of the 8307 locks that it has without the possibility that some other client will 8308 be granted a conflicting lock, a "grace period" is devoted to the 8309 reclaim process. During this period, requests creating client IDs 8310 and sessions are handled normally, but locking requests are subject 8311 to special restrictions. Only reclaim-type locking requests are 8312 allowed, unless the server can reliably determine (through state 8313 persistently maintained across restart instances) that granting any 8314 such lock cannot possibly conflict with a subsequent reclaim. When a 8315 request is made to obtain a new lock (i.e., not a reclaim-type 8316 request) during the grace period and such a determination cannot be 8317 made, the server must return the error NFS4ERR_GRACE. 8319 Once a session is established using the new client ID, the client 8320 will use reclaim-type locking requests (e.g., LOCK operations with 8321 reclaim set to TRUE and OPEN operations with a claim type of 8322 CLAIM_PREVIOUS; see Section 9.11) to re-establish its locking state. 8323 Once this is done, or if there is no such locking state to reclaim, 8324 the client sends a global RECLAIM_COMPLETE operation, i.e., one with 8325 the rca_one_fs argument set to FALSE, to indicate that it has 8326 reclaimed all of the locking state that it will reclaim. Once a 8327 client sends such a RECLAIM_COMPLETE operation, it may attempt non- 8328 reclaim locking operations, although it might get an NFS4ERR_GRACE 8329 status result from each such operation until the period of special 8330 handling is over. See Section 11.11.9 for a discussion of the 8331 analogous handling lock reclamation in the case of file systems 8332 transitioning from server to server. 8334 During the grace period, the server must reject READ and WRITE 8335 operations and non-reclaim locking requests (i.e., other LOCK and 8336 OPEN operations) with an error of NFS4ERR_GRACE, unless it can 8337 guarantee that these may be done safely, as described below. 8339 The grace period may last until all clients that are known to 8340 possibly have had locks have done a global RECLAIM_COMPLETE 8341 operation, indicating that they have finished reclaiming the locks 8342 they held before the server restart. This means that a client that 8343 has done a RECLAIM_COMPLETE must be prepared to receive an 8344 NFS4ERR_GRACE when attempting to acquire new locks. In order for the 8345 server to know that all clients with possible prior lock state have 8346 done a RECLAIM_COMPLETE, the server must maintain in stable storage a 8347 list clients that may have such locks. The server may also terminate 8348 the grace period before all clients have done a global 8349 RECLAIM_COMPLETE. The server SHOULD NOT terminate the grace period 8350 before a time equal to the lease period in order to give clients an 8351 opportunity to find out about the server restart, as a result of 8352 sending requests on associated sessions with a frequency governed by 8353 the lease time. Note that when a client does not send such requests 8354 (or they are sent by the client but not received by the server), it 8355 is possible for the grace period to expire before the client finds 8356 out that the server restart has occurred. 8358 Some additional time in order to allow a client to establish a new 8359 client ID and session and to effect lock reclaims may be added to the 8360 lease time. Note that analogous rules apply to file system-specific 8361 grace periods discussed in Section 11.11.9. 8363 If the server can reliably determine that granting a non-reclaim 8364 request will not conflict with reclamation of locks by other clients, 8365 the NFS4ERR_GRACE error does not have to be returned even within the 8366 grace period, although NFS4ERR_GRACE must always be returned to 8367 clients attempting a non-reclaim lock request before doing their own 8368 global RECLAIM_COMPLETE. For the server to be able to service READ 8369 and WRITE operations during the grace period, it must again be able 8370 to guarantee that no possible conflict could arise between a 8371 potential reclaim locking request and the READ or WRITE operation. 8372 If the server is unable to offer that guarantee, the NFS4ERR_GRACE 8373 error must be returned to the client. 8375 For a server to provide simple, valid handling during the grace 8376 period, the easiest method is to simply reject all non-reclaim 8377 locking requests and READ and WRITE operations by returning the 8378 NFS4ERR_GRACE error. However, a server may keep information about 8379 granted locks in stable storage. With this information, the server 8380 could determine if a locking, READ or WRITE operation can be safely 8381 processed. 8383 For example, if the server maintained on stable storage summary 8384 information on whether mandatory locks exist, either mandatory byte- 8385 range locks, or share reservations specifying deny modes, many 8386 requests could be allowed during the grace period. If it is known 8387 that no such share reservations exist, OPEN request that do not 8388 specify deny modes may be safely granted. If, in addition, it is 8389 known that no mandatory byte-range locks exist, either through 8390 information stored on stable storage or simply because the server 8391 does not support such locks, READ and WRITE operations may be safely 8392 processed during the grace period. Another important case is where 8393 it is known that no mandatory byte-range locks exist, either because 8394 the server does not provide support for them or because their absence 8395 is known from persistently recorded data. In this case, READ and 8396 WRITE operations specifying stateids derived from reclaim-type 8397 operations may be validly processed during the grace period because 8398 of the fact that the valid reclaim ensures that no lock subsequently 8399 granted can prevent the I/O. 8401 To reiterate, for a server that allows non-reclaim lock and I/O 8402 requests to be processed during the grace period, it MUST determine 8403 that no lock subsequently reclaimed will be rejected and that no lock 8404 subsequently reclaimed would have prevented any I/O operation 8405 processed during the grace period. 8407 Clients should be prepared for the return of NFS4ERR_GRACE errors for 8408 non-reclaim lock and I/O requests. In this case, the client should 8409 employ a retry mechanism for the request. A delay (on the order of 8410 several seconds) between retries should be used to avoid overwhelming 8411 the server. Further discussion of the general issue is included in 8412 [55]. The client must account for the server that can perform I/O 8413 and non-reclaim locking requests within the grace period as well as 8414 those that cannot do so. 8416 A reclaim-type locking request outside the server's grace period can 8417 only succeed if the server can guarantee that no conflicting lock or 8418 I/O request has been granted since restart. 8420 A server may, upon restart, establish a new value for the lease 8421 period. Therefore, clients should, once a new client ID is 8422 established, refetch the lease_time attribute and use it as the basis 8423 for lease renewal for the lease associated with that server. 8424 However, the server must establish, for this restart event, a grace 8425 period at least as long as the lease period for the previous server 8426 instantiation. This allows the client state obtained during the 8427 previous server instance to be reliably re-established. 8429 The possibility exists that, because of server configuration events, 8430 the client will be communicating with a server different than the one 8431 on which the locks were obtained, as shown by the combination of 8432 eir_server_scope and eir_server_owner. This leads to the issue of if 8433 and when the client should attempt to reclaim locks previously 8434 obtained on what is being reported as a different server. The rules 8435 to resolve this question are as follows: 8437 * If the server scope is different, the client should not attempt to 8438 reclaim locks. In this situation, no lock reclaim is possible. 8439 Any attempt to re-obtain the locks with non-reclaim operations is 8440 problematic since there is no guarantee that the existing 8441 filehandles will be recognized by the new server, or that if 8442 recognized, they denote the same objects. It is best to treat the 8443 locks as having been revoked by the reconfiguration event. 8445 * If the server scope is the same, the client should attempt to 8446 reclaim locks, even if the eir_server_owner value is different. 8447 In this situation, it is the responsibility of the server to 8448 return NFS4ERR_NO_GRACE if it cannot provide correct support for 8449 lock reclaim operations, including the prevention of edge 8450 conditions. 8452 The eir_server_owner field is not used in making this determination. 8453 Its function is to specify trunking possibilities for the client (see 8454 Section 2.10.5) and not to control lock reclaim. 8456 8.4.2.1.1. Security Considerations for State Reclaim 8458 During the grace period, a client can reclaim state that it believes 8459 or asserts it had before the server restarted. Unless the server 8460 maintained a complete record of all the state the client had, the 8461 server has little choice but to trust the client. (Of course, if the 8462 server maintained a complete record, then it would not have to force 8463 the client to reclaim state after server restart.) While the server 8464 has to trust the client to tell the truth, the negative consequences 8465 for security are limited to enabling denial-of-service attacks in 8466 situations in which AUTH_SYS is supported. The fundamental rule for 8467 the server when processing reclaim requests is that it MUST NOT grant 8468 the reclaim if an equivalent non-reclaim request would not be granted 8469 during steady state due to access control or access conflict issues. 8470 For example, an OPEN request during a reclaim will be refused with 8471 NFS4ERR_ACCESS if the principal making the request does not have 8472 access to open the file according to the discretionary ACL 8473 (Section 6.2.2) on the file. 8475 Nonetheless, it is possible that a client operating in error or 8476 maliciously could, during reclaim, prevent another client from 8477 reclaiming access to state. For example, an attacker could send an 8478 OPEN reclaim operation with a deny mode that prevents another client 8479 from reclaiming the OPEN state it had before the server restarted. 8480 The attacker could perform the same denial of service during steady 8481 state prior to server restart, as long as the attacker had 8482 permissions. Given that the attack vectors are equivalent, the grace 8483 period does not offer any additional opportunity for denial of 8484 service, and any concerns about this attack vector, whether during 8485 grace or steady state, are addressed the same way: use RPCSEC_GSS for 8486 authentication and limit access to the file only to principals that 8487 the owner of the file trusts. 8489 Note that if prior to restart the server had client IDs with the 8490 EXCHGID4_FLAG_BIND_PRINC_STATEID (Section 18.35) capability set, then 8491 the server SHOULD record in stable storage the client owner and the 8492 principal that established the client ID via EXCHANGE_ID. If the 8493 server does not, then there is a risk a client will be unable to 8494 reclaim state if it does not have a credential for a principal that 8495 was originally authorized to establish the state. 8497 8.4.3. Network Partitions and Recovery 8499 If the duration of a network partition is greater than the lease 8500 period provided by the server, the server will not have received a 8501 lease renewal from the client. If this occurs, the server may free 8502 all locks held for the client or it may allow the lock state to 8503 remain for a considerable period, subject to the constraint that if a 8504 request for a conflicting lock is made, locks associated with an 8505 expired lease do not prevent such a conflicting lock from being 8506 granted but MUST be revoked as necessary so as to avoid interfering 8507 with such conflicting requests. 8509 If the server chooses to delay freeing of lock state until there is a 8510 conflict, it may either free all of the client's locks once there is 8511 a conflict or it may only revoke the minimum set of locks necessary 8512 to allow conflicting requests. When it adopts the finer-grained 8513 approach, it must revoke all locks associated with a given stateid, 8514 even if the conflict is with only a subset of locks. 8516 When the server chooses to free all of a client's lock state, either 8517 immediately upon lease expiration or as a result of the first attempt 8518 to obtain a conflicting a lock, the server may report the loss of 8519 lock state in a number of ways. 8521 The server may choose to invalidate the session and the associated 8522 client ID. In this case, once the client can communicate with the 8523 server, it will receive an NFS4ERR_BADSESSION error. Upon attempting 8524 to create a new session, it would get an NFS4ERR_STALE_CLIENTID. 8525 Upon creating the new client ID and new session, the client will 8526 attempt to reclaim locks. Normally, the server will not allow the 8527 client to reclaim locks, because the server will not be in its 8528 recovery grace period. 8530 Another possibility is for the server to maintain the session and 8531 client ID but for all stateids held by the client to become invalid 8532 or stale. Once the client can reach the server after such a network 8533 partition, the status returned by the SEQUENCE operation will 8534 indicate a loss of locking state; i.e., the flag 8535 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED will be set in sr_status_flags. 8536 In addition, all I/O submitted by the client with the now invalid 8537 stateids will fail with the server returning the error 8538 NFS4ERR_EXPIRED. Once the client learns of the loss of locking 8539 state, it will suitably notify the applications that held the 8540 invalidated locks. The client should then take action to free 8541 invalidated stateids, either by establishing a new client ID using a 8542 new verifier or by doing a FREE_STATEID operation to release each of 8543 the invalidated stateids. 8545 When the server adopts a finer-grained approach to revocation of 8546 locks when a client's lease has expired, only a subset of stateids 8547 will normally become invalid during a network partition. When the 8548 client can communicate with the server after such a network partition 8549 heals, the status returned by the SEQUENCE operation will indicate a 8550 partial loss of locking state 8551 (SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED). In addition, operations, 8552 including I/O submitted by the client, with the now invalid stateids 8553 will fail with the server returning the error NFS4ERR_EXPIRED. Once 8554 the client learns of the loss of locking state, it will use the 8555 TEST_STATEID operation on all of its stateids to determine which 8556 locks have been lost and then suitably notify the applications that 8557 held the invalidated locks. The client can then release the 8558 invalidated locking state and acknowledge the revocation of the 8559 associated locks by doing a FREE_STATEID operation on each of the 8560 invalidated stateids. 8562 When a network partition is combined with a server restart, there are 8563 edge conditions that place requirements on the server in order to 8564 avoid silent data corruption following the server restart. Two of 8565 these edge conditions are known, and are discussed below. 8567 The first edge condition arises as a result of the scenarios such as 8568 the following: 8570 1. Client A acquires a lock. 8572 2. Client A and server experience mutual network partition, such 8573 that client A is unable to renew its lease. 8575 3. Client A's lease expires, and the server releases the lock. 8577 4. Client B acquires a lock that would have conflicted with that of 8578 client A. 8580 5. Client B releases its lock. 8582 6. Server restarts. 8584 7. Network partition between client A and server heals. 8586 8. Client A connects to a new server instance and finds out about 8587 server restart. 8589 9. Client A reclaims its lock within the server's grace period. 8591 Thus, at the final step, the server has erroneously granted client 8592 A's lock reclaim. If client B modified the object the lock was 8593 protecting, client A will experience object corruption. 8595 The second known edge condition arises in situations such as the 8596 following: 8598 1. Client A acquires one or more locks. 8600 2. Server restarts. 8602 3. Client A and server experience mutual network partition, such 8603 that client A is unable to reclaim all of its locks within the 8604 grace period. 8606 4. Server's reclaim grace period ends. Client A has either no 8607 locks or an incomplete set of locks known to the server. 8609 5. Client B acquires a lock that would have conflicted with a lock 8610 of client A that was not reclaimed. 8612 6. Client B releases the lock. 8614 7. Server restarts a second time. 8616 8. Network partition between client A and server heals. 8618 9. Client A connects to new server instance and finds out about 8619 server restart. 8621 10. Client A reclaims its lock within the server's grace period. 8623 As with the first edge condition, the final step of the scenario of 8624 the second edge condition has the server erroneously granting client 8625 A's lock reclaim. 8627 Solving the first and second edge conditions requires either that the 8628 server always assumes after it restarts that some edge condition 8629 occurs, and thus returns NFS4ERR_NO_GRACE for all reclaim attempts, 8630 or that the server record some information in stable storage. The 8631 amount of information the server records in stable storage is in 8632 inverse proportion to how harsh the server intends to be whenever 8633 edge conditions arise. The server that is completely tolerant of all 8634 edge conditions will record in stable storage every lock that is 8635 acquired, removing the lock record from stable storage only when the 8636 lock is released. For the two edge conditions discussed above, the 8637 harshest a server can be, and still support a grace period for 8638 reclaims, requires that the server record in stable storage some 8639 minimal information. For example, a server implementation could, for 8640 each client, save in stable storage a record containing: 8642 * the co_ownerid field from the client_owner4 presented in the 8643 EXCHANGE_ID operation. 8645 * a boolean that indicates if the client's lease expired or if there 8646 was administrative intervention (see Section 8.5) to revoke a 8647 byte-range lock, share reservation, or delegation and there has 8648 been no acknowledgment, via FREE_STATEID, of such revocation. 8650 * a boolean that indicates whether the client may have locks that it 8651 believes to be reclaimable in situations in which the grace period 8652 was terminated, making the server's view of lock reclaimability 8653 suspect. The server will set this for any client record in stable 8654 storage where the client has not done a suitable RECLAIM_COMPLETE 8655 (global or file system-specific depending on the target of the 8656 lock request) before it grants any new (i.e., not reclaimed) lock 8657 to any client. 8659 Assuming the above record keeping, for the first edge condition, 8660 after the server restarts, the record that client A's lease expired 8661 means that another client could have acquired a conflicting byte- 8662 range lock, share reservation, or delegation. Hence, the server must 8663 reject a reclaim from client A with the error NFS4ERR_NO_GRACE. 8665 For the second edge condition, after the server restarts for a second 8666 time, the indication that the client had not completed its reclaims 8667 at the time at which the grace period ended means that the server 8668 must reject a reclaim from client A with the error NFS4ERR_NO_GRACE. 8670 When either edge condition occurs, the client's attempt to reclaim 8671 locks will result in the error NFS4ERR_NO_GRACE. When this is 8672 received, or after the client restarts with no lock state, the client 8673 will send a global RECLAIM_COMPLETE. When the RECLAIM_COMPLETE is 8674 received, the server and client are again in agreement regarding 8675 reclaimable locks and both booleans in persistent storage can be 8676 reset, to be set again only when there is a subsequent event that 8677 causes lock reclaim operations to be questionable. 8679 Regardless of the level and approach to record keeping, the server 8680 MUST implement one of the following strategies (which apply to 8681 reclaims of share reservations, byte-range locks, and delegations): 8683 1. Reject all reclaims with NFS4ERR_NO_GRACE. This is extremely 8684 unforgiving, but necessary if the server does not record lock 8685 state in stable storage. 8687 2. Record sufficient state in stable storage such that all known 8688 edge conditions involving server restart, including the two noted 8689 in this section, are detected. It is acceptable to erroneously 8690 recognize an edge condition and not allow a reclaim, when, with 8691 sufficient knowledge, it would be allowed. The error the server 8692 would return in this case is NFS4ERR_NO_GRACE. Note that it is 8693 not known if there are other edge conditions. 8695 In the event that, after a server restart, the server determines 8696 there is unrecoverable damage or corruption to the information in 8697 stable storage, then for all clients and/or locks that may be 8698 affected, the server MUST return NFS4ERR_NO_GRACE. 8700 A mandate for the client's handling of the NFS4ERR_NO_GRACE error is 8701 outside the scope of this specification, since the strategies for 8702 such handling are very dependent on the client's operating 8703 environment. However, one potential approach is described below. 8705 When the client receives NFS4ERR_NO_GRACE, it could examine the 8706 change attribute of the objects for which the client is trying to 8707 reclaim state, and use that to determine whether to re-establish the 8708 state via normal OPEN or LOCK operations. This is acceptable 8709 provided that the client's operating environment allows it. In other 8710 words, the client implementor is advised to document for his users 8711 the behavior. The client could also inform the application that its 8712 byte-range lock or share reservations (whether or not they were 8713 delegated) have been lost, such as via a UNIX signal, a Graphical 8714 User Interface (GUI) pop-up window, etc. See Section 10.5 for a 8715 discussion of what the client should do for dealing with unreclaimed 8716 delegations on client state. 8718 For further discussion of revocation of locks, see Section 8.5. 8720 8.5. Server Revocation of Locks 8722 At any point, the server can revoke locks held by a client, and the 8723 client must be prepared for this event. When the client detects that 8724 its locks have been or may have been revoked, the client is 8725 responsible for validating the state information between itself and 8726 the server. Validating locking state for the client means that it 8727 must verify or reclaim state for each lock currently held. 8729 The first occasion of lock revocation is upon server restart. Note 8730 that this includes situations in which sessions are persistent and 8731 locking state is lost. In this class of instances, the client will 8732 receive an error (NFS4ERR_STALE_CLIENTID) on an operation that takes 8733 client ID, usually as part of recovery in response to a problem with 8734 the current session), and the client will proceed with normal crash 8735 recovery as described in the Section 8.4.2.1. 8737 The second occasion of lock revocation is the inability to renew the 8738 lease before expiration, as discussed in Section 8.4.3. While this 8739 is considered a rare or unusual event, the client must be prepared to 8740 recover. The server is responsible for determining the precise 8741 consequences of the lease expiration, informing the client of the 8742 scope of the lock revocation decided upon. The client then uses the 8743 status information provided by the server in the SEQUENCE results 8744 (field sr_status_flags, see Section 18.46.3) to synchronize its 8745 locking state with that of the server, in order to recover. 8747 The third occasion of lock revocation can occur as a result of 8748 revocation of locks within the lease period, either because of 8749 administrative intervention or because a recallable lock (a 8750 delegation or layout) was not returned within the lease period after 8751 having been recalled. While these are considered rare events, they 8752 are possible, and the client must be prepared to deal with them. 8753 When either of these events occurs, the client finds out about the 8754 situation through the status returned by the SEQUENCE operation. Any 8755 use of stateids associated with locks revoked during the lease period 8756 will receive the error NFS4ERR_ADMIN_REVOKED or 8757 NFS4ERR_DELEG_REVOKED, as appropriate. 8759 In all situations in which a subset of locking state may have been 8760 revoked, which include all cases in which locking state is revoked 8761 within the lease period, it is up to the client to determine which 8762 locks have been revoked and which have not. It does this by using 8763 the TEST_STATEID operation on the appropriate set of stateids. Once 8764 the set of revoked locks has been determined, the applications can be 8765 notified, and the invalidated stateids can be freed and lock 8766 revocation acknowledged by using FREE_STATEID. 8768 8.6. Short and Long Leases 8770 When determining the time period for the server lease, the usual 8771 lease trade-offs apply. A short lease is good for fast server 8772 recovery at a cost of increased operations to effect lease renewal 8773 (when there are no other operations during the period to effect lease 8774 renewal as a side effect). A long lease is certainly kinder and 8775 gentler to servers trying to handle very large numbers of clients. 8776 The number of extra requests to effect lock renewal drops in inverse 8777 proportion to the lease time. The disadvantages of a long lease 8778 include the possibility of slower recovery after certain failures. 8779 After server failure, a longer grace period may be required when some 8780 clients do not promptly reclaim their locks and do a global 8781 RECLAIM_COMPLETE. In the event of client failure, the longer period 8782 for a lease to expire will force conflicting requests to wait longer. 8784 A long lease is practical if the server can store lease state in 8785 stable storage. Upon recovery, the server can reconstruct the lease 8786 state from its stable storage and continue operation with its 8787 clients. 8789 8.7. Clocks, Propagation Delay, and Calculating Lease Expiration 8791 To avoid the need for synchronized clocks, lease times are granted by 8792 the server as a time delta. However, there is a requirement that the 8793 client and server clocks do not drift excessively over the duration 8794 of the lease. There is also the issue of propagation delay across 8795 the network, which could easily be several hundred milliseconds, as 8796 well as the possibility that requests will be lost and need to be 8797 retransmitted. 8799 To take propagation delay into account, the client should subtract it 8800 from lease times (e.g., if the client estimates the one-way 8801 propagation delay as 200 milliseconds, then it can assume that the 8802 lease is already 200 milliseconds old when it gets it). In addition, 8803 it will take another 200 milliseconds to get a response back to the 8804 server. So the client must send a lease renewal or write data back 8805 to the server at least 400 milliseconds before the lease would 8806 expire. If the propagation delay varies over the life of the lease 8807 (e.g., the client is on a mobile host), the client will need to 8808 continuously subtract the increase in propagation delay from the 8809 lease times. 8811 The server's lease period configuration should take into account the 8812 network distance of the clients that will be accessing the server's 8813 resources. It is expected that the lease period will take into 8814 account the network propagation delays and other network delay 8815 factors for the client population. Since the protocol does not allow 8816 for an automatic method to determine an appropriate lease period, the 8817 server's administrator may have to tune the lease period. 8819 8.8. Obsolete Locking Infrastructure from NFSv4.0 8821 There are a number of operations and fields within existing 8822 operations that no longer have a function in NFSv4.1. In one way or 8823 another, these changes are all due to the implementation of sessions 8824 that provide client context and exactly once semantics as a base 8825 feature of the protocol, separate from locking itself. 8827 The following NFSv4.0 operations MUST NOT be implemented in NFSv4.1. 8828 The server MUST return NFS4ERR_NOTSUPP if these operations are found 8829 in an NFSv4.1 COMPOUND. 8831 * SETCLIENTID since its function has been replaced by EXCHANGE_ID. 8833 * SETCLIENTID_CONFIRM since client ID confirmation now happens by 8834 means of CREATE_SESSION. 8836 * OPEN_CONFIRM because state-owner-based seqids have been replaced 8837 by the sequence ID in the SEQUENCE operation. 8839 * RELEASE_LOCKOWNER because lock-owners with no associated locks do 8840 not have any sequence-related state and so can be deleted by the 8841 server at will. 8843 * RENEW because every SEQUENCE operation for a session causes lease 8844 renewal, making a separate operation superfluous. 8846 Also, there are a number of fields, present in existing operations, 8847 related to locking that have no use in minor version 1. They were 8848 used in minor version 0 to perform functions now provided in a 8849 different fashion. 8851 * Sequence ids used to sequence requests for a given state-owner and 8852 to provide retry protection, now provided via sessions. 8854 * Client IDs used to identify the client associated with a given 8855 request. Client identification is now available using the client 8856 ID associated with the current session, without needing an 8857 explicit client ID field. 8859 Such vestigial fields in existing operations have no function in 8860 NFSv4.1 and are ignored by the server. Note that client IDs in 8861 operations new to NFSv4.1 (such as CREATE_SESSION and 8862 DESTROY_CLIENTID) are not ignored. 8864 9. File Locking and Share Reservations 8866 To support Win32 share reservations, it is necessary to provide 8867 operations that atomically open or create files. Having a separate 8868 share/unshare operation would not allow correct implementation of the 8869 Win32 OpenFile API. In order to correctly implement share semantics, 8870 the previous NFS protocol mechanisms used when a file is opened or 8871 created (LOOKUP, CREATE, ACCESS) need to be replaced. The NFSv4.1 8872 protocol defines an OPEN operation that is capable of atomically 8873 looking up, creating, and locking a file on the server. 8875 9.1. Opens and Byte-Range Locks 8877 It is assumed that manipulating a byte-range lock is rare when 8878 compared to READ and WRITE operations. It is also assumed that 8879 server restarts and network partitions are relatively rare. 8880 Therefore, it is important that the READ and WRITE operations have a 8881 lightweight mechanism to indicate if they possess a held lock. A 8882 LOCK operation contains the heavyweight information required to 8883 establish a byte-range lock and uniquely define the owner of the 8884 lock. 8886 9.1.1. State-Owner Definition 8888 When opening a file or requesting a byte-range lock, the client must 8889 specify an identifier that represents the owner of the requested 8890 lock. This identifier is in the form of a state-owner, represented 8891 in the protocol by a state_owner4, a variable-length opaque array 8892 that, when concatenated with the current client ID, uniquely defines 8893 the owner of a lock managed by the client. This may be a thread ID, 8894 process ID, or other unique value. 8896 Owners of opens and owners of byte-range locks are separate entities 8897 and remain separate even if the same opaque arrays are used to 8898 designate owners of each. The protocol distinguishes between open- 8899 owners (represented by open_owner4 structures) and lock-owners 8900 (represented by lock_owner4 structures). 8902 Each open is associated with a specific open-owner while each byte- 8903 range lock is associated with a lock-owner and an open-owner, the 8904 latter being the open-owner associated with the open file under which 8905 the LOCK operation was done. Delegations and layouts, on the other 8906 hand, are not associated with a specific owner but are associated 8907 with the client as a whole (identified by a client ID). 8909 9.1.2. Use of the Stateid and Locking 8911 All READ, WRITE, and SETATTR operations contain a stateid. For the 8912 purposes of this section, SETATTR operations that change the size 8913 attribute of a file are treated as if they are writing the area 8914 between the old and new sizes (i.e., the byte-range truncated or 8915 added to the file by means of the SETATTR), even where SETATTR is not 8916 explicitly mentioned in the text. The stateid passed to one of these 8917 operations must be one that represents an open, a set of byte-range 8918 locks, or a delegation, or it may be a special stateid representing 8919 anonymous access or the special bypass stateid. 8921 If the state-owner performs a READ or WRITE operation in a situation 8922 in which it has established a byte-range lock or share reservation on 8923 the server (any OPEN constitutes a share reservation), the stateid 8924 (previously returned by the server) must be used to indicate what 8925 locks, including both byte-range locks and share reservations, are 8926 held by the state-owner. If no state is established by the client, 8927 either a byte-range lock or a share reservation, a special stateid 8928 for anonymous state (zero as the value for "other" and "seqid") is 8929 used. (See Section 8.2.3 for a description of 'special' stateids in 8930 general.) Regardless of whether a stateid for anonymous state or a 8931 stateid returned by the server is used, if there is a conflicting 8932 share reservation or mandatory byte-range lock held on the file, the 8933 server MUST refuse to service the READ or WRITE operation. 8935 Share reservations are established by OPEN operations and by their 8936 nature are mandatory in that when the OPEN denies READ or WRITE 8937 operations, that denial results in such operations being rejected 8938 with error NFS4ERR_LOCKED. Byte-range locks may be implemented by 8939 the server as either mandatory or advisory, or the choice of 8940 mandatory or advisory behavior may be determined by the server on the 8941 basis of the file being accessed (for example, some UNIX-based 8942 servers support a "mandatory lock bit" on the mode attribute such 8943 that if set, byte-range locks are required on the file before I/O is 8944 possible). When byte-range locks are advisory, they only prevent the 8945 granting of conflicting lock requests and have no effect on READs or 8946 WRITEs. Mandatory byte-range locks, however, prevent conflicting I/O 8947 operations. When they are attempted, they are rejected with 8948 NFS4ERR_LOCKED. When the client gets NFS4ERR_LOCKED on a file for 8949 which it knows it has the proper share reservation, it will need to 8950 send a LOCK operation on the byte-range of the file that includes the 8951 byte-range the I/O was to be performed on, with an appropriate 8952 locktype field of the LOCK operation's arguments (i.e., READ*_LT for 8953 a READ operation, WRITE*_LT for a WRITE operation). 8955 Note that for UNIX environments that support mandatory byte-range 8956 locking, the distinction between advisory and mandatory locking is 8957 subtle. In fact, advisory and mandatory byte-range locks are exactly 8958 the same as far as the APIs and requirements on implementation. If 8959 the mandatory lock attribute is set on the file, the server checks to 8960 see if the lock-owner has an appropriate shared (READ_LT) or 8961 exclusive (WRITE_LT) byte-range lock on the byte-range it wishes to 8962 READ from or WRITE to. If there is no appropriate lock, the server 8963 checks if there is a conflicting lock (which can be done by 8964 attempting to acquire the conflicting lock on behalf of the lock- 8965 owner, and if successful, release the lock after the READ or WRITE 8966 operation is done), and if there is, the server returns 8967 NFS4ERR_LOCKED. 8969 For Windows environments, byte-range locks are always mandatory, so 8970 the server always checks for byte-range locks during I/O requests. 8972 Thus, the LOCK operation does not need to distinguish between 8973 advisory and mandatory byte-range locks. It is the server's 8974 processing of the READ and WRITE operations that introduces the 8975 distinction. 8977 Every stateid that is validly passed to READ, WRITE, or SETATTR, with 8978 the exception of special stateid values, defines an access mode for 8979 the file (i.e., OPEN4_SHARE_ACCESS_READ, OPEN4_SHARE_ACCESS_WRITE, or 8980 OPEN4_SHARE_ACCESS_BOTH). 8982 * For stateids associated with opens, this is the mode defined by 8983 the original OPEN that caused the allocation of the OPEN stateid 8984 and as modified by subsequent OPENs and OPEN_DOWNGRADEs for the 8985 same open-owner/file pair. 8987 * For stateids returned by byte-range LOCK operations, the 8988 appropriate mode is the access mode for the OPEN stateid 8989 associated with the lock set represented by the stateid. 8991 * For delegation stateids, the access mode is based on the type of 8992 delegation. 8994 When a READ, WRITE, or SETATTR (that specifies the size attribute) 8995 operation is done, the operation is subject to checking against the 8996 access mode to verify that the operation is appropriate given the 8997 stateid with which the operation is associated. 8999 In the case of WRITE-type operations (i.e., WRITEs and SETATTRs that 9000 set size), the server MUST verify that the access mode allows writing 9001 and MUST return an NFS4ERR_OPENMODE error if it does not. In the 9002 case of READ, the server may perform the corresponding check on the 9003 access mode, or it may choose to allow READ on OPENs for 9004 OPEN4_SHARE_ACCESS_WRITE, to accommodate clients whose WRITE 9005 implementation may unavoidably do reads (e.g., due to buffer cache 9006 constraints). However, even if READs are allowed in these 9007 circumstances, the server MUST still check for locks that conflict 9008 with the READ (e.g., another OPEN specified OPEN4_SHARE_DENY_READ or 9009 OPEN4_SHARE_DENY_BOTH). Note that a server that does enforce the 9010 access mode check on READs need not explicitly check for conflicting 9011 share reservations since the existence of OPEN for 9012 OPEN4_SHARE_ACCESS_READ guarantees that no conflicting share 9013 reservation can exist. 9015 The READ bypass special stateid (all bits of "other" and "seqid" set 9016 to one) indicates a desire to bypass locking checks. The server MAY 9017 allow READ operations to bypass locking checks at the server, when 9018 this special stateid is used. However, WRITE operations with this 9019 special stateid value MUST NOT bypass locking checks and are treated 9020 exactly the same as if a special stateid for anonymous state were 9021 used. 9023 A lock may not be granted while a READ or WRITE operation using one 9024 of the special stateids is being performed and the scope of the lock 9025 to be granted would conflict with the READ or WRITE operation. This 9026 can occur when: 9028 * A mandatory byte-range lock is requested with a byte-range that 9029 conflicts with the byte-range of the READ or WRITE operation. For 9030 the purposes of this paragraph, a conflict occurs when a shared 9031 lock is requested and a WRITE operation is being performed, or an 9032 exclusive lock is requested and either a READ or a WRITE operation 9033 is being performed. 9035 * A share reservation is requested that denies reading and/or 9036 writing and the corresponding operation is being performed. 9038 * A delegation is to be granted and the delegation type would 9039 prevent the I/O operation, i.e., READ and WRITE conflict with an 9040 OPEN_DELEGATE_WRITE delegation and WRITE conflicts with an 9041 OPEN_DELEGATE_READ delegation. 9043 When a client holds a delegation, it needs to ensure that the stateid 9044 sent conveys the association of operation with the delegation, to 9045 avoid the delegation from being avoidably recalled. When the 9046 delegation stateid, a stateid open associated with that delegation, 9047 or a stateid representing byte-range locks derived from such an open 9048 is used, the server knows that the READ, WRITE, or SETATTR does not 9049 conflict with the delegation but is sent under the aegis of the 9050 delegation. Even though it is possible for the server to determine 9051 from the client ID (via the session ID) that the client does in fact 9052 have a delegation, the server is not obliged to check this, so using 9053 a special stateid can result in avoidable recall of the delegation. 9055 9.2. Lock Ranges 9057 The protocol allows a lock-owner to request a lock with a byte-range 9058 and then either upgrade, downgrade, or unlock a sub-range of the 9059 initial lock, or a byte-range that overlaps -- fully or partially -- 9060 either with that initial lock or a combination of a set of existing 9061 locks for the same lock-owner. It is expected that this will be an 9062 uncommon type of request. In any case, servers or server file 9063 systems may not be able to support sub-range lock semantics. In the 9064 event that a server receives a locking request that represents a sub- 9065 range of current locking state for the lock-owner, the server is 9066 allowed to return the error NFS4ERR_LOCK_RANGE to signify that it 9067 does not support sub-range lock operations. Therefore, the client 9068 should be prepared to receive this error and, if appropriate, report 9069 the error to the requesting application. 9071 The client is discouraged from combining multiple independent locking 9072 ranges that happen to be adjacent into a single request since the 9073 server may not support sub-range requests for reasons related to the 9074 recovery of byte-range locking state in the event of server failure. 9075 As discussed in Section 8.4.2, the server may employ certain 9076 optimizations during recovery that work effectively only when the 9077 client's behavior during lock recovery is similar to the client's 9078 locking behavior prior to server failure. 9080 9.3. Upgrading and Downgrading Locks 9082 If a client has a WRITE_LT lock on a byte-range, it can request an 9083 atomic downgrade of the lock to a READ_LT lock via the LOCK 9084 operation, by setting the type to READ_LT. If the server supports 9085 atomic downgrade, the request will succeed. If not, it will return 9086 NFS4ERR_LOCK_NOTSUPP. The client should be prepared to receive this 9087 error and, if appropriate, report the error to the requesting 9088 application. 9090 If a client has a READ_LT lock on a byte-range, it can request an 9091 atomic upgrade of the lock to a WRITE_LT lock via the LOCK operation 9092 by setting the type to WRITE_LT or WRITEW_LT. If the server does not 9093 support atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the 9094 upgrade can be achieved without an existing conflict, the request 9095 will succeed. Otherwise, the server will return either 9096 NFS4ERR_DENIED or NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is 9097 returned if the client sent the LOCK operation with the type set to 9098 WRITEW_LT and the server has detected a deadlock. The client should 9099 be prepared to receive such errors and, if appropriate, report the 9100 error to the requesting application. 9102 9.4. Stateid Seqid Values and Byte-Range Locks 9104 When a LOCK or LOCKU operation is performed, the stateid returned has 9105 the same "other" value as the argument's stateid, and a "seqid" value 9106 that is incremented (relative to the argument's stateid) to reflect 9107 the occurrence of the LOCK or LOCKU operation. The server MUST 9108 increment the value of the "seqid" field whenever there is any change 9109 to the locking status of any byte offset as described by any of the 9110 locks covered by the stateid. A change in locking status includes a 9111 change from locked to unlocked or the reverse or a change from being 9112 locked for READ_LT to being locked for WRITE_LT or the reverse. 9114 When there is no such change, as, for example, when a range already 9115 locked for WRITE_LT is locked again for WRITE_LT, the server MAY 9116 increment the "seqid" value. 9118 9.5. Issues with Multiple Open-Owners 9120 When the same file is opened by multiple open-owners, a client will 9121 have multiple OPEN stateids for that file, each associated with a 9122 different open-owner. In that case, there can be multiple LOCK and 9123 LOCKU requests for the same lock-owner sent using the different OPEN 9124 stateids, and so a situation may arise in which there are multiple 9125 stateids, each representing byte-range locks on the same file and 9126 held by the same lock-owner but each associated with a different 9127 open-owner. 9129 In such a situation, the locking status of each byte (i.e., whether 9130 it is locked, the READ_LT or WRITE_LT type of the lock, and the lock- 9131 owner holding the lock) MUST reflect the last LOCK or LOCKU operation 9132 done for the lock-owner in question, independent of the stateid 9133 through which the request was sent. 9135 When a byte is locked by the lock-owner in question, the open-owner 9136 to which that byte-range lock is assigned SHOULD be that of the open- 9137 owner associated with the stateid through which the last LOCK of that 9138 byte was done. When there is a change in the open-owner associated 9139 with locks for the stateid through which a LOCK or LOCKU was done, 9140 the "seqid" field of the stateid MUST be incremented, even if the 9141 locking, in terms of lock-owners has not changed. When there is a 9142 change to the set of locked bytes associated with a different stateid 9143 for the same lock-owner, i.e., associated with a different open- 9144 owner, the "seqid" value for that stateid MUST NOT be incremented. 9146 9.6. Blocking Locks 9148 Some clients require the support of blocking locks. While NFSv4.1 9149 provides a callback when a previously unavailable lock becomes 9150 available, this is an OPTIONAL feature and clients cannot depend on 9151 its presence. Clients need to be prepared to continually poll for 9152 the lock. This presents a fairness problem. Two of the lock types, 9153 READW_LT and WRITEW_LT, are used to indicate to the server that the 9154 client is requesting a blocking lock. When the callback is not used, 9155 the server should maintain an ordered list of pending blocking locks. 9156 When the conflicting lock is released, the server may wait for the 9157 period of time equal to lease_time for the first waiting client to 9158 re-request the lock. After the lease period expires, the next 9159 waiting client request is allowed the lock. Clients are required to 9160 poll at an interval sufficiently small that it is likely to acquire 9161 the lock in a timely manner. The server is not required to maintain 9162 a list of pending blocked locks as it is used to increase fairness 9163 and not correct operation. Because of the unordered nature of crash 9164 recovery, storing of lock state to stable storage would be required 9165 to guarantee ordered granting of blocking locks. 9167 Servers may also note the lock types and delay returning denial of 9168 the request to allow extra time for a conflicting lock to be 9169 released, allowing a successful return. In this way, clients can 9170 avoid the burden of needless frequent polling for blocking locks. 9171 The server should take care in the length of delay in the event the 9172 client retransmits the request. 9174 If a server receives a blocking LOCK operation, denies it, and then 9175 later receives a nonblocking request for the same lock, which is also 9176 denied, then it should remove the lock in question from its list of 9177 pending blocking locks. Clients should use such a nonblocking 9178 request to indicate to the server that this is the last time they 9179 intend to poll for the lock, as may happen when the process 9180 requesting the lock is interrupted. This is a courtesy to the 9181 server, to prevent it from unnecessarily waiting a lease period 9182 before granting other LOCK operations. However, clients are not 9183 required to perform this courtesy, and servers must not depend on 9184 them doing so. Also, clients must be prepared for the possibility 9185 that this final locking request will be accepted. 9187 When a server indicates, via the flag OPEN4_RESULT_MAY_NOTIFY_LOCK, 9188 that CB_NOTIFY_LOCK callbacks might be done for the current open 9189 file, the client should take notice of this, but, since this is a 9190 hint, cannot rely on a CB_NOTIFY_LOCK always being done. A client 9191 may reasonably reduce the frequency with which it polls for a denied 9192 lock, since the greater latency that might occur is likely to be 9193 eliminated given a prompt callback, but it still needs to poll. When 9194 it receives a CB_NOTIFY_LOCK, it should promptly try to obtain the 9195 lock, but it should be aware that other clients may be polling and 9196 that the server is under no obligation to reserve the lock for that 9197 particular client. 9199 9.7. Share Reservations 9201 A share reservation is a mechanism to control access to a file. It 9202 is a separate and independent mechanism from byte-range locking. 9203 When a client opens a file, it sends an OPEN operation to the server 9204 specifying the type of access required (READ, WRITE, or BOTH) and the 9205 type of access to deny others (OPEN4_SHARE_DENY_NONE, 9206 OPEN4_SHARE_DENY_READ, OPEN4_SHARE_DENY_WRITE, or 9207 OPEN4_SHARE_DENY_BOTH). If the OPEN fails, the client will fail the 9208 application's open request. 9210 Pseudo-code definition of the semantics: 9212 if (request.access == 0) { 9213 return (NFS4ERR_INVAL) 9214 } else { 9215 if ((request.access & file_state.deny)) || 9216 (request.deny & file_state.access)) { 9217 return (NFS4ERR_SHARE_DENIED) 9218 } 9219 return (NFS4ERR_OK); 9221 When doing this checking of share reservations on OPEN, the current 9222 file_state used in the algorithm includes bits that reflect all 9223 current opens, including those for the open-owner making the new OPEN 9224 request. 9226 The constants used for the OPEN and OPEN_DOWNGRADE operations for the 9227 access and deny fields are as follows: 9229 const OPEN4_SHARE_ACCESS_READ = 0x00000001; 9230 const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; 9231 const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; 9233 const OPEN4_SHARE_DENY_NONE = 0x00000000; 9234 const OPEN4_SHARE_DENY_READ = 0x00000001; 9235 const OPEN4_SHARE_DENY_WRITE = 0x00000002; 9236 const OPEN4_SHARE_DENY_BOTH = 0x00000003; 9238 9.8. OPEN/CLOSE Operations 9240 To provide correct share semantics, a client MUST use the OPEN 9241 operation to obtain the initial filehandle and indicate the desired 9242 access and what access, if any, to deny. Even if the client intends 9243 to use a special stateid for anonymous state or READ bypass, it must 9244 still obtain the filehandle for the regular file with the OPEN 9245 operation so the appropriate share semantics can be applied. Clients 9246 that do not have a deny mode built into their programming interfaces 9247 for opening a file should request a deny mode of 9248 OPEN4_SHARE_DENY_NONE. 9250 The OPEN operation with the CREATE flag also subsumes the CREATE 9251 operation for regular files as used in previous versions of the NFS 9252 protocol. This allows a create with a share to be done atomically. 9254 The CLOSE operation removes all share reservations held by the open- 9255 owner on that file. If byte-range locks are held, the client SHOULD 9256 release all locks before sending a CLOSE operation. The server MAY 9257 free all outstanding locks on CLOSE, but some servers may not support 9258 the CLOSE of a file that still has byte-range locks held. The server 9259 MUST return failure, NFS4ERR_LOCKS_HELD, if any locks would exist 9260 after the CLOSE. 9262 The LOOKUP operation will return a filehandle without establishing 9263 any lock state on the server. Without a valid stateid, the server 9264 will assume that the client has the least access. For example, if 9265 one client opened a file with OPEN4_SHARE_DENY_BOTH and another 9266 client accesses the file via a filehandle obtained through LOOKUP, 9267 the second client could only read the file using the special read 9268 bypass stateid. The second client could not WRITE the file at all 9269 because it would not have a valid stateid from OPEN and the special 9270 anonymous stateid would not be allowed access. 9272 9.9. Open Upgrade and Downgrade 9274 When an OPEN is done for a file and the open-owner for which the OPEN 9275 is being done already has the file open, the result is to upgrade the 9276 open file status maintained on the server to include the access and 9277 deny bits specified by the new OPEN as well as those for the existing 9278 OPEN. The result is that there is one open file, as far as the 9279 protocol is concerned, and it includes the union of the access and 9280 deny bits for all of the OPEN requests completed. The OPEN is 9281 represented by a single stateid whose "other" value matches that of 9282 the original open, and whose "seqid" value is incremented to reflect 9283 the occurrence of the upgrade. The increment is required in cases in 9284 which the "upgrade" results in no change to the open mode (e.g., an 9285 OPEN is done for read when the existing open file is opened for 9286 OPEN4_SHARE_ACCESS_BOTH). Only a single CLOSE will be done to reset 9287 the effects of both OPENs. The client may use the stateid returned 9288 by the OPEN effecting the upgrade or with a stateid sharing the same 9289 "other" field and a seqid of zero, although care needs to be taken as 9290 far as upgrades that happen while the CLOSE is pending. Note that 9291 the client, when sending the OPEN, may not know that the same file is 9292 in fact being opened. The above only applies if both OPENs result in 9293 the OPENed object being designated by the same filehandle. 9295 When the server chooses to export multiple filehandles corresponding 9296 to the same file object and returns different filehandles on two 9297 different OPENs of the same file object, the server MUST NOT "OR" 9298 together the access and deny bits and coalesce the two open files. 9299 Instead, the server must maintain separate OPENs with separate 9300 stateids and will require separate CLOSEs to free them. 9302 When multiple open files on the client are merged into a single OPEN 9303 file object on the server, the close of one of the open files (on the 9304 client) may necessitate change of the access and deny status of the 9305 open file on the server. This is because the union of the access and 9306 deny bits for the remaining opens may be smaller (i.e., a proper 9307 subset) than previously. The OPEN_DOWNGRADE operation is used to 9308 make the necessary change and the client should use it to update the 9309 server so that share reservation requests by other clients are 9310 handled properly. The stateid returned has the same "other" field as 9311 that passed to the server. The "seqid" value in the returned stateid 9312 MUST be incremented, even in situations in which there is no change 9313 to the access and deny bits for the file. 9315 9.10. Parallel OPENs 9317 Unlike the case of NFSv4.0, in which OPEN operations for the same 9318 open-owner are inherently serialized because of the owner-based 9319 seqid, multiple OPENs for the same open-owner may be done in 9320 parallel. When clients do this, they may encounter situations in 9321 which, because of the existence of hard links, two OPEN operations 9322 may turn out to open the same file, with a later OPEN performed being 9323 an upgrade of the first, with this fact only visible to the client 9324 once the operations complete. 9326 In this situation, clients may determine the order in which the OPENs 9327 were performed by examining the stateids returned by the OPENs. 9328 Stateids that share a common value of the "other" field can be 9329 recognized as having opened the same file, with the order of the 9330 operations determinable from the order of the "seqid" fields, mod any 9331 possible wraparound of the 32-bit field. 9333 When the possibility exists that the client will send multiple OPENs 9334 for the same open-owner in parallel, it may be the case that an open 9335 upgrade may happen without the client knowing beforehand that this 9336 could happen. Because of this possibility, CLOSEs and 9337 OPEN_DOWNGRADEs should generally be sent with a non-zero seqid in the 9338 stateid, to avoid the possibility that the status change associated 9339 with an open upgrade is not inadvertently lost. 9341 9.11. Reclaim of Open and Byte-Range Locks 9343 Special forms of the LOCK and OPEN operations are provided when it is 9344 necessary to re-establish byte-range locks or opens after a server 9345 failure. 9347 * To reclaim existing opens, an OPEN operation is performed using a 9348 CLAIM_PREVIOUS. Because the client, in this type of situation, 9349 will have already opened the file and have the filehandle of the 9350 target file, this operation requires that the current filehandle 9351 be the target file, rather than a directory, and no file name is 9352 specified. 9354 * To reclaim byte-range locks, a LOCK operation with the reclaim 9355 parameter set to true is used. 9357 Reclaims of opens associated with delegations are discussed in 9358 Section 10.2.1. 9360 10. Client-Side Caching 9362 Client-side caching of data, of file attributes, and of file names is 9363 essential to providing good performance with the NFS protocol. 9364 Providing distributed cache coherence is a difficult problem, and 9365 previous versions of the NFS protocol have not attempted it. 9366 Instead, several NFS client implementation techniques have been used 9367 to reduce the problems that a lack of coherence poses for users. 9368 These techniques have not been clearly defined by earlier protocol 9369 specifications, and it is often unclear what is valid or invalid 9370 client behavior. 9372 The NFSv4.1 protocol uses many techniques similar to those that have 9373 been used in previous protocol versions. The NFSv4.1 protocol does 9374 not provide distributed cache coherence. However, it defines a more 9375 limited set of caching guarantees to allow locks and share 9376 reservations to be used without destructive interference from client- 9377 side caching. 9379 In addition, the NFSv4.1 protocol introduces a delegation mechanism, 9380 which allows many decisions normally made by the server to be made 9381 locally by clients. This mechanism provides efficient support of the 9382 common cases where sharing is infrequent or where sharing is read- 9383 only. 9385 10.1. Performance Challenges for Client-Side Caching 9387 Caching techniques used in previous versions of the NFS protocol have 9388 been successful in providing good performance. However, several 9389 scalability challenges can arise when those techniques are used with 9390 very large numbers of clients. This is particularly true when 9391 clients are geographically distributed, which classically increases 9392 the latency for cache revalidation requests. 9394 The previous versions of the NFS protocol repeat their file data 9395 cache validation requests at the time the file is opened. This 9396 behavior can have serious performance drawbacks. A common case is 9397 one in which a file is only accessed by a single client. Therefore, 9398 sharing is infrequent. 9400 In this case, repeated references to the server to find that no 9401 conflicts exist are expensive. A better option with regards to 9402 performance is to allow a client that repeatedly opens a file to do 9403 so without reference to the server. This is done until potentially 9404 conflicting operations from another client actually occur. 9406 A similar situation arises in connection with byte-range locking. 9407 Sending LOCK and LOCKU operations as well as the READ and WRITE 9408 operations necessary to make data caching consistent with the locking 9409 semantics (see Section 10.3.2) can severely limit performance. When 9410 locking is used to provide protection against infrequent conflicts, a 9411 large penalty is incurred. This penalty may discourage the use of 9412 byte-range locking by applications. 9414 The NFSv4.1 protocol provides more aggressive caching strategies with 9415 the following design goals: 9417 * Compatibility with a large range of server semantics. 9419 * Providing the same caching benefits as previous versions of the 9420 NFS protocol when unable to support the more aggressive model. 9422 * Requirements for aggressive caching are organized so that a large 9423 portion of the benefit can be obtained even when not all of the 9424 requirements can be met. 9426 The appropriate requirements for the server are discussed in later 9427 sections in which specific forms of caching are covered (see 9428 Section 10.4). 9430 10.2. Delegation and Callbacks 9432 Recallable delegation of server responsibilities for a file to a 9433 client improves performance by avoiding repeated requests to the 9434 server in the absence of inter-client conflict. With the use of a 9435 "callback" RPC from server to client, a server recalls delegated 9436 responsibilities when another client engages in sharing of a 9437 delegated file. 9439 A delegation is passed from the server to the client, specifying the 9440 object of the delegation and the type of delegation. There are 9441 different types of delegations, but each type contains a stateid to 9442 be used to represent the delegation when performing operations that 9443 depend on the delegation. This stateid is similar to those 9444 associated with locks and share reservations but differs in that the 9445 stateid for a delegation is associated with a client ID and may be 9446 used on behalf of all the open-owners for the given client. A 9447 delegation is made to the client as a whole and not to any specific 9448 process or thread of control within it. 9450 The backchannel is established by CREATE_SESSION and 9451 BIND_CONN_TO_SESSION, and the client is required to maintain it. 9452 Because the backchannel may be down, even temporarily, correct 9453 protocol operation does not depend on them. Preliminary testing of 9454 backchannel functionality by means of a CB_COMPOUND procedure with a 9455 single operation, CB_SEQUENCE, can be used to check the continuity of 9456 the backchannel. A server avoids delegating responsibilities until 9457 it has determined that the backchannel exists. Because the granting 9458 of a delegation is always conditional upon the absence of conflicting 9459 access, clients MUST NOT assume that a delegation will be granted and 9460 they MUST always be prepared for OPENs, WANT_DELEGATIONs, and 9461 GET_DIR_DELEGATIONs to be processed without any delegations being 9462 granted. 9464 Unlike locks, an operation by a second client to a delegated file 9465 will cause the server to recall a delegation through a callback. For 9466 individual operations, we will describe, under IMPLEMENTATION, when 9467 such operations are required to effect a recall. A number of points 9468 should be noted, however. 9470 * The server is free to recall a delegation whenever it feels it is 9471 desirable and may do so even if no operations requiring recall are 9472 being done. 9474 * Operations done outside the NFSv4.1 protocol, due to, for example, 9475 access by other protocols, or by local access, also need to result 9476 in delegation recall when they make analogous changes to file 9477 system data. What is crucial is if the change would invalidate 9478 the guarantees provided by the delegation. When this is possible, 9479 the delegation needs to be recalled and MUST be returned or 9480 revoked before allowing the operation to proceed. 9482 * The semantics of the file system are crucial in defining when 9483 delegation recall is required. If a particular change within a 9484 specific implementation causes change to a file attribute, then 9485 delegation recall is required, whether that operation has been 9486 specifically listed as requiring delegation recall. Again, what 9487 is critical is whether the guarantees provided by the delegation 9488 are being invalidated. 9490 Despite those caveats, the implementation sections for a number of 9491 operations describe situations in which delegation recall would be 9492 required under some common circumstances: 9494 * For GETATTR, see Section 18.7.4. 9496 * For OPEN, see Section 18.16.4. 9498 * For READ, see Section 18.22.4. 9500 * For REMOVE, see Section 18.25.4. 9502 * For RENAME, see Section 18.26.4. 9504 * For SETATTR, see Section 18.30.4. 9506 * For WRITE, see Section 18.32.4. 9508 On recall, the client holding the delegation needs to flush modified 9509 state (such as modified data) to the server and return the 9510 delegation. The conflicting request will not be acted on until the 9511 recall is complete. The recall is considered complete when the 9512 client returns the delegation or the server times its wait for the 9513 delegation to be returned and revokes the delegation as a result of 9514 the timeout. In the interim, the server will either delay responding 9515 to conflicting requests or respond to them with NFS4ERR_DELAY. 9516 Following the resolution of the recall, the server has the 9517 information necessary to grant or deny the second client's request. 9519 At the time the client receives a delegation recall, it may have 9520 substantial state that needs to be flushed to the server. Therefore, 9521 the server should allow sufficient time for the delegation to be 9522 returned since it may involve numerous RPCs to the server. If the 9523 server is able to determine that the client is diligently flushing 9524 state to the server as a result of the recall, the server may extend 9525 the usual time allowed for a recall. However, the time allowed for 9526 recall completion should not be unbounded. 9528 An example of this is when responsibility to mediate opens on a given 9529 file is delegated to a client (see Section 10.4). The server will 9530 not know what opens are in effect on the client. Without this 9531 knowledge, the server will be unable to determine if the access and 9532 deny states for the file allow any particular open until the 9533 delegation for the file has been returned. 9535 A client failure or a network partition can result in failure to 9536 respond to a recall callback. In this case, the server will revoke 9537 the delegation, which in turn will render useless any modified state 9538 still on the client. 9540 10.2.1. Delegation Recovery 9542 There are three situations that delegation recovery needs to deal 9543 with: 9545 * client restart 9547 * server restart 9549 * network partition (full or backchannel-only) 9550 In the event the client restarts, the failure to renew the lease will 9551 result in the revocation of byte-range locks and share reservations. 9552 Delegations, however, may be treated a bit differently. 9554 There will be situations in which delegations will need to be re- 9555 established after a client restarts. The reason for this is that the 9556 client may have file data stored locally and this data was associated 9557 with the previously held delegations. The client will need to re- 9558 establish the appropriate file state on the server. 9560 To allow for this type of client recovery, the server MAY extend the 9561 period for delegation recovery beyond the typical lease expiration 9562 period. This implies that requests from other clients that conflict 9563 with these delegations will need to wait. Because the normal recall 9564 process may require significant time for the client to flush changed 9565 state to the server, other clients need be prepared for delays that 9566 occur because of a conflicting delegation. This longer interval 9567 would increase the window for clients to restart and consult stable 9568 storage so that the delegations can be reclaimed. For OPEN 9569 delegations, such delegations are reclaimed using OPEN with a claim 9570 type of CLAIM_DELEGATE_PREV or CLAIM_DELEG_PREV_FH (see Sections 10.5 9571 and 18.16 for discussion of OPEN delegation and the details of OPEN, 9572 respectively). 9574 A server MAY support claim types of CLAIM_DELEGATE_PREV and 9575 CLAIM_DELEG_PREV_FH, and if it does, it MUST NOT remove delegations 9576 upon a CREATE_SESSION that confirm a client ID created by 9577 EXCHANGE_ID. Instead, the server MUST, for a period of time no less 9578 than that of the value of the lease_time attribute, maintain the 9579 client's delegations to allow time for the client to send 9580 CLAIM_DELEGATE_PREV and/or CLAIM_DELEG_PREV_FH requests. The server 9581 that supports CLAIM_DELEGATE_PREV and/or CLAIM_DELEG_PREV_FH MUST 9582 support the DELEGPURGE operation. 9584 When the server restarts, delegations are reclaimed (using the OPEN 9585 operation with CLAIM_PREVIOUS) in a similar fashion to byte-range 9586 locks and share reservations. However, there is a slight semantic 9587 difference. In the normal case, if the server decides that a 9588 delegation should not be granted, it performs the requested action 9589 (e.g., OPEN) without granting any delegation. For reclaim, the 9590 server grants the delegation but a special designation is applied so 9591 that the client treats the delegation as having been granted but 9592 recalled by the server. Because of this, the client has the duty to 9593 write all modified state to the server and then return the 9594 delegation. This process of handling delegation reclaim reconciles 9595 three principles of the NFSv4.1 protocol: 9597 * Upon reclaim, a client reporting resources assigned to it by an 9598 earlier server instance must be granted those resources. 9600 * The server has unquestionable authority to determine whether 9601 delegations are to be granted and, once granted, whether they are 9602 to be continued. 9604 * The use of callbacks should not be depended upon until the client 9605 has proven its ability to receive them. 9607 When a client needs to reclaim a delegation and there is no 9608 associated open, the client may use the CLAIM_PREVIOUS variant of the 9609 WANT_DELEGATION operation. However, since the server is not required 9610 to support this operation, an alternative is to reclaim via a dummy 9611 OPEN together with the delegation using an OPEN of type 9612 CLAIM_PREVIOUS. The dummy open file can be released using a CLOSE to 9613 re-establish the original state to be reclaimed, a delegation without 9614 an associated open. 9616 When a client has more than a single open associated with a 9617 delegation, state for those additional opens can be established using 9618 OPEN operations of type CLAIM_DELEGATE_CUR. When these are used to 9619 establish opens associated with reclaimed delegations, the server 9620 MUST allow them when made within the grace period. 9622 When a network partition occurs, delegations are subject to freeing 9623 by the server when the lease renewal period expires. This is similar 9624 to the behavior for locks and share reservations. For delegations, 9625 however, the server may extend the period in which conflicting 9626 requests are held off. Eventually, the occurrence of a conflicting 9627 request from another client will cause revocation of the delegation. 9628 A loss of the backchannel (e.g., by later network configuration 9629 change) will have the same effect. A recall request will fail and 9630 revocation of the delegation will result. 9632 A client normally finds out about revocation of a delegation when it 9633 uses a stateid associated with a delegation and receives one of the 9634 errors NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, or 9635 NFS4ERR_DELEG_REVOKED. It also may find out about delegation 9636 revocation after a client restart when it attempts to reclaim a 9637 delegation and receives that same error. Note that in the case of a 9638 revoked OPEN_DELEGATE_WRITE delegation, there are issues because data 9639 may have been modified by the client whose delegation is revoked and 9640 separately by other clients. See Section 10.5.1 for a discussion of 9641 such issues. Note also that when delegations are revoked, 9642 information about the revoked delegation will be written by the 9643 server to stable storage (as described in Section 8.4.3). This is 9644 done to deal with the case in which a server restarts after revoking 9645 a delegation but before the client holding the revoked delegation is 9646 notified about the revocation. 9648 10.3. Data Caching 9650 When applications share access to a set of files, they need to be 9651 implemented so as to take account of the possibility of conflicting 9652 access by another application. This is true whether the applications 9653 in question execute on different clients or reside on the same 9654 client. 9656 Share reservations and byte-range locks are the facilities the 9657 NFSv4.1 protocol provides to allow applications to coordinate access 9658 by using mutual exclusion facilities. The NFSv4.1 protocol's data 9659 caching must be implemented such that it does not invalidate the 9660 assumptions on which those using these facilities depend. 9662 10.3.1. Data Caching and OPENs 9664 In order to avoid invalidating the sharing assumptions on which 9665 applications rely, NFSv4.1 clients should not provide cached data to 9666 applications or modify it on behalf of an application when it would 9667 not be valid to obtain or modify that same data via a READ or WRITE 9668 operation. 9670 Furthermore, in the absence of an OPEN delegation (see Section 10.4), 9671 two additional rules apply. Note that these rules are obeyed in 9672 practice by many NFSv3 clients. 9674 * First, cached data present on a client must be revalidated after 9675 doing an OPEN. Revalidating means that the client fetches the 9676 change attribute from the server, compares it with the cached 9677 change attribute, and if different, declares the cached data (as 9678 well as the cached attributes) as invalid. This is to ensure that 9679 the data for the OPENed file is still correctly reflected in the 9680 client's cache. This validation must be done at least when the 9681 client's OPEN operation includes a deny of OPEN4_SHARE_DENY_WRITE 9682 or OPEN4_SHARE_DENY_BOTH, thus terminating a period in which other 9683 clients may have had the opportunity to open the file with 9684 OPEN4_SHARE_ACCESS_WRITE/OPEN4_SHARE_ACCESS_BOTH access. Clients 9685 may choose to do the revalidation more often (i.e., at OPENs 9686 specifying a deny mode of OPEN4_SHARE_DENY_NONE) to parallel the 9687 NFSv3 protocol's practice for the benefit of users assuming this 9688 degree of cache revalidation. 9690 Since the change attribute is updated for data and metadata 9691 modifications, some client implementors may be tempted to use the 9692 time_modify attribute and not the change attribute to validate 9693 cached data, so that metadata changes do not spuriously invalidate 9694 clean data. The implementor is cautioned in this approach. The 9695 change attribute is guaranteed to change for each update to the 9696 file, whereas time_modify is guaranteed to change only at the 9697 granularity of the time_delta attribute. Use by the client's data 9698 cache validation logic of time_modify and not change runs the risk 9699 of the client incorrectly marking stale data as valid. Thus, any 9700 cache validation approach by the client MUST include the use of 9701 the change attribute. 9703 * Second, modified data must be flushed to the server before closing 9704 a file OPENed for OPEN4_SHARE_ACCESS_WRITE. This is complementary 9705 to the first rule. If the data is not flushed at CLOSE, the 9706 revalidation done after the client OPENs a file is unable to 9707 achieve its purpose. The other aspect to flushing the data before 9708 close is that the data must be committed to stable storage, at the 9709 server, before the CLOSE operation is requested by the client. In 9710 the case of a server restart and a CLOSEd file, it may not be 9711 possible to retransmit the data to be written to the file, hence, 9712 this requirement. 9714 10.3.2. Data Caching and File Locking 9716 For those applications that choose to use byte-range locking instead 9717 of share reservations to exclude inconsistent file access, there is 9718 an analogous set of constraints that apply to client-side data 9719 caching. These rules are effective only if the byte-range locking is 9720 used in a way that matches in an equivalent way the actual READ and 9721 WRITE operations executed. This is as opposed to byte-range locking 9722 that is based on pure convention. For example, it is possible to 9723 manipulate a two-megabyte file by dividing the file into two one- 9724 megabyte ranges and protecting access to the two byte-ranges by byte- 9725 range locks on bytes zero and one. A WRITE_LT lock on byte zero of 9726 the file would represent the right to perform READ and WRITE 9727 operations on the first byte-range. A WRITE_LT lock on byte one of 9728 the file would represent the right to perform READ and WRITE 9729 operations on the second byte-range. As long as all applications 9730 manipulating the file obey this convention, they will work on a local 9731 file system. However, they may not work with the NFSv4.1 protocol 9732 unless clients refrain from data caching. 9734 The rules for data caching in the byte-range locking environment are: 9736 * First, when a client obtains a byte-range lock for a particular 9737 byte-range, the data cache corresponding to that byte-range (if 9738 any cache data exists) must be revalidated. If the change 9739 attribute indicates that the file may have been updated since the 9740 cached data was obtained, the client must flush or invalidate the 9741 cached data for the newly locked byte-range. A client might 9742 choose to invalidate all of the non-modified cached data that it 9743 has for the file, but the only requirement for correct operation 9744 is to invalidate all of the data in the newly locked byte-range. 9746 * Second, before releasing a WRITE_LT lock for a byte-range, all 9747 modified data for that byte-range must be flushed to the server. 9748 The modified data must also be written to stable storage. 9750 Note that flushing data to the server and the invalidation of cached 9751 data must reflect the actual byte-ranges locked or unlocked. 9752 Rounding these up or down to reflect client cache block boundaries 9753 will cause problems if not carefully done. For example, writing a 9754 modified block when only half of that block is within an area being 9755 unlocked may cause invalid modification to the byte-range outside the 9756 unlocked area. This, in turn, may be part of a byte-range locked by 9757 another client. Clients can avoid this situation by synchronously 9758 performing portions of WRITE operations that overlap that portion 9759 (initial or final) that is not a full block. Similarly, invalidating 9760 a locked area that is not an integral number of full buffer blocks 9761 would require the client to read one or two partial blocks from the 9762 server if the revalidation procedure shows that the data that the 9763 client possesses may not be valid. 9765 The data that is written to the server as a prerequisite to the 9766 unlocking of a byte-range must be written, at the server, to stable 9767 storage. The client may accomplish this either with synchronous 9768 writes or by following asynchronous writes with a COMMIT operation. 9769 This is required because retransmission of the modified data after a 9770 server restart might conflict with a lock held by another client. 9772 A client implementation may choose to accommodate applications that 9773 use byte-range locking in non-standard ways (e.g., using a byte-range 9774 lock as a global semaphore) by flushing to the server more data upon 9775 a LOCKU than is covered by the locked range. This may include 9776 modified data within files other than the one for which the unlocks 9777 are being done. In such cases, the client must not interfere with 9778 applications whose READs and WRITEs are being done only within the 9779 bounds of byte-range locks that the application holds. For example, 9780 an application locks a single byte of a file and proceeds to write 9781 that single byte. A client that chose to handle a LOCKU by flushing 9782 all modified data to the server could validly write that single byte 9783 in response to an unrelated LOCKU operation. However, it would not 9784 be valid to write the entire block in which that single written byte 9785 was located since it includes an area that is not locked and might be 9786 locked by another client. Client implementations can avoid this 9787 problem by dividing files with modified data into those for which all 9788 modifications are done to areas covered by an appropriate byte-range 9789 lock and those for which there are modifications not covered by a 9790 byte-range lock. Any writes done for the former class of files must 9791 not include areas not locked and thus not modified on the client. 9793 10.3.3. Data Caching and Mandatory File Locking 9795 Client-side data caching needs to respect mandatory byte-range 9796 locking when it is in effect. The presence of mandatory byte-range 9797 locking for a given file is indicated when the client gets back 9798 NFS4ERR_LOCKED from a READ or WRITE operation on a file for which it 9799 has an appropriate share reservation. When mandatory locking is in 9800 effect for a file, the client must check for an appropriate byte- 9801 range lock for data being read or written. If a byte-range lock 9802 exists for the range being read or written, the client may satisfy 9803 the request using the client's validated cache. If an appropriate 9804 byte-range lock is not held for the range of the read or write, the 9805 read or write request must not be satisfied by the client's cache and 9806 the request must be sent to the server for processing. When a read 9807 or write request partially overlaps a locked byte-range, the request 9808 should be subdivided into multiple pieces with each byte-range 9809 (locked or not) treated appropriately. 9811 10.3.4. Data Caching and File Identity 9813 When clients cache data, the file data needs to be organized 9814 according to the file system object to which the data belongs. For 9815 NFSv3 clients, the typical practice has been to assume for the 9816 purpose of caching that distinct filehandles represent distinct file 9817 system objects. The client then has the choice to organize and 9818 maintain the data cache on this basis. 9820 In the NFSv4.1 protocol, there is now the possibility to have 9821 significant deviations from a "one filehandle per object" model 9822 because a filehandle may be constructed on the basis of the object's 9823 pathname. Therefore, clients need a reliable method to determine if 9824 two filehandles designate the same file system object. If clients 9825 were simply to assume that all distinct filehandles denote distinct 9826 objects and proceed to do data caching on this basis, caching 9827 inconsistencies would arise between the distinct client-side objects 9828 that mapped to the same server-side object. 9830 By providing a method to differentiate filehandles, the NFSv4.1 9831 protocol alleviates a potential functional regression in comparison 9832 with the NFSv3 protocol. Without this method, caching 9833 inconsistencies within the same client could occur, and this has not 9834 been present in previous versions of the NFS protocol. Note that it 9835 is possible to have such inconsistencies with applications executing 9836 on multiple clients, but that is not the issue being addressed here. 9838 For the purposes of data caching, the following steps allow an 9839 NFSv4.1 client to determine whether two distinct filehandles denote 9840 the same server-side object: 9842 * If GETATTR directed to two filehandles returns different values of 9843 the fsid attribute, then the filehandles represent distinct 9844 objects. 9846 * If GETATTR for any file with an fsid that matches the fsid of the 9847 two filehandles in question returns a unique_handles attribute 9848 with a value of TRUE, then the two objects are distinct. 9850 * If GETATTR directed to the two filehandles does not return the 9851 fileid attribute for both of the handles, then it cannot be 9852 determined whether the two objects are the same. Therefore, 9853 operations that depend on that knowledge (e.g., client-side data 9854 caching) cannot be done reliably. Note that if GETATTR does not 9855 return the fileid attribute for both filehandles, it will return 9856 it for neither of the filehandles, since the fsid for both 9857 filehandles is the same. 9859 * If GETATTR directed to the two filehandles returns different 9860 values for the fileid attribute, then they are distinct objects. 9862 * Otherwise, they are the same object. 9864 10.4. Open Delegation 9866 When a file is being OPENed, the server may delegate further handling 9867 of opens and closes for that file to the opening client. Any such 9868 delegation is recallable since the circumstances that allowed for the 9869 delegation are subject to change. In particular, if the server 9870 receives a conflicting OPEN from another client, the server must 9871 recall the delegation before deciding whether the OPEN from the other 9872 client may be granted. Making a delegation is up to the server, and 9873 clients should not assume that any particular OPEN either will or 9874 will not result in an OPEN delegation. The following is a typical 9875 set of conditions that servers might use in deciding whether an OPEN 9876 should be delegated: 9878 * The client must be able to respond to the server's callback 9879 requests. If a backchannel has been established, the server will 9880 send a CB_COMPOUND request, containing a single operation, 9881 CB_SEQUENCE, for a test of backchannel availability. 9883 * The client must have responded properly to previous recalls. 9885 * There must be no current OPEN conflicting with the requested 9886 delegation. 9888 * There should be no current delegation that conflicts with the 9889 delegation being requested. 9891 * The probability of future conflicting open requests should be low 9892 based on the recent history of the file. 9894 * The existence of any server-specific semantics of OPEN/CLOSE that 9895 would make the required handling incompatible with the prescribed 9896 handling that the delegated client would apply (see below). 9898 There are two types of OPEN delegations: OPEN_DELEGATE_READ and 9899 OPEN_DELEGATE_WRITE. An OPEN_DELEGATE_READ delegation allows a 9900 client to handle, on its own, requests to open a file for reading 9901 that do not deny OPEN4_SHARE_ACCESS_READ access to others. Multiple 9902 OPEN_DELEGATE_READ delegations may be outstanding simultaneously and 9903 do not conflict. An OPEN_DELEGATE_WRITE delegation allows the client 9904 to handle, on its own, all opens. Only one OPEN_DELEGATE_WRITE 9905 delegation may exist for a given file at a given time, and it is 9906 inconsistent with any OPEN_DELEGATE_READ delegations. 9908 When a client has an OPEN_DELEGATE_READ delegation, it is assured 9909 that neither the contents, the attributes (with the exception of 9910 time_access), nor the names of any links to the file will change 9911 without its knowledge, so long as the delegation is held. When a 9912 client has an OPEN_DELEGATE_WRITE delegation, it may modify the file 9913 data locally since no other client will be accessing the file's data. 9914 The client holding an OPEN_DELEGATE_WRITE delegation may only locally 9915 affect file attributes that are intimately connected with the file 9916 data: size, change, time_access, time_metadata, and time_modify. All 9917 other attributes must be reflected on the server. 9919 When a client has an OPEN delegation, it does not need to send OPENs 9920 or CLOSEs to the server. Instead, the client may update the 9921 appropriate status internally. For an OPEN_DELEGATE_READ delegation, 9922 opens that cannot be handled locally (opens that are for 9923 OPEN4_SHARE_ACCESS_WRITE/OPEN4_SHARE_ACCESS_BOTH or that deny 9924 OPEN4_SHARE_ACCESS_READ access) must be sent to the server. 9926 When an OPEN delegation is made, the reply to the OPEN contains an 9927 OPEN delegation structure that specifies the following: 9929 * the type of delegation (OPEN_DELEGATE_READ or 9930 OPEN_DELEGATE_WRITE). 9932 * space limitation information to control flushing of data on close 9933 (OPEN_DELEGATE_WRITE delegation only; see Section 10.4.1) 9935 * an nfsace4 specifying read and write permissions 9937 * a stateid to represent the delegation 9939 The delegation stateid is separate and distinct from the stateid for 9940 the OPEN proper. The standard stateid, unlike the delegation 9941 stateid, is associated with a particular lock-owner and will continue 9942 to be valid after the delegation is recalled and the file remains 9943 open. 9945 When a request internal to the client is made to open a file and an 9946 OPEN delegation is in effect, it will be accepted or rejected solely 9947 on the basis of the following conditions. Any requirement for other 9948 checks to be made by the delegate should result in the OPEN 9949 delegation being denied so that the checks can be made by the server 9950 itself. 9952 * The access and deny bits for the request and the file as described 9953 in Section 9.7. 9955 * The read and write permissions as determined below. 9957 The nfsace4 passed with delegation can be used to avoid frequent 9958 ACCESS calls. The permission check should be as follows: 9960 * If the nfsace4 indicates that the open may be done, then it should 9961 be granted without reference to the server. 9963 * If the nfsace4 indicates that the open may not be done, then an 9964 ACCESS request must be sent to the server to obtain the definitive 9965 answer. 9967 The server may return an nfsace4 that is more restrictive than the 9968 actual ACL of the file. This includes an nfsace4 that specifies 9969 denial of all access. Note that some common practices such as 9970 mapping the traditional user "root" to the user "nobody" (see 9971 Section 5.9) may make it incorrect to return the actual ACL of the 9972 file in the delegation response. 9974 The use of a delegation together with various other forms of caching 9975 creates the possibility that no server authentication and 9976 authorization will ever be performed for a given user since all of 9977 the user's requests might be satisfied locally. Where the client is 9978 depending on the server for authentication and authorization, the 9979 client should be sure authentication and authorization occurs for 9980 each user by use of the ACCESS operation. This should be the case 9981 even if an ACCESS operation would not be required otherwise. As 9982 mentioned before, the server may enforce frequent authentication by 9983 returning an nfsace4 denying all access with every OPEN delegation. 9985 10.4.1. Open Delegation and Data Caching 9987 An OPEN delegation allows much of the message overhead associated 9988 with the opening and closing files to be eliminated. An open when an 9989 OPEN delegation is in effect does not require that a validation 9990 message be sent to the server. The continued endurance of the 9991 "OPEN_DELEGATE_READ delegation" provides a guarantee that no OPEN for 9992 OPEN4_SHARE_ACCESS_WRITE/OPEN4_SHARE_ACCESS_BOTH, and thus no write, 9993 has occurred. Similarly, when closing a file opened for 9994 OPEN4_SHARE_ACCESS_WRITE/OPEN4_SHARE_ACCESS_BOTH and if an 9995 OPEN_DELEGATE_WRITE delegation is in effect, the data written does 9996 not have to be written to the server until the OPEN delegation is 9997 recalled. The continued endurance of the OPEN delegation provides a 9998 guarantee that no open, and thus no READ or WRITE, has been done by 9999 another client. 10001 For the purposes of OPEN delegation, READs and WRITEs done without an 10002 OPEN are treated as the functional equivalents of a corresponding 10003 type of OPEN. Although a client SHOULD NOT use special stateids when 10004 an open exists, delegation handling on the server can use the client 10005 ID associated with the current session to determine if the operation 10006 has been done by the holder of the delegation (in which case, no 10007 recall is necessary) or by another client (in which case, the 10008 delegation must be recalled and I/O not proceed until the delegation 10009 is returned or revoked). 10011 With delegations, a client is able to avoid writing data to the 10012 server when the CLOSE of a file is serviced. The file close system 10013 call is the usual point at which the client is notified of a lack of 10014 stable storage for the modified file data generated by the 10015 application. At the close, file data is written to the server and, 10016 through normal accounting, the server is able to determine if the 10017 available file system space for the data has been exceeded (i.e., the 10018 server returns NFS4ERR_NOSPC or NFS4ERR_DQUOT). This accounting 10019 includes quotas. The introduction of delegations requires that an 10020 alternative method be in place for the same type of communication to 10021 occur between client and server. 10023 In the delegation response, the server provides either the limit of 10024 the size of the file or the number of modified blocks and associated 10025 block size. The server must ensure that the client will be able to 10026 write modified data to the server of a size equal to that provided in 10027 the original delegation. The server must make this assurance for all 10028 outstanding delegations. Therefore, the server must be careful in 10029 its management of available space for new or modified data, taking 10030 into account available file system space and any applicable quotas. 10031 The server can recall delegations as a result of managing the 10032 available file system space. The client should abide by the server's 10033 state space limits for delegations. If the client exceeds the stated 10034 limits for the delegation, the server's behavior is undefined. 10036 Based on server conditions, quotas, or available file system space, 10037 the server may grant OPEN_DELEGATE_WRITE delegations with very 10038 restrictive space limitations. The limitations may be defined in a 10039 way that will always force modified data to be flushed to the server 10040 on close. 10042 With respect to authentication, flushing modified data to the server 10043 after a CLOSE has occurred may be problematic. For example, the user 10044 of the application may have logged off the client, and unexpired 10045 authentication credentials may not be present. In this case, the 10046 client may need to take special care to ensure that local unexpired 10047 credentials will in fact be available. This may be accomplished by 10048 tracking the expiration time of credentials and flushing data well in 10049 advance of their expiration or by making private copies of 10050 credentials to assure their availability when needed. 10052 10.4.2. Open Delegation and File Locks 10054 When a client holds an OPEN_DELEGATE_WRITE delegation, lock 10055 operations are performed locally. This includes those required for 10056 mandatory byte-range locking. This can be done since the delegation 10057 implies that there can be no conflicting locks. Similarly, all of 10058 the revalidations that would normally be associated with obtaining 10059 locks and the flushing of data associated with the releasing of locks 10060 need not be done. 10062 When a client holds an OPEN_DELEGATE_READ delegation, lock operations 10063 are not performed locally. All lock operations, including those 10064 requesting non-exclusive locks, are sent to the server for 10065 resolution. 10067 10.4.3. Handling of CB_GETATTR 10069 The server needs to employ special handling for a GETATTR where the 10070 target is a file that has an OPEN_DELEGATE_WRITE delegation in 10071 effect. The reason for this is that the client holding the 10072 OPEN_DELEGATE_WRITE delegation may have modified the data, and the 10073 server needs to reflect this change to the second client that 10074 submitted the GETATTR. Therefore, the client holding the 10075 OPEN_DELEGATE_WRITE delegation needs to be interrogated. The server 10076 will use the CB_GETATTR operation. The only attributes that the 10077 server can reliably query via CB_GETATTR are size and change. 10079 Since CB_GETATTR is being used to satisfy another client's GETATTR 10080 request, the server only needs to know if the client holding the 10081 delegation has a modified version of the file. If the client's copy 10082 of the delegated file is not modified (data or size), the server can 10083 satisfy the second client's GETATTR request from the attributes 10084 stored locally at the server. If the file is modified, the server 10085 only needs to know about this modified state. If the server 10086 determines that the file is currently modified, it will respond to 10087 the second client's GETATTR as if the file had been modified locally 10088 at the server. 10090 Since the form of the change attribute is determined by the server 10091 and is opaque to the client, the client and server need to agree on a 10092 method of communicating the modified state of the file. For the size 10093 attribute, the client will report its current view of the file size. 10094 For the change attribute, the handling is more involved. 10096 For the client, the following steps will be taken when receiving an 10097 OPEN_DELEGATE_WRITE delegation: 10099 * The value of the change attribute will be obtained from the server 10100 and cached. Let this value be represented by c. 10102 * The client will create a value greater than c that will be used 10103 for communicating that modified data is held at the client. Let 10104 this value be represented by d. 10106 * When the client is queried via CB_GETATTR for the change 10107 attribute, it checks to see if it holds modified data. If the 10108 file is modified, the value d is returned for the change attribute 10109 value. If this file is not currently modified, the client returns 10110 the value c for the change attribute. 10112 For simplicity of implementation, the client MAY for each CB_GETATTR 10113 return the same value d. This is true even if, between successive 10114 CB_GETATTR operations, the client again modifies the file's data or 10115 metadata in its cache. The client can return the same value because 10116 the only requirement is that the client be able to indicate to the 10117 server that the client holds modified data. Therefore, the value of 10118 d may always be c + 1. 10120 While the change attribute is opaque to the client in the sense that 10121 it has no idea what units of time, if any, the server is counting 10122 change with, it is not opaque in that the client has to treat it as 10123 an unsigned integer, and the server has to be able to see the results 10124 of the client's changes to that integer. Therefore, the server MUST 10125 encode the change attribute in network order when sending it to the 10126 client. The client MUST decode it from network order to its native 10127 order when receiving it, and the client MUST encode it in network 10128 order when sending it to the server. For this reason, change is 10129 defined as an unsigned integer rather than an opaque array of bytes. 10131 For the server, the following steps will be taken when providing an 10132 OPEN_DELEGATE_WRITE delegation: 10134 * Upon providing an OPEN_DELEGATE_WRITE delegation, the server will 10135 cache a copy of the change attribute in the data structure it uses 10136 to record the delegation. Let this value be represented by sc. 10138 * When a second client sends a GETATTR operation on the same file to 10139 the server, the server obtains the change attribute from the first 10140 client. Let this value be cc. 10142 * If the value cc is equal to sc, the file is not modified and the 10143 server returns the current values for change, time_metadata, and 10144 time_modify (for example) to the second client. 10146 * If the value cc is NOT equal to sc, the file is currently modified 10147 at the first client and most likely will be modified at the server 10148 at a future time. The server then uses its current time to 10149 construct attribute values for time_metadata and time_modify. A 10150 new value of sc, which we will call nsc, is computed by the 10151 server, such that nsc >= sc + 1. The server then returns the 10152 constructed time_metadata, time_modify, and nsc values to the 10153 requester. The server replaces sc in the delegation record with 10154 nsc. To prevent the possibility of time_modify, time_metadata, 10155 and change from appearing to go backward (which would happen if 10156 the client holding the delegation fails to write its modified data 10157 to the server before the delegation is revoked or returned), the 10158 server SHOULD update the file's metadata record with the 10159 constructed attribute values. For reasons of reasonable 10160 performance, committing the constructed attribute values to stable 10161 storage is OPTIONAL. 10163 As discussed earlier in this section, the client MAY return the same 10164 cc value on subsequent CB_GETATTR calls, even if the file was 10165 modified in the client's cache yet again between successive 10166 CB_GETATTR calls. Therefore, the server must assume that the file 10167 has been modified yet again, and MUST take care to ensure that the 10168 new nsc it constructs and returns is greater than the previous nsc it 10169 returned. An example implementation's delegation record would 10170 satisfy this mandate by including a boolean field (let us call it 10171 "modified") that is set to FALSE when the delegation is granted, and 10172 an sc value set at the time of grant to the change attribute value. 10173 The modified field would be set to TRUE the first time cc != sc, and 10174 would stay TRUE until the delegation is returned or revoked. The 10175 processing for constructing nsc, time_modify, and time_metadata would 10176 use this pseudo code: 10178 if (!modified) { 10179 do CB_GETATTR for change and size; 10181 if (cc != sc) 10182 modified = TRUE; 10183 } else { 10184 do CB_GETATTR for size; 10185 } 10187 if (modified) { 10188 sc = sc + 1; 10189 time_modify = time_metadata = current_time; 10190 update sc, time_modify, time_metadata into file's metadata; 10191 } 10193 This would return to the client (that sent GETATTR) the attributes it 10194 requested, but make sure size comes from what CB_GETATTR returned. 10195 The server would not update the file's metadata with the client's 10196 modified size. 10198 In the case that the file attribute size is different than the 10199 server's current value, the server treats this as a modification 10200 regardless of the value of the change attribute retrieved via 10201 CB_GETATTR and responds to the second client as in the last step. 10203 This methodology resolves issues of clock differences between client 10204 and server and other scenarios where the use of CB_GETATTR break 10205 down. 10207 It should be noted that the server is under no obligation to use 10208 CB_GETATTR, and therefore the server MAY simply recall the delegation 10209 to avoid its use. 10211 10.4.4. Recall of Open Delegation 10213 The following events necessitate recall of an OPEN delegation: 10215 * potentially conflicting OPEN request (or a READ or WRITE operation 10216 done with a special stateid) 10218 * SETATTR sent by another client 10220 * REMOVE request for the file 10222 * RENAME request for the file as either the source or target of the 10223 RENAME 10225 Whether a RENAME of a directory in the path leading to the file 10226 results in recall of an OPEN delegation depends on the semantics of 10227 the server's file system. If that file system denies such RENAMEs 10228 when a file is open, the recall must be performed to determine 10229 whether the file in question is, in fact, open. 10231 In addition to the situations above, the server may choose to recall 10232 OPEN delegations at any time if resource constraints make it 10233 advisable to do so. Clients should always be prepared for the 10234 possibility of recall. 10236 When a client receives a recall for an OPEN delegation, it needs to 10237 update state on the server before returning the delegation. These 10238 same updates must be done whenever a client chooses to return a 10239 delegation voluntarily. The following items of state need to be 10240 dealt with: 10242 * If the file associated with the delegation is no longer open and 10243 no previous CLOSE operation has been sent to the server, a CLOSE 10244 operation must be sent to the server. 10246 * If a file has other open references at the client, then OPEN 10247 operations must be sent to the server. The appropriate stateids 10248 will be provided by the server for subsequent use by the client 10249 since the delegation stateid will no longer be valid. These OPEN 10250 requests are done with the claim type of CLAIM_DELEGATE_CUR. This 10251 will allow the presentation of the delegation stateid so that the 10252 client can establish the appropriate rights to perform the OPEN. 10253 (See Section 18.16, which describes the OPEN operation, for 10254 details.) 10256 * If there are granted byte-range locks, the corresponding LOCK 10257 operations need to be performed. This applies to the 10258 OPEN_DELEGATE_WRITE delegation case only. 10260 * For an OPEN_DELEGATE_WRITE delegation, if at the time of recall 10261 the file is not open for OPEN4_SHARE_ACCESS_WRITE/ 10262 OPEN4_SHARE_ACCESS_BOTH, all modified data for the file must be 10263 flushed to the server. If the delegation had not existed, the 10264 client would have done this data flush before the CLOSE operation. 10266 * For an OPEN_DELEGATE_WRITE delegation when a file is still open at 10267 the time of recall, any modified data for the file needs to be 10268 flushed to the server. 10270 * With the OPEN_DELEGATE_WRITE delegation in place, it is possible 10271 that the file was truncated during the duration of the delegation. 10272 For example, the truncation could have occurred as a result of an 10273 OPEN UNCHECKED with a size attribute value of zero. Therefore, if 10274 a truncation of the file has occurred and this operation has not 10275 been propagated to the server, the truncation must occur before 10276 any modified data is written to the server. 10278 In the case of OPEN_DELEGATE_WRITE delegation, byte-range locking 10279 imposes some additional requirements. To precisely maintain the 10280 associated invariant, it is required to flush any modified data in 10281 any byte-range for which a WRITE_LT lock was released while the 10282 OPEN_DELEGATE_WRITE delegation was in effect. However, because the 10283 OPEN_DELEGATE_WRITE delegation implies no other locking by other 10284 clients, a simpler implementation is to flush all modified data for 10285 the file (as described just above) if any WRITE_LT lock has been 10286 released while the OPEN_DELEGATE_WRITE delegation was in effect. 10288 An implementation need not wait until delegation recall (or the 10289 decision to voluntarily return a delegation) to perform any of the 10290 above actions, if implementation considerations (e.g., resource 10291 availability constraints) make that desirable. Generally, however, 10292 the fact that the actual OPEN state of the file may continue to 10293 change makes it not worthwhile to send information about opens and 10294 closes to the server, except as part of delegation return. An 10295 exception is when the client has no more internal opens of the file. 10296 In this case, sending a CLOSE is useful because it reduces resource 10297 utilization on the client and server. Regardless of the client's 10298 choices on scheduling these actions, all must be performed before the 10299 delegation is returned, including (when applicable) the close that 10300 corresponds to the OPEN that resulted in the delegation. These 10301 actions can be performed either in previous requests or in previous 10302 operations in the same COMPOUND request. 10304 10.4.5. Clients That Fail to Honor Delegation Recalls 10306 A client may fail to respond to a recall for various reasons, such as 10307 a failure of the backchannel from server to the client. The client 10308 may be unaware of a failure in the backchannel. This lack of 10309 awareness could result in the client finding out long after the 10310 failure that its delegation has been revoked, and another client has 10311 modified the data for which the client had a delegation. This is 10312 especially a problem for the client that held an OPEN_DELEGATE_WRITE 10313 delegation. 10315 Status bits returned by SEQUENCE operations help to provide an 10316 alternate way of informing the client of issues regarding the status 10317 of the backchannel and of recalled delegations. When the backchannel 10318 is not available, the server returns the status bit 10319 SEQ4_STATUS_CB_PATH_DOWN on SEQUENCE operations. The client can 10320 react by attempting to re-establish the backchannel and by returning 10321 recallable objects if a backchannel cannot be successfully re- 10322 established. 10324 Whether the backchannel is functioning or not, it may be that the 10325 recalled delegation is not returned. Note that the client's lease 10326 might still be renewed, even though the recalled delegation is not 10327 returned. In this situation, servers SHOULD revoke delegations that 10328 are not returned in a period of time equal to the lease period. This 10329 period of time should allow the client time to note the backchannel- 10330 down status and re-establish the backchannel. 10332 When delegations are revoked, the server will return with the 10333 SEQ4_STATUS_RECALLABLE_STATE_REVOKED status bit set on subsequent 10334 SEQUENCE operations. The client should note this and then use 10335 TEST_STATEID to find which delegations have been revoked. 10337 10.4.6. Delegation Revocation 10339 At the point a delegation is revoked, if there are associated opens 10340 on the client, these opens may or may not be revoked. If no byte- 10341 range lock or open is granted that is inconsistent with the existing 10342 open, the stateid for the open may remain valid and be disconnected 10343 from the revoked delegation, just as would be the case if the 10344 delegation were returned. 10346 For example, if an OPEN for OPEN4_SHARE_ACCESS_BOTH with a deny of 10347 OPEN4_SHARE_DENY_NONE is associated with the delegation, granting of 10348 another such OPEN to a different client will revoke the delegation 10349 but need not revoke the OPEN, since the two OPENs are consistent with 10350 each other. On the other hand, if an OPEN denying write access is 10351 granted, then the existing OPEN must be revoked. 10353 When opens and/or locks are revoked, the applications holding these 10354 opens or locks need to be notified. This notification usually occurs 10355 by returning errors for READ/WRITE operations or when a close is 10356 attempted for the open file. 10358 If no opens exist for the file at the point the delegation is 10359 revoked, then notification of the revocation is unnecessary. 10360 However, if there is modified data present at the client for the 10361 file, the user of the application should be notified. Unfortunately, 10362 it may not be possible to notify the user since active applications 10363 may not be present at the client. See Section 10.5.1 for additional 10364 details. 10366 10.4.7. Delegations via WANT_DELEGATION 10368 In addition to providing delegations as part of the reply to OPEN 10369 operations, servers MAY provide delegations separate from open, via 10370 the OPTIONAL WANT_DELEGATION operation. This allows delegations to 10371 be obtained in advance of an OPEN that might benefit from them, for 10372 objects that are not a valid target of OPEN, or to deal with cases in 10373 which a delegation has been recalled and the client wants to make an 10374 attempt to re-establish it if the absence of use by other clients 10375 allows that. 10377 The WANT_DELEGATION operation may be performed on any type of file 10378 object other than a directory. 10380 When a delegation is obtained using WANT_DELEGATION, any open files 10381 for the same filehandle held by that client are to be treated as 10382 subordinate to the delegation, just as if they had been created using 10383 an OPEN of type CLAIM_DELEGATE_CUR. They are otherwise unchanged as 10384 to seqid, access and deny modes, and the relationship with byte-range 10385 locks. Similarly, because existing byte-range locks are subordinate 10386 to an open, those byte-range locks also become indirectly subordinate 10387 to that new delegation. 10389 The WANT_DELEGATION operation provides for delivery of delegations 10390 via callbacks, when the delegations are not immediately available. 10391 When a requested delegation is available, it is delivered to the 10392 client via a CB_PUSH_DELEG operation. When this happens, open files 10393 for the same filehandle become subordinate to the new delegation at 10394 the point at which the delegation is delivered, just as if they had 10395 been created using an OPEN of type CLAIM_DELEGATE_CUR. Similarly, 10396 this occurs for existing byte-range locks subordinate to an open. 10398 10.5. Data Caching and Revocation 10400 When locks and delegations are revoked, the assumptions upon which 10401 successful caching depends are no longer guaranteed. For any locks 10402 or share reservations that have been revoked, the corresponding 10403 state-owner needs to be notified. This notification includes 10404 applications with a file open that has a corresponding delegation 10405 that has been revoked. Cached data associated with the revocation 10406 must be removed from the client. In the case of modified data 10407 existing in the client's cache, that data must be removed from the 10408 client without being written to the server. As mentioned, the 10409 assumptions made by the client are no longer valid at the point when 10410 a lock or delegation has been revoked. For example, another client 10411 may have been granted a conflicting byte-range lock after the 10412 revocation of the byte-range lock at the first client. Therefore, 10413 the data within the lock range may have been modified by the other 10414 client. Obviously, the first client is unable to guarantee to the 10415 application what has occurred to the file in the case of revocation. 10417 Notification to a state-owner will in many cases consist of simply 10418 returning an error on the next and all subsequent READs/WRITEs to the 10419 open file or on the close. Where the methods available to a client 10420 make such notification impossible because errors for certain 10421 operations may not be returned, more drastic action such as signals 10422 or process termination may be appropriate. The justification here is 10423 that an invariant on which an application depends may be violated. 10424 Depending on how errors are typically treated for the client- 10425 operating environment, further levels of notification including 10426 logging, console messages, and GUI pop-ups may be appropriate. 10428 10.5.1. Revocation Recovery for Write Open Delegation 10430 Revocation recovery for an OPEN_DELEGATE_WRITE delegation poses the 10431 special issue of modified data in the client cache while the file is 10432 not open. In this situation, any client that does not flush modified 10433 data to the server on each close must ensure that the user receives 10434 appropriate notification of the failure as a result of the 10435 revocation. Since such situations may require human action to 10436 correct problems, notification schemes in which the appropriate user 10437 or administrator is notified may be necessary. Logging and console 10438 messages are typical examples. 10440 If there is modified data on the client, it must not be flushed 10441 normally to the server. A client may attempt to provide a copy of 10442 the file data as modified during the delegation under a different 10443 name in the file system namespace to ease recovery. Note that when 10444 the client can determine that the file has not been modified by any 10445 other client, or when the client has a complete cached copy of the 10446 file in question, such a saved copy of the client's view of the file 10447 may be of particular value for recovery. In another case, recovery 10448 using a copy of the file based partially on the client's cached data 10449 and partially on the server's copy as modified by other clients will 10450 be anything but straightforward, so clients may avoid saving file 10451 contents in these situations or specially mark the results to warn 10452 users of possible problems. 10454 Saving of such modified data in delegation revocation situations may 10455 be limited to files of a certain size or might be used only when 10456 sufficient disk space is available within the target file system. 10457 Such saving may also be restricted to situations when the client has 10458 sufficient buffering resources to keep the cached copy available 10459 until it is properly stored to the target file system. 10461 10.6. Attribute Caching 10463 This section pertains to the caching of a file's attributes on a 10464 client when that client does not hold a delegation on the file. 10466 The attributes discussed in this section do not include named 10467 attributes. Individual named attributes are analogous to files, and 10468 caching of the data for these needs to be handled just as data 10469 caching is for ordinary files. Similarly, LOOKUP results from an 10470 OPENATTR directory (as well as the directory's contents) are to be 10471 cached on the same basis as any other pathnames. 10473 Clients may cache file attributes obtained from the server and use 10474 them to avoid subsequent GETATTR requests. Such caching is write 10475 through in that modification to file attributes is always done by 10476 means of requests to the server and should not be done locally and 10477 should not be cached. The exception to this are modifications to 10478 attributes that are intimately connected with data caching. 10479 Therefore, extending a file by writing data to the local data cache 10480 is reflected immediately in the size as seen on the client without 10481 this change being immediately reflected on the server. Normally, 10482 such changes are not propagated directly to the server, but when the 10483 modified data is flushed to the server, analogous attribute changes 10484 are made on the server. When OPEN delegation is in effect, the 10485 modified attributes may be returned to the server in reaction to a 10486 CB_RECALL call. 10488 The result of local caching of attributes is that the attribute 10489 caches maintained on individual clients will not be coherent. 10490 Changes made in one order on the server may be seen in a different 10491 order on one client and in a third order on another client. 10493 The typical file system application programming interfaces do not 10494 provide means to atomically modify or interrogate attributes for 10495 multiple files at the same time. The following rules provide an 10496 environment where the potential incoherencies mentioned above can be 10497 reasonably managed. These rules are derived from the practice of 10498 previous NFS protocols. 10500 * All attributes for a given file (per-fsid attributes excepted) are 10501 cached as a unit at the client so that no non-serializability can 10502 arise within the context of a single file. 10504 * An upper time boundary is maintained on how long a client cache 10505 entry can be kept without being refreshed from the server. 10507 * When operations are performed that change attributes at the 10508 server, the updated attribute set is requested as part of the 10509 containing RPC. This includes directory operations that update 10510 attributes indirectly. This is accomplished by following the 10511 modifying operation with a GETATTR operation and then using the 10512 results of the GETATTR to update the client's cached attributes. 10514 Note that if the full set of attributes to be cached is requested by 10515 READDIR, the results can be cached by the client on the same basis as 10516 attributes obtained via GETATTR. 10518 A client may validate its cached version of attributes for a file by 10519 fetching both the change and time_access attributes and assuming that 10520 if the change attribute has the same value as it did when the 10521 attributes were cached, then no attributes other than time_access 10522 have changed. The reason why time_access is also fetched is because 10523 many servers operate in environments where the operation that updates 10524 change does not update time_access. For example, POSIX file 10525 semantics do not update access time when a file is modified by the 10526 write system call [15]. Therefore, the client that wants a current 10527 time_access value should fetch it with change during the attribute 10528 cache validation processing and update its cached time_access. 10530 The client may maintain a cache of modified attributes for those 10531 attributes intimately connected with data of modified regular files 10532 (size, time_modify, and change). Other than those three attributes, 10533 the client MUST NOT maintain a cache of modified attributes. 10534 Instead, attribute changes are immediately sent to the server. 10536 In some operating environments, the equivalent to time_access is 10537 expected to be implicitly updated by each read of the content of the 10538 file object. If an NFS client is caching the content of a file 10539 object, whether it is a regular file, directory, or symbolic link, 10540 the client SHOULD NOT update the time_access attribute (via SETATTR 10541 or a small READ or READDIR request) on the server with each read that 10542 is satisfied from cache. The reason is that this can defeat the 10543 performance benefits of caching content, especially since an explicit 10544 SETATTR of time_access may alter the change attribute on the server. 10545 If the change attribute changes, clients that are caching the content 10546 will think the content has changed, and will re-read unmodified data 10547 from the server. Nor is the client encouraged to maintain a modified 10548 version of time_access in its cache, since the client either would 10549 eventually have to write the access time to the server with bad 10550 performance effects or never update the server's time_access, thereby 10551 resulting in a situation where an application that caches access time 10552 between a close and open of the same file observes the access time 10553 oscillating between the past and present. The time_access attribute 10554 always means the time of last access to a file by a read that was 10555 satisfied by the server. This way clients will tend to see only 10556 time_access changes that go forward in time. 10558 10.7. Data and Metadata Caching and Memory Mapped Files 10560 Some operating environments include the capability for an application 10561 to map a file's content into the application's address space. Each 10562 time the application accesses a memory location that corresponds to a 10563 block that has not been loaded into the address space, a page fault 10564 occurs and the file is read (or if the block does not exist in the 10565 file, the block is allocated and then instantiated in the 10566 application's address space). 10568 As long as each memory-mapped access to the file requires a page 10569 fault, the relevant attributes of the file that are used to detect 10570 access and modification (time_access, time_metadata, time_modify, and 10571 change) will be updated. However, in many operating environments, 10572 when page faults are not required, these attributes will not be 10573 updated on reads or updates to the file via memory access (regardless 10574 of whether the file is local or is accessed remotely). A client or 10575 server MAY fail to update attributes of a file that is being accessed 10576 via memory-mapped I/O. This has several implications: 10578 * If there is an application on the server that has memory mapped a 10579 file that a client is also accessing, the client may not be able 10580 to get a consistent value of the change attribute to determine 10581 whether or not its cache is stale. A server that knows that the 10582 file is memory-mapped could always pessimistically return updated 10583 values for change so as to force the application to always get the 10584 most up-to-date data and metadata for the file. However, due to 10585 the negative performance implications of this, such behavior is 10586 OPTIONAL. 10588 * If the memory-mapped file is not being modified on the server, and 10589 instead is just being read by an application via the memory-mapped 10590 interface, the client will not see an updated time_access 10591 attribute. However, in many operating environments, neither will 10592 any process running on the server. Thus, NFS clients are at no 10593 disadvantage with respect to local processes. 10595 * If there is another client that is memory mapping the file, and if 10596 that client is holding an OPEN_DELEGATE_WRITE delegation, the same 10597 set of issues as discussed in the previous two bullet points 10598 apply. So, when a server does a CB_GETATTR to a file that the 10599 client has modified in its cache, the reply from CB_GETATTR will 10600 not necessarily be accurate. As discussed earlier, the client's 10601 obligation is to report that the file has been modified since the 10602 delegation was granted, not whether it has been modified again 10603 between successive CB_GETATTR calls, and the server MUST assume 10604 that any file the client has modified in cache has been modified 10605 again between successive CB_GETATTR calls. Depending on the 10606 nature of the client's memory management system, this weak 10607 obligation may not be possible. A client MAY return stale 10608 information in CB_GETATTR whenever the file is memory-mapped. 10610 * The mixture of memory mapping and byte-range locking on the same 10611 file is problematic. Consider the following scenario, where a 10612 page size on each client is 8192 bytes. 10614 - Client A memory maps the first page (8192 bytes) of file X. 10616 - Client B memory maps the first page (8192 bytes) of file X. 10618 - Client A WRITE_LT locks the first 4096 bytes. 10620 - Client B WRITE_LT locks the second 4096 bytes. 10622 - Client A, via a STORE instruction, modifies part of its locked 10623 byte-range. 10625 - Simultaneous to client A, client B executes a STORE on part of 10626 its locked byte-range. 10628 Here the challenge is for each client to resynchronize to get a 10629 correct view of the first page. In many operating environments, the 10630 virtual memory management systems on each client only know a page is 10631 modified, not that a subset of the page corresponding to the 10632 respective lock byte-ranges has been modified. So it is not possible 10633 for each client to do the right thing, which is to write to the 10634 server only that portion of the page that is locked. For example, if 10635 client A simply writes out the page, and then client B writes out the 10636 page, client A's data is lost. 10638 Moreover, if mandatory locking is enabled on the file, then we have a 10639 different problem. When clients A and B execute the STORE 10640 instructions, the resulting page faults require a byte-range lock on 10641 the entire page. Each client then tries to extend their locked range 10642 to the entire page, which results in a deadlock. Communicating the 10643 NFS4ERR_DEADLOCK error to a STORE instruction is difficult at best. 10645 If a client is locking the entire memory-mapped file, there is no 10646 problem with advisory or mandatory byte-range locking, at least until 10647 the client unlocks a byte-range in the middle of the file. 10649 Given the above issues, the following are permitted: 10651 * Clients and servers MAY deny memory mapping a file for which they 10652 know there are byte-range locks. 10654 * Clients and servers MAY deny a byte-range lock on a file they know 10655 is memory-mapped. 10657 * A client MAY deny memory mapping a file that it knows requires 10658 mandatory locking for I/O. If mandatory locking is enabled after 10659 the file is opened and mapped, the client MAY deny the application 10660 further access to its mapped file. 10662 10.8. Name and Directory Caching without Directory Delegations 10664 The NFSv4.1 directory delegation facility (described in Section 10.9 10665 below) is OPTIONAL for servers to implement. Even where it is 10666 implemented, it may not always be functional because of resource 10667 availability issues or other constraints. Thus, it is important to 10668 understand how name and directory caching are done in the absence of 10669 directory delegations. These topics are discussed in the next two 10670 subsections. 10672 10.8.1. Name Caching 10674 The results of LOOKUP and READDIR operations may be cached to avoid 10675 the cost of subsequent LOOKUP operations. Just as in the case of 10676 attribute caching, inconsistencies may arise among the various client 10677 caches. To mitigate the effects of these inconsistencies and given 10678 the context of typical file system APIs, an upper time boundary is 10679 maintained for how long a client name cache entry can be kept without 10680 verifying that the entry has not been made invalid by a directory 10681 change operation performed by another client. 10683 When a client is not making changes to a directory for which there 10684 exist name cache entries, the client needs to periodically fetch 10685 attributes for that directory to ensure that it is not being 10686 modified. After determining that no modification has occurred, the 10687 expiration time for the associated name cache entries may be updated 10688 to be the current time plus the name cache staleness bound. 10690 When a client is making changes to a given directory, it needs to 10691 determine whether there have been changes made to the directory by 10692 other clients. It does this by using the change attribute as 10693 reported before and after the directory operation in the associated 10694 change_info4 value returned for the operation. The server is able to 10695 communicate to the client whether the change_info4 data is provided 10696 atomically with respect to the directory operation. If the change 10697 values are provided atomically, the client has a basis for 10698 determining, given proper care, whether other clients are modifying 10699 the directory in question. 10701 The simplest way to enable the client to make this determination is 10702 for the client to serialize all changes made to a specific directory. 10703 When this is done, and the server provides before and after values of 10704 the change attribute atomically, the client can simply compare the 10705 after value of the change attribute from one operation on a directory 10706 with the before value on the subsequent operation modifying that 10707 directory. When these are equal, the client is assured that no other 10708 client is modifying the directory in question. 10710 When such serialization is not used, and there may be multiple 10711 simultaneous outstanding operations modifying a single directory sent 10712 from a single client, making this sort of determination can be more 10713 complicated. If two such operations complete in a different order 10714 than they were actually performed, that might give an appearance 10715 consistent with modification being made by another client. Where 10716 this appears to happen, the client needs to await the completion of 10717 all such modifications that were started previously, to see if the 10718 outstanding before and after change numbers can be sorted into a 10719 chain such that the before value of one change number matches the 10720 after value of a previous one, in a chain consistent with this client 10721 being the only one modifying the directory. 10723 In either of these cases, the client is able to determine whether the 10724 directory is being modified by another client. If the comparison 10725 indicates that the directory was updated by another client, the name 10726 cache associated with the modified directory is purged from the 10727 client. If the comparison indicates no modification, the name cache 10728 can be updated on the client to reflect the directory operation and 10729 the associated timeout can be extended. The post-operation change 10730 value needs to be saved as the basis for future change_info4 10731 comparisons. 10733 As demonstrated by the scenario above, name caching requires that the 10734 client revalidate name cache data by inspecting the change attribute 10735 of a directory at the point when the name cache item was cached. 10736 This requires that the server update the change attribute for 10737 directories when the contents of the corresponding directory is 10738 modified. For a client to use the change_info4 information 10739 appropriately and correctly, the server must report the pre- and 10740 post-operation change attribute values atomically. When the server 10741 is unable to report the before and after values atomically with 10742 respect to the directory operation, the server must indicate that 10743 fact in the change_info4 return value. When the information is not 10744 atomically reported, the client should not assume that other clients 10745 have not changed the directory. 10747 10.8.2. Directory Caching 10749 The results of READDIR operations may be used to avoid subsequent 10750 READDIR operations. Just as in the cases of attribute and name 10751 caching, inconsistencies may arise among the various client caches. 10752 To mitigate the effects of these inconsistencies, and given the 10753 context of typical file system APIs, the following rules should be 10754 followed: 10756 * Cached READDIR information for a directory that is not obtained in 10757 a single READDIR operation must always be a consistent snapshot of 10758 directory contents. This is determined by using a GETATTR before 10759 the first READDIR and after the last READDIR that contributes to 10760 the cache. 10762 * An upper time boundary is maintained to indicate the length of 10763 time a directory cache entry is considered valid before the client 10764 must revalidate the cached information. 10766 The revalidation technique parallels that discussed in the case of 10767 name caching. When the client is not changing the directory in 10768 question, checking the change attribute of the directory with GETATTR 10769 is adequate. The lifetime of the cache entry can be extended at 10770 these checkpoints. When a client is modifying the directory, the 10771 client needs to use the change_info4 data to determine whether there 10772 are other clients modifying the directory. If it is determined that 10773 no other client modifications are occurring, the client may update 10774 its directory cache to reflect its own changes. 10776 As demonstrated previously, directory caching requires that the 10777 client revalidate directory cache data by inspecting the change 10778 attribute of a directory at the point when the directory was cached. 10779 This requires that the server update the change attribute for 10780 directories when the contents of the corresponding directory is 10781 modified. For a client to use the change_info4 information 10782 appropriately and correctly, the server must report the pre- and 10783 post-operation change attribute values atomically. When the server 10784 is unable to report the before and after values atomically with 10785 respect to the directory operation, the server must indicate that 10786 fact in the change_info4 return value. When the information is not 10787 atomically reported, the client should not assume that other clients 10788 have not changed the directory. 10790 10.9. Directory Delegations 10792 10.9.1. Introduction to Directory Delegations 10794 Directory caching for the NFSv4.1 protocol, as previously described, 10795 is similar to file caching in previous versions. Clients typically 10796 cache directory information for a duration determined by the client. 10797 At the end of a predefined timeout, the client will query the server 10798 to see if the directory has been updated. By caching attributes, 10799 clients reduce the number of GETATTR calls made to the server to 10800 validate attributes. Furthermore, frequently accessed files and 10801 directories, such as the current working directory, have their 10802 attributes cached on the client so that some NFS operations can be 10803 performed without having to make an RPC call. By caching name and 10804 inode information about most recently looked up entries in a 10805 Directory Name Lookup Cache (DNLC), clients do not need to send 10806 LOOKUP calls to the server every time these files are accessed. 10808 This caching approach works reasonably well at reducing network 10809 traffic in many environments. However, it does not address 10810 environments where there are numerous queries for files that do not 10811 exist. In these cases of "misses", the client sends requests to the 10812 server in order to provide reasonable application semantics and 10813 promptly detect the creation of new directory entries. Examples of 10814 high miss activity are compilation in software development 10815 environments. The current behavior of NFS limits its potential 10816 scalability and wide-area sharing effectiveness in these types of 10817 environments. Other distributed stateful file system architectures 10818 such as AFS and DFS have proven that adding state around directory 10819 contents can greatly reduce network traffic in high-miss 10820 environments. 10822 Delegation of directory contents is an OPTIONAL feature of NFSv4.1. 10823 Directory delegations provide similar traffic reduction benefits as 10824 with file delegations. By allowing clients to cache directory 10825 contents (in a read-only fashion) while being notified of changes, 10826 the client can avoid making frequent requests to interrogate the 10827 contents of slowly-changing directories, reducing network traffic and 10828 improving client performance. It can also simplify the task of 10829 determining whether other clients are making changes to the directory 10830 when the client itself is making many changes to the directory and 10831 changes are not serialized. 10833 Directory delegations allow improved namespace cache consistency to 10834 be achieved through delegations and synchronous recalls, in the 10835 absence of notifications. In addition, if time-based consistency is 10836 sufficient, asynchronous notifications can provide performance 10837 benefits for the client, and possibly the server, under some common 10838 operating conditions such as slowly-changing and/or very large 10839 directories. 10841 10.9.2. Directory Delegation Design 10843 NFSv4.1 introduces the GET_DIR_DELEGATION (Section 18.39) operation 10844 to allow the client to ask for a directory delegation. The 10845 delegation covers directory attributes and all entries in the 10846 directory. If either of these change, the delegation will be 10847 recalled synchronously. The operation causing the recall will have 10848 to wait before the recall is complete. Any changes to directory 10849 entry attributes will not cause the delegation to be recalled. 10851 In addition to asking for delegations, a client can also ask for 10852 notifications for certain events. These events include changes to 10853 the directory's attributes and/or its contents. If a client asks for 10854 notification for a certain event, the server will notify the client 10855 when that event occurs. This will not result in the delegation being 10856 recalled for that client. The notifications are asynchronous and 10857 provide a way of avoiding recalls in situations where a directory is 10858 changing enough that the pure recall model may not be effective while 10859 trying to allow the client to get substantial benefit. In the 10860 absence of notifications, once the delegation is recalled the client 10861 has to refresh its directory cache; this might not be very efficient 10862 for very large directories. 10864 The delegation is read-only and the client may not make changes to 10865 the directory other than by performing NFSv4.1 operations that modify 10866 the directory or the associated file attributes so that the server 10867 has knowledge of these changes. In order to keep the client's 10868 namespace synchronized with that of the server, the server will 10869 notify the delegation-holding client (assuming it has requested 10870 notifications) of the changes made as a result of that client's 10871 directory-modifying operations. This is to avoid any need for that 10872 client to send subsequent GETATTR or READDIR operations to the 10873 server. If a single client is holding the delegation and that client 10874 makes any changes to the directory (i.e., the changes are made via 10875 operations sent on a session associated with the client ID holding 10876 the delegation), the delegation will not be recalled. Multiple 10877 clients may hold a delegation on the same directory, but if any such 10878 client modifies the directory, the server MUST recall the delegation 10879 from the other clients, unless those clients have made provisions to 10880 be notified of that sort of modification. 10882 Delegations can be recalled by the server at any time. Normally, the 10883 server will recall the delegation when the directory changes in a way 10884 that is not covered by the notification, or when the directory 10885 changes and notifications have not been requested. If another client 10886 removes the directory for which a delegation has been granted, the 10887 server will recall the delegation. 10889 10.9.3. Attributes in Support of Directory Notifications 10891 See Section 5.11 for a description of the attributes associated with 10892 directory notifications. 10894 10.9.4. Directory Delegation Recall 10896 The server will recall the directory delegation by sending a callback 10897 to the client. It will use the same callback procedure as used for 10898 recalling file delegations. The server will recall the delegation 10899 when the directory changes in a way that is not covered by the 10900 notification. However, the server need not recall the delegation if 10901 attributes of an entry within the directory change. 10903 If the server notices that handing out a delegation for a directory 10904 is causing too many notifications to be sent out, it may decide to 10905 not hand out delegations for that directory and/or recall those 10906 already granted. If a client tries to remove the directory for which 10907 a delegation has been granted, the server will recall all associated 10908 delegations. 10910 The implementation sections for a number of operations describe 10911 situations in which notification or delegation recall would be 10912 required under some common circumstances. In this regard, a similar 10913 set of caveats to those listed in Section 10.2 apply. 10915 * For CREATE, see Section 18.4.4. 10917 * For LINK, see Section 18.9.4. 10919 * For OPEN, see Section 18.16.4. 10921 * For REMOVE, see Section 18.25.4. 10923 * For RENAME, see Section 18.26.4. 10925 * For SETATTR, see Section 18.30.4. 10927 10.9.5. Directory Delegation Recovery 10929 Recovery from client or server restart for state on regular files has 10930 two main goals: avoiding the necessity of breaking application 10931 guarantees with respect to locked files and delivery of updates 10932 cached at the client. Neither of these goals applies to directories 10933 protected by OPEN_DELEGATE_READ delegations and notifications. Thus, 10934 no provision is made for reclaiming directory delegations in the 10935 event of client or server restart. The client can simply establish a 10936 directory delegation in the same fashion as was done initially. 10938 11. Multi-Server Namespace 10940 NFSv4.1 supports attributes that allow a namespace to extend beyond 10941 the boundaries of a single server. It is desirable that clients and 10942 servers support construction of such multi-server namespaces. Use of 10943 such multi-server namespaces is OPTIONAL; however, and for many 10944 purposes, single-server namespaces are perfectly acceptable. The use 10945 of multi-server namespaces can provide many advantages by separating 10946 a file system's logical position in a namespace from the (possibly 10947 changing) logistical and administrative considerations that cause a 10948 particular file system to be located on a particular server via a 10949 single network access path that has to be known in advance or 10950 determined using DNS. 10952 11.1. Terminology 10954 In this section as a whole (i.e., within all of Section 11), the 10955 phrase "client ID" always refers to the 64-bit shorthand identifier 10956 assigned by the server (a clientid4) and never to the structure that 10957 the client uses to identify itself to the server (called an 10958 nfs_client_id4 or client_owner in NFSv4.0 and NFSv4.1, respectively). 10959 The opaque identifier within those structures is referred to as a 10960 "client id string". 10962 11.1.1. Terminology Related to Trunking 10964 It is particularly important to clarify the distinction between 10965 trunking detection and trunking discovery. The definitions we 10966 present are applicable to all minor versions of NFSv4, but we will 10967 focus on how these terms apply to NFS version 4.1. 10969 * Trunking detection refers to ways of deciding whether two specific 10970 network addresses are connected to the same NFSv4 server. The 10971 means available to make this determination depends on the protocol 10972 version, and, in some cases, on the client implementation. 10974 In the case of NFS version 4.1 and later minor versions, the means 10975 of trunking detection are as described in this document and are 10976 available to every client. Two network addresses connected to the 10977 same server can always be used together to access a particular 10978 server but cannot necessarily be used together to access a single 10979 session. See below for definitions of the terms "server- 10980 trunkable" and "session-trunkable". 10982 * Trunking discovery is a process by which a client using one 10983 network address can obtain other addresses that are connected to 10984 the same server. Typically, it builds on a trunking detection 10985 facility by providing one or more methods by which candidate 10986 addresses are made available to the client, who can then use 10987 trunking detection to appropriately filter them. 10989 Despite the support for trunking detection, there was no 10990 description of trunking discovery provided in RFC 5661 [66], 10991 making it necessary to provide those means in this document. 10993 The combination of a server network address and a particular 10994 connection type to be used by a connection is referred to as a 10995 "server endpoint". Although using different connection types may 10996 result in different ports being used, the use of different ports by 10997 multiple connections to the same network address in such cases is not 10998 the essence of the distinction between the two endpoints used. This 10999 is in contrast to the case of port-specific endpoints, in which the 11000 explicit specification of port numbers within network addresses is 11001 used to allow a single server node to support multiple NFS servers. 11003 Two network addresses connected to the same server are said to be 11004 server-trunkable. Two such addresses support the use of client ID 11005 trunking, as described in Section 2.10.5. 11007 Two network addresses connected to the same server such that those 11008 addresses can be used to support a single common session are referred 11009 to as session-trunkable. Note that two addresses may be server- 11010 trunkable without being session-trunkable, and that, when two 11011 connections of different connection types are made to the same 11012 network address and are based on a single file system location entry, 11013 they are always session-trunkable, independent of the connection 11014 type, as specified by Section 2.10.5, since their derivation from the 11015 same file system location entry, together with the identity of their 11016 network addresses, assures that both connections are to the same 11017 server and will return server-owner information, allowing session 11018 trunking to be used. 11020 11.1.2. Terminology Related to File System Location 11022 Regarding the terminology that relates to the construction of multi- 11023 server namespaces out of a set of local per-server namespaces: 11025 * Each server has a set of exported file systems that may be 11026 accessed by NFSv4 clients. Typically, this is done by assigning 11027 each file system a name within the pseudo-fs associated with the 11028 server, although the pseudo-fs may be dispensed with if there is 11029 only a single exported file system. Each such file system is part 11030 of the server's local namespace, and can be considered as a file 11031 system instance within a larger multi-server namespace. 11033 * The set of all exported file systems for a given server 11034 constitutes that server's local namespace. 11036 * In some cases, a server will have a namespace more extensive than 11037 its local namespace by using features associated with attributes 11038 that provide file system location information. These features, 11039 which allow construction of a multi-server namespace, are all 11040 described in individual sections below and include referrals 11041 (Section 11.5.6), migration (Section 11.5.5), and replication 11042 (Section 11.5.4). 11044 * A file system present in a server's pseudo-fs may have multiple 11045 file system instances on different servers associated with it. 11046 All such instances are considered replicas of one another. 11047 Whether such replicas can be used simultaneously is discussed in 11048 Section 11.11.1, while the level of coordination between them 11049 (important when switching between them) is discussed in Sections 11050 11.11.2 through 11.11.8 below. 11052 * When a file system is present in a server's pseudo-fs, but there 11053 is no corresponding local file system, it is said to be "absent". 11054 In such cases, all associated instances will be accessed on other 11055 servers. 11057 Regarding the terminology that relates to attributes used in trunking 11058 discovery and other multi-server namespace features: 11060 * File system location attributes include the fs_locations and 11061 fs_locations_info attributes. 11063 * File system location entries provide the individual file system 11064 locations within the file system location attributes. Each such 11065 entry specifies a server, in the form of a hostname or an address, 11066 and an fs name, which designates the location of the file system 11067 within the server's local namespace. A file system location entry 11068 designates a set of server endpoints to which the client may 11069 establish connections. There may be multiple endpoints because a 11070 hostname may map to multiple network addresses and because 11071 multiple connection types may be used to communicate with a single 11072 network address. However, except where explicit port numbers are 11073 used to designate a set of servers within a single server node, 11074 all such endpoints MUST designate a way of connecting to a single 11075 server. The exact form of the location entry varies with the 11076 particular file system location attribute used, as described in 11077 Section 11.2. 11079 The network addresses used in file system location entries 11080 typically appear without port number indications and are used to 11081 designate a server at one of the standard ports for NFS access, 11082 e.g., 2049 for TCP or 20049 for use with RPC-over-RDMA. Port 11083 numbers may be used in file system location entries to designate 11084 servers (typically user-level ones) accessed using other port 11085 numbers. In the case where network addresses indicate trunking 11086 relationships, the use of an explicit port number is inappropriate 11087 since trunking is a relationship between network addresses. See 11088 Section 11.5.2 for details. 11090 * File system location elements are derived from location entries, 11091 and each describes a particular network access path consisting of 11092 a network address and a location within the server's local 11093 namespace. Such location elements need not appear within a file 11094 system location attribute, but the existence of each location 11095 element derives from a corresponding location entry. When a 11096 location entry specifies an IP address, there is only a single 11097 corresponding location element. File system location entries that 11098 contain a hostname are resolved using DNS, and may result in one 11099 or more location elements. All location elements consist of a 11100 location address that includes the IP address of an interface to a 11101 server and an fs name, which is the location of the file system 11102 within the server's local namespace. The fs name can be empty if 11103 the server has no pseudo-fs and only a single exported file system 11104 at the root filehandle. 11106 * Two file system location elements are said to be server-trunkable 11107 if they specify the same fs name and the location addresses are 11108 such that the location addresses are server-trunkable. When the 11109 corresponding network paths are used, the client will always be 11110 able to use client ID trunking, but will only be able to use 11111 session trunking if the paths are also session-trunkable. 11113 * Two file system location elements are said to be session-trunkable 11114 if they specify the same fs name and the location addresses are 11115 such that the location addresses are session-trunkable. When the 11116 corresponding network paths are used, the client will be able to 11117 able to use either client ID trunking or session trunking. 11119 Discussion of the term "replica" is complicated by the fact that the 11120 term was used in RFC 5661 [66] with a meaning different from that 11121 used in this document. In short, in [66] each replica is identified 11122 by a single network access path, while in the current document, a set 11123 of network access paths that have server-trunkable network addresses 11124 and the same root-relative file system pathname is considered to be a 11125 single replica with multiple network access paths. 11127 Each set of server-trunkable location elements defines a set of 11128 available network access paths to a particular file system. When 11129 there are multiple such file systems, each of which containing the 11130 same data, these file systems are considered replicas of one another. 11131 Logically, such replication is symmetric, since the fs currently in 11132 use and an alternate fs are replicas of each other. Often, in other 11133 documents, the term "replica" is not applied to the fs currently in 11134 use, despite the fact that the replication relation is inherently 11135 symmetric. 11137 11.2. File System Location Attributes 11139 NFSv4.1 contains attributes that provide information about how a 11140 given file system may be accessed (i.e., at what network address and 11141 namespace position). As a result, file systems in the namespace of 11142 one server can be associated with one or more instances of that file 11143 system on other servers. These attributes contain file system 11144 location entries specifying a server address target (either as a DNS 11145 name representing one or more IP addresses or as a specific IP 11146 address) together with the pathname of that file system within the 11147 associated single-server namespace. 11149 The fs_locations_info RECOMMENDED attribute allows specification of 11150 one or more file system instance locations where the data 11151 corresponding to a given file system may be found. In addition to 11152 the specification of file system instance locations, this attribute 11153 provides helpful information to do the following: 11155 * Guide choices among the various file system instances provided 11156 (e.g., priority for use, writability, currency, etc.). 11158 * Help the client efficiently effect as seamless a transition as 11159 possible among multiple file system instances, when and if that 11160 should be necessary. 11162 * Guide the selection of the appropriate connection type to be used 11163 when establishing a connection. 11165 Within the fs_locations_info attribute, each fs_locations_server4 11166 entry corresponds to a file system location entry: the fls_server 11167 field designates the server, and the fl_rootpath field of the 11168 encompassing fs_locations_item4 gives the location pathname within 11169 the server's pseudo-fs. 11171 The fs_locations attribute defined in NFSv4.0 is also a part of 11172 NFSv4.1. This attribute only allows specification of the file system 11173 locations where the data corresponding to a given file system may be 11174 found. Servers SHOULD make this attribute available whenever 11175 fs_locations_info is supported, but client use of fs_locations_info 11176 is preferable because it provides more information. 11178 Within the fs_locations attribute, each fs_location4 contains a file 11179 system location entry with the server field designating the server 11180 and the rootpath field giving the location pathname within the 11181 server's pseudo-fs. 11183 11.3. File System Presence or Absence 11185 A given location in an NFSv4.1 namespace (typically but not 11186 necessarily a multi-server namespace) can have a number of file 11187 system instance locations associated with it (via the fs_locations or 11188 fs_locations_info attribute). There may also be an actual current 11189 file system at that location, accessible via normal namespace 11190 operations (e.g., LOOKUP). In this case, the file system is said to 11191 be "present" at that position in the namespace, and clients will 11192 typically use it, reserving use of additional locations specified via 11193 the location-related attributes to situations in which the principal 11194 location is no longer available. 11196 When there is no actual file system at the namespace location in 11197 question, the file system is said to be "absent". An absent file 11198 system contains no files or directories other than the root. Any 11199 reference to it, except to access a small set of attributes useful in 11200 determining alternate locations, will result in an error, 11201 NFS4ERR_MOVED. Note that if the server ever returns the error 11202 NFS4ERR_MOVED, it MUST support the fs_locations attribute and SHOULD 11203 support the fs_locations_info and fs_status attributes. 11205 While the error name suggests that we have a case of a file system 11206 that once was present, and has only become absent later, this is only 11207 one possibility. A position in the namespace may be permanently 11208 absent with the set of file system(s) designated by the location 11209 attributes being the only realization. The name NFS4ERR_MOVED 11210 reflects an earlier, more limited conception of its function, but 11211 this error will be returned whenever the referenced file system is 11212 absent, whether it has moved or not. 11214 Except in the case of GETATTR-type operations (to be discussed 11215 later), when the current filehandle at the start of an operation is 11216 within an absent file system, that operation is not performed and the 11217 error NFS4ERR_MOVED is returned, to indicate that the file system is 11218 absent on the current server. 11220 Because a GETFH cannot succeed if the current filehandle is within an 11221 absent file system, filehandles within an absent file system cannot 11222 be transferred to the client. When a client does have filehandles 11223 within an absent file system, it is the result of obtaining them when 11224 the file system was present, and having the file system become absent 11225 subsequently. 11227 It should be noted that because the check for the current filehandle 11228 being within an absent file system happens at the start of every 11229 operation, operations that change the current filehandle so that it 11230 is within an absent file system will not result in an error. This 11231 allows such combinations as PUTFH-GETATTR and LOOKUP-GETATTR to be 11232 used to get attribute information, particularly location attribute 11233 information, as discussed below. 11235 The RECOMMENDED file system attribute fs_status can be used to 11236 interrogate the present/absent status of a given file system. 11238 11.4. Getting Attributes for an Absent File System 11240 When a file system is absent, most attributes are not available, but 11241 it is necessary to allow the client access to the small set of 11242 attributes that are available, and most particularly those that give 11243 information about the correct current locations for this file system: 11244 fs_locations and fs_locations_info. 11246 11.4.1. GETATTR within an Absent File System 11248 As mentioned above, an exception is made for GETATTR in that 11249 attributes may be obtained for a filehandle within an absent file 11250 system. This exception only applies if the attribute mask contains 11251 at least one attribute bit that indicates the client is interested in 11252 a result regarding an absent file system: fs_locations, 11253 fs_locations_info, or fs_status. If none of these attributes is 11254 requested, GETATTR will result in an NFS4ERR_MOVED error. 11256 When a GETATTR is done on an absent file system, the set of supported 11257 attributes is very limited. Many attributes, including those that 11258 are normally REQUIRED, will not be available on an absent file 11259 system. In addition to the attributes mentioned above (fs_locations, 11260 fs_locations_info, fs_status), the following attributes SHOULD be 11261 available on absent file systems. In the case of RECOMMENDED 11262 attributes, they should be available at least to the same degree that 11263 they are available on present file systems. 11265 change_policy: This attribute is useful for absent file systems and 11266 can be helpful in summarizing to the client when any of the 11267 location-related attributes change. 11269 fsid: This attribute should be provided so that the client can 11270 determine file system boundaries, including, in particular, the 11271 boundary between present and absent file systems. This value must 11272 be different from any other fsid on the current server and need 11273 have no particular relationship to fsids on any particular 11274 destination to which the client might be directed. 11276 mounted_on_fileid: For objects at the top of an absent file system, 11277 this attribute needs to be available. Since the fileid is within 11278 the present parent file system, there should be no need to 11279 reference the absent file system to provide this information. 11281 Other attributes SHOULD NOT be made available for absent file 11282 systems, even when it is possible to provide them. The server should 11283 not assume that more information is always better and should avoid 11284 gratuitously providing additional information. 11286 When a GETATTR operation includes a bit mask for one of the 11287 attributes fs_locations, fs_locations_info, or fs_status, but where 11288 the bit mask includes attributes that are not supported, GETATTR will 11289 not return an error, but will return the mask of the actual 11290 attributes supported with the results. 11292 Handling of VERIFY/NVERIFY is similar to GETATTR in that if the 11293 attribute mask does not include fs_locations, fs_locations_info, or 11294 fs_status, the error NFS4ERR_MOVED will result. It differs in that 11295 any appearance in the attribute mask of an attribute not supported 11296 for an absent file system (and note that this will include some 11297 normally REQUIRED attributes) will also cause an NFS4ERR_MOVED 11298 result. 11300 11.4.2. READDIR and Absent File Systems 11302 A READDIR performed when the current filehandle is within an absent 11303 file system will result in an NFS4ERR_MOVED error, since, unlike the 11304 case of GETATTR, no such exception is made for READDIR. 11306 Attributes for an absent file system may be fetched via a READDIR for 11307 a directory in a present file system, when that directory contains 11308 the root directories of one or more absent file systems. In this 11309 case, the handling is as follows: 11311 * If the attribute set requested includes one of the attributes 11312 fs_locations, fs_locations_info, or fs_status, then fetching of 11313 attributes proceeds normally and no NFS4ERR_MOVED indication is 11314 returned, even when the rdattr_error attribute is requested. 11316 * If the attribute set requested does not include one of the 11317 attributes fs_locations, fs_locations_info, or fs_status, then if 11318 the rdattr_error attribute is requested, each directory entry for 11319 the root of an absent file system will report NFS4ERR_MOVED as the 11320 value of the rdattr_error attribute. 11322 * If the attribute set requested does not include any of the 11323 attributes fs_locations, fs_locations_info, fs_status, or 11324 rdattr_error, then the occurrence of the root of an absent file 11325 system within the directory will result in the READDIR failing 11326 with an NFS4ERR_MOVED error. 11328 * The unavailability of an attribute because of a file system's 11329 absence, even one that is ordinarily REQUIRED, does not result in 11330 any error indication. The set of attributes returned for the root 11331 directory of the absent file system in that case is simply 11332 restricted to those actually available. 11334 11.5. Uses of File System Location Information 11336 The file system location attributes (i.e., fs_locations and 11337 fs_locations_info), together with the possibility of absent file 11338 systems, provide a number of important facilities for reliable, 11339 manageable, and scalable data access. 11341 When a file system is present, these attributes can provide the 11342 following: 11344 * The locations of alternative replicas to be used to access the 11345 same data in the event of server failures, communications 11346 problems, or other difficulties that make continued access to the 11347 current replica impossible or otherwise impractical. Provisioning 11348 and use of such alternate replicas is referred to as "replication" 11349 and is discussed in Section 11.5.4 below. 11351 * The network address(es) to be used to access the current file 11352 system instance or replicas of it. Client use of this information 11353 is discussed in Section 11.5.2 below. 11355 Under some circumstances, multiple replicas may be used 11356 simultaneously to provide higher-performance access to the file 11357 system in question, although the lack of state sharing between 11358 servers may be an impediment to such use. 11360 When a file system is present but becomes absent, clients can be 11361 given the opportunity to have continued access to their data using a 11362 different replica. In this case, a continued attempt to use the data 11363 in the now-absent file system will result in an NFS4ERR_MOVED error, 11364 and then the successor replica or set of possible replica choices can 11365 be fetched and used to continue access. Transfer of access to the 11366 new replica location is referred to as "migration" and is discussed 11367 in Section 11.5.4 below. 11369 When a file system is currently absent, specification of file system 11370 location provides a means by which file systems located on one server 11371 can be associated with a namespace defined by another server, thus 11372 allowing a general multi-server namespace facility. A designation of 11373 such a remote instance, in place of a file system not previously 11374 present, is called a "pure referral" and is discussed in 11375 Section 11.5.6 below. 11377 Because client support for attributes related to file system location 11378 is OPTIONAL, a server may choose to take action to hide migration and 11379 referral events from such clients, by acting as a proxy, for example. 11380 The server can determine the presence of client support from the 11381 arguments of the EXCHANGE_ID operation (see Section 18.35.3). 11383 11.5.1. Combining Multiple Uses in a Single Attribute 11385 A file system location attribute will sometimes contain information 11386 relating to the location of multiple replicas, which may be used in 11387 different ways: 11389 * File system location entries that relate to the file system 11390 instance currently in use provide trunking information, allowing 11391 the client to find additional network addresses by which the 11392 instance may be accessed. 11394 * File system location entries that provide information about 11395 replicas to which access is to be transferred. 11397 * Other file system location entries that relate to replicas that 11398 are available to use in the event that access to the current 11399 replica becomes unsatisfactory. 11401 In order to simplify client handling and to allow the best choice of 11402 replicas to access, the server should adhere to the following 11403 guidelines: 11405 * All file system location entries that relate to a single file 11406 system instance should be adjacent. 11408 * File system location entries that relate to the instance currently 11409 in use should appear first. 11411 * File system location entries that relate to replica(s) to which 11412 migration is occurring should appear before replicas that are 11413 available for later use if the current replica should become 11414 inaccessible. 11416 11.5.2. File System Location Attributes and Trunking 11418 Trunking is the use of multiple connections between a client and 11419 server in order to increase the speed of data transfer. A client may 11420 determine the set of network addresses to use to access a given file 11421 system in a number of ways: 11423 * When the name of the server is known to the client, it may use DNS 11424 to obtain a set of network addresses to use in accessing the 11425 server. 11427 * The client may fetch the file system location attribute for the 11428 file system. This will provide either the name of the server 11429 (which can be turned into a set of network addresses using DNS) or 11430 a set of server-trunkable location entries. Using the latter 11431 alternative, the server can provide addresses it regards as 11432 desirable to use to access the file system in question. Although 11433 these entries can contain port numbers, these port numbers are not 11434 used in determining trunking relationships. Once the candidate 11435 addresses have been determined and EXCHANGE_ID done to the proper 11436 server, only the value of the so_major_id field returned by the 11437 servers in question determines whether a trunking relationship 11438 actually exists. 11440 When the client fetches a location attribute for a file system, it 11441 should be noted that the client may encounter multiple entries for a 11442 number of reasons, such that when it determines trunking information, 11443 it may need to bypass addresses not trunkable with one already known. 11445 The server can provide location entries that include either names or 11446 network addresses. It might use the latter form because of DNS- 11447 related security concerns or because the set of addresses to be used 11448 might require active management by the server. 11450 Location entries used to discover candidate addresses for use in 11451 trunking are subject to change, as discussed in Section 11.5.7 below. 11452 The client may respond to such changes by using additional addresses 11453 once they are verified or by ceasing to use existing ones. The 11454 server can force the client to cease using an address by returning 11455 NFS4ERR_MOVED when that address is used to access a file system. 11456 This allows a transfer of client access that is similar to migration, 11457 although the same file system instance is accessed throughout. 11459 11.5.3. File System Location Attributes and Connection Type Selection 11461 Because of the need to support multiple types of connections, clients 11462 face the issue of determining the proper connection type to use when 11463 establishing a connection to a given server network address. In some 11464 cases, this issue can be addressed through the use of the connection 11465 "step-up" facility described in Section 18.36. However, because 11466 there are cases in which that facility is not available, the client 11467 may have to choose a connection type with no possibility of changing 11468 it within the scope of a single connection. 11470 The two file system location attributes differ as to the information 11471 made available in this regard. The fs_locations attribute provides 11472 no information to support connection type selection. As a result, 11473 clients supporting multiple connection types would need to attempt to 11474 establish connections using multiple connection types until the one 11475 preferred by the client is successfully established. 11477 The fs_locations_info attribute includes the FSLI4TF_RDMA flag, which 11478 is convenient for a client wishing to use RDMA. When this flag is 11479 set, it indicates that RPC-over-RDMA support is available using the 11480 specified location entry. A client can establish a TCP connection 11481 and then convert that connection to use RDMA by using the step-up 11482 facility. 11484 Irrespective of the particular attribute used, when there is no 11485 indication that a step-up operation can be performed, a client 11486 supporting RDMA operation can establish a new RDMA connection, and it 11487 can be bound to the session already established by the TCP 11488 connection, allowing the TCP connection to be dropped and the session 11489 converted to further use in RDMA mode, if the server supports that. 11491 11.5.4. File System Replication 11493 The fs_locations and fs_locations_info attributes provide alternative 11494 file system locations, to be used to access data in place of or in 11495 addition to the current file system instance. On first access to a 11496 file system, the client should obtain the set of alternate locations 11497 by interrogating the fs_locations or fs_locations_info attribute, 11498 with the latter being preferred. 11500 In the event that the occurrence of server failures, communications 11501 problems, or other difficulties make continued access to the current 11502 file system impossible or otherwise impractical, the client can use 11503 the alternate locations as a way to get continued access to its data. 11505 The alternate locations may be physical replicas of the (typically 11506 read-only) file system data supplemented by possible asynchronous 11507 propagation of updates. Alternatively, they may provide for the use 11508 of various forms of server clustering in which multiple servers 11509 provide alternate ways of accessing the same physical file system. 11510 How the difference between replicas affects file system transitions 11511 can be represented within the fs_locations and fs_locations_info 11512 attributes, and how the client deals with file system transition 11513 issues will be discussed in detail in later sections. 11515 Although the location attributes provide some information about the 11516 nature of the inter-replica transition, many aspects of the semantics 11517 of possible asynchronous updates are not currently described by the 11518 protocol, which makes it necessary for clients using replication to 11519 switch among replicas undergoing change to familiarize themselves 11520 with the semantics of the update approach used. Due to this lack of 11521 specificity, many applications may find the use of migration more 11522 appropriate because a server can propagate all updates made before an 11523 established point in time to the new replica as part of the migration 11524 event. 11526 11.5.4.1. File System Trunking Presented as Replication 11528 In some situations, a file system location entry may indicate a file 11529 system access path to be used as an alternate location, where 11530 trunking, rather than replication, is to be used. The situations in 11531 which this is appropriate are limited to those in which both of the 11532 following are true: 11534 * The two file system locations (i.e., the one on which the location 11535 attribute is obtained and the one specified in the file system 11536 location entry) designate the same locations within their 11537 respective single-server namespaces. 11539 * The two server network addresses (i.e., the one being used to 11540 obtain the location attribute and the one specified in the file 11541 system location entry) designate the same server (as indicated by 11542 the same value of the so_major_id field of the eir_server_owner 11543 field returned in response to EXCHANGE_ID). 11545 When these conditions hold, operations using both access paths are 11546 generally trunked, although trunking may be disallowed when the 11547 attribute fs_locations_info is used: 11549 * When the fs_locations_info attribute shows the two entries as not 11550 having the same simultaneous-use class, trunking is inhibited, and 11551 the two access paths cannot be used together. 11553 In this case, the two paths can be used serially with no 11554 transition activity required on the part of the client, and any 11555 transition between access paths is transparent. In transferring 11556 access from one to the other, the client acts as if communication 11557 were interrupted, establishing a new connection and possibly a new 11558 session to continue access to the same file system. 11560 * Note that for two such location entries, any information within 11561 the fs_locations_info attribute that indicates the need for 11562 special transition activity, i.e., the appearance of the two file 11563 system location entries with different handle, fileid, write- 11564 verifier, change, and readdir classes, indicates a serious 11565 problem. The client, if it allows transition to the file system 11566 instance at all, must not treat any transition as a transparent 11567 one. The server SHOULD NOT indicate that these two entries (for 11568 the same file system on the same server) belong to different 11569 handle, fileid, write-verifier, change, and readdir classes, 11570 whether or not the two entries are shown belonging to the same 11571 simultaneous-use class. 11573 These situations were recognized by [66], even though that document 11574 made no explicit mention of trunking: 11576 * It treated the situation that we describe as trunking as one of 11577 simultaneous use of two distinct file system instances, even 11578 though, in the explanatory framework now used to describe the 11579 situation, the case is one in which a single file system is 11580 accessed by two different trunked addresses. 11582 * It treated the situation in which two paths are to be used 11583 serially as a special sort of "transparent transition". However, 11584 in the descriptive framework now used to categorize transition 11585 situations, this is considered a case of a "network endpoint 11586 transition" (see Section 11.9). 11588 11.5.5. File System Migration 11590 When a file system is present and becomes inaccessible using the 11591 current access path, the NFSv4.1 protocol provides a means by which 11592 clients can be given the opportunity to have continued access to 11593 their data. This may involve using a different access path to the 11594 existing replica or providing a path to a different replica. The new 11595 access path or the location of the new replica is specified by a file 11596 system location attribute. The ensuing migration of access includes 11597 the ability to retain locks across the transition. Depending on 11598 circumstances, this can involve: 11600 * The continued use of the existing clientid when accessing the 11601 current replica using a new access path. 11603 * Use of lock reclaim, taking advantage of a per-fs grace period. 11605 * Use of Transparent State Migration. 11607 Typically, a client will be accessing the file system in question, 11608 get an NFS4ERR_MOVED error, and then use a file system location 11609 attribute to determine the new access path for the data. When 11610 fs_locations_info is used, additional information will be available 11611 that will define the nature of the client's handling of the 11612 transition to a new server. 11614 In most instances, servers will choose to migrate all clients using a 11615 particular file system to a successor replica at the same time to 11616 avoid cases in which different clients are updating different 11617 replicas. However, migration of an individual client can be helpful 11618 in providing load balancing, as long as the replicas in question are 11619 such that they represent the same data as described in 11620 Section 11.11.8. 11622 * In the case in which there is no transition between replicas 11623 (i.e., only a change in access path), there are no special 11624 difficulties in using of this mechanism to effect load balancing. 11626 * In the case in which the two replicas are sufficiently coordinated 11627 as to allow a single client coherent, simultaneous access to both, 11628 there is, in general, no obstacle to the use of migration of 11629 particular clients to effect load balancing. Generally, such 11630 simultaneous use involves cooperation between servers to ensure 11631 that locks granted on two coordinated replicas cannot conflict and 11632 can remain effective when transferred to a common replica. 11634 * In the case in which a large set of clients is accessing a file 11635 system in a read-only fashion, it can be helpful to migrate all 11636 clients with writable access simultaneously, while using load 11637 balancing on the set of read-only copies, as long as the rules in 11638 Section 11.11.8, which are designed to prevent data reversion, are 11639 followed. 11641 In other cases, the client might not have sufficient guarantees of 11642 data similarity or coherence to function properly (e.g., the data in 11643 the two replicas is similar but not identical), and the possibility 11644 that different clients are updating different replicas can exacerbate 11645 the difficulties, making the use of load balancing in such situations 11646 a perilous enterprise. 11648 The protocol does not specify how the file system will be moved 11649 between servers or how updates to multiple replicas will be 11650 coordinated. It is anticipated that a number of different server-to- 11651 server coordination mechanisms might be used, with the choice left to 11652 the server implementer. The NFSv4.1 protocol specifies the method 11653 used to communicate the migration event between client and server. 11655 In the case of various forms of server clustering, the new location 11656 may be another server providing access to the same physical file 11657 system. The client's responsibilities in dealing with this 11658 transition will depend on whether a switch between replicas has 11659 occurred and the means the server has chosen to provide continuity of 11660 locking state. These issues will be discussed in detail below. 11662 Although a single successor location is typical, multiple locations 11663 may be provided. When multiple locations are provided, the client 11664 will typically use the first one provided. If that is inaccessible 11665 for some reason, later ones can be used. In such cases, the client 11666 might consider the transition to the new replica to be a migration 11667 event, even though some of the servers involved might not be aware of 11668 the use of the server that was inaccessible. In such a case, a 11669 client might lose access to locking state as a result of the access 11670 transfer. 11672 When an alternate location is designated as the target for migration, 11673 it must designate the same data (with metadata being the same to the 11674 degree indicated by the fs_locations_info attribute). Where file 11675 systems are writable, a change made on the original file system must 11676 be visible on all migration targets. Where a file system is not 11677 writable but represents a read-only copy (possibly periodically 11678 updated) of a writable file system, similar requirements apply to the 11679 propagation of updates. Any change visible in the original file 11680 system must already be effected on all migration targets, to avoid 11681 any possibility that a client, in effecting a transition to the 11682 migration target, will see any reversion in file system state. 11684 11.5.6. Referrals 11686 Referrals allow the server to associate a file system namespace entry 11687 located on one server with a file system located on another server. 11688 When this includes the use of pure referrals, servers are provided a 11689 way of placing a file system in a location within the namespace 11690 essentially without respect to its physical location on a particular 11691 server. This allows a single server or a set of servers to present a 11692 multi-server namespace that encompasses file systems located on a 11693 wider range of servers. Some likely uses of this facility include 11694 establishment of site-wide or organization-wide namespaces, with the 11695 eventual possibility of combining such together into a truly global 11696 namespace, such as the one provided by AFS (the Andrew File System) 11697 [65]. 11699 Referrals occur when a client determines, upon first referencing a 11700 position in the current namespace, that it is part of a new file 11701 system and that the file system is absent. When this occurs, 11702 typically upon receiving the error NFS4ERR_MOVED, the actual location 11703 or locations of the file system can be determined by fetching a 11704 locations attribute. 11706 The file system location attribute may designate a single file system 11707 location or multiple file system locations, to be selected based on 11708 the needs of the client. The server, in the fs_locations_info 11709 attribute, may specify priorities to be associated with various file 11710 system location choices. The server may assign different priorities 11711 to different locations as reported to individual clients, in order to 11712 adapt to client physical location or to effect load balancing. When 11713 both read-only and read-write file systems are present, some of the 11714 read-only locations might not be absolutely up-to-date (as they would 11715 have to be in the case of replication and migration). Servers may 11716 also specify file system locations that include client-substituted 11717 variables so that different clients are referred to different file 11718 systems (with different data contents) based on client attributes 11719 such as CPU architecture. 11721 If the fs_locations_info attribute lists multiple possible targets, 11722 the relationships among them may be important to the client in 11723 selecting which one to use. The same rules specified in 11724 Section 11.5.5 below regarding multiple migration targets apply to 11725 these multiple replicas as well. For example, the client might 11726 prefer a writable target on a server that has additional writable 11727 replicas to which it subsequently might switch. Note that, as 11728 distinguished from the case of replication, there is no need to deal 11729 with the case of propagation of updates made by the current client, 11730 since the current client has not accessed the file system in 11731 question. 11733 Use of multi-server namespaces is enabled by NFSv4.1 but is not 11734 required. The use of multi-server namespaces and their scope will 11735 depend on the applications used and system administration 11736 preferences. 11738 Multi-server namespaces can be established by a single server 11739 providing a large set of pure referrals to all of the included file 11740 systems. Alternatively, a single multi-server namespace may be 11741 administratively segmented with separate referral file systems (on 11742 separate servers) for each separately administered portion of the 11743 namespace. The top-level referral file system or any segment may use 11744 replicated referral file systems for higher availability. 11746 Generally, multi-server namespaces are for the most part uniform, in 11747 that the same data made available to one client at a given location 11748 in the namespace is made available to all clients at that namespace 11749 location. However, there are facilities provided that allow 11750 different clients to be directed to different sets of data, for 11751 reasons such as enabling adaptation to such client characteristics as 11752 CPU architecture. These facilities are described in Section 11.17.3. 11754 Note that it is possible, when providing a uniform namespace, to 11755 provide different location entries to different clients in order to 11756 provide each client with a copy of the data physically closest to it 11757 or otherwise optimize access (e.g., provide load balancing). 11759 11.5.7. Changes in a File System Location Attribute 11761 Although clients will typically fetch a file system location 11762 attribute when first accessing a file system and when NFS4ERR_MOVED 11763 is returned, a client can choose to fetch the attribute periodically, 11764 in which case, the value fetched may change over time. 11766 For clients not prepared to access multiple replicas simultaneously 11767 (see Section 11.11.1), the handling of the various cases of location 11768 change are as follows: 11770 * Changes in the list of replicas or in the network addresses 11771 associated with replicas do not require immediate action. The 11772 client will typically update its list of replicas to reflect the 11773 new information. 11775 * Additions to the list of network addresses for the current file 11776 system instance need not be acted on promptly. However, to 11777 prepare for a subsequent migration event, the client can choose to 11778 take note of the new address and then use it whenever it needs to 11779 switch access to a new replica. 11781 * Deletions from the list of network addresses for the current file 11782 system instance do not require the client to immediately cease use 11783 of existing access paths, although new connections are not to be 11784 established on addresses that have been deleted. However, clients 11785 can choose to act on such deletions by preparing for an eventual 11786 shift in access, which becomes unavoidable as soon as the server 11787 returns NFS4ERR_MOVED to indicate that a particular network access 11788 path is not usable to access the current file system. 11790 For clients that are prepared to access several replicas 11791 simultaneously, the following additional cases need to be addressed. 11792 As in the cases discussed above, changes in the set of replicas need 11793 not be acted upon promptly, although the client has the option of 11794 adjusting its access even in the absence of difficulties that would 11795 lead to the selection of a new replica. 11797 * When a new replica is added, which may be accessed simultaneously 11798 with one currently in use, the client is free to use the new 11799 replica immediately. 11801 * When a replica currently in use is deleted from the list, the 11802 client need not cease using it immediately. However, since the 11803 server may subsequently force such use to cease (by returning 11804 NFS4ERR_MOVED), clients might decide to limit the need for later 11805 state transfer. For example, new opens might be done on other 11806 replicas, rather than on one not present in the list. 11808 11.6. Trunking without File System Location Information 11810 In situations in which a file system is accessed using two server- 11811 trunkable addresses (as indicated by the same value of the 11812 so_major_id field of the eir_server_owner field returned in response 11813 to EXCHANGE_ID), trunked access is allowed even though there might 11814 not be any location entries specifically indicating the use of 11815 trunking for that file system. 11817 This situation was recognized by [66], although that document made no 11818 explicit mention of trunking and treated the situation as one of 11819 simultaneous use of two distinct file system instances. In the 11820 explanatory framework now used to describe the situation, the case is 11821 one in which a single file system is accessed by two different 11822 trunked addresses. 11824 11.7. Users and Groups in a Multi-Server Namespace 11826 As in the case of a single-server environment (see Section 5.9), when 11827 an owner or group name of the form "id@domain" is assigned to a file, 11828 there is an implicit promise to return that same string when the 11829 corresponding attribute is interrogated subsequently. In the case of 11830 a multi-server namespace, that same promise applies even if server 11831 boundaries have been crossed. Similarly, when the owner attribute of 11832 a file is derived from the security principal that created the file, 11833 that attribute should have the same value even if the interrogation 11834 occurs on a different server from the file creation. 11836 Similarly, the set of security principals recognized by all the 11837 participating servers needs to be the same, with each such principal 11838 having the same credentials, regardless of the particular server 11839 being accessed. 11841 In order to meet these requirements, those setting up multi-server 11842 namespaces will need to limit the servers included so that: 11844 * In all cases in which more than a single domain is supported, the 11845 requirements stated in RFC 8000 [31] are to be respected. 11847 * All servers support a common set of domains that includes all of 11848 the domains clients use and expect to see returned as the domain 11849 portion of an owner or group in the form "id@domain". Note that, 11850 although this set most often consists of a single domain, it is 11851 possible for multiple domains to be supported. 11853 * All servers, for each domain that they support, accept the same 11854 set of user and group ids as valid. 11856 * All servers recognize the same set of security principals. For 11857 each principal, the same credential is required, independent of 11858 the server being accessed. In addition, the group membership for 11859 each such principal is to be the same, independent of the server 11860 accessed. 11862 Note that there is no requirement in general that the users 11863 corresponding to particular security principals have the same local 11864 representation on each server, even though it is most often the case 11865 that this is so. 11867 When AUTH_SYS is used, the following additional requirements must be 11868 met: 11870 * Only a single NFSv4 domain can be supported through the use of 11871 AUTH_SYS. 11873 * The "local" representation of all owners and groups must be the 11874 same on all servers. The word "local" is used here since that is 11875 the way that numeric user and group ids are described in 11876 Section 5.9. However, when AUTH_SYS or stringified numeric owners 11877 or groups are used, these identifiers are not truly local, since 11878 they are known to the clients as well as to the server. 11880 Similarly, when stringified numeric user and group ids are used, the 11881 "local" representation of all owners and groups must be the same on 11882 all servers, even when AUTH_SYS is not used. 11884 11.8. Additional Client-Side Considerations 11886 When clients make use of servers that implement referrals, 11887 replication, and migration, care should be taken that a user who 11888 mounts a given file system that includes a referral or a relocated 11889 file system continues to see a coherent picture of that user-side 11890 file system despite the fact that it contains a number of server-side 11891 file systems that may be on different servers. 11893 One important issue is upward navigation from the root of a server- 11894 side file system to its parent (specified as ".." in UNIX), in the 11895 case in which it transitions to that file system as a result of 11896 referral, migration, or a transition as a result of replication. 11897 When the client is at such a point, and it needs to ascend to the 11898 parent, it must go back to the parent as seen within the multi-server 11899 namespace rather than sending a LOOKUPP operation to the server, 11900 which would result in the parent within that server's single-server 11901 namespace. In order to do this, the client needs to remember the 11902 filehandles that represent such file system roots and use these 11903 instead of sending a LOOKUPP operation to the current server. This 11904 will allow the client to present to applications a consistent 11905 namespace, where upward navigation and downward navigation are 11906 consistent. 11908 Another issue concerns refresh of referral locations. When referrals 11909 are used extensively, they may change as server configurations 11910 change. It is expected that clients will cache information related 11911 to traversing referrals so that future client-side requests are 11912 resolved locally without server communication. This is usually 11913 rooted in client-side name look up caching. Clients should 11914 periodically purge this data for referral points in order to detect 11915 changes in location information. When the change_policy attribute 11916 changes for directories that hold referral entries or for the 11917 referral entries themselves, clients should consider any associated 11918 cached referral information to be out of date. 11920 11.9. Overview of File Access Transitions 11922 File access transitions are of two types: 11924 * Those that involve a transition from accessing the current replica 11925 to another one in connection with either replication or migration. 11926 How these are dealt with is discussed in Section 11.11. 11928 * Those in which access to the current file system instance is 11929 retained, while the network path used to access that instance is 11930 changed. This case is discussed in Section 11.10. 11932 11.10. Effecting Network Endpoint Transitions 11934 The endpoints used to access a particular file system instance may 11935 change in a number of ways, as listed below. In each of these cases, 11936 the same fsid, client IDs, filehandles, and stateids are used to 11937 continue access, with a continuity of lock state. In many cases, the 11938 same sessions can also be used. 11940 The appropriate action depends on the set of replacement addresses 11941 that are available for use (i.e., server endpoints that are server- 11942 trunkable with one previously being used). 11944 * When use of a particular address is to cease, and there is also 11945 another address currently in use that is server-trunkable with it, 11946 requests that would have been issued on the address whose use is 11947 to be discontinued can be issued on the remaining address(es). 11948 When an address is server-trunkable but not session-trunkable with 11949 the address whose use is to be discontinued, the request might 11950 need to be modified to reflect the fact that a different session 11951 will be used. 11953 * When use of a particular connection is to cease, as indicated by 11954 receiving NFS4ERR_MOVED when using that connection, but that 11955 address is still indicated as accessible according to the 11956 appropriate file system location entries, it is likely that 11957 requests can be issued on a new connection of a different 11958 connection type once that connection is established. Since any 11959 two non-port-specific server endpoints that share a network 11960 address are inherently session-trunkable, the client can use 11961 BIND_CONN_TO_SESSION to access the existing session with the new 11962 connection. 11964 * When there are no potential replacement addresses in use, but 11965 there are valid addresses session-trunkable with the one whose use 11966 is to be discontinued, the client can use BIND_CONN_TO_SESSION to 11967 access the existing session using the new address. Although the 11968 target session will generally be accessible, there may be rare 11969 situations in which that session is no longer accessible when an 11970 attempt is made to bind the new connection to it. In this case, 11971 the client can create a new session to enable continued access to 11972 the existing instance using the new connection, providing for the 11973 use of existing filehandles, stateids, and client ids while 11974 supplying continuity of locking state. 11976 * When there is no potential replacement address in use, and there 11977 are no valid addresses session-trunkable with the one whose use is 11978 to be discontinued, other server-trunkable addresses may be used 11979 to provide continued access. Although the use of CREATE_SESSION 11980 is available to provide continued access to the existing instance, 11981 servers have the option of providing continued access to the 11982 existing session through the new network access path in a fashion 11983 similar to that provided by session migration (see Section 11.12). 11984 To take advantage of this possibility, clients can perform an 11985 initial BIND_CONN_TO_SESSION, as in the previous case, and use 11986 CREATE_SESSION only if that fails. 11988 11.11. Effecting File System Transitions 11990 There are a range of situations in which there is a change to be 11991 effected in the set of replicas used to access a particular file 11992 system. Some of these may involve an expansion or contraction of the 11993 set of replicas used as discussed in Section 11.11.1 below. 11995 For reasons explained in that section, most transitions will involve 11996 a transition from a single replica to a corresponding replacement 11997 replica. When effecting replica transition, some types of sharing 11998 between the replicas may affect handling of the transition as 11999 described in Sections 11.11.2 through 11.11.8 below. The attribute 12000 fs_locations_info provides helpful information to allow the client to 12001 determine the degree of inter-replica sharing. 12003 With regard to some types of state, the degree of continuity across 12004 the transition depends on the occasion prompting the transition, with 12005 transitions initiated by the servers (i.e., migration) offering much 12006 more scope for a nondisruptive transition than cases in which the 12007 client on its own shifts its access to another replica (i.e., 12008 replication). This issue potentially applies to locking state and to 12009 session state, which are dealt with below as follows: 12011 * An introduction to the possible means of providing continuity in 12012 these areas appears in Section 11.11.9 below. 12014 * Transparent State Migration is introduced in Section 11.12. The 12015 possible transfer of session state is addressed there as well. 12017 * The client handling of transitions, including determining how to 12018 deal with the various means that the server might take to supply 12019 effective continuity of locking state, is discussed in 12020 Section 11.13. 12022 * The source and destination servers' responsibilities in effecting 12023 Transparent State Migration of locking and session state are 12024 discussed in Section 11.14. 12026 11.11.1. File System Transitions and Simultaneous Access 12028 The fs_locations_info attribute (described in Section 11.17) may 12029 indicate that two replicas may be used simultaneously, although some 12030 situations in which such simultaneous access is permitted are more 12031 appropriately described as instances of trunking (see 12032 Section 11.5.4.1). Although situations in which multiple replicas 12033 may be accessed simultaneously are somewhat similar to those in which 12034 a single replica is accessed by multiple network addresses, there are 12035 important differences since locking state is not shared among 12036 multiple replicas. 12038 Because of this difference in state handling, many clients will not 12039 have the ability to take advantage of the fact that such replicas 12040 represent the same data. Such clients will not be prepared to use 12041 multiple replicas simultaneously but will access each file system 12042 using only a single replica, although the replica selected might make 12043 multiple server-trunkable addresses available. 12045 Clients who are prepared to use multiple replicas simultaneously can 12046 divide opens among replicas however they choose. Once that choice is 12047 made, any subsequent transitions will treat the set of locking state 12048 associated with each replica as a single entity. 12050 For example, if one of the replicas become unavailable, access will 12051 be transferred to a different replica, which is also capable of 12052 simultaneous access with the one still in use. 12054 When there is no such replica, the transition may be to the replica 12055 already in use. At this point, the client has a choice between 12056 merging the locking state for the two replicas under the aegis of the 12057 sole replica in use or treating these separately until another 12058 replica capable of simultaneous access presents itself. 12060 11.11.2. Filehandles and File System Transitions 12062 There are a number of ways in which filehandles can be handled across 12063 a file system transition. These can be divided into two broad 12064 classes depending upon whether the two file systems across which the 12065 transition happens share sufficient state to effect some sort of 12066 continuity of file system handling. 12068 When there is no such cooperation in filehandle assignment, the two 12069 file systems are reported as being in different handle classes. In 12070 this case, all filehandles are assumed to expire as part of the file 12071 system transition. Note that this behavior does not depend on the 12072 fh_expire_type attribute and supersedes the specification of the 12073 FH4_VOL_MIGRATION bit, which only affects behavior when 12074 fs_locations_info is not available. 12076 When there is cooperation in filehandle assignment, the two file 12077 systems are reported as being in the same handle classes. In this 12078 case, persistent filehandles remain valid after the file system 12079 transition, while volatile filehandles (excluding those that are only 12080 volatile due to the FH4_VOL_MIGRATION bit) are subject to expiration 12081 on the target server. 12083 11.11.3. Fileids and File System Transitions 12085 In NFSv4.0, the issue of continuity of fileids in the event of a file 12086 system transition was not addressed. The general expectation had 12087 been that in situations in which the two file system instances are 12088 created by a single vendor using some sort of file system image copy, 12089 fileids would be consistent across the transition, while in the 12090 analogous multi-vendor transitions they would not. This poses 12091 difficulties, especially for the client without special knowledge of 12092 the transition mechanisms adopted by the server. Note that although 12093 fileid is not a REQUIRED attribute, many servers support fileids and 12094 many clients provide APIs that depend on fileids. 12096 It is important to note that while clients themselves may have no 12097 trouble with a fileid changing as a result of a file system 12098 transition event, applications do typically have access to the fileid 12099 (e.g., via stat). The result is that an application may work 12100 perfectly well if there is no file system instance transition or if 12101 any such transition is among instances created by a single vendor, 12102 yet be unable to deal with the situation in which a multi-vendor 12103 transition occurs at the wrong time. 12105 Providing the same fileids in a multi-vendor (multiple server 12106 vendors) environment has generally been held to be quite difficult. 12107 While there is work to be done, it needs to be pointed out that this 12108 difficulty is partly self-imposed. Servers have typically identified 12109 fileid with inode number, i.e. with a quantity used to find the file 12110 in question. This identification poses special difficulties for 12111 migration of a file system between vendors where assigning the same 12112 index to a given file may not be possible. Note here that a fileid 12113 is not required to be useful to find the file in question, only that 12114 it is unique within the given file system. Servers prepared to 12115 accept a fileid as a single piece of metadata and store it apart from 12116 the value used to index the file information can relatively easily 12117 maintain a fileid value across a migration event, allowing a truly 12118 transparent migration event. 12120 In any case, where servers can provide continuity of fileids, they 12121 should, and the client should be able to find out that such 12122 continuity is available and take appropriate action. Information 12123 about the continuity (or lack thereof) of fileids across a file 12124 system transition is represented by specifying whether the file 12125 systems in question are of the same fileid class. 12127 Note that when consistent fileids do not exist across a transition 12128 (either because there is no continuity of fileids or because fileid 12129 is not a supported attribute on one of instances involved), and there 12130 are no reliable filehandles across a transition event (either because 12131 there is no filehandle continuity or because the filehandles are 12132 volatile), the client is in a position where it cannot verify that 12133 files it was accessing before the transition are the same objects. 12134 It is forced to assume that no object has been renamed, and, unless 12135 there are guarantees that provide this (e.g., the file system is 12136 read-only), problems for applications may occur. Therefore, use of 12137 such configurations should be limited to situations where the 12138 problems that this may cause can be tolerated. 12140 11.11.4. Fsids and File System Transitions 12142 Since fsids are generally only unique on a per-server basis, it is 12143 likely that they will change during a file system transition. 12144 Clients should not make the fsids received from the server visible to 12145 applications since they may not be globally unique, and because they 12146 may change during a file system transition event. Applications are 12147 best served if they are isolated from such transitions to the extent 12148 possible. 12150 Although normally a single source file system will transition to a 12151 single target file system, there is a provision for splitting a 12152 single source file system into multiple target file systems, by 12153 specifying the FSLI4F_MULTI_FS flag. 12155 11.11.4.1. File System Splitting 12157 When a file system transition is made and the fs_locations_info 12158 indicates that the file system in question might be split into 12159 multiple file systems (via the FSLI4F_MULTI_FS flag), the client 12160 SHOULD do GETATTRs to determine the fsid attribute on all known 12161 objects within the file system undergoing transition to determine the 12162 new file system boundaries. 12164 Clients might choose to maintain the fsids passed to existing 12165 applications by mapping all of the fsids for the descendant file 12166 systems to the common fsid used for the original file system. 12168 Splitting a file system can be done on a transition between file 12169 systems of the same fileid class, since the fact that fileids are 12170 unique within the source file system ensure they will be unique in 12171 each of the target file systems. 12173 11.11.5. The Change Attribute and File System Transitions 12175 Since the change attribute is defined as a server-specific one, 12176 change attributes fetched from one server are normally presumed to be 12177 invalid on another server. Such a presumption is troublesome since 12178 it would invalidate all cached change attributes, requiring 12179 refetching. Even more disruptive, the absence of any assured 12180 continuity for the change attribute means that even if the same value 12181 is retrieved on refetch, no conclusions can be drawn as to whether 12182 the object in question has changed. The identical change attribute 12183 could be merely an artifact of a modified file with a different 12184 change attribute construction algorithm, with that new algorithm just 12185 happening to result in an identical change value. 12187 When the two file systems have consistent change attribute formats, 12188 and this fact is communicated to the client by reporting in the same 12189 change class, the client may assume a continuity of change attribute 12190 construction and handle this situation just as it would be handled 12191 without any file system transition. 12193 11.11.6. Write Verifiers and File System Transitions 12195 In a file system transition, the two file systems might be 12196 cooperating in the handling of unstably written data. Clients can 12197 determine if this is the case by seeing if the two file systems 12198 belong to the same write-verifier class. When this is the case, 12199 write verifiers returned from one system may be compared to those 12200 returned by the other and superfluous writes can be avoided. 12202 When two file systems belong to different write-verifier classes, any 12203 verifier generated by one must not be compared to one provided by the 12204 other. Instead, the two verifiers should be treated as not equal 12205 even when the values are identical. 12207 11.11.7. READDIR Cookies and Verifiers and File System Transitions 12209 In a file system transition, the two file systems might be consistent 12210 in their handling of READDIR cookies and verifiers. Clients can 12211 determine if this is the case by seeing if the two file systems 12212 belong to the same readdir class. When this is the case, readdir 12213 class, READDIR cookies, and verifiers from one system will be 12214 recognized by the other, and READDIR operations started on one server 12215 can be validly continued on the other simply by presenting the cookie 12216 and verifier returned by a READDIR operation done on the first file 12217 system to the second. 12219 When two file systems belong to different readdir classes, any 12220 READDIR cookie and verifier generated by one is not valid on the 12221 second and must not be presented to that server by the client. The 12222 client should act as if the verifier were rejected. 12224 11.11.8. File System Data and File System Transitions 12226 When multiple replicas exist and are used simultaneously or in 12227 succession by a client, applications using them will normally expect 12228 that they contain either the same data or data that is consistent 12229 with the normal sorts of changes that are made by other clients 12230 updating the data of the file system (with metadata being the same to 12231 the degree indicated by the fs_locations_info attribute). However, 12232 when multiple file systems are presented as replicas of one another, 12233 the precise relationship between the data of one and the data of 12234 another is not, as a general matter, specified by the NFSv4.1 12235 protocol. It is quite possible to present as replicas file systems 12236 where the data of those file systems is sufficiently different that 12237 some applications have problems dealing with the transition between 12238 replicas. The namespace will typically be constructed so that 12239 applications can choose an appropriate level of support, so that in 12240 one position in the namespace, a varied set of replicas might be 12241 listed, while in another, only those that are up-to-date would be 12242 considered replicas. The protocol does define three special cases of 12243 the relationship among replicas to be specified by the server and 12244 relied upon by clients: 12246 * When multiple replicas exist and are used simultaneously by a 12247 client (see the FSLIB4_CLSIMUL definition within 12248 fs_locations_info), they must designate the same data. Where file 12249 systems are writable, a change made on one instance must be 12250 visible on all instances at the same time, regardless of whether 12251 the interrogated instance is the one on which the modification was 12252 done. This allows a client to use these replicas simultaneously 12253 without any special adaptation to the fact that there are multiple 12254 replicas, beyond adapting to the fact that locks obtained on one 12255 replica are maintained separately (i.e., under a different client 12256 ID). In this case, locks (whether share reservations or byte- 12257 range locks) and delegations obtained on one replica are 12258 immediately reflected on all replicas, in the sense that access 12259 from all other servers is prevented regardless of the replica 12260 used. However, because the servers are not required to treat two 12261 associated client IDs as representing the same client, it is best 12262 to access each file using only a single client ID. 12264 * When one replica is designated as the successor instance to 12265 another existing instance after the return of NFS4ERR_MOVED (i.e., 12266 the case of migration), the client may depend on the fact that all 12267 changes written to stable storage on the original instance are 12268 written to stable storage of the successor (uncommitted writes are 12269 dealt with in Section 11.11.6 above). 12271 * Where a file system is not writable but represents a read-only 12272 copy (possibly periodically updated) of a writable file system, 12273 clients have similar requirements with regard to the propagation 12274 of updates. They may need a guarantee that any change visible on 12275 the original file system instance must be immediately visible on 12276 any replica before the client transitions access to that replica, 12277 in order to avoid any possibility that a client, in effecting a 12278 transition to a replica, will see any reversion in file system 12279 state. The specific means of this guarantee varies based on the 12280 value of the fss_type field that is reported as part of the 12281 fs_status attribute (see Section 11.18). Since these file systems 12282 are presumed to be unsuitable for simultaneous use, there is no 12283 specification of how locking is handled; in general, locks 12284 obtained on one file system will be separate from those on others. 12285 Since these are expected to be read-only file systems, this is not 12286 likely to pose an issue for clients or applications. 12288 When none of these special situations applies, there is no basis 12289 within the protocol for the client to make assumptions about the 12290 contents of a replica file system or its relationship to previous 12291 file system instances. Thus, switching between nominally identical 12292 read-write file systems would not be possible because either the 12293 client does not use the fs_locations_info attribute, or the server 12294 does not support it. 12296 11.11.9. Lock State and File System Transitions 12298 While accessing a file system, clients obtain locks enforced by the 12299 server, which may prevent actions by other clients that are 12300 inconsistent with those locks. 12302 When access is transferred between replicas, clients need to be 12303 assured that the actions disallowed by holding these locks cannot 12304 have occurred during the transition. This can be ensured by the 12305 methods below. Unless at least one of these is implemented, clients 12306 will not be assured of continuity of lock possession across a 12307 migration event: 12309 * Providing the client an opportunity to re-obtain his locks via a 12310 per-fs grace period on the destination server, denying all clients 12311 using the destination file system the opportunity to obtain new 12312 locks that conflict with those held by the transferred client as 12313 long as that client has not completed its per-fs grace period. 12314 Because the lock reclaim mechanism was originally defined to 12315 support server reboot, it implicitly assumes that filehandles 12316 will, upon reclaim, be the same as those at open. In the case of 12317 migration, this requires that source and destination servers use 12318 the same filehandles, as evidenced by using the same server scope 12319 (see Section 2.10.4) or by showing this agreement using 12320 fs_locations_info (see Section 11.11.2 above). 12322 Note that such a grace period can be implemented without 12323 interfering with the ability of non-transferred clients to obtain 12324 new locks while it is going on. As long as the destination server 12325 is aware of the transferred locks, it can distinguish requests to 12326 obtain new locks that contrast with existing locks from those that 12327 do not, allowing it to treat such client requests without 12328 reference to the ongoing grace period. 12330 * Locking state can be transferred as part of the transition by 12331 providing Transparent State Migration as described in 12332 Section 11.12. 12334 Of these, Transparent State Migration provides the smoother 12335 experience for clients in that there is no need to go through a 12336 reclaim process before new locks can be obtained; however, it 12337 requires a greater degree of inter-server coordination. In general, 12338 the servers taking part in migration are free to provide either 12339 facility. However, when the filehandles can differ across the 12340 migration event, Transparent State Migration is the only available 12341 means of providing the needed functionality. 12343 It should be noted that these two methods are not mutually exclusive 12344 and that a server might well provide both. In particular, if there 12345 is some circumstance preventing a specific lock from being 12346 transferred transparently, the destination server can allow it to be 12347 reclaimed by implementing a per-fs grace period for the migrated file 12348 system. 12350 11.11.9.1. Security Consideration Related to Reclaiming Lock State 12351 after File System Transitions 12353 Although it is possible for a client reclaiming state to misrepresent 12354 its state in the same fashion as described in Section 8.4.2.1.1, most 12355 implementations providing for such reclamation in the case of file 12356 system transitions will have the ability to detect such 12357 misrepresentations. This limits the ability of unauthenticated 12358 clients to execute denial-of-service attacks in these circumstances. 12359 Nevertheless, the rules stated in Section 8.4.2.1.1 regarding 12360 principal verification for reclaim requests apply in this situation 12361 as well. 12363 Typically, implementations that support file system transitions will 12364 have extensive information about the locks to be transferred. This 12365 is because of the following: 12367 * Since failure is not involved, there is no need to store locking 12368 information in persistent storage. 12370 * There is no need, as there is in the failure case, to update 12371 multiple repositories containing locking state to keep them in 12372 sync. Instead, there is a one-time communication of locking state 12373 from the source to the destination server. 12375 * Providing this information avoids potential interference with 12376 existing clients using the destination file system by denying them 12377 the ability to obtain new locks during the grace period. 12379 When such detailed locking information, not necessarily including the 12380 associated stateids, is available: 12382 * It is possible to detect reclaim requests that attempt to reclaim 12383 locks that did not exist before the transfer, rejecting them with 12384 NFS4ERR_RECLAIM_BAD (Section 15.1.9.4). 12386 * It is possible when dealing with non-reclaim requests, to 12387 determine whether they conflict with existing locks, eliminating 12388 the need to return NFS4ERR_GRACE (Section 15.1.9.2) on non-reclaim 12389 requests. 12391 It is possible for implementations of grace periods in connection 12392 with file system transitions not to have detailed locking information 12393 available at the destination server, in which case, the security 12394 situation is exactly as described in Section 8.4.2.1.1. 12396 11.11.9.2. Leases and File System Transitions 12398 In the case of lease renewal, the client may not be submitting 12399 requests for a file system that has been transferred to another 12400 server. This can occur because of the lease renewal mechanism. The 12401 client renews the lease associated with all file systems when 12402 submitting a request on an associated session, regardless of the 12403 specific file system being referenced. 12405 In order for the client to schedule renewal of its lease where there 12406 is locking state that may have been relocated to the new server, the 12407 client must find out about lease relocation before that lease expire. 12408 To accomplish this, the SEQUENCE operation will return the status bit 12409 SEQ4_STATUS_LEASE_MOVED if responsibility for any of the renewed 12410 locking state has been transferred to a new server. This will 12411 continue until the client receives an NFS4ERR_MOVED error for each of 12412 the file systems for which there has been locking state relocation. 12414 When a client receives an SEQ4_STATUS_LEASE_MOVED indication from a 12415 server, for each file system of the server for which the client has 12416 locking state, the client should perform an operation. For 12417 simplicity, the client may choose to reference all file systems, but 12418 what is important is that it must reference all file systems for 12419 which there was locking state where that state has moved. Once the 12420 client receives an NFS4ERR_MOVED error for each such file system, the 12421 server will clear the SEQ4_STATUS_LEASE_MOVED indication. The client 12422 can terminate the process of checking file systems once this 12423 indication is cleared (but only if the client has received a reply 12424 for all outstanding SEQUENCE requests on all sessions it has with the 12425 server), since there are no others for which locking state has moved. 12427 A client may use GETATTR of the fs_status (or fs_locations_info) 12428 attribute on all of the file systems to get absence indications in a 12429 single (or a few) request(s), since absent file systems will not 12430 cause an error in this context. However, it still must do an 12431 operation that receives NFS4ERR_MOVED on each file system, in order 12432 to clear the SEQ4_STATUS_LEASE_MOVED indication. 12434 Once the set of file systems with transferred locking state has been 12435 determined, the client can follow the normal process to obtain the 12436 new server information (through the fs_locations and 12437 fs_locations_info attributes) and perform renewal of that lease on 12438 the new server, unless information in the fs_locations_info attribute 12439 shows that no state could have been transferred. If the server has 12440 not had state transferred to it transparently, the client will 12441 receive NFS4ERR_STALE_CLIENTID from the new server, as described 12442 above, and the client can then reclaim locks as is done in the event 12443 of server failure. 12445 11.11.9.3. Transitions and the Lease_time Attribute 12447 In order that the client may appropriately manage its lease in the 12448 case of a file system transition, the destination server must 12449 establish proper values for the lease_time attribute. 12451 When state is transferred transparently, that state should include 12452 the correct value of the lease_time attribute. The lease_time 12453 attribute on the destination server must never be less than that on 12454 the source, since this would result in premature expiration of a 12455 lease granted by the source server. Upon transitions in which state 12456 is transferred transparently, the client is under no obligation to 12457 refetch the lease_time attribute and may continue to use the value 12458 previously fetched (on the source server). 12460 If state has not been transferred transparently, either because the 12461 associated servers are shown as having different eir_server_scope 12462 strings or because the client ID is rejected when presented to the 12463 new server, the client should fetch the value of lease_time on the 12464 new (i.e., destination) server, and use it for subsequent locking 12465 requests. However, the server must respect a grace period of at 12466 least as long as the lease_time on the source server, in order to 12467 ensure that clients have ample time to reclaim their lock before 12468 potentially conflicting non-reclaimed locks are granted. 12470 11.12. Transferring State upon Migration 12472 When the transition is a result of a server-initiated decision to 12473 transition access, and the source and destination servers have 12474 implemented appropriate cooperation, it is possible to do the 12475 following: 12477 * Transfer locking state from the source to the destination server 12478 in a fashion similar to that provided by Transparent State 12479 Migration in NFSv4.0, as described in [69]. Server 12480 responsibilities are described in Section 11.14.2. 12482 * Transfer session state from the source to the destination server. 12483 Server responsibilities in effecting such a transfer are described 12484 in Section 11.14.3. 12486 The means by which the client determines which of these transfer 12487 events has occurred are described in Section 11.13. 12489 11.12.1. Transparent State Migration and pNFS 12491 When pNFS is involved, the protocol is capable of supporting: 12493 * Migration of the Metadata Server (MDS), leaving the Data Servers 12494 (DSs) in place. 12496 * Migration of the file system as a whole, including the MDS and 12497 associated DSs. 12499 * Replacement of one DS by another. 12501 * Migration of a pNFS file system to one in which pNFS is not used. 12503 * Migration of a file system not using pNFS to one in which layouts 12504 are available. 12506 Note that migration, per se, is only involved in the transfer of the 12507 MDS function. Although the servicing of a layout may be transferred 12508 from one data server to another, this not done using the file system 12509 location attributes. The MDS can effect such transfers by recalling 12510 or revoking existing layouts and granting new ones on a different 12511 data server. 12513 Migration of the MDS function is directly supported by Transparent 12514 State Migration. Layout state will normally be transparently 12515 transferred, just as other state is. As a result, Transparent State 12516 Migration provides a framework in which, given appropriate inter-MDS 12517 data transfer, one MDS can be substituted for another. 12519 Migration of the file system function as a whole can be accomplished 12520 by recalling all layouts as part of the initial phase of the 12521 migration process. As a result, I/O will be done through the MDS 12522 during the migration process, and new layouts can be granted once the 12523 client is interacting with the new MDS. An MDS can also effect this 12524 sort of transition by revoking all layouts as part of Transparent 12525 State Migration, as long as the client is notified about the loss of 12526 locking state. 12528 In order to allow migration to a file system on which pNFS is not 12529 supported, clients need to be prepared for a situation in which 12530 layouts are not available or supported on the destination file system 12531 and so direct I/O requests to the destination server, rather than 12532 depending on layouts being available. 12534 Replacement of one DS by another is not addressed by migration as 12535 such but can be effected by an MDS recalling layouts for the DS to be 12536 replaced and issuing new ones to be served by the successor DS. 12538 Migration may transfer a file system from a server that does not 12539 support pNFS to one that does. In order to properly adapt to this 12540 situation, clients that support pNFS, but function adequately in its 12541 absence, should check for pNFS support when a file system is migrated 12542 and be prepared to use pNFS when support is available on the 12543 destination. 12545 11.13. Client Responsibilities When Access Is Transitioned 12547 For a client to respond to an access transition, it must become aware 12548 of it. The ways in which this can happen are discussed in 12549 Section 11.13.1, which discusses indications that a specific file 12550 system access path has transitioned as well as situations in which 12551 additional activity is necessary to determine the set of file systems 12552 that have been migrated. Section 11.13.2 goes on to complete the 12553 discussion of how the set of migrated file systems might be 12554 determined. Sections 11.13.3 through 11.13.5 discuss how the client 12555 should deal with each transition it becomes aware of, either directly 12556 or as a result of migration discovery. 12558 The following terms are used to describe client activities: 12560 * "Transition recovery" refers to the process of restoring access to 12561 a file system on which NFS4ERR_MOVED was received. 12563 * "Migration recovery" refers to that subset of transition recovery 12564 that applies when the file system has migrated to a different 12565 replica. 12567 * "Migration discovery" refers to the process of determining which 12568 file system(s) have been migrated. It is necessary to avoid a 12569 situation in which leases could expire when a file system is not 12570 accessed for a long period of time, since a client unaware of the 12571 migration might be referencing an unmigrated file system and not 12572 renewing the lease associated with the migrated file system. 12574 11.13.1. Client Transition Notifications 12576 When there is a change in the network access path that a client is to 12577 use to access a file system, there are a number of related status 12578 indications with which clients need to deal: 12580 * If an attempt is made to use or return a filehandle within a file 12581 system that is no longer accessible at the address previously used 12582 to access it, the error NFS4ERR_MOVED is returned. 12584 Exceptions are made to allow such filehandles to be used when 12585 interrogating a file system location attribute. This enables a 12586 client to determine a new replica's location or a new network 12587 access path. 12589 This condition continues on subsequent attempts to access the file 12590 system in question. The only way the client can avoid the error 12591 is to cease accessing the file system in question at its old 12592 server location and access it instead using a different address at 12593 which it is now available. 12595 * Whenever a client sends a SEQUENCE operation to a server that 12596 generated state held on that client and associated with a file 12597 system no longer accessible on that server, the response will 12598 contain the status bit SEQ4_STATUS_LEASE_MOVED, indicating that 12599 there has been a lease migration. 12601 This condition continues until the client acknowledges the 12602 notification by fetching a file system location attribute for the 12603 file system whose network access path is being changed. When 12604 there are multiple such file systems, a location attribute for 12605 each such file system needs to be fetched. The location attribute 12606 for all migrated file systems needs to be fetched in order to 12607 clear the condition. Even after the condition is cleared, the 12608 client needs to respond by using the location information to 12609 access the file system at its new location to ensure that leases 12610 are not needlessly expired. 12612 Unlike NFSv4.0, in which the corresponding conditions are both errors 12613 and thus mutually exclusive, in NFSv4.1 the client can, and often 12614 will, receive both indications on the same request. As a result, 12615 implementations need to address the question of how to coordinate the 12616 necessary recovery actions when both indications arrive in the 12617 response to the same request. It should be noted that when 12618 processing an NFSv4 COMPOUND, the server will normally decide whether 12619 SEQ4_STATUS_LEASE_MOVED is to be set before it determines which file 12620 system will be referenced or whether NFS4ERR_MOVED is to be returned. 12622 Since these indications are not mutually exclusive in NFSv4.1, the 12623 following combinations are possible results when a COMPOUND is 12624 issued: 12626 * The COMPOUND status is NFS4ERR_MOVED, and SEQ4_STATUS_LEASE_MOVED 12627 is asserted. 12629 In this case, transition recovery is required. While it is 12630 possible that migration discovery is needed in addition, it is 12631 likely that only the accessed file system has transitioned. In 12632 any case, because addressing NFS4ERR_MOVED is necessary to allow 12633 the rejected requests to be processed on the target, dealing with 12634 it will typically have priority over migration discovery. 12636 * The COMPOUND status is NFS4ERR_MOVED, and SEQ4_STATUS_LEASE_MOVED 12637 is clear. 12639 In this case, transition recovery is also required. It is clear 12640 that migration discovery is not needed to find file systems that 12641 have been migrated other than the one returning NFS4ERR_MOVED. 12642 Cases in which this result can arise include a referral or a 12643 migration for which there is no associated locking state. This 12644 can also arise in cases in which an access path transition other 12645 than migration occurs within the same server. In such a case, 12646 there is no need to set SEQ4_STATUS_LEASE_MOVED, since the lease 12647 remains associated with the current server even though the access 12648 path has changed. 12650 * The COMPOUND status is not NFS4ERR_MOVED, and 12651 SEQ4_STATUS_LEASE_MOVED is asserted. 12653 In this case, no transition recovery activity is required on the 12654 file system(s) accessed by the request. However, to prevent 12655 avoidable lease expiration, migration discovery needs to be done. 12657 * The COMPOUND status is not NFS4ERR_MOVED, and 12658 SEQ4_STATUS_LEASE_MOVED is clear. 12660 In this case, neither transition-related activity nor migration 12661 discovery is required. 12663 Note that the specified actions only need to be taken if they are not 12664 already going on. For example, when NFS4ERR_MOVED is received while 12665 accessing a file system for which transition recovery is already 12666 occurring, the client merely waits for that recovery to be completed, 12667 while the receipt of the SEQ4_STATUS_LEASE_MOVED indication only 12668 needs to initiate migration discovery for a server if such discovery 12669 is not already underway for that server. 12671 The fact that a lease-migrated condition does not result in an error 12672 in NFSv4.1 has a number of important consequences. In addition to 12673 the fact that the two indications are not mutually exclusive, as 12674 discussed above, there are number of issues that are important in 12675 considering implementation of migration discovery, as discussed in 12676 Section 11.13.2. 12678 Because SEQ4_STATUS_LEASE_MOVED is not an error condition, it is 12679 possible for file systems whose access paths have not changed to be 12680 successfully accessed on a given server even though recovery is 12681 necessary for other file systems on the same server. As a result, 12682 access can take place while: 12684 * The migration discovery process is happening for that server. 12686 * The transition recovery process is happening for other file 12687 systems connected to that server. 12689 11.13.2. Performing Migration Discovery 12691 Migration discovery can be performed in the same context as 12692 transition recovery, allowing recovery for each migrated file system 12693 to be invoked as it is discovered. Alternatively, it may be done in 12694 a separate migration discovery thread, allowing migration discovery 12695 to be done in parallel with one or more instances of transition 12696 recovery. 12698 In either case, because the lease-migrated indication does not result 12699 in an error, other access to file systems on the server can proceed 12700 normally, with the possibility that further such indications will be 12701 received, raising the issue of how such indications are to be dealt 12702 with. In general: 12704 * No action needs to be taken for such indications received by any 12705 threads performing migration discovery, since continuation of that 12706 work will address the issue. 12708 * In other cases in which migration discovery is currently being 12709 performed, nothing further needs to be done to respond to such 12710 lease migration indications, as long as one can be certain that 12711 the migration discovery process would deal with those indications. 12712 See below for details. 12714 * For such indications received in all other contexts, the 12715 appropriate response is to initiate or otherwise provide for the 12716 execution of migration discovery for file systems associated with 12717 the server IP address returning the indication. 12719 This leaves a potential difficulty in situations in which the 12720 migration discovery process is near to completion but is still 12721 operating. One should not ignore a SEQ4_STATUS_LEASE_MOVED 12722 indication if the migration discovery process is not able to respond 12723 to the discovery of additional migrating file systems without 12724 additional aid. A further complexity relevant in addressing such 12725 situations is that a lease-migrated indication may reflect the 12726 server's state at the time the SEQUENCE operation was processed, 12727 which may be different from that in effect at the time the response 12728 is received. Because new migration events may occur at any time, and 12729 because a SEQ4_STATUS_LEASE_MOVED indication may reflect the 12730 situation in effect a considerable time before the indication is 12731 received, special care needs to be taken to ensure that 12732 SEQ4_STATUS_LEASE_MOVED indications are not inappropriately ignored. 12734 A useful approach to this issue involves the use of separate 12735 externally-visible migration discovery states for each server. 12736 Separate values could represent the various possible states for the 12737 migration discovery process for a server: 12739 * Non-operation, in which migration discovery is not being 12740 performed. 12742 * Normal operation, in which there is an ongoing scan for migrated 12743 file systems. 12745 * Completion/verification of migration discovery processing, in 12746 which the possible completion of migration discovery processing 12747 needs to be verified. 12749 Given that framework, migration discovery processing would proceed as 12750 follows: 12752 * While in the normal-operation state, the thread performing 12753 discovery would fetch, for successive file systems known to the 12754 client on the server being worked on, a file system location 12755 attribute plus the fs_status attribute. 12757 * If the fs_status attribute indicates that the file system is a 12758 migrated one (i.e., fss_absent is true, and fss_type != 12759 STATUS4_REFERRAL), then a migrated file system has been found. In 12760 this situation, it is likely that the fetch of the file system 12761 location attribute has cleared one of the file systems 12762 contributing to the lease-migrated indication. 12764 * In cases in which that happened, the thread cannot know whether 12765 the lease-migrated indication has been cleared, and so it enters 12766 the completion/verification state and proceeds to issue a COMPOUND 12767 to see if the SEQ4_STATUS_LEASE_MOVED indication has been cleared. 12769 * When the discovery process is in the completion/verification 12770 state, if other requests get a lease-migrated indication, they 12771 note that it was received. Later, the existence of such 12772 indications is used when the request completes, as described 12773 below. 12775 When the request used in the completion/verification state completes: 12777 * If a lease-migrated indication is returned, the discovery 12778 continues normally. Note that this is so even if all file systems 12779 have been traversed, since new migrations could have occurred 12780 while the process was going on. 12782 * Otherwise, if there is any record that other requests saw a lease- 12783 migrated indication while the request was occurring, that record 12784 is cleared, and the verification request is retried. The 12785 discovery process remains in the completion/verification state. 12787 * If there have been no lease-migrated indications, the work of 12788 migration discovery is considered completed, and it enters the 12789 non-operating state. Once it enters this state, subsequent lease- 12790 migrated indications will trigger a new migration discovery 12791 process. 12793 It should be noted that the process described above is not guaranteed 12794 to terminate, as a long series of new migration events might 12795 continually delay the clearing of the SEQ4_STATUS_LEASE_MOVED 12796 indication. To prevent unnecessary lease expiration, it is 12797 appropriate for clients to use the discovery of migrations to effect 12798 lease renewal immediately, rather than waiting for the clearing of 12799 the SEQ4_STATUS_LEASE_MOVED indication when the complete set of 12800 migrations is available. 12802 Lease discovery needs to be provided as described above. This 12803 ensures that the client discovers file system migrations soon enough 12804 to renew its leases on each destination server before they expire. 12805 Non-renewal of leases can lead to loss of locking state. While the 12806 consequences of such loss can be ameliorated through implementations 12807 of courtesy locks, servers are under no obligation to do so, and a 12808 conflicting lock request may mean that a lock is revoked 12809 unexpectedly. Clients should be aware of this possibility. 12811 11.13.3. Overview of Client Response to NFS4ERR_MOVED 12813 This section outlines a way in which a client that receives 12814 NFS4ERR_MOVED can effect transition recovery by using a new server or 12815 server endpoint if one is available. As part of that process, it 12816 will determine: 12818 * Whether the NFS4ERR_MOVED indicates migration has occurred, or 12819 whether it indicates another sort of file system access transition 12820 as discussed in Section 11.10 above. 12822 * In the case of migration, whether Transparent State Migration has 12823 occurred. 12825 * Whether any state has been lost during the process of Transparent 12826 State Migration. 12828 * Whether sessions have been transferred as part of Transparent 12829 State Migration. 12831 During the first phase of this process, the client proceeds to 12832 examine file system location entries to find the initial network 12833 address it will use to continue access to the file system or its 12834 replacement. For each location entry that the client examines, the 12835 process consists of five steps: 12837 1. Performing an EXCHANGE_ID directed at the location address. This 12838 operation is used to register the client owner (in the form of a 12839 client_owner4) with the server, to obtain a client ID to be used 12840 subsequently to communicate with it, to obtain that client ID's 12841 confirmation status, and to determine server_owner4 and scope for 12842 the purpose of determining if the entry is trunkable with the 12843 address previously being used to access the file system (i.e., 12844 that it represents another network access path to the same file 12845 system and can share locking state with it). 12847 2. Making an initial determination of whether migration has 12848 occurred. The initial determination will be based on whether the 12849 EXCHANGE_ID results indicate that the current location element is 12850 server-trunkable with that used to access the file system when 12851 access was terminated by receiving NFS4ERR_MOVED. If it is, then 12852 migration has not occurred. In that case, the transition is 12853 dealt with, at least initially, as one involving continued access 12854 to the same file system on the same server through a new network 12855 address. 12857 3. Obtaining access to existing session state or creating new 12858 sessions. How this is done depends on the initial determination 12859 of whether migration has occurred and can be done as described in 12860 Section 11.13.4 below in the case of migration or as described in 12861 Section 11.13.5 below in the case of a network address transfer 12862 without migration. 12864 4. Verifying the trunking relationship assumed in step 2 as 12865 discussed in Section 2.10.5.1. Although this step will generally 12866 confirm the initial determination, it is possible for 12867 verification to invalidate the initial determination of network 12868 address shift (without migration) and instead determine that 12869 migration had occurred. There is no need to redo step 3 above, 12870 since it will be possible to continue use of the session 12871 established already. 12873 5. Obtaining access to existing locking state and/or re-obtaining 12874 it. How this is done depends on the final determination of 12875 whether migration has occurred and can be done as described below 12876 in Section 11.13.4 in the case of migration or as described in 12877 Section 11.13.5 in the case of a network address transfer without 12878 migration. 12880 Once the initial address has been determined, clients are free to 12881 apply an abbreviated process to find additional addresses trunkable 12882 with it (clients may seek session-trunkable or server-trunkable 12883 addresses depending on whether they support client ID trunking). 12884 During this later phase of the process, further location entries are 12885 examined using the abbreviated procedure specified below: 12887 A: Before the EXCHANGE_ID, the fs name of the location entry is 12888 examined, and if it does not match that currently being used, the 12889 entry is ignored. Otherwise, one proceeds as specified by step 1 12890 above. 12892 B: In the case that the network address is session-trunkable with 12893 one used previously, a BIND_CONN_TO_SESSION is used to access 12894 that session using the new network address. Otherwise, or if the 12895 bind operation fails, a CREATE_SESSION is done. 12897 C: The verification procedure referred to in step 4 above is used. 12898 However, if it fails, the entry is ignored and the next available 12899 entry is used. 12901 11.13.4. Obtaining Access to Sessions and State after Migration 12903 In the event that migration has occurred, migration recovery will 12904 involve determining whether Transparent State Migration has occurred. 12905 This decision is made based on the client ID returned by the 12906 EXCHANGE_ID and the reported confirmation status. 12908 * If the client ID is an unconfirmed client ID not previously known 12909 to the client, then Transparent State Migration has not occurred. 12911 * If the client ID is a confirmed client ID previously known to the 12912 client, then any transferred state would have been merged with an 12913 existing client ID representing the client to the destination 12914 server. In this state merger case, Transparent State Migration 12915 might or might not have occurred, and a determination as to 12916 whether it has occurred is deferred until sessions are established 12917 and the client is ready to begin state recovery. 12919 * If the client ID is a confirmed client ID not previously known to 12920 the client, then the client can conclude that the client ID was 12921 transferred as part of Transparent State Migration. In this 12922 transferred client ID case, Transparent State Migration has 12923 occurred, although some state might have been lost. 12925 Once the client ID has been obtained, it is necessary to obtain 12926 access to sessions to continue communication with the new server. In 12927 any of the cases in which Transparent State Migration has occurred, 12928 it is possible that a session was transferred as well. To deal with 12929 that possibility, clients can, after doing the EXCHANGE_ID, issue a 12930 BIND_CONN_TO_SESSION to connect the transferred session to a 12931 connection to the new server. If that fails, it is an indication 12932 that the session was not transferred and that a new session needs to 12933 be created to take its place. 12935 In some situations, it is possible for a BIND_CONN_TO_SESSION to 12936 succeed without session migration having occurred. If state merger 12937 has taken place, then the associated client ID may have already had a 12938 set of existing sessions, with it being possible that the session ID 12939 of a given session is the same as one that might have been migrated. 12940 In that event, a BIND_CONN_TO_SESSION might succeed, even though 12941 there could have been no migration of the session with that session 12942 ID. In such cases, the client will receive sequence errors when the 12943 slot sequence values used are not appropriate on the new session. 12944 When this occurs, the client can create a new a session and cease 12945 using the existing one. 12947 Once the client has determined the initial migration status, and 12948 determined that there was a shift to a new server, it needs to re- 12949 establish its locking state, if possible. To enable this to happen 12950 without loss of the guarantees normally provided by locking, the 12951 destination server needs to implement a per-fs grace period in all 12952 cases in which lock state was lost, including those in which 12953 Transparent State Migration was not implemented. Each client for 12954 which there was a transfer of locking state to the new server will 12955 have the duration of the grace period to reclaim its locks, from the 12956 time its locks were transferred. 12958 Clients need to deal with the following cases: 12960 * In the state merger case, it is possible that the server has not 12961 attempted Transparent State Migration, in which case state may 12962 have been lost without it being reflected in the SEQ4_STATUS bits. 12963 To determine whether this has happened, the client can use 12964 TEST_STATEID to check whether the stateids created on the source 12965 server are still accessible on the destination server. Once a 12966 single stateid is found to have been successfully transferred, the 12967 client can conclude that Transparent State Migration was begun, 12968 and any failure to transport all of the stateids will be reflected 12969 in the SEQ4_STATUS bits. Otherwise, Transparent State Migration 12970 has not occurred. 12972 * In a case in which Transparent State Migration has not occurred, 12973 the client can use the per-fs grace period provided by the 12974 destination server to reclaim locks that were held on the source 12975 server. 12977 * In a case in which Transparent State Migration has occurred, and 12978 no lock state was lost (as shown by SEQ4_STATUS flags), no lock 12979 reclaim is necessary. 12981 * In a case in which Transparent State Migration has occurred, and 12982 some lock state was lost (as shown by SEQ4_STATUS flags), existing 12983 stateids need to be checked for validity using TEST_STATEID, and 12984 reclaim used to re-establish any that were not transferred. 12986 For all of the cases above, RECLAIM_COMPLETE with an rca_one_fs value 12987 of TRUE needs to be done before normal use of the file system, 12988 including obtaining new locks for the file system. This applies even 12989 if no locks were lost and there was no need for any to be reclaimed. 12991 11.13.5. Obtaining Access to Sessions and State after Network Address 12992 Transfer 12994 The case in which there is a transfer to a new network address 12995 without migration is similar to that described in Section 11.13.4 12996 above in that there is a need to obtain access to needed sessions and 12997 locking state. However, the details are simpler and will vary 12998 depending on the type of trunking between the address receiving 12999 NFS4ERR_MOVED and that to which the transfer is to be made. 13001 To make a session available for use, a BIND_CONN_TO_SESSION should be 13002 used to obtain access to the session previously in use. Only if this 13003 fails, should a CREATE_SESSION be done. While this procedure mirrors 13004 that in Section 11.13.4 above, there is an important difference in 13005 that preservation of the session is not purely optional but depends 13006 on the type of trunking. 13008 Access to appropriate locking state will generally need no actions 13009 beyond access to the session. However, the SEQ4_STATUS bits need to 13010 be checked for lost locking state, including the need to reclaim 13011 locks after a server reboot, since there is always a possibility of 13012 locking state being lost. 13014 11.14. Server Responsibilities Upon Migration 13016 In the event of file system migration, when the client connects to 13017 the destination server, that server needs to be able to provide the 13018 client continued access to the files it had open on the source 13019 server. There are two ways to provide this: 13021 * By provision of an fs-specific grace period, allowing the client 13022 the ability to reclaim its locks, in a fashion similar to what 13023 would have been done in the case of recovery from a server 13024 restart. See Section 11.14.1 for a more complete discussion. 13026 * By implementing Transparent State Migration possibly in connection 13027 with session migration, the server can provide the client 13028 immediate access to the state built up on the source server on the 13029 destination server. 13031 These features are discussed separately in Sections 11.14.2 and 13032 11.14.3, which discuss Transparent State Migration and session 13033 migration, respectively. 13035 All the features described above can involve transfer of lock-related 13036 information between source and destination servers. In some cases, 13037 this transfer is a necessary part of the implementation, while in 13038 other cases, it is a helpful implementation aid, which servers might 13039 or might not use. The subsections below discuss the information that 13040 would be transferred but do not define the specifics of the transfer 13041 protocol. This is left as an implementation choice, although 13042 standards in this area could be developed at a later time. 13044 11.14.1. Server Responsibilities in Effecting State Reclaim after 13045 Migration 13047 In this case, the destination server needs no knowledge of the locks 13048 held on the source server. It relies on the clients to accurately 13049 report (via reclaim operations) the locks previously held, and does 13050 not allow new locks to be granted on migrated file systems until the 13051 grace period expires. Disallowing of new locks applies to all 13052 clients accessing these file systems, while grace period expiration 13053 occurs for each migrated client independently. 13055 During this grace period, clients have the opportunity to use reclaim 13056 operations to obtain locks for file system objects within the 13057 migrated file system, in the same way that they do when recovering 13058 from server restart, and the servers typically rely on clients to 13059 accurately report their locks, although they have the option of 13060 subjecting these requests to verification. If the clients only 13061 reclaim locks held on the source server, no conflict can arise. Once 13062 the client has reclaimed its locks, it indicates the completion of 13063 lock reclamation by performing a RECLAIM_COMPLETE specifying 13064 rca_one_fs as TRUE. 13066 While it is not necessary for source and destination servers to 13067 cooperate to transfer information about locks, implementations are 13068 well advised to consider transferring the following useful 13069 information: 13071 * If information about the set of clients that have locking state 13072 for the transferred file system is made available, the destination 13073 server will be able to terminate the grace period once all such 13074 clients have reclaimed their locks, allowing normal locking 13075 activity to resume earlier than it would have otherwise. 13077 * Locking summary information for individual clients (at various 13078 possible levels of detail) can detect some instances in which 13079 clients do not accurately represent the locks held on the source 13080 server. 13082 11.14.2. Server Responsibilities in Effecting Transparent State 13083 Migration 13085 The basic responsibility of the source server in effecting 13086 Transparent State Migration is to make available to the destination 13087 server a description of each piece of locking state associated with 13088 the file system being migrated. In addition to client id string and 13089 verifier, the source server needs to provide for each stateid: 13091 * The stateid including the current sequence value. 13093 * The associated client ID. 13095 * The handle of the associated file. 13097 * The type of the lock, such as open, byte-range lock, delegation, 13098 or layout. 13100 * For locks such as opens and byte-range locks, there will be 13101 information about the owner(s) of the lock. 13103 * For recallable/revocable lock types, the current recall status 13104 needs to be included. 13106 * For each lock type, there will be associated type-specific 13107 information. For opens, this will include share and deny mode 13108 while for byte-range locks and layouts, there will be a type and a 13109 byte-range. 13111 Such information will most probably be organized by client id string 13112 on the destination server so that it can be used to provide 13113 appropriate context to each client when it makes itself known to the 13114 client. Issues connected with a client impersonating another by 13115 presenting another client's client id string can be addressed using 13116 NFSv4.1 state protection features, as described in Section 21. 13118 A further server responsibility concerns locks that are revoked or 13119 otherwise lost during the process of file system migration. Because 13120 locks that appear to be lost during the process of migration will be 13121 reclaimed by the client, the servers have to take steps to ensure 13122 that locks revoked soon before or soon after migration are not 13123 inadvertently allowed to be reclaimed in situations in which the 13124 continuity of lock possession cannot be assured. 13126 * For locks lost on the source but whose loss has not yet been 13127 acknowledged by the client (by using FREE_STATEID), the 13128 destination must be aware of this loss so that it can deny a 13129 request to reclaim them. 13131 * For locks lost on the destination after the state transfer but 13132 before the client's RECLAIM_COMPLETE is done, the destination 13133 server should note these and not allow them to be reclaimed. 13135 An additional responsibility of the cooperating servers concerns 13136 situations in which a stateid cannot be transferred transparently 13137 because it conflicts with an existing stateid held by the client and 13138 associated with a different file system. In this case, there are two 13139 valid choices: 13141 * Treat the transfer, as in NFSv4.0, as one without Transparent 13142 State Migration. In this case, conflicting locks cannot be 13143 granted until the client does a RECLAIM_COMPLETE, after reclaiming 13144 the locks it had, with the exception of reclaims denied because 13145 they were attempts to reclaim locks that had been lost. 13147 * Implement Transparent State Migration, except for the lock with 13148 the conflicting stateid. In this case, the client will be aware 13149 of a lost lock (through the SEQ4_STATUS flags) and be allowed to 13150 reclaim it. 13152 When transferring state between the source and destination, the 13153 issues discussed in Section 7.2 of [69] must still be attended to. 13154 In this case, the use of NFS4ERR_DELAY may still be necessary in 13155 NFSv4.1, as it was in NFSv4.0, to prevent locking state changing 13156 while it is being transferred. See Section 15.1.1.3 for information 13157 about appropriate client retry approaches in the event that 13158 NFS4ERR_DELAY is returned. 13160 There are a number of important differences in the NFS4.1 context: 13162 * The absence of RELEASE_LOCKOWNER means that the one case in which 13163 an operation could not be deferred by use of NFS4ERR_DELAY no 13164 longer exists. 13166 * Sequencing of operations is no longer done using owner-based 13167 operation sequences numbers. Instead, sequencing is session- 13168 based. 13170 As a result, when sessions are not transferred, the techniques 13171 discussed in Section 7.2 of [69] are adequate and will not be further 13172 discussed. 13174 11.14.3. Server Responsibilities in Effecting Session Transfer 13176 The basic responsibility of the source server in effecting session 13177 transfer is to make available to the destination server a description 13178 of the current state of each slot with the session, including the 13179 following: 13181 * The last sequence value received for that slot. 13183 * Whether there is cached reply data for the last request executed 13184 and, if so, the cached reply. 13186 When sessions are transferred, there are a number of issues that pose 13187 challenges in terms of making the transferred state unmodifiable 13188 during the period it is gathered up and transferred to the 13189 destination server: 13191 * A single session may be used to access multiple file systems, not 13192 all of which are being transferred. 13194 * Requests made on a session may, even if rejected, affect the state 13195 of the session by advancing the sequence number associated with 13196 the slot used. 13198 As a result, when the file system state might otherwise be considered 13199 unmodifiable, the client might have any number of in-flight requests, 13200 each of which is capable of changing session state, which may be of a 13201 number of types: 13203 1. Those requests that were processed on the migrating file system 13204 before migration began. 13206 2. Those requests that received the error NFS4ERR_DELAY because the 13207 file system being accessed was in the process of being migrated. 13209 3. Those requests that received the error NFS4ERR_MOVED because the 13210 file system being accessed had been migrated. 13212 4. Those requests that accessed the migrating file system in order 13213 to obtain location or status information. 13215 5. Those requests that did not reference the migrating file system. 13217 It should be noted that the history of any particular slot is likely 13218 to include a number of these request classes. In the case in which a 13219 session that is migrated is used by file systems other than the one 13220 migrated, requests of class 5 may be common and may be the last 13221 request processed for many slots. 13223 Since session state can change even after the locking state has been 13224 fixed as part of the migration process, the session state known to 13225 the client could be different from that on the destination server, 13226 which necessarily reflects the session state on the source server at 13227 an earlier time. In deciding how to deal with this situation, it is 13228 helpful to distinguish between two sorts of behavioral consequences 13229 of the choice of initial sequence ID values: 13231 * The error NFS4ERR_SEQ_MISORDERED is returned when the sequence ID 13232 in a request is neither equal to the last one seen for the current 13233 slot nor the next greater one. 13235 In view of the difficulty of arriving at a mutually acceptable 13236 value for the correct last sequence value at the point of 13237 migration, it may be necessary for the server to show some degree 13238 of forbearance when the sequence ID is one that would be 13239 considered unacceptable if session migration were not involved. 13241 * Returning the cached reply for a previously executed request when 13242 the sequence ID in the request matches the last value recorded for 13243 the slot. 13245 In the cases in which an error is returned and there is no 13246 possibility of any non-idempotent operation having been executed, 13247 it may not be necessary to adhere to this as strictly as might be 13248 proper if session migration were not involved. For example, the 13249 fact that the error NFS4ERR_DELAY was returned may not assist the 13250 client in any material way, while the fact that NFS4ERR_MOVED was 13251 returned by the source server may not be relevant when the request 13252 was reissued and directed to the destination server. 13254 An important issue is that the specification needs to take note of 13255 all potential COMPOUNDs, even if they might be unlikely in practice. 13256 For example, a COMPOUND is allowed to access multiple file systems 13257 and might perform non-idempotent operations in some of them before 13258 accessing a file system being migrated. Also, a COMPOUND may return 13259 considerable data in the response before being rejected with 13260 NFS4ERR_DELAY or NFS4ERR_MOVED, and may in addition be marked as 13261 sa_cachethis. However, note that if the client and server adhere to 13262 rules in Section 15.1.1.3, there is no possibility of non-idempotent 13263 operations being spuriously reissued after receiving NFS4ERR_DELAY 13264 response. 13266 To address these issues, a destination server MAY do any of the 13267 following when implementing session transfer: 13269 * Avoid enforcing any sequencing semantics for a particular slot 13270 until the client has established the starting sequence for that 13271 slot on the destination server. 13273 * For each slot, avoid returning a cached reply returning 13274 NFS4ERR_DELAY or NFS4ERR_MOVED until the client has established 13275 the starting sequence for that slot on the destination server. 13277 * Until the client has established the starting sequence for a 13278 particular slot on the destination server, avoid reporting 13279 NFS4ERR_SEQ_MISORDERED or returning a cached reply that contains 13280 either NFS4ERR_DELAY or NFS4ERR_MOVED and consists solely of a 13281 series of operations where the response is NFS4_OK until the final 13282 error. 13284 Because of the considerations mentioned above, including the rules 13285 for the handling of NFS4ERR_DELAY included in Section 15.1.1.3, the 13286 destination server can respond appropriately to SEQUENCE operations 13287 received from the client by adopting the three policies listed below: 13289 * Not responding with NFS4ERR_SEQ_MISORDERED for the initial request 13290 on a slot within a transferred session because the destination 13291 server cannot be aware of requests made by the client after the 13292 server handoff but before the client became aware of the shift. 13293 In cases in which NFS4ERR_SEQ_MISORDERED would normally have been 13294 reported, the request is to be processed normally as a new 13295 request. 13297 * Replying as it would for a retry whenever the sequence matches 13298 that transferred by the source server, even though this would not 13299 provide retry handling for requests issued after the server 13300 handoff, under the assumption that, when such requests are issued, 13301 they will never be responded to in a state-changing fashion, 13302 making retry support for them unnecessary. 13304 * Once a non-retry SEQUENCE is received for a given slot, using that 13305 as the basis for further sequence checking, with no further 13306 reference to the sequence value transferred by the source server. 13308 11.15. Effecting File System Referrals 13310 Referrals are effected when an absent file system is encountered and 13311 one or more alternate locations are made available by the 13312 fs_locations or fs_locations_info attributes. The client will 13313 typically get an NFS4ERR_MOVED error, fetch the appropriate location 13314 information, and proceed to access the file system on a different 13315 server, even though it retains its logical position within the 13316 original namespace. Referrals differ from migration events in that 13317 they happen only when the client has not previously referenced the 13318 file system in question (so there is nothing to transition). 13319 Referrals can only come into effect when an absent file system is 13320 encountered at its root. 13322 The examples given in the sections below are somewhat artificial in 13323 that an actual client will not typically do a multi-component look 13324 up, but will have cached information regarding the upper levels of 13325 the name hierarchy. However, these examples are chosen to make the 13326 required behavior clear and easy to put within the scope of a small 13327 number of requests, without getting into a discussion of the details 13328 of how specific clients might choose to cache things. 13330 11.15.1. Referral Example (LOOKUP) 13332 Let us suppose that the following COMPOUND is sent in an environment 13333 in which /this/is/the/path is absent from the target server. This 13334 may be for a number of reasons. It may be that the file system has 13335 moved, or it may be that the target server is functioning mainly, or 13336 solely, to refer clients to the servers on which various file systems 13337 are located. 13339 * PUTROOTFH 13341 * LOOKUP "this" 13343 * LOOKUP "is" 13345 * LOOKUP "the" 13347 * LOOKUP "path" 13349 * GETFH 13351 * GETATTR (fsid, fileid, size, time_modify) 13353 Under the given circumstances, the following will be the result. 13355 * PUTROOTFH --> NFS_OK. The current fh is now the root of the 13356 pseudo-fs. 13358 * LOOKUP "this" --> NFS_OK. The current fh is for /this and is 13359 within the pseudo-fs. 13361 * LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 13362 within the pseudo-fs. 13364 * LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 13365 is within the pseudo-fs. 13367 * LOOKUP "path" --> NFS_OK. The current fh is for /this/is/the/path 13368 and is within a new, absent file system, but ... the client will 13369 never see the value of that fh. 13371 * GETFH --> NFS4ERR_MOVED. Fails because current fh is in an absent 13372 file system at the start of the operation, and the specification 13373 makes no exception for GETFH. 13375 * GETATTR (fsid, fileid, size, time_modify). Not executed because 13376 the failure of the GETFH stops processing of the COMPOUND. 13378 Given the failure of the GETFH, the client has the job of determining 13379 the root of the absent file system and where to find that file 13380 system, i.e., the server and path relative to that server's root fh. 13381 Note that in this example, the client did not obtain filehandles and 13382 attribute information (e.g., fsid) for the intermediate directories, 13383 so that it would not be sure where the absent file system starts. It 13384 could be the case, for example, that /this/is/the is the root of the 13385 moved file system and that the reason that the look up of "path" 13386 succeeded is that the file system was not absent on that operation 13387 but was moved between the last LOOKUP and the GETFH (since COMPOUND 13388 is not atomic). Even if we had the fsids for all of the intermediate 13389 directories, we could have no way of knowing that /this/is/the/path 13390 was the root of a new file system, since we don't yet have its fsid. 13392 In order to get the necessary information, let us re-send the chain 13393 of LOOKUPs with GETFHs and GETATTRs to at least get the fsids so we 13394 can be sure where the appropriate file system boundaries are. The 13395 client could choose to get fs_locations_info at the same time but in 13396 most cases the client will have a good guess as to where file system 13397 boundaries are (because of where NFS4ERR_MOVED was, and was not, 13398 received) making fetching of fs_locations_info unnecessary. 13400 OP01: PUTROOTFH --> NFS_OK 13402 * Current fh is root of pseudo-fs. 13404 OP02: GETATTR(fsid) --> NFS_OK 13406 * Just for completeness. Normally, clients will know the fsid of 13407 the pseudo-fs as soon as they establish communication with a 13408 server. 13410 OP03: LOOKUP "this" --> NFS_OK 13411 OP04: GETATTR(fsid) --> NFS_OK 13413 * Get current fsid to see where file system boundaries are. The 13414 fsid will be that for the pseudo-fs in this example, so no 13415 boundary. 13417 OP05: GETFH --> NFS_OK 13419 * Current fh is for /this and is within pseudo-fs. 13421 OP06: LOOKUP "is" --> NFS_OK 13423 * Current fh is for /this/is and is within pseudo-fs. 13425 OP07: GETATTR(fsid) --> NFS_OK 13427 * Get current fsid to see where file system boundaries are. The 13428 fsid will be that for the pseudo-fs in this example, so no 13429 boundary. 13431 OP08: GETFH --> NFS_OK 13433 * Current fh is for /this/is and is within pseudo-fs. 13435 OP09: LOOKUP "the" --> NFS_OK 13437 * Current fh is for /this/is/the and is within pseudo-fs. 13439 OP10: GETATTR(fsid) --> NFS_OK 13441 * Get current fsid to see where file system boundaries are. The 13442 fsid will be that for the pseudo-fs in this example, so no 13443 boundary. 13445 OP11: GETFH --> NFS_OK 13447 * Current fh is for /this/is/the and is within pseudo-fs. 13449 OP12: LOOKUP "path" --> NFS_OK 13451 * Current fh is for /this/is/the/path and is within a new, absent 13452 file system, but ... 13454 * The client will never see the value of that fh. 13456 OP13: GETATTR(fsid, fs_locations_info) --> NFS_OK 13457 * We are getting the fsid to know where the file system 13458 boundaries are. In this operation, the fsid will be different 13459 than that of the parent directory (which in turn was retrieved 13460 in OP10). Note that the fsid we are given will not necessarily 13461 be preserved at the new location. That fsid might be 13462 different, and in fact the fsid we have for this file system 13463 might be a valid fsid of a different file system on that new 13464 server. 13466 * In this particular case, we are pretty sure anyway that what 13467 has moved is /this/is/the/path rather than /this/is/the since 13468 we have the fsid of the latter and it is that of the pseudo-fs, 13469 which presumably cannot move. However, in other examples, we 13470 might not have this kind of information to rely on (e.g., 13471 /this/is/the might be a non-pseudo file system separate from 13472 /this/is/the/path), so we need to have other reliable source 13473 information on the boundary of the file system that is moved. 13474 If, for example, the file system /this/is had moved, we would 13475 have a case of migration rather than referral, and once the 13476 boundaries of the migrated file system was clear we could fetch 13477 fs_locations_info. 13479 * We are fetching fs_locations_info because the fact that we got 13480 an NFS4ERR_MOVED at this point means that it is most likely 13481 that this is a referral and we need the destination. Even if 13482 it is the case that /this/is/the is a file system that has 13483 migrated, we will still need the location information for that 13484 file system. 13486 OP14: GETFH --> NFS4ERR_MOVED 13488 * Fails because current fh is in an absent file system at the 13489 start of the operation, and the specification makes no 13490 exception for GETFH. Note that this means the server will 13491 never send the client a filehandle from within an absent file 13492 system. 13494 Given the above, the client knows where the root of the absent file 13495 system is (/this/is/the/path) by noting where the change of fsid 13496 occurred (between "the" and "path"). The fs_locations_info attribute 13497 also gives the client the actual location of the absent file system, 13498 so that the referral can proceed. The server gives the client the 13499 bare minimum of information about the absent file system so that 13500 there will be very little scope for problems of conflict between 13501 information sent by the referring server and information of the file 13502 system's home. No filehandles and very few attributes are present on 13503 the referring server, and the client can treat those it receives as 13504 transient information with the function of enabling the referral. 13506 11.15.2. Referral Example (READDIR) 13508 Another context in which a client may encounter referrals is when it 13509 does a READDIR on a directory in which some of the sub-directories 13510 are the roots of absent file systems. 13512 Suppose such a directory is read as follows: 13514 * PUTROOTFH 13516 * LOOKUP "this" 13518 * LOOKUP "is" 13520 * LOOKUP "the" 13522 * READDIR (fsid, size, time_modify, mounted_on_fileid) 13524 In this case, because rdattr_error is not requested, 13525 fs_locations_info is not requested, and some of the attributes cannot 13526 be provided, the result will be an NFS4ERR_MOVED error on the 13527 READDIR, with the detailed results as follows: 13529 * PUTROOTFH --> NFS_OK. The current fh is at the root of the 13530 pseudo-fs. 13532 * LOOKUP "this" --> NFS_OK. The current fh is for /this and is 13533 within the pseudo-fs. 13535 * LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 13536 within the pseudo-fs. 13538 * LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 13539 is within the pseudo-fs. 13541 * READDIR (fsid, size, time_modify, mounted_on_fileid) --> 13542 NFS4ERR_MOVED. Note that the same error would have been returned 13543 if /this/is/the had migrated, but it is returned because the 13544 directory contains the root of an absent file system. 13546 So now suppose that we re-send with rdattr_error: 13548 * PUTROOTFH 13550 * LOOKUP "this" 13552 * LOOKUP "is" 13553 * LOOKUP "the" 13555 * READDIR (rdattr_error, fsid, size, time_modify, mounted_on_fileid) 13557 The results will be: 13559 * PUTROOTFH --> NFS_OK. The current fh is at the root of the 13560 pseudo-fs. 13562 * LOOKUP "this" --> NFS_OK. The current fh is for /this and is 13563 within the pseudo-fs. 13565 * LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 13566 within the pseudo-fs. 13568 * LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 13569 is within the pseudo-fs. 13571 * READDIR (rdattr_error, fsid, size, time_modify, mounted_on_fileid) 13572 --> NFS_OK. The attributes for directory entry with the component 13573 named "path" will only contain rdattr_error with the value 13574 NFS4ERR_MOVED, together with an fsid value and a value for 13575 mounted_on_fileid. 13577 Suppose we do another READDIR to get fs_locations_info (although we 13578 could have used a GETATTR directly, as in Section 11.15.1). 13580 * PUTROOTFH 13582 * LOOKUP "this" 13584 * LOOKUP "is" 13586 * LOOKUP "the" 13588 * READDIR (rdattr_error, fs_locations_info, mounted_on_fileid, fsid, 13589 size, time_modify) 13591 The results would be: 13593 * PUTROOTFH --> NFS_OK. The current fh is at the root of the 13594 pseudo-fs. 13596 * LOOKUP "this" --> NFS_OK. The current fh is for /this and is 13597 within the pseudo-fs. 13599 * LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 13600 within the pseudo-fs. 13602 * LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 13603 is within the pseudo-fs. 13605 * READDIR (rdattr_error, fs_locations_info, mounted_on_fileid, fsid, 13606 size, time_modify) --> NFS_OK. The attributes will be as shown 13607 below. 13609 The attributes for the directory entry with the component named 13610 "path" will only contain: 13612 * rdattr_error (value: NFS_OK) 13614 * fs_locations_info 13616 * mounted_on_fileid (value: unique fileid within referring file 13617 system) 13619 * fsid (value: unique value within referring server) 13621 The attributes for entry "path" will not contain size or time_modify 13622 because these attributes are not available within an absent file 13623 system. 13625 11.16. The Attribute fs_locations 13627 The fs_locations attribute is structured in the following way: 13629 struct fs_location4 { 13630 utf8str_cis server<>; 13631 pathname4 rootpath; 13632 }; 13634 struct fs_locations4 { 13635 pathname4 fs_root; 13636 fs_location4 locations<>; 13637 }; 13639 The fs_location4 data type is used to represent the location of a 13640 file system by providing a server name and the path to the root of 13641 the file system within that server's namespace. When a set of 13642 servers have corresponding file systems at the same path within their 13643 namespaces, an array of server names may be provided. An entry in 13644 the server array is a UTF-8 string and represents one of a 13645 traditional DNS host name, IPv4 address, IPv6 address, or a zero- 13646 length string. An IPv4 or IPv6 address is represented as a universal 13647 address (see Section 3.3.9 and [12]), minus the netid, and either 13648 with or without the trailing ".p1.p2" suffix that represents the port 13649 number. If the suffix is omitted, then the default port, 2049, 13650 SHOULD be assumed. A zero-length string SHOULD be used to indicate 13651 the current address being used for the RPC call. It is not a 13652 requirement that all servers that share the same rootpath be listed 13653 in one fs_location4 instance. The array of server names is provided 13654 for convenience. Servers that share the same rootpath may also be 13655 listed in separate fs_location4 entries in the fs_locations 13656 attribute. 13658 The fs_locations4 data type and the fs_locations attribute each 13659 contain an array of such locations. Since the namespace of each 13660 server may be constructed differently, the "fs_root" field is 13661 provided. The path represented by fs_root represents the location of 13662 the file system in the current server's namespace, i.e., that of the 13663 server from which the fs_locations attribute was obtained. The 13664 fs_root path is meant to aid the client by clearly referencing the 13665 root of the file system whose locations are being reported, no matter 13666 what object within the current file system the current filehandle 13667 designates. The fs_root is simply the pathname the client used to 13668 reach the object on the current server (i.e., the object to which the 13669 fs_locations attribute applies). 13671 When the fs_locations attribute is interrogated and there are no 13672 alternate file system locations, the server SHOULD return a zero- 13673 length array of fs_location4 structures, together with a valid 13674 fs_root. 13676 As an example, suppose there is a replicated file system located at 13677 two servers (servA and servB). At servA, the file system is located 13678 at path /a/b/c. At, servB the file system is located at path /x/y/z. 13679 If the client were to obtain the fs_locations value for the directory 13680 at /a/b/c/d, it might not necessarily know that the file system's 13681 root is located in servA's namespace at /a/b/c. When the client 13682 switches to servB, it will need to determine that the directory it 13683 first referenced at servA is now represented by the path /x/y/z/d on 13684 servB. To facilitate this, the fs_locations attribute provided by 13685 servA would have an fs_root value of /a/b/c and two entries in 13686 fs_locations. One entry in fs_locations will be for itself (servA) 13687 and the other will be for servB with a path of /x/y/z. With this 13688 information, the client is able to substitute /x/y/z for the /a/b/c 13689 at the beginning of its access path and construct /x/y/z/d to use for 13690 the new server. 13692 Note that there is no requirement that the number of components in 13693 each rootpath be the same; there is no relation between the number of 13694 components in rootpath or fs_root, and none of the components in a 13695 rootpath and fs_root have to be the same. In the above example, we 13696 could have had a third element in the locations array, with server 13697 equal to "servC" and rootpath equal to "/I/II", and a fourth element 13698 in locations with server equal to "servD" and rootpath equal to 13699 "/aleph/beth/gimel/daleth/he". 13701 The relationship between fs_root to a rootpath is that the client 13702 replaces the pathname indicated in fs_root for the current server for 13703 the substitute indicated in rootpath for the new server. 13705 For an example of a referred or migrated file system, suppose there 13706 is a file system located at serv1. At serv1, the file system is 13707 located at /az/buky/vedi/glagoli. The client finds that object at 13708 glagoli has migrated (or is a referral). The client gets the 13709 fs_locations attribute, which contains an fs_root of /az/buky/vedi/ 13710 glagoli, and one element in the locations array, with server equal to 13711 serv2, and rootpath equal to /izhitsa/fita. The client replaces 13712 /az/buky/vedi/glagoli with /izhitsa/fita, and uses the latter 13713 pathname on serv2. 13715 Thus, the server MUST return an fs_root that is equal to the path the 13716 client used to reach the object to which the fs_locations attribute 13717 applies. Otherwise, the client cannot determine the new path to use 13718 on the new server. 13720 Since the fs_locations attribute lacks information defining various 13721 attributes of the various file system choices presented, it SHOULD 13722 only be interrogated and used when fs_locations_info is not 13723 available. When fs_locations is used, information about the specific 13724 locations should be assumed based on the following rules. 13726 The following rules are general and apply irrespective of the 13727 context. 13729 * All listed file system instances should be considered as of the 13730 same handle class, if and only if, the current fh_expire_type 13731 attribute does not include the FH4_VOL_MIGRATION bit. Note that 13732 in the case of referral, filehandle issues do not apply since 13733 there can be no filehandles known within the current file system, 13734 nor is there any access to the fh_expire_type attribute on the 13735 referring (absent) file system. 13737 * All listed file system instances should be considered as of the 13738 same fileid class if and only if the fh_expire_type attribute 13739 indicates persistent filehandles and does not include the 13740 FH4_VOL_MIGRATION bit. Note that in the case of referral, fileid 13741 issues do not apply since there can be no fileids known within the 13742 referring (absent) file system, nor is there any access to the 13743 fh_expire_type attribute. 13745 * All file system instances servers should be considered as of 13746 different change classes. 13748 For other class assignments, handling of file system transitions 13749 depends on the reasons for the transition: 13751 * When the transition is due to migration, that is, the client was 13752 directed to a new file system after receiving an NFS4ERR_MOVED 13753 error, the target should be treated as being of the same write- 13754 verifier class as the source. 13756 * When the transition is due to failover to another replica, that 13757 is, the client selected another replica without receiving an 13758 NFS4ERR_MOVED error, the target should be treated as being of a 13759 different write-verifier class from the source. 13761 The specific choices reflect typical implementation patterns for 13762 failover and controlled migration, respectively. Since other choices 13763 are possible and useful, this information is better obtained by using 13764 fs_locations_info. When a server implementation needs to communicate 13765 other choices, it MUST support the fs_locations_info attribute. 13767 See Section 21 for a discussion on the recommendations for the 13768 security flavor to be used by any GETATTR operation that requests the 13769 fs_locations attribute. 13771 11.17. The Attribute fs_locations_info 13773 The fs_locations_info attribute is intended as a more functional 13774 replacement for the fs_locations attribute, which will continue to 13775 exist and be supported. Clients can use it to get a more complete 13776 set of data about alternative file system locations, including 13777 additional network paths to access replicas in use and additional 13778 replicas. When the server does not support fs_locations_info, 13779 fs_locations can be used to get a subset of the data. A server that 13780 supports fs_locations_info MUST support fs_locations as well. 13782 There is additional data present in fs_locations_info that is not 13783 available in fs_locations: 13785 * Attribute continuity information. This information will allow a 13786 client to select a replica that meets the transparency 13787 requirements of the applications accessing the data and to 13788 leverage optimizations due to the server guarantees of attribute 13789 continuity (e.g., if the change attribute of a file of the file 13790 system is continuous between multiple replicas, the client does 13791 not have to invalidate the file's cache when switching to a 13792 different replica). 13794 * File system identity information that indicates when multiple 13795 replicas, from the client's point of view, correspond to the same 13796 target file system, allowing them to be used interchangeably, 13797 without disruption, as distinct synchronized replicas of the same 13798 file data. 13800 Note that having two replicas with common identity information is 13801 distinct from the case of two (trunked) paths to the same replica. 13803 * Information that will bear on the suitability of various replicas, 13804 depending on the use that the client intends. For example, many 13805 applications need an absolutely up-to-date copy (e.g., those that 13806 write), while others may only need access to the most up-to-date 13807 copy reasonably available. 13809 * Server-derived preference information for replicas, which can be 13810 used to implement load-balancing while giving the client the 13811 entire file system list to be used in case the primary fails. 13813 The fs_locations_info attribute is structured similarly to the 13814 fs_locations attribute. A top-level structure (fs_locations_info4) 13815 contains the entire attribute including the root pathname of the file 13816 system and an array of lower-level structures that define replicas 13817 that share a common rootpath on their respective servers. The lower- 13818 level structure in turn (fs_locations_item4) contains a specific 13819 pathname and information on one or more individual network access 13820 paths. For that last, lowest level, fs_locations_info has an 13821 fs_locations_server4 structure that contains per-server-replica 13822 information in addition to the file system location entry. This per- 13823 server-replica information includes a nominally opaque array, 13824 fls_info, within which specific pieces of information are located at 13825 the specific indices listed below. 13827 Two fs_location_server4 entries that are within different 13828 fs_location_item4 structures are never trunkable, while two entries 13829 within in the same fs_location_item4 structure might or might not be 13830 trunkable. Two entries that are trunkable will have identical 13831 identity information, although, as noted above, the converse is not 13832 the case. 13834 The attribute will always contain at least a single 13835 fs_locations_server entry. Typically, there will be an entry with 13836 the FS4LIGF_CUR_REQ flag set, although in the case of a referral 13837 there will be no entry with that flag set. 13839 It should be noted that fs_locations_info attributes returned by 13840 servers for various replicas may differ for various reasons. One 13841 server may know about a set of replicas that are not known to other 13842 servers. Further, compatibility attributes may differ. Filehandles 13843 might be of the same class going from replica A to replica B but not 13844 going in the reverse direction. This might happen because the 13845 filehandles are the same, but replica B's server implementation might 13846 not have provision to note and report that equivalence. 13848 The fs_locations_info attribute consists of a root pathname 13849 (fli_fs_root, just like fs_root in the fs_locations attribute), 13850 together with an array of fs_location_item4 structures. The 13851 fs_location_item4 structures in turn consist of a root pathname 13852 (fli_rootpath) together with an array (fli_entries) of elements of 13853 data type fs_locations_server4, all defined as follows. 13855 /* 13856 * Defines an individual server access path 13857 */ 13858 struct fs_locations_server4 { 13859 int32_t fls_currency; 13860 opaque fls_info<>; 13861 utf8str_cis fls_server; 13862 }; 13864 /* 13865 * Byte indices of items within 13866 * fls_info: flag fields, class numbers, 13867 * bytes indicating ranks and orders. 13868 */ 13869 const FSLI4BX_GFLAGS = 0; 13870 const FSLI4BX_TFLAGS = 1; 13872 const FSLI4BX_CLSIMUL = 2; 13873 const FSLI4BX_CLHANDLE = 3; 13874 const FSLI4BX_CLFILEID = 4; 13875 const FSLI4BX_CLWRITEVER = 5; 13876 const FSLI4BX_CLCHANGE = 6; 13877 const FSLI4BX_CLREADDIR = 7; 13879 const FSLI4BX_READRANK = 8; 13880 const FSLI4BX_WRITERANK = 9; 13881 const FSLI4BX_READORDER = 10; 13882 const FSLI4BX_WRITEORDER = 11; 13884 /* 13885 * Bits defined within the general flag byte. 13886 */ 13887 const FSLI4GF_WRITABLE = 0x01; 13888 const FSLI4GF_CUR_REQ = 0x02; 13889 const FSLI4GF_ABSENT = 0x04; 13890 const FSLI4GF_GOING = 0x08; 13891 const FSLI4GF_SPLIT = 0x10; 13893 /* 13894 * Bits defined within the transport flag byte. 13895 */ 13896 const FSLI4TF_RDMA = 0x01; 13898 /* 13899 * Defines a set of replicas sharing 13900 * a common value of the rootpath 13901 * within the corresponding 13902 * single-server namespaces. 13903 */ 13904 struct fs_locations_item4 { 13905 fs_locations_server4 fli_entries<>; 13906 pathname4 fli_rootpath; 13907 }; 13909 /* 13910 * Defines the overall structure of 13911 * the fs_locations_info attribute. 13912 */ 13913 struct fs_locations_info4 { 13914 uint32_t fli_flags; 13915 int32_t fli_valid_for; 13916 pathname4 fli_fs_root; 13917 fs_locations_item4 fli_items<>; 13918 }; 13920 /* 13921 * Flag bits in fli_flags. 13922 */ 13923 const FSLI4IF_VAR_SUB = 0x00000001; 13925 typedef fs_locations_info4 fattr4_fs_locations_info; 13927 As noted above, the fs_locations_info attribute, when supported, may 13928 be requested of absent file systems without causing NFS4ERR_MOVED to 13929 be returned. It is generally expected that it will be available for 13930 both present and absent file systems even if only a single 13931 fs_locations_server4 entry is present, designating the current 13932 (present) file system, or two fs_locations_server4 entries 13933 designating the previous location of an absent file system (the one 13934 just referenced) and its successor location. Servers are strongly 13935 urged to support this attribute on all file systems if they support 13936 it on any file system. 13938 The data presented in the fs_locations_info attribute may be obtained 13939 by the server in any number of ways, including specification by the 13940 administrator or by current protocols for transferring data among 13941 replicas and protocols not yet developed. NFSv4.1 only defines how 13942 this information is presented by the server to the client. 13944 11.17.1. The fs_locations_server4 Structure 13946 The fs_locations_server4 structure consists of the following items in 13947 addition to the fls_server field, which specifies a network address 13948 or set of addresses to be used to access the specified file system. 13949 Note that both of these items (i.e., fls_currency and fls_info) 13950 specify attributes of the file system replica and should not be 13951 different when there are multiple fs_locations_server4 structures, 13952 each specifying a network path to the chosen replica, for the same 13953 replica. 13955 When these values are different in two fs_locations_server4 13956 structures, a client has no basis for choosing one over the other and 13957 is best off simply ignoring both entries, whether these entries apply 13958 to migration replication or referral. When there are more than two 13959 such entries, majority voting can be used to exclude a single 13960 erroneous entry from consideration. In the case in which trunking 13961 information is provided for a replica currently being accessed, the 13962 additional trunked addresses can be ignored while access continues on 13963 the address currently being used, even if the entry corresponding to 13964 that path might be considered invalid. 13966 * An indication of how up-to-date the file system is (fls_currency) 13967 in seconds. This value is relative to the master copy. A 13968 negative value indicates that the server is unable to give any 13969 reasonably useful value here. A value of zero indicates that the 13970 file system is the actual writable data or a reliably coherent and 13971 fully up-to-date copy. Positive values indicate how out-of-date 13972 this copy can normally be before it is considered for update. 13973 Such a value is not a guarantee that such updates will always be 13974 performed on the required schedule but instead serves as a hint 13975 about how far the copy of the data would be expected to be behind 13976 the most up-to-date copy. 13978 * A counted array of one-byte values (fls_info) containing 13979 information about the particular file system instance. This data 13980 includes general flags, transport capability flags, file system 13981 equivalence class information, and selection priority information. 13982 The encoding will be discussed below. 13984 * The server string (fls_server). For the case of the replica 13985 currently being accessed (via GETATTR), a zero-length string MAY 13986 be used to indicate the current address being used for the RPC 13987 call. The fls_server field can also be an IPv4 or IPv6 address, 13988 formatted the same way as an IPv4 or IPv6 address in the "server" 13989 field of the fs_location4 data type (see Section 11.16). 13991 With the exception of the transport-flag field (at offset 13992 FSLI4BX_TFLAGS with the fls_info array), all of this data defined in 13993 this specification applies to the replica specified by the entry, 13994 rather than the specific network path used to access it. The 13995 classification of data in extensions to this data is discussed below. 13997 Data within the fls_info array is in the form of 8-bit data items 13998 with constants giving the offsets within the array of various values 13999 describing this particular file system instance. This style of 14000 definition was chosen, in preference to explicit XDR structure 14001 definitions for these values, for a number of reasons. 14003 * The kinds of data in the fls_info array, representing flags, file 14004 system classes, and priorities among sets of file systems 14005 representing the same data, are such that 8 bits provide a quite 14006 acceptable range of values. Even where there might be more than 14007 256 such file system instances, having more than 256 distinct 14008 classes or priorities is unlikely. 14010 * Explicit definition of the various specific data items within XDR 14011 would limit expandability in that any extension within would 14012 require yet another attribute, leading to specification and 14013 implementation clumsiness. In the context of the NFSv4 extension 14014 model in effect at the time fs_locations_info was designed (i.e., 14015 that which is described in RFC 5661 [66]), this would necessitate 14016 a new minor version to effect any Standards Track extension to the 14017 data in fls_info. 14019 The set of fls_info data is subject to expansion in a future minor 14020 version or in a Standards Track RFC within the context of a single 14021 minor version. The server SHOULD NOT send and the client MUST NOT 14022 use indices within the fls_info array or flag bits that are not 14023 defined in Standards Track RFCs. 14025 In light of the new extension model defined in RFC 8178 [67] and the 14026 fact that the individual items within fls_info are not explicitly 14027 referenced in the XDR, the following practices should be followed 14028 when extending or otherwise changing the structure of the data 14029 returned in fls_info within the scope of a single minor version: 14031 * All extensions need to be described by Standards Track documents. 14032 There is no need for such documents to be marked as updating RFC 14033 5661 [66] or this document. 14035 * It needs to be made clear whether the information in any added 14036 data items applies to the replica specified by the entry or to the 14037 specific network paths specified in the entry. 14039 * There needs to be a reliable way defined to determine whether the 14040 server is aware of the extension. This may be based on the length 14041 field of the fls_info array, but it is more flexible to provide 14042 fs-scope or server-scope attributes to indicate what extensions 14043 are provided. 14045 This encoding scheme can be adapted to the specification of multi- 14046 byte numeric values, even though none are currently defined. If 14047 extensions are made via Standards Track RFCs, multi-byte quantities 14048 will be encoded as a range of bytes with a range of indices, with the 14049 byte interpreted in big-endian byte order. Further, any such index 14050 assignments will be constrained by the need for the relevant 14051 quantities not to cross XDR word boundaries. 14053 The fls_info array currently contains: 14055 * Two 8-bit flag fields, one devoted to general file-system 14056 characteristics and a second reserved for transport-related 14057 capabilities. 14059 * Six 8-bit class values that define various file system equivalence 14060 classes as explained below. 14062 * Four 8-bit priority values that govern file system selection as 14063 explained below. 14065 The general file system characteristics flag (at byte index 14066 FSLI4BX_GFLAGS) has the following bits defined within it: 14068 * FSLI4GF_WRITABLE indicates that this file system target is 14069 writable, allowing it to be selected by clients that may need to 14070 write on this file system. When the current file system instance 14071 is writable and is defined as of the same simultaneous use class 14072 (as specified by the value at index FSLI4BX_CLSIMUL) to which the 14073 client was previously writing, then it must incorporate within its 14074 data any committed write made on the source file system instance. 14075 See Section 11.11.6, which discusses the write-verifier class. 14076 While there is no harm in not setting this flag for a file system 14077 that turns out to be writable, turning the flag on for a read-only 14078 file system can cause problems for clients that select a migration 14079 or replication target based on the flag and then find themselves 14080 unable to write. 14082 * FSLI4GF_CUR_REQ indicates that this replica is the one on which 14083 the request is being made. Only a single server entry may have 14084 this flag set and, in the case of a referral, no entry will have 14085 it set. Note that this flag might be set even if the request was 14086 made on a network access path different from any of those 14087 specified in the current entry. 14089 * FSLI4GF_ABSENT indicates that this entry corresponds to an absent 14090 file system replica. It can only be set if FSLI4GF_CUR_REQ is 14091 set. When both such bits are set, it indicates that a file system 14092 instance is not usable but that the information in the entry can 14093 be used to determine the sorts of continuity available when 14094 switching from this replica to other possible replicas. Since 14095 this bit can only be true if FSLI4GF_CUR_REQ is true, the value 14096 could be determined using the fs_status attribute, but the 14097 information is also made available here for the convenience of the 14098 client. An entry with this bit, since it represents a true file 14099 system (albeit absent), does not appear in the event of a 14100 referral, but only when a file system has been accessed at this 14101 location and has subsequently been migrated. 14103 * FSLI4GF_GOING indicates that a replica, while still available, 14104 should not be used further. The client, if using it, should make 14105 an orderly transfer to another file system instance as 14106 expeditiously as possible. It is expected that file systems going 14107 out of service will be announced as FSLI4GF_GOING some time before 14108 the actual loss of service. It is also expected that the 14109 fli_valid_for value will be sufficiently small to allow clients to 14110 detect and act on scheduled events, while large enough that the 14111 cost of the requests to fetch the fs_locations_info values will 14112 not be excessive. Values on the order of ten minutes seem 14113 reasonable. 14115 When this flag is seen as part of a transition into a new file 14116 system, a client might choose to transfer immediately to another 14117 replica, or it may reference the current file system and only 14118 transition when a migration event occurs. Similarly, when this 14119 flag appears as a replica in the referral, clients would likely 14120 avoid being referred to this instance whenever there is another 14121 choice. 14123 This flag, like the other items within fls_info, applies to the 14124 replica rather than to a particular path to that replica. When it 14125 appears, a transition to a new replica, rather than to a different 14126 path to the same replica, is indicated. 14128 * FSLI4GF_SPLIT indicates that when a transition occurs from the 14129 current file system instance to this one, the replacement may 14130 consist of multiple file systems. In this case, the client has to 14131 be prepared for the possibility that objects on the same file 14132 system before migration will be on different ones after. Note 14133 that FSLI4GF_SPLIT is not incompatible with the file systems 14134 belonging to the same fileid class since, if one has a set of 14135 fileids that are unique within a file system, each subset assigned 14136 to a smaller file system after migration would not have any 14137 conflicts internal to that file system. 14139 A client, in the case of a split file system, will interrogate 14140 existing files with which it has continuing connection (it is free 14141 to simply forget cached filehandles). If the client remembers the 14142 directory filehandle associated with each open file, it may 14143 proceed upward using LOOKUPP to find the new file system 14144 boundaries. Note that in the event of a referral, there will not 14145 be any such files and so these actions will not be performed. 14146 Instead, a reference to a portion of the original file system now 14147 split off into other file systems will encounter an fsid change 14148 and possibly a further referral. 14150 Once the client recognizes that one file system has been split 14151 into two, it can prevent the disruption of running applications by 14152 presenting the two file systems as a single one until a convenient 14153 point to recognize the transition, such as a restart. This would 14154 require a mapping from the server's fsids to fsids as seen by the 14155 client, but this is already necessary for other reasons. As noted 14156 above, existing fileids within the two descendant file systems 14157 will not conflict. Providing non-conflicting fileids for newly 14158 created files on the split file systems is the responsibility of 14159 the server (or servers working in concert). The server can encode 14160 filehandles such that filehandles generated before the split event 14161 can be discerned from those generated after the split, allowing 14162 the server to determine when the need for emulating two file 14163 systems as one is over. 14165 Although it is possible for this flag to be present in the event 14166 of referral, it would generally be of little interest to the 14167 client, since the client is not expected to have information 14168 regarding the current contents of the absent file system. 14170 The transport-flag field (at byte index FSLI4BX_TFLAGS) contains the 14171 following bits related to the transport capabilities of the specific 14172 network path(s) specified by the entry: 14174 * FSLI4TF_RDMA indicates that any specified network paths provide 14175 NFSv4.1 clients access using an RDMA-capable transport. 14177 Attribute continuity and file system identity information are 14178 expressed by defining equivalence relations on the sets of file 14179 systems presented to the client. Each such relation is expressed as 14180 a set of file system equivalence classes. For each relation, a file 14181 system has an 8-bit class number. Two file systems belong to the 14182 same class if both have identical non-zero class numbers. Zero is 14183 treated as non-matching. Most often, the relevant question for the 14184 client will be whether a given replica is identical to / continuous 14185 with the current one in a given respect, but the information should 14186 be available also as to whether two other replicas match in that 14187 respect as well. 14189 The following fields specify the file system's class numbers for the 14190 equivalence relations used in determining the nature of file system 14191 transitions. See Sections 11.9 through 11.14 and their various 14192 subsections for details about how this information is to be used. 14193 Servers may assign these values as they wish, so long as file system 14194 instances that share the same value have the specified relationship 14195 to one another; conversely, file systems that have the specified 14196 relationship to one another share a common class value. As each 14197 instance entry is added, the relationships of this instance to 14198 previously entered instances can be consulted, and if one is found 14199 that bears the specified relationship, that entry's class value can 14200 be copied to the new entry. When no such previous entry exists, a 14201 new value for that byte index (not previously used) can be selected, 14202 most likely by incrementing the value of the last class value 14203 assigned for that index. 14205 * The field with byte index FSLI4BX_CLSIMUL defines the 14206 simultaneous-use class for the file system. 14208 * The field with byte index FSLI4BX_CLHANDLE defines the handle 14209 class for the file system. 14211 * The field with byte index FSLI4BX_CLFILEID defines the fileid 14212 class for the file system. 14214 * The field with byte index FSLI4BX_CLWRITEVER defines the write- 14215 verifier class for the file system. 14217 * The field with byte index FSLI4BX_CLCHANGE defines the change 14218 class for the file system. 14220 * The field with byte index FSLI4BX_CLREADDIR defines the readdir 14221 class for the file system. 14223 Server-specified preference information is also provided via 8-bit 14224 values within the fls_info array. The values provide a rank and an 14225 order (see below) to be used with separate values specifiable for the 14226 cases of read-only and writable file systems. These values are 14227 compared for different file systems to establish the server-specified 14228 preference, with lower values indicating "more preferred". 14230 Rank is used to express a strict server-imposed ordering on clients, 14231 with lower values indicating "more preferred". Clients should 14232 attempt to use all replicas with a given rank before they use one 14233 with a higher rank. Only if all of those file systems are 14234 unavailable should the client proceed to those of a higher rank. 14235 Because specifying a rank will override client preferences, servers 14236 should be conservative about using this mechanism, particularly when 14237 the environment is one in which client communication characteristics 14238 are neither tightly controlled nor visible to the server. 14240 Within a rank, the order value is used to specify the server's 14241 preference to guide the client's selection when the client's own 14242 preferences are not controlling, with lower values of order 14243 indicating "more preferred". If replicas are approximately equal in 14244 all respects, clients should defer to the order specified by the 14245 server. When clients look at server latency as part of their 14246 selection, they are free to use this criterion, but it is suggested 14247 that when latency differences are not significant, the server- 14248 specified order should guide selection. 14250 * The field at byte index FSLI4BX_READRANK gives the rank value to 14251 be used for read-only access. 14253 * The field at byte index FSLI4BX_READORDER gives the order value to 14254 be used for read-only access. 14256 * The field at byte index FSLI4BX_WRITERANK gives the rank value to 14257 be used for writable access. 14259 * The field at byte index FSLI4BX_WRITEORDER gives the order value 14260 to be used for writable access. 14262 Depending on the potential need for write access by a given client, 14263 one of the pairs of rank and order values is used. The read rank and 14264 order should only be used if the client knows that only reading will 14265 ever be done or if it is prepared to switch to a different replica in 14266 the event that any write access capability is required in the future. 14268 11.17.2. The fs_locations_info4 Structure 14270 The fs_locations_info4 structure, encoding the fs_locations_info 14271 attribute, contains the following: 14273 * The fli_flags field, which contains general flags that affect the 14274 interpretation of this fs_locations_info4 structure and all 14275 fs_locations_item4 structures within it. The only flag currently 14276 defined is FSLI4IF_VAR_SUB. All bits in the fli_flags field that 14277 are not defined should always be returned as zero. 14279 * The fli_fs_root field, which contains the pathname of the root of 14280 the current file system on the current server, just as it does in 14281 the fs_locations4 structure. 14283 * An array called fli_items of fs_locations4_item structures, which 14284 contain information about replicas of the current file system. 14285 Where the current file system is actually present, or has been 14286 present, i.e., this is not a referral situation, one of the 14287 fs_locations_item4 structures will contain an fs_locations_server4 14288 for the current server. This structure will have FSLI4GF_ABSENT 14289 set if the current file system is absent, i.e., normal access to 14290 it will return NFS4ERR_MOVED. 14292 * The fli_valid_for field specifies a time in seconds for which it 14293 is reasonable for a client to use the fs_locations_info attribute 14294 without refetch. The fli_valid_for value does not provide a 14295 guarantee of validity since servers can unexpectedly go out of 14296 service or become inaccessible for any number of reasons. Clients 14297 are well-advised to refetch this information for an actively 14298 accessed file system at every fli_valid_for seconds. This is 14299 particularly important when file system replicas may go out of 14300 service in a controlled way using the FSLI4GF_GOING flag to 14301 communicate an ongoing change. The server should set 14302 fli_valid_for to a value that allows well-behaved clients to 14303 notice the FSLI4GF_GOING flag and make an orderly switch before 14304 the loss of service becomes effective. If this value is zero, 14305 then no refetch interval is appropriate and the client need not 14306 refetch this data on any particular schedule. In the event of a 14307 transition to a new file system instance, a new value of the 14308 fs_locations_info attribute will be fetched at the destination. 14309 It is to be expected that this may have a different fli_valid_for 14310 value, which the client should then use in the same fashion as the 14311 previous value. Because a refetch of the attribute causes 14312 information from all component entries to be refetched, the server 14313 will typically provide a low value for this field if any of the 14314 replicas are likely to go out of service in a short time frame. 14315 Note that, because of the ability of the server to return 14316 NFS4ERR_MOVED to trigger the use of different paths, when 14317 alternate trunked paths are available, there is generally no need 14318 to use low values of fli_valid_for in connection with the 14319 management of alternate paths to the same replica. 14321 The FSLI4IF_VAR_SUB flag within fli_flags controls whether variable 14322 substitution is to be enabled. See Section 11.17.3 for an 14323 explanation of variable substitution. 14325 11.17.3. The fs_locations_item4 Structure 14327 The fs_locations_item4 structure contains a pathname (in the field 14328 fli_rootpath) that encodes the path of the target file system 14329 replicas on the set of servers designated by the included 14330 fs_locations_server4 entries. The precise manner in which this 14331 target location is specified depends on the value of the 14332 FSLI4IF_VAR_SUB flag within the associated fs_locations_info4 14333 structure. 14335 If this flag is not set, then fli_rootpath simply designates the 14336 location of the target file system within each server's single-server 14337 namespace just as it does for the rootpath within the fs_location4 14338 structure. When this bit is set, however, component entries of a 14339 certain form are subject to client-specific variable substitution so 14340 as to allow a degree of namespace non-uniformity in order to 14341 accommodate the selection of client-specific file system targets to 14342 adapt to different client architectures or other characteristics. 14344 When such substitution is in effect, a variable beginning with the 14345 string "${" and ending with the string "}" and containing a colon is 14346 to be replaced by the client-specific value associated with that 14347 variable. The string "unknown" should be used by the client when it 14348 has no value for such a variable. The pathname resulting from such 14349 substitutions is used to designate the target file system, so that 14350 different clients may have different file systems, corresponding to 14351 that location in the multi-server namespace. 14353 As mentioned above, such substituted pathname variables contain a 14354 colon. The part before the colon is to be a DNS domain name, and the 14355 part after is to be a case-insensitive alphanumeric string. 14357 Where the domain is "ietf.org", only variable names defined in this 14358 document or subsequent Standards Track RFCs are subject to such 14359 substitution. Organizations are free to use their domain names to 14360 create their own sets of client-specific variables, to be subject to 14361 such substitution. In cases where such variables are intended to be 14362 used more broadly than a single organization, publication of an 14363 Informational RFC defining such variables is RECOMMENDED. 14365 The variable ${ietf.org:CPU_ARCH} is used to denote that the CPU 14366 architecture object files are compiled. This specification does not 14367 limit the acceptable values (except that they must be valid UTF-8 14368 strings), but such values as "x86", "x86_64", and "sparc" would be 14369 expected to be used in line with industry practice. 14371 The variable ${ietf.org:OS_TYPE} is used to denote the operating 14372 system, and thus the kernel and library APIs, for which code might be 14373 compiled. This specification does not limit the acceptable values 14374 (except that they must be valid UTF-8 strings), but such values as 14375 "linux" and "freebsd" would be expected to be used in line with 14376 industry practice. 14378 The variable ${ietf.org:OS_VERSION} is used to denote the operating 14379 system version, and thus the specific details of versioned 14380 interfaces, for which code might be compiled. This specification 14381 does not limit the acceptable values (except that they must be valid 14382 UTF-8 strings). However, combinations of numbers and letters with 14383 interspersed dots would be expected to be used in line with industry 14384 practice, with the details of the version format depending on the 14385 specific value of the variable ${ietf.org:OS_TYPE} with which it is 14386 used. 14388 Use of these variables could result in the direction of different 14389 clients to different file systems on the same server, as appropriate 14390 to particular clients. In cases in which the target file systems are 14391 located on different servers, a single server could serve as a 14392 referral point so that each valid combination of variable values 14393 would designate a referral hosted on a single server, with the 14394 targets of those referrals on a number of different servers. 14396 Because namespace administration is affected by the values selected 14397 to substitute for various variables, clients should provide 14398 convenient means of determining what variable substitutions a client 14399 will implement, as well as, where appropriate, providing means to 14400 control the substitutions to be used. The exact means by which this 14401 will be done is outside the scope of this specification. 14403 Although variable substitution is most suitable for use in the 14404 context of referrals, it may be used in the context of replication 14405 and migration. If it is used in these contexts, the server must 14406 ensure that no matter what values the client presents for the 14407 substituted variables, the result is always a valid successor file 14408 system instance to that from which a transition is occurring, i.e., 14409 that the data is identical or represents a later image of a writable 14410 file system. 14412 Note that when fli_rootpath is a null pathname (that is, one with 14413 zero components), the file system designated is at the root of the 14414 specified server, whether or not the FSLI4IF_VAR_SUB flag within the 14415 associated fs_locations_info4 structure is set. 14417 11.18. The Attribute fs_status 14419 In an environment in which multiple copies of the same basic set of 14420 data are available, information regarding the particular source of 14421 such data and the relationships among different copies can be very 14422 helpful in providing consistent data to applications. 14424 enum fs4_status_type { 14425 STATUS4_FIXED = 1, 14426 STATUS4_UPDATED = 2, 14427 STATUS4_VERSIONED = 3, 14428 STATUS4_WRITABLE = 4, 14429 STATUS4_REFERRAL = 5 14430 }; 14432 struct fs4_status { 14433 bool fss_absent; 14434 fs4_status_type fss_type; 14435 utf8str_cs fss_source; 14436 utf8str_cs fss_current; 14437 int32_t fss_age; 14438 nfstime4 fss_version; 14439 }; 14441 The boolean fss_absent indicates whether the file system is currently 14442 absent. This value will be set if the file system was previously 14443 present and becomes absent, or if the file system has never been 14444 present and the type is STATUS4_REFERRAL. When this boolean is set 14445 and the type is not STATUS4_REFERRAL, the remaining information in 14446 the fs4_status reflects that last valid when the file system was 14447 present. 14449 The fss_type field indicates the kind of file system image 14450 represented. This is of particular importance when using the version 14451 values to determine appropriate succession of file system images. 14452 When fss_absent is set, and the file system was previously present, 14453 the value of fss_type reflected is that when the file was last 14454 present. Five values are distinguished: 14456 * STATUS4_FIXED, which indicates a read-only image in the sense that 14457 it will never change. The possibility is allowed that, as a 14458 result of migration or switch to a different image, changed data 14459 can be accessed, but within the confines of this instance, no 14460 change is allowed. The client can use this fact to cache 14461 aggressively. 14463 * STATUS4_VERSIONED, which indicates that the image, like the 14464 STATUS4_UPDATED case, is updated externally, but it provides a 14465 guarantee that the server will carefully update an associated 14466 version value so that the client can protect itself from a 14467 situation in which it reads data from one version of the file 14468 system and then later reads data from an earlier version of the 14469 same file system. See below for a discussion of how this can be 14470 done. 14472 * STATUS4_UPDATED, which indicates an image that cannot be updated 14473 by the user writing to it but that may be changed externally, 14474 typically because it is a periodically updated copy of another 14475 writable file system somewhere else. In this case, version 14476 information is not provided, and the client does not have the 14477 responsibility of making sure that this version only advances upon 14478 a file system instance transition. In this case, it is the 14479 responsibility of the server to make sure that the data presented 14480 after a file system instance transition is a proper successor 14481 image and includes all changes seen by the client and any change 14482 made before all such changes. 14484 * STATUS4_WRITABLE, which indicates that the file system is an 14485 actual writable one. The client need not, of course, actually 14486 write to the file system, but once it does, it should not accept a 14487 transition to anything other than a writable instance of that same 14488 file system. 14490 * STATUS4_REFERRAL, which indicates that the file system in question 14491 is absent and has never been present on this server. 14493 Note that in the STATUS4_UPDATED and STATUS4_VERSIONED cases, the 14494 server is responsible for the appropriate handling of locks that are 14495 inconsistent with external changes to delegations. If a server gives 14496 out delegations, they SHOULD be recalled before an inconsistent 14497 change is made to the data, and MUST be revoked if this is not 14498 possible. Similarly, if an OPEN is inconsistent with data that is 14499 changed (the OPEN has OPEN4_SHARE_DENY_WRITE/OPEN4_SHARE_DENY_BOTH 14500 and the data is changed), that OPEN SHOULD be considered 14501 administratively revoked. 14503 The opaque strings fss_source and fss_current provide a way of 14504 presenting information about the source of the file system image 14505 being present. It is not intended that the client do anything with 14506 this information other than make it available to administrative 14507 tools. It is intended that this information be helpful when 14508 researching possible problems with a file system image that might 14509 arise when it is unclear if the correct image is being accessed and, 14510 if not, how that image came to be made. This kind of diagnostic 14511 information will be helpful, if, as seems likely, copies of file 14512 systems are made in many different ways (e.g., simple user-level 14513 copies, file-system-level point-in-time copies, clones of the 14514 underlying storage), under a variety of administrative arrangements. 14515 In such environments, determining how a given set of data was 14516 constructed can be very helpful in resolving problems. 14518 The opaque string fss_source is used to indicate the source of a 14519 given file system with the expectation that tools capable of creating 14520 a file system image propagate this information, when possible. It is 14521 understood that this may not always be possible since a user-level 14522 copy may be thought of as creating a new data set and the tools used 14523 may have no mechanism to propagate this data. When a file system is 14524 initially created, it is desirable to associate with it data 14525 regarding how the file system was created, where it was created, who 14526 created it, etc. Making this information available in this attribute 14527 in a human-readable string will be helpful for applications and 14528 system administrators and will also serve to make it available when 14529 the original file system is used to make subsequent copies. 14531 The opaque string fss_current should provide whatever information is 14532 available about the source of the current copy. Such information 14533 includes the tool creating it, any relevant parameters to that tool, 14534 the time at which the copy was done, the user making the change, the 14535 server on which the change was made, etc. All information should be 14536 in a human-readable string. 14538 The field fss_age provides an indication of how out-of-date the file 14539 system currently is with respect to its ultimate data source (in case 14540 of cascading data updates). This complements the fls_currency field 14541 of fs_locations_server4 (see Section 11.17) in the following way: the 14542 information in fls_currency gives a bound for how out of date the 14543 data in a file system might typically get, while the value in fss_age 14544 gives a bound on how out-of-date that data actually is. Negative 14545 values imply that no information is available. A zero means that 14546 this data is known to be current. A positive value means that this 14547 data is known to be no older than that number of seconds with respect 14548 to the ultimate data source. Using this value, the client may be 14549 able to decide that a data copy is too old, so that it may search for 14550 a newer version to use. 14552 The fss_version field provides a version identification, in the form 14553 of a time value, such that successive versions always have later time 14554 values. When the fs_type is anything other than STATUS4_VERSIONED, 14555 the server may provide such a value, but there is no guarantee as to 14556 its validity and clients will not use it except to provide additional 14557 information to add to fss_source and fss_current. 14559 When fss_type is STATUS4_VERSIONED, servers SHOULD provide a value of 14560 fss_version that progresses monotonically whenever any new version of 14561 the data is established. This allows the client, if reliable image 14562 progression is important to it, to fetch this attribute as part of 14563 each COMPOUND where data or metadata from the file system is used. 14565 When it is important to the client to make sure that only valid 14566 successor images are accepted, it must make sure that it does not 14567 read data or metadata from the file system without updating its sense 14568 of the current state of the image. This is to avoid the possibility 14569 that the fs_status that the client holds will be one for an earlier 14570 image, which would cause the client to accept a new file system 14571 instance that is later than that but still earlier than the updated 14572 data read by the client. 14574 In order to accept valid images reliably, the client must do a 14575 GETATTR of the fs_status attribute that follows any interrogation of 14576 data or metadata within the file system in question. Often this is 14577 most conveniently done by appending such a GETATTR after all other 14578 operations that reference a given file system. When errors occur 14579 between reading file system data and performing such a GETATTR, care 14580 must be exercised to make sure that the data in question is not used 14581 before obtaining the proper fs_status value. In this connection, 14582 when an OPEN is done within such a versioned file system and the 14583 associated GETATTR of fs_status is not successfully completed, the 14584 open file in question must not be accessed until that fs_status is 14585 fetched. 14587 The procedure above will ensure that before using any data from the 14588 file system the client has in hand a newly-fetched current version of 14589 the file system image. Multiple values for multiple requests in 14590 flight can be resolved by assembling them into the required partial 14591 order (and the elements should form a total order within the partial 14592 order) and using the last. The client may then, when switching among 14593 file system instances, decline to use an instance that does not have 14594 an fss_type of STATUS4_VERSIONED or whose fss_version field is 14595 earlier than the last one obtained from the predecessor file system 14596 instance. 14598 12. Parallel NFS (pNFS) 14600 12.1. Introduction 14602 pNFS is an OPTIONAL feature within NFSv4.1; the pNFS feature set 14603 allows direct client access to the storage devices containing file 14604 data. When file data for a single NFSv4 server is stored on multiple 14605 and/or higher-throughput storage devices (by comparison to the 14606 server's throughput capability), the result can be significantly 14607 better file access performance. The relationship among multiple 14608 clients, a single server, and multiple storage devices for pNFS 14609 (server and clients have access to all storage devices) is shown in 14610 Figure 1. 14612 +-----------+ 14613 |+-----------+ +-----------+ 14614 ||+-----------+ | | 14615 ||| | NFSv4.1 + pNFS | | 14616 +|| Clients |<------------------------------>| Server | 14617 +| | | | 14618 +-----------+ | | 14619 ||| +-----------+ 14620 ||| | 14621 ||| | 14622 ||| Storage +-----------+ | 14623 ||| Protocol |+-----------+ | 14624 ||+----------------||+-----------+ Control | 14625 |+-----------------||| | Protocol| 14626 +------------------+|| Storage |------------+ 14627 +| Devices | 14628 +-----------+ 14630 Figure 1 14632 In this model, the clients, server, and storage devices are 14633 responsible for managing file access. This is in contrast to NFSv4 14634 without pNFS, where it is primarily the server's responsibility; some 14635 of this responsibility may be delegated to the client under strictly 14636 specified conditions. See Section 12.2.5 for a discussion of the 14637 Storage Protocol. See Section 12.2.6 for a discussion of the Control 14638 Protocol. 14640 pNFS takes the form of OPTIONAL operations that manage protocol 14641 objects called 'layouts' (Section 12.2.7) that contain a byte-range 14642 and storage location information. The layout is managed in a similar 14643 fashion as NFSv4.1 data delegations. For example, the layout is 14644 leased, recallable, and revocable. However, layouts are distinct 14645 abstractions and are manipulated with new operations. When a client 14646 holds a layout, it is granted the ability to directly access the 14647 byte-range at the storage location specified in the layout. 14649 There are interactions between layouts and other NFSv4.1 abstractions 14650 such as data delegations and byte-range locking. Delegation issues 14651 are discussed in Section 12.5.5. Byte-range locking issues are 14652 discussed in Sections 12.2.9 and 12.5.1. 14654 12.2. pNFS Definitions 14656 NFSv4.1's pNFS feature provides parallel data access to a file system 14657 that stripes its content across multiple storage servers. The first 14658 instantiation of pNFS, as part of NFSv4.1, separates the file system 14659 protocol processing into two parts: metadata processing and data 14660 processing. Data consist of the contents of regular files that are 14661 striped across storage servers. Data striping occurs in at least two 14662 ways: on a file-by-file basis and, within sufficiently large files, 14663 on a block-by-block basis. In contrast, striped access to metadata 14664 by pNFS clients is not provided in NFSv4.1, even though the file 14665 system back end of a pNFS server might stripe metadata. Metadata 14666 consist of everything else, including the contents of non-regular 14667 files (e.g., directories); see Section 12.2.1. The metadata 14668 functionality is implemented by an NFSv4.1 server that supports pNFS 14669 and the operations described in Section 18; such a server is called a 14670 metadata server (Section 12.2.2). 14672 The data functionality is implemented by one or more storage devices, 14673 each of which are accessed by the client via a storage protocol. A 14674 subset (defined in Section 13.6) of NFSv4.1 is one such storage 14675 protocol. New terms are introduced to the NFSv4.1 nomenclature and 14676 existing terms are clarified to allow for the description of the pNFS 14677 feature. 14679 12.2.1. Metadata 14681 Information about a file system object, such as its name, location 14682 within the namespace, owner, ACL, and other attributes. Metadata may 14683 also include storage location information, and this will vary based 14684 on the underlying storage mechanism that is used. 14686 12.2.2. Metadata Server 14688 An NFSv4.1 server that supports the pNFS feature. A variety of 14689 architectural choices exist for the metadata server and its use of 14690 file system information held at the server. Some servers may contain 14691 metadata only for file objects residing at the metadata server, while 14692 the file data resides on associated storage devices. Other metadata 14693 servers may hold both metadata and a varying degree of file data. 14695 12.2.3. pNFS Client 14697 An NFSv4.1 client that supports pNFS operations and supports at least 14698 one storage protocol for performing I/O to storage devices. 14700 12.2.4. Storage Device 14702 A storage device stores a regular file's data, but leaves metadata 14703 management to the metadata server. A storage device could be another 14704 NFSv4.1 server, an object-based storage device (OSD), a block device 14705 accessed over a System Area Network (SAN, e.g., either FiberChannel 14706 or iSCSI SAN), or some other entity. 14708 12.2.5. Storage Protocol 14710 As noted in Figure 1, the storage protocol is the method used by the 14711 client to store and retrieve data directly from the storage devices. 14713 The NFSv4.1 pNFS feature has been structured to allow for a variety 14714 of storage protocols to be defined and used. One example storage 14715 protocol is NFSv4.1 itself (as documented in Section 13). Other 14716 options for the storage protocol are described elsewhere and include: 14718 * Block/volume protocols such as Internet SCSI (iSCSI) [56] and FCP 14719 [57]. The block/volume protocol support can be independent of the 14720 addressing structure of the block/volume protocol used, allowing 14721 more than one protocol to access the same file data and enabling 14722 extensibility to other block/volume protocols. See [48] for a 14723 layout specification that allows pNFS to use block/volume storage 14724 protocols. 14726 * Object protocols such as OSD over iSCSI or Fibre Channel [58]. 14727 See [47] for a layout specification that allows pNFS to use object 14728 storage protocols. 14730 It is possible that various storage protocols are available to both 14731 client and server and it may be possible that a client and server do 14732 not have a matching storage protocol available to them. Because of 14733 this, the pNFS server MUST support normal NFSv4.1 access to any file 14734 accessible by the pNFS feature; this will allow for continued 14735 interoperability between an NFSv4.1 client and server. 14737 12.2.6. Control Protocol 14739 As noted in Figure 1, the control protocol is used by the exported 14740 file system between the metadata server and storage devices. 14741 Specification of such protocols is outside the scope of the NFSv4.1 14742 protocol. Such control protocols would be used to control activities 14743 such as the allocation and deallocation of storage, the management of 14744 state required by the storage devices to perform client access 14745 control, and, depending on the storage protocol, the enforcement of 14746 authentication and authorization so that restrictions that would be 14747 enforced by the metadata server are also enforced by the storage 14748 device. 14750 A particular control protocol is not REQUIRED by NFSv4.1 but 14751 requirements are placed on the control protocol for maintaining 14752 attributes like modify time, the change attribute, and the end-of- 14753 file (EOF) position. Note that if pNFS is layered over a clustered, 14754 parallel file system (e.g., PVFS [59]), the mechanisms that enable 14755 clustering and parallelism in that file system can be considered the 14756 control protocol. 14758 12.2.7. Layout Types 14760 A layout describes the mapping of a file's data to the storage 14761 devices that hold the data. A layout is said to belong to a specific 14762 layout type (data type layouttype4, see Section 3.3.13). The layout 14763 type allows for variants to handle different storage protocols, such 14764 as those associated with block/volume [48], object [47], and file 14765 (Section 13) layout types. A metadata server, along with its control 14766 protocol, MUST support at least one layout type. A private sub-range 14767 of the layout type namespace is also defined. Values from the 14768 private layout type range MAY be used for internal testing or 14769 experimentation (see Section 3.3.13). 14771 As an example, the organization of the file layout type could be an 14772 array of tuples (e.g., device ID, filehandle), along with a 14773 definition of how the data is stored across the devices (e.g., 14774 striping). A block/volume layout might be an array of tuples that 14775 store along with information 14776 about block size and the associated file offset of the block number. 14777 An object layout might be an array of tuples 14778 and an additional structure (i.e., the aggregation map) that defines 14779 how the logical byte sequence of the file data is serialized into the 14780 different objects. Note that the actual layouts are typically more 14781 complex than these simple expository examples. 14783 Requests for pNFS-related operations will often specify a layout 14784 type. Examples of such operations are GETDEVICEINFO and LAYOUTGET. 14785 The response for these operations will include structures such as a 14786 device_addr4 or a layout4, each of which includes a layout type 14787 within it. The layout type sent by the server MUST always be the 14788 same one requested by the client. When a server sends a response 14789 that includes a different layout type, the client SHOULD ignore the 14790 response and behave as if the server had returned an error response. 14792 12.2.8. Layout 14794 A layout defines how a file's data is organized on one or more 14795 storage devices. There are many potential layout types; each of the 14796 layout types are differentiated by the storage protocol used to 14797 access data and by the aggregation scheme that lays out the file data 14798 on the underlying storage devices. A layout is precisely identified 14799 by the tuple , 14800 where filehandle refers to the filehandle of the file on the metadata 14801 server. 14803 It is important to define when layouts overlap and/or conflict with 14804 each other. For two layouts with overlapping byte-ranges to actually 14805 overlap each other, both layouts must be of the same layout type, 14806 correspond to the same filehandle, and have the same iomode. Layouts 14807 conflict when they overlap and differ in the content of the layout 14808 (i.e., the storage device/file mapping parameters differ). Note that 14809 differing iomodes do not lead to conflicting layouts. It is 14810 permissible for layouts with different iomodes, pertaining to the 14811 same byte-range, to be held by the same client. An example of this 14812 would be copy-on-write functionality for a block/volume layout type. 14814 12.2.9. Layout Iomode 14816 The layout iomode (data type layoutiomode4, see Section 3.3.20) 14817 indicates to the metadata server the client's intent to perform 14818 either just READ operations or a mixture containing READ and WRITE 14819 operations. For certain layout types, it is useful for a client to 14820 specify this intent at the time it sends LAYOUTGET (Section 18.43). 14821 For example, for block/volume-based protocols, block allocation could 14822 occur when a LAYOUTIOMODE4_RW iomode is specified. A special 14823 LAYOUTIOMODE4_ANY iomode is defined and can only be used for 14824 LAYOUTRETURN and CB_LAYOUTRECALL, not for LAYOUTGET. It specifies 14825 that layouts pertaining to both LAYOUTIOMODE4_READ and 14826 LAYOUTIOMODE4_RW iomodes are being returned or recalled, 14827 respectively. 14829 A storage device may validate I/O with regard to the iomode; this is 14830 dependent upon storage device implementation and layout type. Thus, 14831 if the client's layout iomode is inconsistent with the I/O being 14832 performed, the storage device may reject the client's I/O with an 14833 error indicating that a new layout with the correct iomode should be 14834 obtained via LAYOUTGET. For example, if a client gets a layout with 14835 a LAYOUTIOMODE4_READ iomode and performs a WRITE to a storage device, 14836 the storage device is allowed to reject that WRITE. 14838 The use of the layout iomode does not conflict with OPEN share modes 14839 or byte-range LOCK operations; open share mode and byte-range lock 14840 conflicts are enforced as they are without the use of pNFS and are 14841 logically separate from the pNFS layout level. Open share modes and 14842 byte-range locks are the preferred method for restricting user access 14843 to data files. For example, an OPEN of OPEN4_SHARE_ACCESS_WRITE does 14844 not conflict with a LAYOUTGET containing an iomode of 14845 LAYOUTIOMODE4_RW performed by another client. Applications that 14846 depend on writing into the same file concurrently may use byte-range 14847 locking to serialize their accesses. 14849 12.2.10. Device IDs 14851 The device ID (data type deviceid4, see Section 3.3.14) identifies a 14852 group of storage devices. The scope of a device ID is the pair 14853 . In practice, a significant amount of 14854 information may be required to fully address a storage device. 14855 Rather than embedding all such information in a layout, layouts embed 14856 device IDs. The NFSv4.1 operation GETDEVICEINFO (Section 18.40) is 14857 used to retrieve the complete address information (including all 14858 device addresses for the device ID) regarding the storage device 14859 according to its layout type and device ID. For example, the address 14860 of an NFSv4.1 data server or of an object-based storage device could 14861 be an IP address and port. The address of a block storage device 14862 could be a volume label. 14864 Clients cannot expect the mapping between a device ID and its storage 14865 device address(es) to persist across metadata server restart. See 14866 Section 12.7.4 for a description of how recovery works in that 14867 situation. 14869 A device ID lives as long as there is a layout referring to the 14870 device ID. If there are no layouts referring to the device ID, the 14871 server is free to delete the device ID any time. Once a device ID is 14872 deleted by the server, the server MUST NOT reuse the device ID for 14873 the same layout type and client ID again. This requirement is 14874 feasible because the device ID is 16 bytes long, leaving sufficient 14875 room to store a generation number if the server's implementation 14876 requires most of the rest of the device ID's content to be reused. 14878 This requirement is necessary because otherwise the race conditions 14879 between asynchronous notification of device ID addition and deletion 14880 would be too difficult to sort out. 14882 Device ID to device address mappings are not leased, and can be 14883 changed at any time. (Note that while device ID to device address 14884 mappings are likely to change after the metadata server restarts, the 14885 server is not required to change the mappings.) A server has two 14886 choices for changing mappings. It can recall all layouts referring 14887 to the device ID or it can use a notification mechanism. 14889 The NFSv4.1 protocol has no optimal way to recall all layouts that 14890 referred to a particular device ID (unless the server associates a 14891 single device ID with a single fsid or a single client ID; in which 14892 case, CB_LAYOUTRECALL has options for recalling all layouts 14893 associated with the fsid, client ID pair, or just the client ID). 14895 Via a notification mechanism (see Section 20.12), device ID to device 14896 address mappings can change over the duration of server operation 14897 without recalling or revoking the layouts that refer to device ID. 14898 The notification mechanism can also delete a device ID, but only if 14899 the client has no layouts referring to the device ID. A notification 14900 of a change to a device ID to device address mapping will immediately 14901 or eventually invalidate some or all of the device ID's mappings. 14902 The server MUST support notifications and the client must request 14903 them before they can be used. For further information about the 14904 notification types, see Section 20.12. 14906 12.3. pNFS Operations 14908 NFSv4.1 has several operations that are needed for pNFS servers, 14909 regardless of layout type or storage protocol. These operations are 14910 all sent to a metadata server and summarized here. While pNFS is an 14911 OPTIONAL feature, if pNFS is implemented, some operations are 14912 REQUIRED in order to comply with pNFS. See Section 17. 14914 These are the fore channel pNFS operations: 14916 GETDEVICEINFO (Section 18.40), as noted previously 14917 (Section 12.2.10), returns the mapping of device ID to storage 14918 device address. 14920 GETDEVICELIST (Section 18.41) allows clients to fetch all device IDs 14921 for a specific file system. 14923 LAYOUTGET (Section 18.43) is used by a client to get a layout for a 14924 file. 14926 LAYOUTCOMMIT (Section 18.42) is used to inform the metadata server 14927 of the client's intent to commit data that has been written to the 14928 storage device (the storage device as originally indicated in the 14929 return value of LAYOUTGET). 14931 LAYOUTRETURN (Section 18.44) is used to return layouts for a file, a 14932 file system ID (FSID), or a client ID. 14934 These are the backchannel pNFS operations: 14936 CB_LAYOUTRECALL (Section 20.3) recalls a layout, all layouts 14937 belonging to a file system, or all layouts belonging to a client 14938 ID. 14940 CB_RECALL_ANY (Section 20.6) tells a client that it needs to return 14941 some number of recallable objects, including layouts, to the 14942 metadata server. 14944 CB_RECALLABLE_OBJ_AVAIL (Section 20.7) tells a client that a 14945 recallable object that it was denied (in case of pNFS, a layout 14946 denied by LAYOUTGET) due to resource exhaustion is now available. 14948 CB_NOTIFY_DEVICEID (Section 20.12) notifies the client of changes to 14949 device IDs. 14951 12.4. pNFS Attributes 14953 A number of attributes specific to pNFS are listed and described in 14954 Section 5.12. 14956 12.5. Layout Semantics 14958 12.5.1. Guarantees Provided by Layouts 14960 Layouts grant to the client the ability to access data located at a 14961 storage device with the appropriate storage protocol. The client is 14962 guaranteed the layout will be recalled when one of two things occur: 14963 either a conflicting layout is requested or the state encapsulated by 14964 the layout becomes invalid (this can happen when an event directly or 14965 indirectly modifies the layout). When a layout is recalled and 14966 returned by the client, the client continues with the ability to 14967 access file data with normal NFSv4.1 operations through the metadata 14968 server. Only the ability to access the storage devices is affected. 14970 The requirement of NFSv4.1 that all user access rights MUST be 14971 obtained through the appropriate OPEN, LOCK, and ACCESS operations is 14972 not modified with the existence of layouts. Layouts are provided to 14973 NFSv4.1 clients, and user access still follows the rules of the 14974 protocol as if they did not exist. It is a requirement that for a 14975 client to access a storage device, a layout must be held by the 14976 client. If a storage device receives an I/O request for a byte-range 14977 for which the client does not hold a layout, the storage device 14978 SHOULD reject that I/O request. Note that the act of modifying a 14979 file for which a layout is held does not necessarily conflict with 14980 the holding of the layout that describes the file being modified. 14981 Therefore, it is the requirement of the storage protocol or layout 14982 type that determines the necessary behavior. For example, block/ 14983 volume layout types require that the layout's iomode agree with the 14984 type of I/O being performed. 14986 Depending upon the layout type and storage protocol in use, storage 14987 device access permissions may be granted by LAYOUTGET and may be 14988 encoded within the type-specific layout. For an example of storage 14989 device access permissions, see an object-based protocol such as [58]. 14990 If access permissions are encoded within the layout, the metadata 14991 server SHOULD recall the layout when those permissions become invalid 14992 for any reason -- for example, when a file becomes unwritable or 14993 inaccessible to a client. Note, clients are still required to 14994 perform the appropriate OPEN, LOCK, and ACCESS operations as 14995 described above. The degree to which it is possible for the client 14996 to circumvent these operations and the consequences of doing so must 14997 be clearly specified by the individual layout type specifications. 14998 In addition, these specifications must be clear about the 14999 requirements and non-requirements for the checking performed by the 15000 server. 15002 In the presence of pNFS functionality, mandatory byte-range locks 15003 MUST behave as they would without pNFS. Therefore, if mandatory file 15004 locks and layouts are provided simultaneously, the storage device 15005 MUST be able to enforce the mandatory byte-range locks. For example, 15006 if one client obtains a mandatory byte-range lock and a second client 15007 accesses the storage device, the storage device MUST appropriately 15008 restrict I/O for the range of the mandatory byte-range lock. If the 15009 storage device is incapable of providing this check in the presence 15010 of mandatory byte-range locks, then the metadata server MUST NOT 15011 grant layouts and mandatory byte-range locks simultaneously. 15013 12.5.2. Getting a Layout 15015 A client obtains a layout with the LAYOUTGET operation. The metadata 15016 server will grant layouts of a particular type (e.g., block/volume, 15017 object, or file). The client selects an appropriate layout type that 15018 the server supports and the client is prepared to use. The layout 15019 returned to the client might not exactly match the requested byte- 15020 range as described in Section 18.43.3. As needed a client may send 15021 multiple LAYOUTGET operations; these might result in multiple 15022 overlapping, non-conflicting layouts (see Section 12.2.8). 15024 In order to get a layout, the client must first have opened the file 15025 via the OPEN operation. When a client has no layout on a file, it 15026 MUST present an open stateid, a delegation stateid, or a byte-range 15027 lock stateid in the loga_stateid argument. A successful LAYOUTGET 15028 result includes a layout stateid. The first successful LAYOUTGET 15029 processed by the server using a non-layout stateid as an argument 15030 MUST have the "seqid" field of the layout stateid in the response set 15031 to one. Thereafter, the client MUST use a layout stateid (see 15032 Section 12.5.3) on future invocations of LAYOUTGET on the file, and 15033 the "seqid" MUST NOT be set to zero. Once the layout has been 15034 retrieved, it can be held across multiple OPEN and CLOSE sequences. 15035 Therefore, a client may hold a layout for a file that is not 15036 currently open by any user on the client. This allows for the 15037 caching of layouts beyond CLOSE. 15039 The storage protocol used by the client to access the data on the 15040 storage device is determined by the layout's type. The client is 15041 responsible for matching the layout type with an available method to 15042 interpret and use the layout. The method for this layout type 15043 selection is outside the scope of the pNFS functionality. 15045 Although the metadata server is in control of the layout for a file, 15046 the pNFS client can provide hints to the server when a file is opened 15047 or created about the preferred layout type and aggregation schemes. 15048 pNFS introduces a layout_hint attribute (Section 5.12.4) that the 15049 client can set at file creation time to provide a hint to the server 15050 for new files. Setting this attribute separately, after the file has 15051 been created might make it difficult, or impossible, for the server 15052 implementation to comply. 15054 Because the EXCLUSIVE4 createmode4 does not allow the setting of 15055 attributes at file creation time, NFSv4.1 introduces the EXCLUSIVE4_1 15056 createmode4, which does allow attributes to be set at file creation 15057 time. In addition, if the session is created with persistent reply 15058 caches, EXCLUSIVE4_1 is neither necessary nor allowed. Instead, 15059 GUARDED4 both works better and is prescribed. Table 18 in 15060 Section 18.16.3 summarizes how a client is allowed to send an 15061 exclusive create. 15063 12.5.3. Layout Stateid 15065 As with all other stateids, the layout stateid consists of a "seqid" 15066 and "other" field. Once a layout stateid is established, the "other" 15067 field will stay constant unless the stateid is revoked or the client 15068 returns all layouts on the file and the server disposes of the 15069 stateid. The "seqid" field is initially set to one, and is never 15070 zero on any NFSv4.1 operation that uses layout stateids, whether it 15071 is a fore channel or backchannel operation. After the layout stateid 15072 is established, the server increments by one the value of the "seqid" 15073 in each subsequent LAYOUTGET and LAYOUTRETURN response, and in each 15074 CB_LAYOUTRECALL request. 15076 Given the design goal of pNFS to provide parallelism, the layout 15077 stateid differs from other stateid types in that the client is 15078 expected to send LAYOUTGET and LAYOUTRETURN operations in parallel. 15079 The "seqid" value is used by the client to properly sort responses to 15080 LAYOUTGET and LAYOUTRETURN. The "seqid" is also used to prevent race 15081 conditions between LAYOUTGET and CB_LAYOUTRECALL. Given that the 15082 processing rules differ from layout stateids and other stateid types, 15083 only the pNFS sections of this document should be considered to 15084 determine proper layout stateid handling. 15086 Once the client receives a layout stateid, it MUST use the correct 15087 "seqid" for subsequent LAYOUTGET or LAYOUTRETURN operations. The 15088 correct "seqid" is defined as the highest "seqid" value from 15089 responses of fully processed LAYOUTGET or LAYOUTRETURN operations or 15090 arguments of a fully processed CB_LAYOUTRECALL operation. Since the 15091 server is incrementing the "seqid" value on each layout operation, 15092 the client may determine the order of operation processing by 15093 inspecting the "seqid" value. In the case of overlapping layout 15094 ranges, the ordering information will provide the client the 15095 knowledge of which layout ranges are held. Note that overlapping 15096 layout ranges may occur because of the client's specific requests or 15097 because the server is allowed to expand the range of a requested 15098 layout and notify the client in the LAYOUTRETURN results. Additional 15099 layout stateid sequencing requirements are provided in 15100 Section 12.5.5.2. 15102 The client's receipt of a "seqid" is not sufficient for subsequent 15103 use. The client must fully process the operations before the "seqid" 15104 can be used. For LAYOUTGET results, if the client is not using the 15105 forgetful model (Section 12.5.5.1), it MUST first update its record 15106 of what ranges of the file's layout it has before using the seqid. 15107 For LAYOUTRETURN results, the client MUST delete the range from its 15108 record of what ranges of the file's layout it had before using the 15109 seqid. For CB_LAYOUTRECALL arguments, the client MUST send a 15110 response to the recall before using the seqid. The fundamental 15111 requirement in client processing is that the "seqid" is used to 15112 provide the order of processing. LAYOUTGET results may be processed 15113 in parallel. LAYOUTRETURN results may be processed in parallel. 15114 LAYOUTGET and LAYOUTRETURN responses may be processed in parallel as 15115 long as the ranges do not overlap. CB_LAYOUTRECALL request 15116 processing MUST be processed in "seqid" order at all times. 15118 Once a client has no more layouts on a file, the layout stateid is no 15119 longer valid and MUST NOT be used. Any attempt to use such a layout 15120 stateid will result in NFS4ERR_BAD_STATEID. 15122 12.5.4. Committing a Layout 15124 Allowing for varying storage protocol capabilities, the pNFS protocol 15125 does not require the metadata server and storage devices to have a 15126 consistent view of file attributes and data location mappings. Data 15127 location mapping refers to aspects such as which offsets store data 15128 as opposed to storing holes (see Section 13.4.4 for a discussion). 15129 Related issues arise for storage protocols where a layout may hold 15130 provisionally allocated blocks where the allocation of those blocks 15131 does not survive a complete restart of both the client and server. 15132 Because of this inconsistency, it is necessary to resynchronize the 15133 client with the metadata server and its storage devices and make any 15134 potential changes available to other clients. This is accomplished 15135 by use of the LAYOUTCOMMIT operation. 15137 The LAYOUTCOMMIT operation is responsible for committing a modified 15138 layout to the metadata server. The data should be written and 15139 committed to the appropriate storage devices before the LAYOUTCOMMIT 15140 occurs. The scope of the LAYOUTCOMMIT operation depends on the 15141 storage protocol in use. It is important to note that the level of 15142 synchronization is from the point of view of the client that sent the 15143 LAYOUTCOMMIT. The updated state on the metadata server need only 15144 reflect the state as of the client's last operation previous to the 15145 LAYOUTCOMMIT. The metadata server is not REQUIRED to maintain a 15146 global view that accounts for other clients' I/O that may have 15147 occurred within the same time frame. 15149 For block/volume-based layouts, LAYOUTCOMMIT may require updating the 15150 block list that comprises the file and committing this layout to 15151 stable storage. For file-based layouts, synchronization of 15152 attributes between the metadata and storage devices, primarily the 15153 size attribute, is required. 15155 The control protocol is free to synchronize the attributes before it 15156 receives a LAYOUTCOMMIT; however, upon successful completion of a 15157 LAYOUTCOMMIT, state that exists on the metadata server that describes 15158 the file MUST be synchronized with the state that exists on the 15159 storage devices that comprise that file as of the client's last sent 15160 operation. Thus, a client that queries the size of a file between a 15161 WRITE to a storage device and the LAYOUTCOMMIT might observe a size 15162 that does not reflect the actual data written. 15164 The client MUST have a layout in order to send a LAYOUTCOMMIT 15165 operation. 15167 12.5.4.1. LAYOUTCOMMIT and change/time_modify 15169 The change and time_modify attributes may be updated by the server 15170 when the LAYOUTCOMMIT operation is processed. The reason for this is 15171 that some layout types do not support the update of these attributes 15172 when the storage devices process I/O operations. If a client has a 15173 layout with the LAYOUTIOMODE4_RW iomode on the file, the client MAY 15174 provide a suggested value to the server for time_modify within the 15175 arguments to LAYOUTCOMMIT. Based on the layout type, the provided 15176 value may or may not be used. The server should sanity-check the 15177 client-provided values before they are used. For example, the server 15178 should ensure that time does not flow backwards. The client always 15179 has the option to set time_modify through an explicit SETATTR 15180 operation. 15182 For some layout protocols, the storage device is able to notify the 15183 metadata server of the occurrence of an I/O; as a result, the change 15184 and time_modify attributes may be updated at the metadata server. 15185 For a metadata server that is capable of monitoring updates to the 15186 change and time_modify attributes, LAYOUTCOMMIT processing is not 15187 required to update the change attribute. In this case, the metadata 15188 server must ensure that no further update to the data has occurred 15189 since the last update of the attributes; file-based protocols may 15190 have enough information to make this determination or may update the 15191 change attribute upon each file modification. This also applies for 15192 the time_modify attribute. If the server implementation is able to 15193 determine that the file has not been modified since the last 15194 time_modify update, the server need not update time_modify at 15195 LAYOUTCOMMIT. At LAYOUTCOMMIT completion, the updated attributes 15196 should be visible if that file was modified since the latest previous 15197 LAYOUTCOMMIT or LAYOUTGET. 15199 12.5.4.2. LAYOUTCOMMIT and size 15201 The size of a file may be updated when the LAYOUTCOMMIT operation is 15202 used by the client. One of the fields in the argument to 15203 LAYOUTCOMMIT is loca_last_write_offset; this field indicates the 15204 highest byte offset written but not yet committed with the 15205 LAYOUTCOMMIT operation. The data type of loca_last_write_offset is 15206 newoffset4 and is switched on a boolean value, no_newoffset, that 15207 indicates if a previous write occurred or not. If no_newoffset is 15208 FALSE, an offset is not given. If the client has a layout with 15209 LAYOUTIOMODE4_RW iomode on the file, with a byte-range (denoted by 15210 the values of lo_offset and lo_length) that overlaps 15211 loca_last_write_offset, then the client MAY set no_newoffset to TRUE 15212 and provide an offset that will update the file size. Keep in mind 15213 that offset is not the same as length, though they are related. For 15214 example, a loca_last_write_offset value of zero means that one byte 15215 was written at offset zero, and so the length of the file is at least 15216 one byte. 15218 The metadata server may do one of the following: 15220 1. Update the file's size using the last write offset provided by 15221 the client as either the true file size or as a hint of the file 15222 size. If the metadata server has a method available, any new 15223 value for file size should be sanity-checked. For example, the 15224 file must not be truncated if the client presents a last write 15225 offset less than the file's current size. 15227 2. Ignore the client-provided last write offset; the metadata server 15228 must have sufficient knowledge from other sources to determine 15229 the file's size. For example, the metadata server queries the 15230 storage devices with the control protocol. 15232 The method chosen to update the file's size will depend on the 15233 storage device's and/or the control protocol's capabilities. For 15234 example, if the storage devices are block devices with no knowledge 15235 of file size, the metadata server must rely on the client to set the 15236 last write offset appropriately. 15238 The results of LAYOUTCOMMIT contain a new size value in the form of a 15239 newsize4 union data type. If the file's size is set as a result of 15240 LAYOUTCOMMIT, the metadata server must reply with the new size; 15241 otherwise, the new size is not provided. If the file size is 15242 updated, the metadata server SHOULD update the storage devices such 15243 that the new file size is reflected when LAYOUTCOMMIT processing is 15244 complete. For example, the client should be able to read up to the 15245 new file size. 15247 The client can extend the length of a file or truncate a file by 15248 sending a SETATTR operation to the metadata server with the size 15249 attribute specified. If the size specified is larger than the 15250 current size of the file, the file is "zero extended", i.e., zeros 15251 are implicitly added between the file's previous EOF and the new EOF. 15252 (In many implementations, the zero-extended byte-range of the file 15253 consists of unallocated holes in the file.) When the client writes 15254 past EOF via WRITE, the SETATTR operation does not need to be used. 15256 12.5.4.3. LAYOUTCOMMIT and layoutupdate 15258 The LAYOUTCOMMIT argument contains a loca_layoutupdate field 15259 (Section 18.42.1) of data type layoutupdate4 (Section 3.3.18). This 15260 argument is a layout-type-specific structure. The structure can be 15261 used to pass arbitrary layout-type-specific information from the 15262 client to the metadata server at LAYOUTCOMMIT time. For example, if 15263 using a block/volume layout, the client can indicate to the metadata 15264 server which reserved or allocated blocks the client used or did not 15265 use. The content of loca_layoutupdate (field lou_body) need not be 15266 the same layout-type-specific content returned by LAYOUTGET 15267 (Section 18.43.2) in the loc_body field of the lo_content field of 15268 the logr_layout field. The content of loca_layoutupdate is defined 15269 by the layout type specification and is opaque to LAYOUTCOMMIT. 15271 12.5.5. Recalling a Layout 15273 Since a layout protects a client's access to a file via a direct 15274 client-storage-device path, a layout need only be recalled when it is 15275 semantically unable to serve this function. Typically, this occurs 15276 when the layout no longer encapsulates the true location of the file 15277 over the byte-range it represents. Any operation or action, such as 15278 server-driven restriping or load balancing, that changes the layout 15279 will result in a recall of the layout. A layout is recalled by the 15280 CB_LAYOUTRECALL callback operation (see Section 20.3) and returned 15281 with LAYOUTRETURN (see Section 18.44). The CB_LAYOUTRECALL operation 15282 may recall a layout identified by a byte-range, all layouts 15283 associated with a file system ID (FSID), or all layouts associated 15284 with a client ID. Section 12.5.5.2 discusses sequencing issues 15285 surrounding the getting, returning, and recalling of layouts. 15287 An iomode is also specified when recalling a layout. Generally, the 15288 iomode in the recall request must match the layout being returned; 15289 for example, a recall with an iomode of LAYOUTIOMODE4_RW should cause 15290 the client to only return LAYOUTIOMODE4_RW layouts and not 15291 LAYOUTIOMODE4_READ layouts. However, a special LAYOUTIOMODE4_ANY 15292 enumeration is defined to enable recalling a layout of any iomode; in 15293 other words, the client must return both LAYOUTIOMODE4_READ and 15294 LAYOUTIOMODE4_RW layouts. 15296 A REMOVE operation SHOULD cause the metadata server to recall the 15297 layout to prevent the client from accessing a non-existent file and 15298 to reclaim state stored on the client. Since a REMOVE may be delayed 15299 until the last close of the file has occurred, the recall may also be 15300 delayed until this time. After the last reference on the file has 15301 been released and the file has been removed, the client should no 15302 longer be able to perform I/O using the layout. In the case of a 15303 file-based layout, the data server SHOULD return NFS4ERR_STALE in 15304 response to any operation on the removed file. 15306 Once a layout has been returned, the client MUST NOT send I/Os to the 15307 storage devices for the file, byte-range, and iomode represented by 15308 the returned layout. If a client does send an I/O to a storage 15309 device for which it does not hold a layout, the storage device SHOULD 15310 reject the I/O. 15312 Although pNFS does not alter the file data caching capabilities of 15313 clients, or their semantics, it recognizes that some clients may 15314 perform more aggressive write-behind caching to optimize the benefits 15315 provided by pNFS. However, write-behind caching may negatively 15316 affect the latency in returning a layout in response to a 15317 CB_LAYOUTRECALL; this is similar to file delegations and the impact 15318 that file data caching has on DELEGRETURN. Client implementations 15319 SHOULD limit the amount of unwritten data they have outstanding at 15320 any one time in order to prevent excessively long responses to 15321 CB_LAYOUTRECALL. Once a layout is recalled, a server MUST wait one 15322 lease period before taking further action. As soon as a lease period 15323 has passed, the server may choose to fence the client's access to the 15324 storage devices if the server perceives the client has taken too long 15325 to return a layout. However, just as in the case of data delegation 15326 and DELEGRETURN, the server may choose to wait, given that the client 15327 is showing forward progress on its way to returning the layout. This 15328 forward progress can take the form of successful interaction with the 15329 storage devices or of sub-portions of the layout being returned by 15330 the client. The server can also limit exposure to these problems by 15331 limiting the byte-ranges initially provided in the layouts and thus 15332 the amount of outstanding modified data. 15334 12.5.5.1. Layout Recall Callback Robustness 15336 It has been assumed thus far that pNFS client state (layout ranges 15337 and iomode) for a file exactly matches that of the pNFS server for 15338 that file. This assumption leads to the implication that any 15339 callback results in a LAYOUTRETURN or set of LAYOUTRETURNs that 15340 exactly match the range in the callback, since both client and server 15341 agree about the state being maintained. However, it can be useful if 15342 this assumption does not always hold. For example: 15344 * If conflicts that require callbacks are very rare, and a server 15345 can use a multi-file callback to recover per-client resources 15346 (e.g., via an FSID recall or a multi-file recall within a single 15347 CB_COMPOUND), the result may be significantly less client-server 15348 pNFS traffic. 15350 * It may be useful for servers to maintain information about what 15351 ranges are held by a client on a coarse-grained basis, leading to 15352 the server's layout ranges being beyond those actually held by the 15353 client. In the extreme, a server could manage conflicts on a per- 15354 file basis, only sending whole-file callbacks even though clients 15355 may request and be granted sub-file ranges. 15357 * It may be useful for clients to "forget" details about what 15358 layouts and ranges the client actually has, leading to the 15359 server's layout ranges being beyond those that the client "thinks" 15360 it has. As long as the client does not assume it has layouts that 15361 are beyond what the server has granted, this is a safe practice. 15362 When a client forgets what ranges and layouts it has, and it 15363 receives a CB_LAYOUTRECALL operation, the client MUST follow up 15364 with a LAYOUTRETURN for what the server recalled, or alternatively 15365 return the NFS4ERR_NOMATCHING_LAYOUT error if it has no layout to 15366 return in the recalled range. 15368 * In order to avoid errors, it is vital that a client not assign 15369 itself layout permissions beyond what the server has granted, and 15370 that the server not forget layout permissions that have been 15371 granted. On the other hand, if a server believes that a client 15372 holds a layout that the client does not know about, it is useful 15373 for the client to cleanly indicate completion of the requested 15374 recall either by sending a LAYOUTRETURN operation for the entire 15375 requested range or by returning an NFS4ERR_NOMATCHING_LAYOUT error 15376 to the CB_LAYOUTRECALL. 15378 Thus, in light of the above, it is useful for a server to be able to 15379 send callbacks for layout ranges it has not granted to a client, and 15380 for a client to return ranges it does not hold. A pNFS client MUST 15381 always return layouts that comprise the full range specified by the 15382 recall. Note, the full recalled layout range need not be returned as 15383 part of a single operation, but may be returned in portions. This 15384 allows the client to stage the flushing of dirty data and commits and 15385 returns of layouts. Also, it indicates to the metadata server that 15386 the client is making progress. 15388 When a layout is returned, the client MUST NOT have any outstanding 15389 I/O requests to the storage devices involved in the layout. 15390 Rephrasing, the client MUST NOT return the layout while it has 15391 outstanding I/O requests to the storage device. 15393 Even with this requirement for the client, it is possible that I/O 15394 requests may be presented to a storage device no longer allowed to 15395 perform them. Since the server has no strict control as to when the 15396 client will return the layout, the server may later decide to 15397 unilaterally revoke the client's access to the storage devices as 15398 provided by the layout. In choosing to revoke access, the server 15399 must deal with the possibility of lingering I/O requests, i.e., I/O 15400 requests that are still in flight to storage devices identified by 15401 the revoked layout. All layout type specifications MUST define 15402 whether unilateral layout revocation by the metadata server is 15403 supported; if it is, the specification must also describe how 15404 lingering writes are processed. For example, storage devices 15405 identified by the revoked layout could be fenced off from the client 15406 that held the layout. 15408 In order to ensure client/server convergence with regard to layout 15409 state, the final LAYOUTRETURN operation in a sequence of LAYOUTRETURN 15410 operations for a particular recall MUST specify the entire range 15411 being recalled, echoing the recalled layout type, iomode, recall/ 15412 return type (FILE, FSID, or ALL), and byte-range, even if layouts 15413 pertaining to partial ranges were previously returned. In addition, 15414 if the client holds no layouts that overlap the range being recalled, 15415 the client should return the NFS4ERR_NOMATCHING_LAYOUT error code to 15416 CB_LAYOUTRECALL. This allows the server to update its view of the 15417 client's layout state. 15419 12.5.5.2. Sequencing of Layout Operations 15421 As with other stateful operations, pNFS requires the correct 15422 sequencing of layout operations. pNFS uses the "seqid" in the layout 15423 stateid to provide the correct sequencing between regular operations 15424 and callbacks. It is the server's responsibility to avoid 15425 inconsistencies regarding the layouts provided and the client's 15426 responsibility to properly serialize its layout requests and layout 15427 returns. 15429 12.5.5.2.1. Layout Recall and Return Sequencing 15431 One critical issue with regard to layout operations sequencing 15432 concerns callbacks. The protocol must defend against races between 15433 the reply to a LAYOUTGET or LAYOUTRETURN operation and a subsequent 15434 CB_LAYOUTRECALL. A client MUST NOT process a CB_LAYOUTRECALL that 15435 implies one or more outstanding LAYOUTGET or LAYOUTRETURN operations 15436 to which the client has not yet received a reply. The client detects 15437 such a CB_LAYOUTRECALL by examining the "seqid" field of the recall's 15438 layout stateid. If the "seqid" is not exactly one higher than what 15439 the client currently has recorded, and the client has at least one 15440 LAYOUTGET and/or LAYOUTRETURN operation outstanding, the client knows 15441 the server sent the CB_LAYOUTRECALL after sending a response to an 15442 outstanding LAYOUTGET or LAYOUTRETURN. The client MUST wait before 15443 processing such a CB_LAYOUTRECALL until it processes all replies for 15444 outstanding LAYOUTGET and LAYOUTRETURN operations for the 15445 corresponding file with seqid less than the seqid given by 15446 CB_LAYOUTRECALL (lor_stateid; see Section 20.3.) 15447 In addition to the seqid-based mechanism, Section 2.10.6.3 describes 15448 the sessions mechanism for allowing the client to detect callback 15449 race conditions and delay processing such a CB_LAYOUTRECALL. The 15450 server MAY reference conflicting operations in the CB_SEQUENCE that 15451 precedes the CB_LAYOUTRECALL. Because the server has already sent 15452 replies for these operations before sending the callback, the replies 15453 may race with the CB_LAYOUTRECALL. The client MUST wait for all the 15454 referenced calls to complete and update its view of the layout state 15455 before processing the CB_LAYOUTRECALL. 15457 12.5.5.2.1.1. Get/Return Sequencing 15459 The protocol allows the client to send concurrent LAYOUTGET and 15460 LAYOUTRETURN operations to the server. The protocol does not provide 15461 any means for the server to process the requests in the same order in 15462 which they were created. However, through the use of the "seqid" 15463 field in the layout stateid, the client can determine the order in 15464 which parallel outstanding operations were processed by the server. 15465 Thus, when a layout retrieved by an outstanding LAYOUTGET operation 15466 intersects with a layout returned by an outstanding LAYOUTRETURN on 15467 the same file, the order in which the two conflicting operations are 15468 processed determines the final state of the overlapping layout. The 15469 order is determined by the "seqid" returned in each operation: the 15470 operation with the higher seqid was executed later. 15472 It is permissible for the client to send multiple parallel LAYOUTGET 15473 operations for the same file or multiple parallel LAYOUTRETURN 15474 operations for the same file or a mix of both. 15476 It is permissible for the client to use the current stateid (see 15477 Section 16.2.3.1.2) for LAYOUTGET operations, for example, when 15478 compounding LAYOUTGETs or compounding OPEN and LAYOUTGETs. It is 15479 also permissible to use the current stateid when compounding 15480 LAYOUTRETURNs. 15482 It is permissible for the client to use the current stateid when 15483 combining LAYOUTRETURN and LAYOUTGET operations for the same file in 15484 the same COMPOUND request since the server MUST process these in 15485 order. However, if a client does send such COMPOUND requests, it 15486 MUST NOT have more than one outstanding for the same file at the same 15487 time, and it MUST NOT have other LAYOUTGET or LAYOUTRETURN operations 15488 outstanding at the same time for that same file. 15490 12.5.5.2.1.2. Client Considerations 15492 Consider a pNFS client that has sent a LAYOUTGET, and before it 15493 receives the reply to LAYOUTGET, it receives a CB_LAYOUTRECALL for 15494 the same file with an overlapping range. There are two 15495 possibilities, which the client can distinguish via the layout 15496 stateid in the recall. 15498 1. The server processed the LAYOUTGET before sending the recall, so 15499 the LAYOUTGET must be waited for because it may be carrying 15500 layout information that will need to be returned to deal with the 15501 CB_LAYOUTRECALL. 15503 2. The server sent the callback before receiving the LAYOUTGET. The 15504 server will not respond to the LAYOUTGET until the 15505 CB_LAYOUTRECALL is processed. 15507 If these possibilities cannot be distinguished, a deadlock could 15508 result, as the client must wait for the LAYOUTGET response before 15509 processing the recall in the first case, but that response will not 15510 arrive until after the recall is processed in the second case. Note 15511 that in the first case, the "seqid" in the layout stateid of the 15512 recall is two greater than what the client has recorded; in the 15513 second case, the "seqid" is one greater than what the client has 15514 recorded. This allows the client to disambiguate between the two 15515 cases. The client thus knows precisely which possibility applies. 15517 In case 1, the client knows it needs to wait for the LAYOUTGET 15518 response before processing the recall (or the client can return 15519 NFS4ERR_DELAY). 15521 In case 2, the client will not wait for the LAYOUTGET response before 15522 processing the recall because waiting would cause deadlock. 15523 Therefore, the action at the client will only require waiting in the 15524 case that the client has not yet seen the server's earlier responses 15525 to the LAYOUTGET operation(s). 15527 The recall process can be considered completed when the final 15528 LAYOUTRETURN operation for the recalled range is completed. The 15529 LAYOUTRETURN uses the layout stateid (with seqid) specified in 15530 CB_LAYOUTRECALL. If the client uses multiple LAYOUTRETURNs in 15531 processing the recall, the first LAYOUTRETURN will use the layout 15532 stateid as specified in CB_LAYOUTRECALL. Subsequent LAYOUTRETURNs 15533 will use the highest seqid as is the usual case. 15535 12.5.5.2.1.3. Server Considerations 15537 Consider a race from the metadata server's point of view. The 15538 metadata server has sent a CB_LAYOUTRECALL and receives an 15539 overlapping LAYOUTGET for the same file before the LAYOUTRETURN(s) 15540 that respond to the CB_LAYOUTRECALL. There are three cases: 15542 1. The client sent the LAYOUTGET before processing the 15543 CB_LAYOUTRECALL. The "seqid" in the layout stateid of the 15544 arguments of LAYOUTGET is one less than the "seqid" in 15545 CB_LAYOUTRECALL. The server returns NFS4ERR_RECALLCONFLICT to 15546 the client, which indicates to the client that there is a pending 15547 recall. 15549 2. The client sent the LAYOUTGET after processing the 15550 CB_LAYOUTRECALL, but the LAYOUTGET arrived before the 15551 LAYOUTRETURN and the response to CB_LAYOUTRECALL that completed 15552 that processing. The "seqid" in the layout stateid of LAYOUTGET 15553 is equal to or greater than that of the "seqid" in 15554 CB_LAYOUTRECALL. The server has not received a response to the 15555 CB_LAYOUTRECALL, so it returns NFS4ERR_RECALLCONFLICT. 15557 3. The client sent the LAYOUTGET after processing the 15558 CB_LAYOUTRECALL; the server received the CB_LAYOUTRECALL 15559 response, but the LAYOUTGET arrived before the LAYOUTRETURN that 15560 completed that processing. The "seqid" in the layout stateid of 15561 LAYOUTGET is equal to that of the "seqid" in CB_LAYOUTRECALL. 15562 The server has received a response to the CB_LAYOUTRECALL, so it 15563 returns NFS4ERR_RETURNCONFLICT. 15565 12.5.5.2.1.4. Wraparound and Validation of Seqid 15567 The rules for layout stateid processing differ from other stateids in 15568 the protocol because the "seqid" value cannot be zero and the 15569 stateid's "seqid" value changes in a CB_LAYOUTRECALL operation. The 15570 non-zero requirement combined with the inherent parallelism of layout 15571 operations means that a set of LAYOUTGET and LAYOUTRETURN operations 15572 may contain the same value for "seqid". The server uses a slightly 15573 modified version of the modulo arithmetic as described in 15574 Section 2.10.6.1 when incrementing the layout stateid's "seqid". The 15575 difference is that zero is not a valid value for "seqid"; when the 15576 value of a "seqid" is 0xFFFFFFFF, the next valid value will be 15577 0x00000001. The modulo arithmetic is also used for the comparisons 15578 of "seqid" values in the processing of CB_LAYOUTRECALL events as 15579 described above in Section 12.5.5.2.1.3. 15581 Just as the server validates the "seqid" in the event of 15582 CB_LAYOUTRECALL usage, as described in Section 12.5.5.2.1.3, the 15583 server also validates the "seqid" value to ensure that it is within 15584 an appropriate range. This range represents the degree of 15585 parallelism the server supports for layout stateids. If the client 15586 is sending multiple layout operations to the server in parallel, by 15587 definition, the "seqid" value in the supplied stateid will not be the 15588 current "seqid" as held by the server. The range of parallelism 15589 spans from the highest or current "seqid" to a "seqid" value in the 15590 past. To assist in the discussion, the server's current "seqid" 15591 value for a layout stateid is defined as SERVER_CURRENT_SEQID. The 15592 lowest "seqid" value that is acceptable to the server is represented 15593 by PAST_SEQID. And the value for the range of valid "seqid"s or 15594 range of parallelism is VALID_SEQID_RANGE. Therefore, the following 15595 holds: VALID_SEQID_RANGE = SERVER_CURRENT_SEQID - PAST_SEQID. In the 15596 following, all arithmetic is the modulo arithmetic as described 15597 above. 15599 The server MUST support a minimum VALID_SEQID_RANGE. The minimum is 15600 defined as: VALID_SEQID_RANGE = summation over 1..N of 15601 (ca_maxoperations(i) - 1), where N is the number of session fore 15602 channels and ca_maxoperations(i) is the value of the ca_maxoperations 15603 returned from CREATE_SESSION of the i'th session. The reason for "- 15604 1" is to allow for the required SEQUENCE operation. The server MAY 15605 support a VALID_SEQID_RANGE value larger than the minimum. The 15606 maximum VALID_SEQID_RANGE is (2^32 - 2) (accounting for zero not 15607 being a valid "seqid" value). 15609 If the server finds the "seqid" is zero, the NFS4ERR_BAD_STATEID 15610 error is returned to the client. The server further validates the 15611 "seqid" to ensure it is within the range of parallelism, 15612 VALID_SEQID_RANGE. If the "seqid" value is outside of that range, 15613 the error NFS4ERR_OLD_STATEID is returned to the client. Upon 15614 receipt of NFS4ERR_OLD_STATEID, the client updates the stateid in the 15615 layout request based on processing of other layout requests and re- 15616 sends the operation to the server. 15618 12.5.5.2.1.5. Bulk Recall and Return 15620 pNFS supports recalling and returning all layouts that are for files 15621 belonging to a particular fsid (LAYOUTRECALL4_FSID, 15622 LAYOUTRETURN4_FSID) or client ID (LAYOUTRECALL4_ALL, 15623 LAYOUTRETURN4_ALL). There are no "bulk" stateids, so detection of 15624 races via the seqid is not possible. The server MUST NOT initiate 15625 bulk recall while another recall is in progress, or the corresponding 15626 LAYOUTRETURN is in progress or pending. In the event the server 15627 sends a bulk recall while the client has a pending or in-progress 15628 LAYOUTRETURN, CB_LAYOUTRECALL, or LAYOUTGET, the client returns 15629 NFS4ERR_DELAY. In the event the client sends a LAYOUTGET or 15630 LAYOUTRETURN while a bulk recall is in progress, the server returns 15631 NFS4ERR_RECALLCONFLICT. If the client sends a LAYOUTGET or 15632 LAYOUTRETURN after the server receives NFS4ERR_DELAY from a bulk 15633 recall, then to ensure forward progress, the server MAY return 15634 NFS4ERR_RECALLCONFLICT. 15636 Once a CB_LAYOUTRECALL of LAYOUTRECALL4_ALL is sent, the server MUST 15637 NOT allow the client to use any layout stateid except for 15638 LAYOUTCOMMIT operations. Once the client receives a CB_LAYOUTRECALL 15639 of LAYOUTRECALL4_ALL, it MUST NOT use any layout stateid except for 15640 LAYOUTCOMMIT operations. Once a LAYOUTRETURN of LAYOUTRETURN4_ALL is 15641 sent, all layout stateids granted to the client ID are freed. The 15642 client MUST NOT use the layout stateids again. It MUST use LAYOUTGET 15643 to obtain new layout stateids. 15645 Once a CB_LAYOUTRECALL of LAYOUTRECALL4_FSID is sent, the server MUST 15646 NOT allow the client to use any layout stateid that refers to a file 15647 with the specified fsid except for LAYOUTCOMMIT operations. Once the 15648 client receives a CB_LAYOUTRECALL of LAYOUTRECALL4_ALL, it MUST NOT 15649 use any layout stateid that refers to a file with the specified fsid 15650 except for LAYOUTCOMMIT operations. Once a LAYOUTRETURN of 15651 LAYOUTRETURN4_FSID is sent, all layout stateids granted to the 15652 referenced fsid are freed. The client MUST NOT use those freed 15653 layout stateids for files with the referenced fsid again. 15654 Subsequently, for any file with the referenced fsid, to use a layout, 15655 the client MUST first send a LAYOUTGET operation in order to obtain a 15656 new layout stateid for that file. 15658 If the server has sent a bulk CB_LAYOUTRECALL and receives a 15659 LAYOUTGET, or a LAYOUTRETURN with a stateid, the server MUST return 15660 NFS4ERR_RECALLCONFLICT. If the server has sent a bulk 15661 CB_LAYOUTRECALL and receives a LAYOUTRETURN with an lr_returntype 15662 that is not equal to the lor_recalltype of the CB_LAYOUTRECALL, the 15663 server MUST return NFS4ERR_RECALLCONFLICT. 15665 12.5.6. Revoking Layouts 15667 Parallel NFS permits servers to revoke layouts from clients that fail 15668 to respond to recalls and/or fail to renew their lease in time. 15669 Depending on the layout type, the server might revoke the layout and 15670 might take certain actions with respect to the client's I/O to data 15671 servers. 15673 12.5.7. Metadata Server Write Propagation 15675 Asynchronous writes written through the metadata server may be 15676 propagated lazily to the storage devices. For data written 15677 asynchronously through the metadata server, a client performing a 15678 read at the appropriate storage device is not guaranteed to see the 15679 newly written data until a COMMIT occurs at the metadata server. 15680 While the write is pending, reads to the storage device may give out 15681 either the old data, the new data, or a mixture of new and old. Upon 15682 completion of a synchronous WRITE or COMMIT (for asynchronously 15683 written data), the metadata server MUST ensure that storage devices 15684 give out the new data and that the data has been written to stable 15685 storage. If the server implements its storage in any way such that 15686 it cannot obey these constraints, then it MUST recall the layouts to 15687 prevent reads being done that cannot be handled correctly. Note that 15688 the layouts MUST be recalled prior to the server responding to the 15689 associated WRITE operations. 15691 12.6. pNFS Mechanics 15693 This section describes the operations flow taken by a pNFS client to 15694 a metadata server and storage device. 15696 When a pNFS client encounters a new FSID, it sends a GETATTR to the 15697 NFSv4.1 server for the fs_layout_type (Section 5.12.1) attribute. If 15698 the attribute returns at least one layout type, and the layout types 15699 returned are among the set supported by the client, the client knows 15700 that pNFS is a possibility for the file system. If, from the server 15701 that returned the new FSID, the client does not have a client ID that 15702 came from an EXCHANGE_ID result that returned 15703 EXCHGID4_FLAG_USE_PNFS_MDS, it MUST send an EXCHANGE_ID to the server 15704 with the EXCHGID4_FLAG_USE_PNFS_MDS bit set. If the server's 15705 response does not have EXCHGID4_FLAG_USE_PNFS_MDS, then contrary to 15706 what the fs_layout_type attribute said, the server does not support 15707 pNFS, and the client will not be able use pNFS to that server; in 15708 this case, the server MUST return NFS4ERR_NOTSUPP in response to any 15709 pNFS operation. 15711 The client then creates a session, requesting a persistent session, 15712 so that exclusive creates can be done with single round trip via the 15713 createmode4 of GUARDED4. If the session ends up not being 15714 persistent, the client will use EXCLUSIVE4_1 for exclusive creates. 15716 If a file is to be created on a pNFS-enabled file system, the client 15717 uses the OPEN operation. With the normal set of attributes that may 15718 be provided upon OPEN used for creation, there is an OPTIONAL 15719 layout_hint attribute. The client's use of layout_hint allows the 15720 client to express its preference for a layout type and its associated 15721 layout details. The use of a createmode4 of UNCHECKED4, GUARDED4, or 15722 EXCLUSIVE4_1 will allow the client to provide the layout_hint 15723 attribute at create time. The client MUST NOT use EXCLUSIVE4 (see 15724 Table 18). The client is RECOMMENDED to combine a GETATTR operation 15725 after the OPEN within the same COMPOUND. The GETATTR may then 15726 retrieve the layout_type attribute for the newly created file. The 15727 client will then know what layout type the server has chosen for the 15728 file and therefore what storage protocol the client must use. 15730 If the client wants to open an existing file, then it also includes a 15731 GETATTR to determine what layout type the file supports. 15733 The GETATTR in either the file creation or plain file open case can 15734 also include the layout_blksize and layout_alignment attributes so 15735 that the client can determine optimal offsets and lengths for I/O on 15736 the file. 15738 Assuming the client supports the layout type returned by GETATTR and 15739 it chooses to use pNFS for data access, it then sends LAYOUTGET using 15740 the filehandle and stateid returned by OPEN, specifying the range it 15741 wants to do I/O on. The response is a layout, which may be a subset 15742 of the range for which the client asked. It also includes device IDs 15743 and a description of how data is organized (or in the case of 15744 writing, how data is to be organized) across the devices. The device 15745 IDs and data description are encoded in a format that is specific to 15746 the layout type, but the client is expected to understand. 15748 When the client wants to send an I/O, it determines to which device 15749 ID it needs to send the I/O command by examining the data description 15750 in the layout. It then sends a GETDEVICEINFO to find the device 15751 address(es) of the device ID. The client then sends the I/O request 15752 to one of device ID's device addresses, using the storage protocol 15753 defined for the layout type. Note that if a client has multiple I/Os 15754 to send, these I/O requests may be done in parallel. 15756 If the I/O was a WRITE, then at some point the client may want to use 15757 LAYOUTCOMMIT to commit the modification time and the new size of the 15758 file (if it believes it extended the file size) to the metadata 15759 server and the modified data to the file system. 15761 12.7. Recovery 15763 Recovery is complicated by the distributed nature of the pNFS 15764 protocol. In general, crash recovery for layouts is similar to crash 15765 recovery for delegations in the base NFSv4.1 protocol. However, the 15766 client's ability to perform I/O without contacting the metadata 15767 server introduces subtleties that must be handled correctly if the 15768 possibility of file system corruption is to be avoided. 15770 12.7.1. Recovery from Client Restart 15772 Client recovery for layouts is similar to client recovery for other 15773 lock and delegation state. When a pNFS client restarts, it will lose 15774 all information about the layouts that it previously owned. There 15775 are two methods by which the server can reclaim these resources and 15776 allow otherwise conflicting layouts to be provided to other clients. 15778 The first is through the expiry of the client's lease. If the client 15779 recovery time is longer than the lease period, the client's lease 15780 will expire and the server will know that state may be released. For 15781 layouts, the server may release the state immediately upon lease 15782 expiry or it may allow the layout to persist, awaiting possible lease 15783 revival, as long as no other layout conflicts. 15785 The second is through the client restarting in less time than it 15786 takes for the lease period to expire. In such a case, the client 15787 will contact the server through the standard EXCHANGE_ID protocol. 15788 The server will find that the client's co_ownerid matches the 15789 co_ownerid of the previous client invocation, but that the verifier 15790 is different. The server uses this as a signal to release all layout 15791 state associated with the client's previous invocation. In this 15792 scenario, the data written by the client but not covered by a 15793 successful LAYOUTCOMMIT is in an undefined state; it may have been 15794 written or it may now be lost. This is acceptable behavior and it is 15795 the client's responsibility to use LAYOUTCOMMIT to achieve the 15796 desired level of stability. 15798 12.7.2. Dealing with Lease Expiration on the Client 15800 If a client believes its lease has expired, it MUST NOT send I/O to 15801 the storage device until it has validated its lease. The client can 15802 send a SEQUENCE operation to the metadata server. If the SEQUENCE 15803 operation is successful, but sr_status_flag has 15804 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, 15805 SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, or 15806 SEQ4_STATUS_ADMIN_STATE_REVOKED set, the client MUST NOT use 15807 currently held layouts. The client has two choices to recover from 15808 the lease expiration. First, for all modified but uncommitted data, 15809 the client writes it to the metadata server using the FILE_SYNC4 flag 15810 for the WRITEs, or WRITE and COMMIT. Second, the client re- 15811 establishes a client ID and session with the server and obtains new 15812 layouts and device-ID-to-device-address mappings for the modified 15813 data ranges and then writes the data to the storage devices with the 15814 newly obtained layouts. 15816 If sr_status_flags from the metadata server has 15817 SEQ4_STATUS_RESTART_RECLAIM_NEEDED set (or SEQUENCE returns 15818 NFS4ERR_BAD_SESSION and CREATE_SESSION returns 15819 NFS4ERR_STALE_CLIENTID), then the metadata server has restarted, and 15820 the client SHOULD recover using the methods described in 15821 Section 12.7.4. 15823 If sr_status_flags from the metadata server has 15824 SEQ4_STATUS_LEASE_MOVED set, then the client recovers by following 15825 the procedure described in Section 11.11.9.2. After that, the client 15826 may get an indication that the layout state was not moved with the 15827 file system. The client recovers as in the other applicable 15828 situations discussed in the first two paragraphs of this section. 15830 If sr_status_flags reports no loss of state, then the lease for the 15831 layouts that the client has are valid and renewed, and the client can 15832 once again send I/O requests to the storage devices. 15834 While clients SHOULD NOT send I/Os to storage devices that may extend 15835 past the lease expiration time period, this is not always possible, 15836 for example, an extended network partition that starts after the I/O 15837 is sent and does not heal until the I/O request is received by the 15838 storage device. Thus, the metadata server and/or storage devices are 15839 responsible for protecting themselves from I/Os that are both sent 15840 before the lease expires and arrive after the lease expires. See 15841 Section 12.7.3. 15843 12.7.3. Dealing with Loss of Layout State on the Metadata Server 15845 This is a description of the case where all of the following are 15846 true: 15848 * the metadata server has not restarted 15850 * a pNFS client's layouts have been discarded (usually because the 15851 client's lease expired) and are invalid 15853 * an I/O from the pNFS client arrives at the storage device 15855 The metadata server and its storage devices MUST solve this by 15856 fencing the client. In other words, they MUST solve this by 15857 preventing the execution of I/O operations from the client to the 15858 storage devices after layout state loss. The details of how fencing 15859 is done are specific to the layout type. The solution for NFSv4.1 15860 file-based layouts is described in (Section 13.11), and solutions for 15861 other layout types are in their respective external specification 15862 documents. 15864 12.7.4. Recovery from Metadata Server Restart 15866 The pNFS client will discover that the metadata server has restarted 15867 via the methods described in Section 8.4.2 and discussed in a pNFS- 15868 specific context in Section 12.7.2, Paragraph 2. The client MUST 15869 stop using layouts and delete the device ID to device address 15870 mappings it previously received from the metadata server. Having 15871 done that, if the client wrote data to the storage device without 15872 committing the layouts via LAYOUTCOMMIT, then the client has 15873 additional work to do in order to have the client, metadata server, 15874 and storage device(s) all synchronized on the state of the data. 15876 * If the client has data still modified and unwritten in the 15877 client's memory, the client has only two choices. 15879 1. The client can obtain a layout via LAYOUTGET after the 15880 server's grace period and write the data to the storage 15881 devices. 15883 2. The client can WRITE that data through the metadata server 15884 using the WRITE (Section 18.32) operation, and then obtain 15885 layouts as desired. 15887 * If the client asynchronously wrote data to the storage device, but 15888 still has a copy of the data in its memory, then it has available 15889 to it the recovery options listed above in the previous bullet 15890 point. If the metadata server is also in its grace period, the 15891 client has available to it the options below in the next bullet 15892 point. 15894 * The client does not have a copy of the data in its memory and the 15895 metadata server is still in its grace period. The client cannot 15896 use LAYOUTGET (within or outside the grace period) to reclaim a 15897 layout because the contents of the response from LAYOUTGET may not 15898 match what it had previously. The range might be different or the 15899 client might get the same range but the content of the layout 15900 might be different. Even if the content of the layout appears to 15901 be the same, the device IDs may map to different device addresses, 15902 and even if the device addresses are the same, the device 15903 addresses could have been assigned to a different storage device. 15904 The option of retrieving the data from the storage device and 15905 writing it to the metadata server per the recovery scenario 15906 described above is not available because, again, the mappings of 15907 range to device ID, device ID to device address, and device 15908 address to physical device are stale, and new mappings via new 15909 LAYOUTGET do not solve the problem. 15911 The only recovery option for this scenario is to send a 15912 LAYOUTCOMMIT in reclaim mode, which the metadata server will 15913 accept as long as it is in its grace period. The use of 15914 LAYOUTCOMMIT in reclaim mode informs the metadata server that the 15915 layout has changed. It is critical that the metadata server 15916 receive this information before its grace period ends, and thus 15917 before it starts allowing updates to the file system. 15919 To send LAYOUTCOMMIT in reclaim mode, the client sets the 15920 loca_reclaim field of the operation's arguments (Section 18.42.1) 15921 to TRUE. During the metadata server's recovery grace period (and 15922 only during the recovery grace period) the metadata server is 15923 prepared to accept LAYOUTCOMMIT requests with the loca_reclaim 15924 field set to TRUE. 15926 When loca_reclaim is TRUE, the client is attempting to commit 15927 changes to the layout that occurred prior to the restart of the 15928 metadata server. The metadata server applies some consistency 15929 checks on the loca_layoutupdate field of the arguments to 15930 determine whether the client can commit the data written to the 15931 storage device to the file system. The loca_layoutupdate field is 15932 of data type layoutupdate4 and contains layout-type-specific 15933 content (in the lou_body field of loca_layoutupdate). The layout- 15934 type-specific information that loca_layoutupdate might have is 15935 discussed in Section 12.5.4.3. If the metadata server's 15936 consistency checks on loca_layoutupdate succeed, then the metadata 15937 server MUST commit the data (as described by the loca_offset, 15938 loca_length, and loca_layoutupdate fields of the arguments) that 15939 was written to the storage device. If the metadata server's 15940 consistency checks on loca_layoutupdate fail, the metadata server 15941 rejects the LAYOUTCOMMIT operation and makes no changes to the 15942 file system. However, any time LAYOUTCOMMIT with loca_reclaim 15943 TRUE fails, the pNFS client has lost all the data in the range 15944 defined by . A client can defend 15945 against this risk by caching all data, whether written 15946 synchronously or asynchronously in its memory, and by not 15947 releasing the cached data until a successful LAYOUTCOMMIT. This 15948 condition does not hold true for all layout types; for example, 15949 file-based storage devices need not suffer from this limitation. 15951 * The client does not have a copy of the data in its memory and the 15952 metadata server is no longer in its grace period; i.e., the 15953 metadata server returns NFS4ERR_NO_GRACE. As with the scenario in 15954 the above bullet point, the failure of LAYOUTCOMMIT means the data 15955 in the range lost. The defense against 15956 the risk is the same -- cache all written data on the client until 15957 a successful LAYOUTCOMMIT. 15959 12.7.5. Operations during Metadata Server Grace Period 15961 Some of the recovery scenarios thus far noted that some operations 15962 (namely, WRITE and LAYOUTGET) might be permitted during the metadata 15963 server's grace period. The metadata server may allow these 15964 operations during its grace period. For LAYOUTGET, the metadata 15965 server must reliably determine that servicing such a request will not 15966 conflict with an impending LAYOUTCOMMIT reclaim request. For WRITE, 15967 the metadata server must reliably determine that servicing the 15968 request will not conflict with an impending OPEN or with a LOCK where 15969 the file has mandatory byte-range locking enabled. 15971 As mentioned previously, for expediency, the metadata server might 15972 reject some operations (namely, WRITE and LAYOUTGET) during its grace 15973 period, because the simplest correct approach is to reject all non- 15974 reclaim pNFS requests and WRITE operations by returning the 15975 NFS4ERR_GRACE error. However, depending on the storage protocol 15976 (which is specific to the layout type) and metadata server 15977 implementation, the metadata server may be able to determine that a 15978 particular request is safe. For example, a metadata server may save 15979 provisional allocation mappings for each file to stable storage, as 15980 well as information about potentially conflicting OPEN share modes 15981 and mandatory byte-range locks that might have been in effect at the 15982 time of restart, and the metadata server may use this information 15983 during the recovery grace period to determine that a WRITE request is 15984 safe. 15986 12.7.6. Storage Device Recovery 15988 Recovery from storage device restart is mostly dependent upon the 15989 layout type in use. However, there are a few general techniques a 15990 client can use if it discovers a storage device has crashed while 15991 holding modified, uncommitted data that was asynchronously written. 15992 First and foremost, it is important to realize that the client is the 15993 only one that has the information necessary to recover non-committed 15994 data since it holds the modified data and probably nothing else does. 15995 Second, the best solution is for the client to err on the side of 15996 caution and attempt to rewrite the modified data through another 15997 path. 15999 The client SHOULD immediately WRITE the data to the metadata server, 16000 with the stable field in the WRITE4args set to FILE_SYNC4. Once it 16001 does this, there is no need to wait for the original storage device. 16003 12.8. Metadata and Storage Device Roles 16005 If the same physical hardware is used to implement both a metadata 16006 server and storage device, then the same hardware entity is to be 16007 understood to be implementing two distinct roles and it is important 16008 that it be clearly understood on behalf of which role the hardware is 16009 executing at any given time. 16011 Two sub-cases can be distinguished. 16013 1. The storage device uses NFSv4.1 as the storage protocol, i.e., 16014 the same physical hardware is used to implement both a metadata 16015 and data server. See Section 13.1 for a description of how 16016 multiple roles are handled. 16018 2. The storage device does not use NFSv4.1 as the storage protocol, 16019 and the same physical hardware is used to implement both a 16020 metadata and storage device. Whether distinct network addresses 16021 are used to access the metadata server and storage device is 16022 immaterial. This is because it is always clear to the pNFS 16023 client and server, from the upper-layer protocol being used 16024 (NFSv4.1 or non-NFSv4.1), to which role the request to the common 16025 server network address is directed. 16027 12.9. Security Considerations for pNFS 16029 pNFS separates file system metadata and data and provides access to 16030 both. There are pNFS-specific operations (listed in Section 12.3) 16031 that provide access to the metadata; all existing NFSv4.1 16032 conventional (non-pNFS) security mechanisms and features apply to 16033 accessing the metadata. The combination of components in a pNFS 16034 system (see Figure 1) is required to preserve the security properties 16035 of NFSv4.1 with respect to an entity that is accessing a storage 16036 device from a client, including security countermeasures to defend 16037 against threats for which NFSv4.1 provides defenses in environments 16038 where these threats are considered significant. 16040 In some cases, the security countermeasures for connections to 16041 storage devices may take the form of physical isolation or a 16042 recommendation to avoid the use of pNFS in an environment. For 16043 example, it may be impractical to provide confidentiality protection 16044 for some storage protocols to protect against eavesdropping. In 16045 environments where eavesdropping on such protocols is of sufficient 16046 concern to require countermeasures, physical isolation of the 16047 communication channel (e.g., via direct connection from client(s) to 16048 storage device(s)) and/or a decision to forgo use of pNFS (e.g., and 16049 fall back to conventional NFSv4.1) may be appropriate courses of 16050 action. 16052 Where communication with storage devices is subject to the same 16053 threats as client-to-metadata server communication, the protocols 16054 used for that communication need to provide security mechanisms as 16055 strong as or no weaker than those available via RPCSEC_GSS for 16056 NFSv4.1. Except for the storage protocol used for the 16057 LAYOUT4_NFSV4_1_FILES layout (see Section 13), i.e., except for 16058 NFSv4.1, it is beyond the scope of this document to specify the 16059 security mechanisms for storage access protocols. 16061 pNFS implementations MUST NOT remove NFSv4.1's access controls. The 16062 combination of clients, storage devices, and the metadata server are 16063 responsible for ensuring that all client-to-storage-device file data 16064 access respects NFSv4.1's ACLs and file open modes. This entails 16065 performing both of these checks on every access in the client, the 16066 storage device, or both (as applicable; when the storage device is an 16067 NFSv4.1 server, the storage device is ultimately responsible for 16068 controlling access as described in Section 13.9.2). If a pNFS 16069 configuration performs these checks only in the client, the risk of a 16070 misbehaving client obtaining unauthorized access is an important 16071 consideration in determining when it is appropriate to use such a 16072 pNFS configuration. Such layout types SHOULD NOT be used when 16073 client-only access checks do not provide sufficient assurance that 16074 NFSv4.1 access control is being applied correctly. (This is not a 16075 problem for the file layout type described in Section 13 because the 16076 storage access protocol for LAYOUT4_NFSV4_1_FILES is NFSv4.1, and 16077 thus the security model for storage device access via 16078 LAYOUT4_NFSv4_1_FILES is the same as that of the metadata server.) 16079 For handling of access control specific to a layout, the reader 16080 should examine the layout specification, such as the NFSv4.1/ 16081 file-based layout (Section 13) of this document, the blocks layout 16082 [48], and objects layout [47]. 16084 13. NFSv4.1 as a Storage Protocol in pNFS: the File Layout Type 16086 This section describes the semantics and format of NFSv4.1 file-based 16087 layouts for pNFS. NFSv4.1 file-based layouts use the 16088 LAYOUT4_NFSV4_1_FILES layout type. The LAYOUT4_NFSV4_1_FILES type 16089 defines striping data across multiple NFSv4.1 data servers. 16091 13.1. Client ID and Session Considerations 16093 Sessions are a REQUIRED feature of NFSv4.1, and this extends to both 16094 the metadata server and file-based (NFSv4.1-based) data servers. 16096 The role a server plays in pNFS is determined by the result it 16097 returns from EXCHANGE_ID. The roles are: 16099 * Metadata server (EXCHGID4_FLAG_USE_PNFS_MDS is set in the result 16100 eir_flags). 16102 * Data server (EXCHGID4_FLAG_USE_PNFS_DS). 16104 * Non-metadata server (EXCHGID4_FLAG_USE_NON_PNFS). This is an 16105 NFSv4.1 server that does not support operations (e.g., LAYOUTGET) 16106 or attributes that pertain to pNFS. 16108 The client MAY request zero or more of EXCHGID4_FLAG_USE_NON_PNFS, 16109 EXCHGID4_FLAG_USE_PNFS_DS, or EXCHGID4_FLAG_USE_PNFS_MDS, even though 16110 some combinations (e.g., EXCHGID4_FLAG_USE_NON_PNFS | 16111 EXCHGID4_FLAG_USE_PNFS_MDS) are contradictory. However, the server 16112 MUST only return the following acceptable combinations: 16114 +========================================================+ 16115 | Acceptable Results from EXCHANGE_ID | 16116 +========================================================+ 16117 | EXCHGID4_FLAG_USE_PNFS_MDS | 16118 +--------------------------------------------------------+ 16119 | EXCHGID4_FLAG_USE_PNFS_MDS | EXCHGID4_FLAG_USE_PNFS_DS | 16120 +--------------------------------------------------------+ 16121 | EXCHGID4_FLAG_USE_PNFS_DS | 16122 +--------------------------------------------------------+ 16123 | EXCHGID4_FLAG_USE_NON_PNFS | 16124 +--------------------------------------------------------+ 16125 | EXCHGID4_FLAG_USE_PNFS_DS | EXCHGID4_FLAG_USE_NON_PNFS | 16126 +--------------------------------------------------------+ 16128 Table 8 16130 As the above table implies, a server can have one or two roles. A 16131 server can be both a metadata server and a data server, or it can be 16132 both a data server and non-metadata server. In addition to returning 16133 two roles in the EXCHANGE_ID's results, and thus serving both roles 16134 via a common client ID, a server can serve two roles by returning a 16135 unique client ID and server owner for each role in each of two 16136 EXCHANGE_ID results, with each result indicating each role. 16138 In the case of a server with concurrent pNFS roles that are served by 16139 a common client ID, if the EXCHANGE_ID request from the client has 16140 zero or a combination of the bits set in eia_flags, the server result 16141 should set bits that represent the higher of the acceptable 16142 combination of the server roles, with a preference to match the roles 16143 requested by the client. Thus, if a client request has 16144 (EXCHGID4_FLAG_USE_NON_PNFS | EXCHGID4_FLAG_USE_PNFS_MDS | 16145 EXCHGID4_FLAG_USE_PNFS_DS) flags set, and the server is both a 16146 metadata server and a data server, serving both the roles by a common 16147 client ID, the server SHOULD return with 16148 (EXCHGID4_FLAG_USE_PNFS_MDS | EXCHGID4_FLAG_USE_PNFS_DS) set. 16150 In the case of a server that has multiple concurrent pNFS roles, each 16151 role served by a unique client ID, if the client specifies zero or a 16152 combination of roles in the request, the server results SHOULD return 16153 only one of the roles from the combination specified by the client 16154 request. If the role specified by the server result does not match 16155 the intended use by the client, the client should send the 16156 EXCHANGE_ID specifying just the interested pNFS role. 16158 If a pNFS metadata client gets a layout that refers it to an NFSv4.1 16159 data server, it needs a client ID on that data server. If it does 16160 not yet have a client ID from the server that had the 16161 EXCHGID4_FLAG_USE_PNFS_DS flag set in the EXCHANGE_ID results, then 16162 the client needs to send an EXCHANGE_ID to the data server, using the 16163 same co_ownerid as it sent to the metadata server, with the 16164 EXCHGID4_FLAG_USE_PNFS_DS flag set in the arguments. If the server's 16165 EXCHANGE_ID results have EXCHGID4_FLAG_USE_PNFS_DS set, then the 16166 client may use the client ID to create sessions that will exchange 16167 pNFS data operations. The client ID returned by the data server has 16168 no relationship with the client ID returned by a metadata server 16169 unless the client IDs are equal, and the server owners and server 16170 scopes of the data server and metadata server are equal. 16172 In NFSv4.1, the session ID in the SEQUENCE operation implies the 16173 client ID, which in turn might be used by the server to map the 16174 stateid to the right client/server pair. However, when a data server 16175 is presented with a READ or WRITE operation with a stateid, because 16176 the stateid is associated with a client ID on a metadata server, and 16177 because the session ID in the preceding SEQUENCE operation is tied to 16178 the client ID of the data server, the data server has no obvious way 16179 to determine the metadata server from the COMPOUND procedure, and 16180 thus has no way to validate the stateid. One RECOMMENDED approach is 16181 for pNFS servers to encode metadata server routing and/or identity 16182 information in the data server filehandles as returned in the layout. 16184 If metadata server routing and/or identity information is encoded in 16185 data server filehandles, when the metadata server identity or 16186 location changes, the data server filehandles it gave out will become 16187 invalid (stale), and so the metadata server MUST first recall the 16188 layouts. Invalidating a data server filehandle does not render the 16189 NFS client's data cache invalid. The client's cache should map a 16190 data server filehandle to a metadata server filehandle, and a 16191 metadata server filehandle to cached data. 16193 If a server is both a metadata server and a data server, the server 16194 might need to distinguish operations on files that are directed to 16195 the metadata server from those that are directed to the data server. 16196 It is RECOMMENDED that the values of the filehandles returned by the 16197 LAYOUTGET operation be different than the value of the filehandle 16198 returned by the OPEN of the same file. 16200 Another scenario is for the metadata server and the storage device to 16201 be distinct from one client's point of view, and the roles reversed 16202 from another client's point of view. For example, in the cluster 16203 file system model, a metadata server to one client might be a data 16204 server to another client. If NFSv4.1 is being used as the storage 16205 protocol, then pNFS servers need to encode the values of filehandles 16206 according to their specific roles. 16208 13.1.1. Sessions Considerations for Data Servers 16210 Section 2.10.11.2 states that a client has to keep its lease renewed 16211 in order to prevent a session from being deleted by the server. If 16212 the reply to EXCHANGE_ID has just the EXCHGID4_FLAG_USE_PNFS_DS role 16213 set, then (as noted in Section 13.6) the client will not be able to 16214 determine the data server's lease_time attribute because GETATTR will 16215 not be permitted. Instead, the rule is that any time a client 16216 receives a layout referring it to a data server that returns just the 16217 EXCHGID4_FLAG_USE_PNFS_DS role, the client MAY assume that the 16218 lease_time attribute from the metadata server that returned the 16219 layout applies to the data server. Thus, the data server MUST be 16220 aware of the values of all lease_time attributes of all metadata 16221 servers for which it is providing I/O, and it MUST use the maximum of 16222 all such lease_time values as the lease interval for all client IDs 16223 and sessions established on it. 16225 For example, if one metadata server has a lease_time attribute of 20 16226 seconds, and a second metadata server has a lease_time attribute of 16227 10 seconds, then if both servers return layouts that refer to an 16228 EXCHGID4_FLAG_USE_PNFS_DS-only data server, the data server MUST 16229 renew a client's lease if the interval between two SEQUENCE 16230 operations on different COMPOUND requests is less than 20 seconds. 16232 13.2. File Layout Definitions 16234 The following definitions apply to the LAYOUT4_NFSV4_1_FILES layout 16235 type and may be applicable to other layout types. 16237 Unit. A unit is a fixed-size quantity of data written to a data 16238 server. 16240 Pattern. A pattern is a method of distributing one or more equal 16241 sized units across a set of data servers. A pattern is iterated 16242 one or more times. 16244 Stripe. A stripe is a set of data distributed across a set of data 16245 servers in a pattern before that pattern repeats. 16247 Stripe Count. A stripe count is the number of units in a pattern. 16249 Stripe Width. A stripe width is the size of a stripe in bytes. The 16250 stripe width = the stripe count * the size of the stripe unit. 16252 Hereafter, this document will refer to a unit that is a written in a 16253 pattern as a "stripe unit". 16255 A pattern may have more stripe units than data servers. If so, some 16256 data servers will have more than one stripe unit per stripe. A data 16257 server that has multiple stripe units per stripe MAY store each unit 16258 in a different data file (and depending on the implementation, will 16259 possibly assign a unique data filehandle to each data file). 16261 13.3. File Layout Data Types 16263 The high level NFSv4.1 layout types are nfsv4_1_file_layouthint4, 16264 nfsv4_1_file_layout_ds_addr4, and nfsv4_1_file_layout4. 16266 The SETATTR operation supports a layout hint attribute 16267 (Section 5.12.4). When the client sets a layout hint (data type 16268 layouthint4) with a layout type of LAYOUT4_NFSV4_1_FILES (the 16269 loh_type field), the loh_body field contains a value of data type 16270 nfsv4_1_file_layouthint4. 16272 const NFL4_UFLG_MASK = 0x0000003F; 16273 const NFL4_UFLG_DENSE = 0x00000001; 16274 const NFL4_UFLG_COMMIT_THRU_MDS = 0x00000002; 16275 const NFL4_UFLG_STRIPE_UNIT_SIZE_MASK 16276 = 0xFFFFFFC0; 16278 typedef uint32_t nfl_util4; 16280 enum filelayout_hint_care4 { 16281 NFLH4_CARE_DENSE = NFL4_UFLG_DENSE, 16283 NFLH4_CARE_COMMIT_THRU_MDS 16284 = NFL4_UFLG_COMMIT_THRU_MDS, 16286 NFLH4_CARE_STRIPE_UNIT_SIZE 16287 = 0x00000040, 16289 NFLH4_CARE_STRIPE_COUNT = 0x00000080 16290 }; 16292 /* Encoded in the loh_body field of data type layouthint4: */ 16294 struct nfsv4_1_file_layouthint4 { 16295 uint32_t nflh_care; 16296 nfl_util4 nflh_util; 16297 count4 nflh_stripe_count; 16298 }; 16299 The generic layout hint structure is described in Section 3.3.19. 16300 The client uses the layout hint in the layout_hint (Section 5.12.4) 16301 attribute to indicate the preferred type of layout to be used for a 16302 newly created file. The LAYOUT4_NFSV4_1_FILES layout-type-specific 16303 content for the layout hint is composed of three fields. The first 16304 field, nflh_care, is a set of flags indicating which values of the 16305 hint the client cares about. If the NFLH4_CARE_DENSE flag is set, 16306 then the client indicates in the second field, nflh_util, a 16307 preference for how the data file is packed (Section 13.4.4), which is 16308 controlled by the value of the expression nflh_util & NFL4_UFLG_DENSE 16309 ("&" represents the bitwise AND operator). If the 16310 NFLH4_CARE_COMMIT_THRU_MDS flag is set, then the client indicates a 16311 preference for whether the client should send COMMIT operations to 16312 the metadata server or data server (Section 13.7), which is 16313 controlled by the value of nflh_util & NFL4_UFLG_COMMIT_THRU_MDS. If 16314 the NFLH4_CARE_STRIPE_UNIT_SIZE flag is set, the client indicates its 16315 preferred stripe unit size, which is indicated in nflh_util & 16316 NFL4_UFLG_STRIPE_UNIT_SIZE_MASK (thus, the stripe unit size MUST be a 16317 multiple of 64 bytes). The minimum stripe unit size is 64 bytes. If 16318 the NFLH4_CARE_STRIPE_COUNT flag is set, the client indicates in the 16319 third field, nflh_stripe_count, the stripe count. The stripe count 16320 multiplied by the stripe unit size is the stripe width. 16322 When LAYOUTGET returns a LAYOUT4_NFSV4_1_FILES layout (indicated in 16323 the loc_type field of the lo_content field), the loc_body field of 16324 the lo_content field contains a value of data type 16325 nfsv4_1_file_layout4. Among other content, nfsv4_1_file_layout4 has 16326 a storage device ID (field nfl_deviceid) of data type deviceid4. The 16327 GETDEVICEINFO operation maps a device ID to a storage device address 16328 (type device_addr4). When GETDEVICEINFO returns a device address 16329 with a layout type of LAYOUT4_NFSV4_1_FILES (the da_layout_type 16330 field), the da_addr_body field contains a value of data type 16331 nfsv4_1_file_layout_ds_addr4. 16333 typedef netaddr4 multipath_list4<>; 16335 /* 16336 * Encoded in the da_addr_body field of 16337 * data type device_addr4: 16338 */ 16339 struct nfsv4_1_file_layout_ds_addr4 { 16340 uint32_t nflda_stripe_indices<>; 16341 multipath_list4 nflda_multipath_ds_list<>; 16342 }; 16344 The nfsv4_1_file_layout_ds_addr4 data type represents the device 16345 address. It is composed of two fields: 16347 1. nflda_multipath_ds_list: An array of lists of data servers, where 16348 each list can be one or more elements, and each element 16349 represents a data server address that may serve equally as the 16350 target of I/O operations (see Section 13.5). The length of this 16351 array might be different than the stripe count. 16353 2. nflda_stripe_indices: An array of indices used to index into 16354 nflda_multipath_ds_list. The value of each element of 16355 nflda_stripe_indices MUST be less than the number of elements in 16356 nflda_multipath_ds_list. Each element of nflda_multipath_ds_list 16357 SHOULD be referred to by one or more elements of 16358 nflda_stripe_indices. The number of elements in 16359 nflda_stripe_indices is always equal to the stripe count. 16361 /* 16362 * Encoded in the loc_body field of 16363 * data type layout_content4: 16364 */ 16365 struct nfsv4_1_file_layout4 { 16366 deviceid4 nfl_deviceid; 16367 nfl_util4 nfl_util; 16368 uint32_t nfl_first_stripe_index; 16369 offset4 nfl_pattern_offset; 16370 nfs_fh4 nfl_fh_list<>; 16371 }; 16373 The nfsv4_1_file_layout4 data type represents the layout. It is 16374 composed of the following fields: 16376 1. nfl_deviceid: The device ID that maps to a value of type 16377 nfsv4_1_file_layout_ds_addr4. 16379 2. nfl_util: Like the nflh_util field of data type 16380 nfsv4_1_file_layouthint4, a compact representation of how the 16381 data on a file on each data server is packed, whether the client 16382 should send COMMIT operations to the metadata server or data 16383 server, and the stripe unit size. If a server returns two or 16384 more overlapping layouts, each stripe unit size in each 16385 overlapping layout MUST be the same. 16387 3. nfl_first_stripe_index: The index into the first element of the 16388 nflda_stripe_indices array to use. 16390 4. nfl_pattern_offset: This field is the logical offset into the 16391 file where the striping pattern starts. It is required for 16392 converting the client's logical I/O offset (e.g., the current 16393 offset in a POSIX file descriptor before the read() or write() 16394 system call is sent) into the stripe unit number (see 16395 Section 13.4.1). 16397 If dense packing is used, then nfl_pattern_offset is also needed 16398 to convert the client's logical I/O offset to an offset on the 16399 file on the data server corresponding to the stripe unit number 16400 (see Section 13.4.4). 16402 Note that nfl_pattern_offset is not always the same as lo_offset. 16403 For example, via the LAYOUTGET operation, a client might request 16404 a layout starting at offset 1000 of a file that has its striping 16405 pattern start at offset zero. 16407 5. nfl_fh_list: An array of data server filehandles for each list of 16408 data servers in each element of the nflda_multipath_ds_list 16409 array. The number of elements in nfl_fh_list depends on whether 16410 sparse or dense packing is being used. 16412 * If sparse packing is being used, the number of elements in 16413 nfl_fh_list MUST be one of three values: 16415 - Zero. This means that filehandles used for each data 16416 server are the same as the filehandle returned by the OPEN 16417 operation from the metadata server. 16419 - One. This means that every data server uses the same 16420 filehandle: what is specified in nfl_fh_list[0]. 16422 - The same number of elements in nflda_multipath_ds_list. 16423 Thus, in this case, when sending an I/O operation to any 16424 data server in nflda_multipath_ds_list[X], the filehandle 16425 in nfl_fh_list[X] MUST be used. 16427 See the discussion on sparse packing in Section 13.4.4. 16429 * If dense packing is being used, the number of elements in 16430 nfl_fh_list MUST be the same as the number of elements in 16431 nflda_stripe_indices. Thus, when sending an I/O operation to 16432 any data server in 16433 nflda_multipath_ds_list[nflda_stripe_indices[Y]], the 16434 filehandle in nfl_fh_list[Y] MUST be used. In addition, any 16435 time there exists i and j, (i != j), such that the 16436 intersection of 16437 nflda_multipath_ds_list[nflda_stripe_indices[i]] and 16438 nflda_multipath_ds_list[nflda_stripe_indices[j]] is not empty, 16439 then nfl_fh_list[i] MUST NOT equal nfl_fh_list[j]. In other 16440 words, when dense packing is being used, if a data server 16441 appears in two or more units of a striping pattern, each 16442 reference to the data server MUST use a different filehandle. 16444 Indeed, if there are multiple striping patterns, as indicated 16445 by the presence of multiple objects of data type layout4 16446 (either returned in one or multiple LAYOUTGET operations), and 16447 a data server is the target of a unit of one pattern and 16448 another unit of another pattern, then each reference to each 16449 data server MUST use a different filehandle. 16451 See the discussion on dense packing in Section 13.4.4. 16453 The details on the interpretation of the layout are in Section 13.4. 16455 13.4. Interpreting the File Layout 16457 13.4.1. Determining the Stripe Unit Number 16459 To find the stripe unit number that corresponds to the client's 16460 logical file offset, the pattern offset will also be used. The i'th 16461 stripe unit (SUi) is: 16463 relative_offset = file_offset - nfl_pattern_offset; 16464 SUi = floor(relative_offset / stripe_unit_size); 16466 13.4.2. Interpreting the File Layout Using Sparse Packing 16468 When sparse packing is used, the algorithm for determining the 16469 filehandle and set of data-server network addresses to write stripe 16470 unit i (SUi) to is: 16472 stripe_count = number of elements in nflda_stripe_indices; 16474 j = (SUi + nfl_first_stripe_index) % stripe_count; 16476 idx = nflda_stripe_indices[j]; 16478 fh_count = number of elements in nfl_fh_list; 16479 ds_count = number of elements in nflda_multipath_ds_list; 16481 switch (fh_count) { 16482 case ds_count: 16483 fh = nfl_fh_list[idx]; 16484 break; 16486 case 1: 16487 fh = nfl_fh_list[0]; 16488 break; 16490 case 0: 16491 fh = filehandle returned by OPEN; 16492 break; 16494 default: 16495 throw a fatal exception; 16496 break; 16497 } 16499 address_list = nflda_multipath_ds_list[idx]; 16501 The client would then select a data server from address_list, and 16502 send a READ or WRITE operation using the filehandle specified in fh. 16504 Consider the following example: 16506 Suppose we have a device address consisting of seven data servers, 16507 arranged in three equivalence (Section 13.5) classes: 16509 { A, B, C, D }, { E }, { F, G } 16511 where A through G are network addresses. 16513 Then 16515 nflda_multipath_ds_list<> = { A, B, C, D }, { E }, { F, G } 16517 i.e., 16519 nflda_multipath_ds_list[0] = { A, B, C, D } 16520 nflda_multipath_ds_list[1] = { E } 16522 nflda_multipath_ds_list[2] = { F, G } 16524 Suppose the striping index array is: 16526 nflda_stripe_indices<> = { 2, 0, 1, 0 } 16528 Now suppose the client gets a layout that has a device ID that maps 16529 to the above device address. The initial index contains 16531 nfl_first_stripe_index = 2, 16533 and the filehandle list is 16535 nfl_fh_list = { 0x36, 0x87, 0x67 }. 16537 If the client wants to write to SU0, the set of valid { network 16538 address, filehandle } combinations for SUi are determined by: 16540 nfl_first_stripe_index = 2 16542 So 16544 idx = nflda_stripe_indices[(0 + 2) % 4] 16546 = nflda_stripe_indices[2] 16548 = 1 16550 So 16552 nflda_multipath_ds_list[1] = { E } 16554 and 16556 nfl_fh_list[1] = { 0x87 } 16558 The client can thus write SU0 to { 0x87, { E } }. 16560 The destinations of the first 13 storage units are: 16562 +=====+============+==============+ 16563 | SUi | filehandle | data servers | 16564 +=====+============+==============+ 16565 | 0 | 87 | E | 16566 +-----+------------+--------------+ 16567 | 1 | 36 | A,B,C,D | 16568 +-----+------------+--------------+ 16569 | 2 | 67 | F,G | 16570 +-----+------------+--------------+ 16571 | 3 | 36 | A,B,C,D | 16572 +-----+------------+--------------+ 16573 +-----+------------+--------------+ 16574 | 4 | 87 | E | 16575 +-----+------------+--------------+ 16576 | 5 | 36 | A,B,C,D | 16577 +-----+------------+--------------+ 16578 | 6 | 67 | F,G | 16579 +-----+------------+--------------+ 16580 | 7 | 36 | A,B,C,D | 16581 +-----+------------+--------------+ 16582 +-----+------------+--------------+ 16583 | 8 | 87 | E | 16584 +-----+------------+--------------+ 16585 | 9 | 36 | A,B,C,D | 16586 +-----+------------+--------------+ 16587 | 10 | 67 | F,G | 16588 +-----+------------+--------------+ 16589 | 11 | 36 | A,B,C,D | 16590 +-----+------------+--------------+ 16591 +-----+------------+--------------+ 16592 | 12 | 87 | E | 16593 +-----+------------+--------------+ 16595 Table 9 16597 13.4.3. Interpreting the File Layout Using Dense Packing 16599 When dense packing is used, the algorithm for determining the 16600 filehandle and set of data server network addresses to write stripe 16601 unit i (SUi) to is: 16603 stripe_count = number of elements in nflda_stripe_indices; 16605 j = (SUi + nfl_first_stripe_index) % stripe_count; 16607 idx = nflda_stripe_indices[j]; 16609 fh_count = number of elements in nfl_fh_list; 16610 ds_count = number of elements in nflda_multipath_ds_list; 16612 switch (fh_count) { 16613 case stripe_count: 16614 fh = nfl_fh_list[j]; 16615 break; 16617 default: 16618 throw a fatal exception; 16619 break; 16620 } 16622 address_list = nflda_multipath_ds_list[idx]; 16624 The client would then select a data server from address_list, and 16625 send a READ or WRITE operation using the filehandle specified in fh. 16627 Consider the following example (which is the same as the sparse 16628 packing example, except for the filehandle list): 16630 Suppose we have a device address consisting of seven data servers, 16631 arranged in three equivalence (Section 13.5) classes: 16633 { A, B, C, D }, { E }, { F, G } 16635 where A through G are network addresses. 16637 Then 16639 nflda_multipath_ds_list<> = { A, B, C, D }, { E }, { F, G } 16641 i.e., 16643 nflda_multipath_ds_list[0] = { A, B, C, D } 16645 nflda_multipath_ds_list[1] = { E } 16647 nflda_multipath_ds_list[2] = { F, G } 16649 Suppose the striping index array is: 16651 nflda_stripe_indices<> = { 2, 0, 1, 0 } 16653 Now suppose the client gets a layout that has a device ID that maps 16654 to the above device address. The initial index contains 16656 nfl_first_stripe_index = 2, 16658 and 16660 nfl_fh_list = { 0x67, 0x37, 0x87, 0x36 }. 16662 The interesting examples for dense packing are SU1 and SU3 because 16663 each stripe unit refers to the same data server list, yet each stripe 16664 unit MUST use a different filehandle. If the client wants to write 16665 to SU1, the set of valid { network address, filehandle } combinations 16666 for SUi are determined by: 16668 nfl_first_stripe_index = 2 16670 So 16672 j = (1 + 2) % 4 = 3 16674 idx = nflda_stripe_indices[j] 16676 = nflda_stripe_indices[3] 16678 = 0 16680 So 16682 nflda_multipath_ds_list[0] = { A, B, C, D } 16684 and 16686 nfl_fh_list[3] = { 0x36 } 16688 The client can thus write SU1 to { 0x36, { A, B, C, D } }. 16690 For SU3, j = (3 + 2) % 4 = 1, and nflda_stripe_indices[1] = 0. Then 16691 nflda_multipath_ds_list[0] = { A, B, C, D }, and nfl_fh_list[1] = 16692 0x37. The client can thus write SU3 to { 0x37, { A, B, C, D } }. 16694 The destinations of the first 13 storage units are: 16696 +=====+============+==============+ 16697 | SUi | filehandle | data servers | 16698 +=====+============+==============+ 16699 | 0 | 87 | E | 16700 +-----+------------+--------------+ 16701 | 1 | 36 | A,B,C,D | 16702 +-----+------------+--------------+ 16703 | 2 | 67 | F,G | 16704 +-----+------------+--------------+ 16705 | 3 | 37 | A,B,C,D | 16706 +-----+------------+--------------+ 16707 +-----+------------+--------------+ 16708 | 4 | 87 | E | 16709 +-----+------------+--------------+ 16710 | 5 | 36 | A,B,C,D | 16711 +-----+------------+--------------+ 16712 | 6 | 67 | F,G | 16713 +-----+------------+--------------+ 16714 | 7 | 37 | A,B,C,D | 16715 +-----+------------+--------------+ 16716 +-----+------------+--------------+ 16717 | 8 | 87 | E | 16718 +-----+------------+--------------+ 16719 | 9 | 36 | A,B,C,D | 16720 +-----+------------+--------------+ 16721 | 10 | 67 | F,G | 16722 +-----+------------+--------------+ 16723 | 11 | 37 | A,B,C,D | 16724 +-----+------------+--------------+ 16725 +-----+------------+--------------+ 16726 | 12 | 87 | E | 16727 +-----+------------+--------------+ 16729 Table 10 16731 13.4.4. Sparse and Dense Stripe Unit Packing 16733 The flag NFL4_UFLG_DENSE of the nfl_util4 data type (field nflh_util 16734 of the data type nfsv4_1_file_layouthint4 and field nfl_util of data 16735 type nfsv4_1_file_layout_ds_addr4) specifies how the data is packed 16736 within the data file on a data server. It allows for two different 16737 data packings: sparse and dense. The packing type determines the 16738 calculation that will be made to map the client-visible file offset 16739 to the offset within the data file located on the data server. 16741 If nfl_util & NFL4_UFLG_DENSE is zero, this means that sparse packing 16742 is being used. Hence, the logical offsets of the file as viewed by a 16743 client sending READs and WRITEs directly to the metadata server are 16744 the same offsets each data server uses when storing a stripe unit. 16745 The effect then, for striping patterns consisting of at least two 16746 stripe units, is for each data server file to be sparse or "holey". 16747 So for example, suppose there is a pattern with three stripe units, 16748 the stripe unit size is 4096 bytes, and there are three data servers 16749 in the pattern. Then, the file in data server 1 will have stripe 16750 units 0, 3, 6, 9, ... filled; data server 2's file will have stripe 16751 units 1, 4, 7, 10, ... filled; and data server 3's file will have 16752 stripe units 2, 5, 8, 11, ... filled. The unfilled stripe units of 16753 each file will be holes; hence, the files in each data server are 16754 sparse. 16756 If sparse packing is being used and a client attempts I/O to one of 16757 the holes, then an error MUST be returned by the data server. Using 16758 the above example, if data server 3 received a READ or WRITE 16759 operation for block 4, the data server would return 16760 NFS4ERR_PNFS_IO_HOLE. Thus, data servers need to understand the 16761 striping pattern in order to support sparse packing. 16763 If nfl_util & NFL4_UFLG_DENSE is one, this means that dense packing 16764 is being used, and the data server files have no holes. Dense 16765 packing might be selected because the data server does not 16766 (efficiently) support holey files or because the data server cannot 16767 recognize read-ahead unless there are no holes. If dense packing is 16768 indicated in the layout, the data files will be packed. Using the 16769 same striping pattern and stripe unit size that were used for the 16770 sparse packing example, the corresponding dense packing example would 16771 have all stripe units of all data files filled as follows: 16773 * Logical stripe units 0, 3, 6, ... of the file would live on stripe 16774 units 0, 1, 2, ... of the file of data server 1. 16776 * Logical stripe units 1, 4, 7, ... of the file would live on stripe 16777 units 0, 1, 2, ... of the file of data server 2. 16779 * Logical stripe units 2, 5, 8, ... of the file would live on stripe 16780 units 0, 1, 2, ... of the file of data server 3. 16782 Because dense packing does not leave holes on the data servers, the 16783 pNFS client is allowed to write to any offset of any data file of any 16784 data server in the stripe. Thus, the data servers need not know the 16785 file's striping pattern. 16787 The calculation to determine the byte offset within the data file for 16788 dense data server layouts is: 16790 stripe_width = stripe_unit_size * N; 16791 where N = number of elements in nflda_stripe_indices. 16793 relative_offset = file_offset - nfl_pattern_offset; 16795 data_file_offset = floor(relative_offset / stripe_width) 16796 * stripe_unit_size 16797 + relative_offset % stripe_unit_size 16799 If dense packing is being used, and a data server appears more than 16800 once in a striping pattern, then to distinguish one stripe unit from 16801 another, the data server MUST use a different filehandle. Let's 16802 suppose there are two data servers. Logical stripe units 0, 3, 6 are 16803 served by data server 1; logical stripe units 1, 4, 7 are served by 16804 data server 2; and logical stripe units 2, 5, 8 are also served by 16805 data server 2. Unless data server 2 has two filehandles (each 16806 referring to a different data file), then, for example, a write to 16807 logical stripe unit 1 overwrites the write to logical stripe unit 2 16808 because both logical stripe units are located in the same stripe unit 16809 (0) of data server 2. 16811 13.5. Data Server Multipathing 16813 The NFSv4.1 file layout supports multipathing to multiple data server 16814 addresses. Data-server-level multipathing is used for bandwidth 16815 scaling via trunking (Section 2.10.5) and for higher availability of 16816 use in the case of a data-server failure. Multipathing allows the 16817 client to switch to another data server address which may be that of 16818 another data server that is exporting the same data stripe unit, 16819 without having to contact the metadata server for a new layout. 16821 To support data server multipathing, each element of the 16822 nflda_multipath_ds_list contains an array of one more data server 16823 network addresses. This array (data type multipath_list4) represents 16824 a list of data servers (each identified by a network address), with 16825 the possibility that some data servers will appear in the list 16826 multiple times. 16828 The client is free to use any of the network addresses as a 16829 destination to send data server requests. If some network addresses 16830 are less optimal paths to the data than others, then the MDS SHOULD 16831 NOT include those network addresses in an element of 16832 nflda_multipath_ds_list. If less optimal network addresses exist to 16833 provide failover, the RECOMMENDED method to offer the addresses is to 16834 provide them in a replacement device-ID-to-device-address mapping, or 16835 a replacement device ID. When a client finds that no data server in 16836 an element of nflda_multipath_ds_list responds, it SHOULD send a 16837 GETDEVICEINFO to attempt to replace the existing device-ID-to-device- 16838 address mappings. If the MDS detects that all data servers 16839 represented by an element of nflda_multipath_ds_list are unavailable, 16840 the MDS SHOULD send a CB_NOTIFY_DEVICEID (if the client has indicated 16841 it wants device ID notifications for changed device IDs) to change 16842 the device-ID-to-device-address mappings to the available data 16843 servers. If the device ID itself will be replaced, the MDS SHOULD 16844 recall all layouts with the device ID, and thus force the client to 16845 get new layouts and device ID mappings via LAYOUTGET and 16846 GETDEVICEINFO. 16848 Generally, if two network addresses appear in an element of 16849 nflda_multipath_ds_list, they will designate the same data server, 16850 and the two data server addresses will support the implementation of 16851 client ID or session trunking (the latter is RECOMMENDED) as defined 16852 in Section 2.10.5. The two data server addresses will share the same 16853 server owner or major ID of the server owner. It is not always 16854 necessary for the two data server addresses to designate the same 16855 server with trunking being used. For example, the data could be 16856 read-only, and the data consist of exact replicas. 16858 13.6. Operations Sent to NFSv4.1 Data Servers 16860 Clients accessing data on an NFSv4.1 data server MUST send only the 16861 NULL procedure and COMPOUND procedures whose operations are taken 16862 only from two restricted subsets of the operations defined as valid 16863 NFSv4.1 operations. Clients MUST use the filehandle specified by the 16864 layout when accessing data on NFSv4.1 data servers. 16866 The first of these operation subsets consists of management 16867 operations. This subset consists of the BACKCHANNEL_CTL, 16868 BIND_CONN_TO_SESSION, CREATE_SESSION, DESTROY_CLIENTID, 16869 DESTROY_SESSION, EXCHANGE_ID, SECINFO_NO_NAME, SET_SSV, and SEQUENCE 16870 operations. The client may use these operations in order to set up 16871 and maintain the appropriate client IDs, sessions, and security 16872 contexts involved in communication with the data server. Henceforth, 16873 these will be referred to as data-server housekeeping operations. 16875 The second subset consists of COMMIT, READ, WRITE, and PUTFH. These 16876 operations MUST be used with a current filehandle specified by the 16877 layout. In the case of PUTFH, the new current filehandle MUST be one 16878 taken from the layout. Henceforth, these will be referred to as 16879 data-server I/O operations. As described in Section 12.5.1, a client 16880 MUST NOT send an I/O to a data server for which it does not hold a 16881 valid layout; the data server MUST reject such an I/O. 16883 Unless the server has a concurrent non-data-server personality -- 16884 i.e., EXCHANGE_ID results returned (EXCHGID4_FLAG_USE_PNFS_DS | 16885 EXCHGID4_FLAG_USE_PNFS_MDS) or (EXCHGID4_FLAG_USE_PNFS_DS | 16886 EXCHGID4_FLAG_USE_NON_PNFS) see Section 13.1 -- any attempted use of 16887 operations against a data server other than those specified in the 16888 two subsets above MUST return NFS4ERR_NOTSUPP to the client. 16890 When the server has concurrent data-server and non-data-server 16891 personalities, each COMPOUND sent by the client MUST be constructed 16892 so that it is appropriate to one of the two personalities, and it 16893 MUST NOT contain operations directed to a mix of those personalities. 16894 The server MUST enforce this. To understand the constraints, 16895 operations within a COMPOUND are divided into the following three 16896 classes: 16898 1. An operation that is ambiguous regarding its personality 16899 assignment. This includes all of the data-server housekeeping 16900 operations. Additionally, if the server has assigned filehandles 16901 so that the ones defined by the layout are the same as those used 16902 by the metadata server, all operations using such filehandles are 16903 within this class, with the following exception. The exception 16904 is that if the operation uses a stateid that is incompatible with 16905 a data-server personality (e.g., a special stateid or the stateid 16906 has a non-zero "seqid" field, see Section 13.9.1), the operation 16907 is in class 3, as described below. A COMPOUND containing 16908 multiple class 1 operations (and operations of no other class) 16909 MAY be sent to a server with multiple concurrent data server and 16910 non-data-server personalities. 16912 2. An operation that is unambiguously referable to the data-server 16913 personality. This includes data-server I/O operations where the 16914 filehandle is one that can only be validly directed to the data- 16915 server personality. 16917 3. An operation that is unambiguously referable to the non-data- 16918 server personality. This includes all COMPOUND operations that 16919 are neither data-server housekeeping nor data-server I/O 16920 operations, plus data-server I/O operations where the current fh 16921 (or the one to be made the current fh in the case of PUTFH) is 16922 only valid on the metadata server or where a stateid is used that 16923 is incompatible with the data server, i.e., is a special stateid 16924 or has a non-zero seqid value. 16926 When a COMPOUND first executes an operation from class 3 above, it 16927 acts as a normal COMPOUND on any other server, and the data-server 16928 personality ceases to be relevant. There are no special restrictions 16929 on the operations in the COMPOUND to limit them to those for a data 16930 server. When a PUTFH is done, filehandles derived from the layout 16931 are not valid. If their format is not normally acceptable, then 16932 NFS4ERR_BADHANDLE MUST result. Similarly, current filehandles for 16933 other operations do not accept filehandles derived from layouts and 16934 are not normally usable on the metadata server. Using these will 16935 result in NFS4ERR_STALE. 16937 When a COMPOUND first executes an operation from class 2, which would 16938 be PUTFH where the filehandle is one from a layout, the COMPOUND 16939 henceforth is interpreted with respect to the data-server 16940 personality. Operations outside the two classes discussed above MUST 16941 result in NFS4ERR_NOTSUPP. Filehandles are validated using the rules 16942 of the data server, resulting in NFS4ERR_BADHANDLE and/or 16943 NFS4ERR_STALE even when they would not normally do so when addressed 16944 to the non-data-server personality. Stateids must obey the rules of 16945 the data server in that any use of special stateids or stateids with 16946 non-zero seqid values must result in NFS4ERR_BAD_STATEID. 16948 Until the server first executes an operation from class 2 or class 3, 16949 the client MUST NOT depend on the operation being executed by either 16950 the data-server or the non-data-server personality. The server MUST 16951 pick one personality consistently for a given COMPOUND, with the only 16952 possible transition being a single one when the first operation from 16953 class 2 or class 3 is executed. 16955 Because of the complexity induced by assigning filehandles so they 16956 can be used on both a data server and a metadata server, it is 16957 RECOMMENDED that where the same server can have both personalities, 16958 the server assign separate unique filehandles to both personalities. 16959 This makes it unambiguous for which server a given request is 16960 intended. 16962 GETATTR and SETATTR MUST be directed to the metadata server. In the 16963 case of a SETATTR of the size attribute, the control protocol is 16964 responsible for propagating size updates/truncations to the data 16965 servers. In the case of extending WRITEs to the data servers, the 16966 new size must be visible on the metadata server once a LAYOUTCOMMIT 16967 has completed (see Section 12.5.4.2). Section 13.10 describes the 16968 mechanism by which the client is to handle data-server files that do 16969 not reflect the metadata server's size. 16971 13.7. COMMIT through Metadata Server 16973 The file layout provides two alternate means of providing for the 16974 commit of data written through data servers. The flag 16975 NFL4_UFLG_COMMIT_THRU_MDS in the field nfl_util of the file layout 16976 (data type nfsv4_1_file_layout4) is an indication from the metadata 16977 server to the client of the REQUIRED way of performing COMMIT, either 16978 by sending the COMMIT to the data server or the metadata server. 16979 These two methods of dealing with the issue correspond to broad 16980 styles of implementation for a pNFS server supporting the file layout 16981 type. 16983 * When the flag is FALSE, COMMIT operations MUST to be sent to the 16984 data server to which the corresponding WRITE operations were sent. 16985 This approach is sometimes useful when file striping is 16986 implemented within the pNFS server (instead of the file system), 16987 with the individual data servers each implementing their own file 16988 systems. 16990 * When the flag is TRUE, COMMIT operations MUST be sent to the 16991 metadata server, rather than to the individual data servers. This 16992 approach is sometimes useful when file striping is implemented 16993 within the clustered file system that is the backend to the pNFS 16994 server. In such an implementation, each COMMIT to each data 16995 server might result in repeated writes of metadata blocks to the 16996 detriment of write performance. Sending a single COMMIT to the 16997 metadata server can be more efficient when there exists a 16998 clustered file system capable of implementing such a coordinated 16999 COMMIT. 17001 If nfl_util & NFL4_UFLG_COMMIT_THRU_MDS is TRUE, then in order to 17002 maintain the current NFSv4.1 commit and recovery model, the data 17003 servers MUST return a common writeverf verifier in all WRITE 17004 responses for a given file layout, and the metadata server's 17005 COMMIT implementation must return the same writeverf. The value 17006 of the writeverf verifier MUST be changed at the metadata server 17007 or any data server that is referenced in the layout, whenever 17008 there is a server event that can possibly lead to loss of 17009 uncommitted data. The scope of the verifier can be for a file or 17010 for the entire pNFS server. It might be more difficult for the 17011 server to maintain the verifier at the file level, but the benefit 17012 is that only events that impact a given file will require recovery 17013 action. 17015 Note that if the layout specified dense packing, then the offset used 17016 to a COMMIT to the MDS may differ than that of an offset used to a 17017 COMMIT to the data server. 17019 The single COMMIT to the metadata server will return a verifier, and 17020 the client should compare it to all the verifiers from the WRITEs and 17021 fail the COMMIT if there are any mismatched verifiers. If COMMIT to 17022 the metadata server fails, the client should re-send WRITEs for all 17023 the modified data in the file. The client should treat modified data 17024 with a mismatched verifier as a WRITE failure and try to recover by 17025 resending the WRITEs to the original data server or using another 17026 path to that data if the layout has not been recalled. 17027 Alternatively, the client can obtain a new layout or it could rewrite 17028 the data directly to the metadata server. If nfl_util & 17029 NFL4_UFLG_COMMIT_THRU_MDS is FALSE, sending a COMMIT to the metadata 17030 server might have no effect. If nfl_util & NFL4_UFLG_COMMIT_THRU_MDS 17031 is FALSE, a COMMIT sent to the metadata server should be used only to 17032 commit data that was written to the metadata server. See 17033 Section 12.7.6 for recovery options. 17035 13.8. The Layout Iomode 17037 The layout iomode need not be used by the metadata server when 17038 servicing NFSv4.1 file-based layouts, although in some circumstances 17039 it may be useful. For example, if the server implementation supports 17040 reading from read-only replicas or mirrors, it would be useful for 17041 the server to return a layout enabling the client to do so. As such, 17042 the client SHOULD set the iomode based on its intent to read or write 17043 the data. The client may default to an iomode of LAYOUTIOMODE4_RW. 17044 The iomode need not be checked by the data servers when clients 17045 perform I/O. However, the data servers SHOULD still validate that 17046 the client holds a valid layout and return an error if the client 17047 does not. 17049 13.9. Metadata and Data Server State Coordination 17051 13.9.1. Global Stateid Requirements 17053 When the client sends I/O to a data server, the stateid used MUST NOT 17054 be a layout stateid as returned by LAYOUTGET or sent by 17055 CB_LAYOUTRECALL. Permitted stateids are based on one of the 17056 following: an OPEN stateid (the stateid field of data type OPEN4resok 17057 as returned by OPEN), a delegation stateid (the stateid field of data 17058 types open_read_delegation4 and open_write_delegation4 as returned by 17059 OPEN or WANT_DELEGATION, or as sent by CB_PUSH_DELEG), or a stateid 17060 returned by the LOCK or LOCKU operations. The stateid sent to the 17061 data server MUST be sent with the seqid set to zero, indicating the 17062 most current version of that stateid, rather than indicating a 17063 specific non-zero seqid value. In no case is the use of special 17064 stateid values allowed. 17066 The stateid used for I/O MUST have the same effect and be subject to 17067 the same validation on a data server as it would if the I/O was being 17068 performed on the metadata server itself in the absence of pNFS. This 17069 has the implication that stateids are globally valid on both the 17070 metadata and data servers. This requires the metadata server to 17071 propagate changes in LOCK and OPEN state to the data servers, so that 17072 the data servers can validate I/O accesses. This is discussed 17073 further in Section 13.9.2. Depending on when stateids are 17074 propagated, the existence of a valid stateid on the data server may 17075 act as proof of a valid layout. 17077 Clients performing I/O operations need to select an appropriate 17078 stateid based on the locks (including opens and delegations) held by 17079 the client and the various types of state-owners sending the I/O 17080 requests. The rules for doing so when referencing data servers are 17081 somewhat different from those discussed in Section 8.2.5, which apply 17082 when accessing metadata servers. 17084 The following rules, applied in order of decreasing priority, govern 17085 the selection of the appropriate stateid: 17087 * If the client holds a delegation for the file in question, the 17088 delegation stateid should be used. 17090 * Otherwise, there must be an OPEN stateid for the current open- 17091 owner, and that OPEN stateid for the open file in question is 17092 used, unless mandatory locking prevents that. See below. 17094 * If the data server had previously responded with NFS4ERR_LOCKED to 17095 use of the OPEN stateid, then the client should use the byte-range 17096 lock stateid whenever one exists for that open file with the 17097 current lock-owner. 17099 * Special stateids should never be used. If they are used, the data 17100 server MUST reject the I/O with an NFS4ERR_BAD_STATEID error. 17102 13.9.2. Data Server State Propagation 17104 Since the metadata server, which handles byte-range lock and open- 17105 mode state changes as well as ACLs, might not be co-located with the 17106 data servers where I/O accesses are validated, the server 17107 implementation MUST take care of propagating changes of this state to 17108 the data servers. Once the propagation to the data servers is 17109 complete, the full effect of those changes MUST be in effect at the 17110 data servers. However, some state changes need not be propagated 17111 immediately, although all changes SHOULD be propagated promptly. 17112 These state propagations have an impact on the design of the control 17113 protocol, even though the control protocol is outside of the scope of 17114 this specification. Immediate propagation refers to the synchronous 17115 propagation of state from the metadata server to the data server(s); 17116 the propagation must be complete before returning to the client. 17118 13.9.2.1. Lock State Propagation 17120 If the pNFS server supports mandatory byte-range locking, any 17121 mandatory byte-range locks on a file MUST be made effective at the 17122 data servers before the request that establishes them returns to the 17123 caller. The effect MUST be the same as if the mandatory byte-range 17124 lock state were synchronously propagated to the data servers, even 17125 though the details of the control protocol may avoid actual transfer 17126 of the state under certain circumstances. 17128 On the other hand, since advisory byte-range lock state is not used 17129 for checking I/O accesses at the data servers, there is no semantic 17130 reason for propagating advisory byte-range lock state to the data 17131 servers. Since updates to advisory locks neither confer nor remove 17132 privileges, these changes need not be propagated immediately, and may 17133 not need to be propagated promptly. The updates to advisory locks 17134 need only be propagated when the data server needs to resolve a 17135 question about a stateid. In fact, if byte-range locking is not 17136 mandatory (i.e., is advisory) the clients are advised to avoid using 17137 the byte-range lock-based stateids for I/O. The stateids returned by 17138 OPEN are sufficient and eliminate overhead for this kind of state 17139 propagation. 17141 If a client gets back an NFS4ERR_LOCKED error from a data server, 17142 this is an indication that mandatory byte-range locking is in force. 17143 The client recovers from this by getting a byte-range lock that 17144 covers the affected range and re-sends the I/O with the stateid of 17145 the byte-range lock. 17147 13.9.2.2. Open and Deny Mode Validation 17149 Open and deny mode validation MUST be performed against the open and 17150 deny mode(s) held by the data servers. When access is reduced or a 17151 deny mode made more restrictive (because of CLOSE or OPEN_DOWNGRADE), 17152 the data server MUST prevent any I/Os that would be denied if 17153 performed on the metadata server. When access is expanded, the data 17154 server MUST make sure that no requests are subsequently rejected 17155 because of open or deny issues that no longer apply, given the 17156 previous relaxation. 17158 13.9.2.3. File Attributes 17160 Since the SETATTR operation has the ability to modify state that is 17161 visible on both the metadata and data servers (e.g., the size), care 17162 must be taken to ensure that the resultant state across the set of 17163 data servers is consistent, especially when truncating or growing the 17164 file. 17166 As described earlier, the LAYOUTCOMMIT operation is used to ensure 17167 that the metadata is synchronized with changes made to the data 17168 servers. For the NFSv4.1-based data storage protocol, it is 17169 necessary to re-synchronize state such as the size attribute, and the 17170 setting of mtime/change/atime. See Section 12.5.4 for a full 17171 description of the semantics regarding LAYOUTCOMMIT and attribute 17172 synchronization. It should be noted that by using an NFSv4.1-based 17173 layout type, it is possible to synchronize this state before 17174 LAYOUTCOMMIT occurs. For example, the control protocol can be used 17175 to query the attributes present on the data servers. 17177 Any changes to file attributes that control authorization or access 17178 as reflected by ACCESS calls or READs and WRITEs on the metadata 17179 server, MUST be propagated to the data servers for enforcement on 17180 READ and WRITE I/O calls. If the changes made on the metadata server 17181 result in more restrictive access permissions for any user, those 17182 changes MUST be propagated to the data servers synchronously. 17184 The OPEN operation (Section 18.16.4) does not impose any requirement 17185 that I/O operations on an open file have the same credentials as the 17186 OPEN itself (unless EXCHGID4_FLAG_BIND_PRINC_STATEID is set when 17187 EXCHANGE_ID creates the client ID), and so it requires the server's 17188 READ and WRITE operations to perform appropriate access checking. 17189 Changes to ACLs also require new access checking by READ and WRITE on 17190 the server. The propagation of access-right changes due to changes 17191 in ACLs may be asynchronous only if the server implementation is able 17192 to determine that the updated ACL is not more restrictive for any 17193 user specified in the old ACL. Due to the relative infrequency of 17194 ACL updates, it is suggested that all changes be propagated 17195 synchronously. 17197 13.10. Data Server Component File Size 17199 A potential problem exists when a component data file on a particular 17200 data server has grown past EOF; the problem exists for both dense and 17201 sparse layouts. Imagine the following scenario: a client creates a 17202 new file (size == 0) and writes to byte 131072; the client then seeks 17203 to the beginning of the file and reads byte 100. The client should 17204 receive zeroes back as a result of the READ. However, if the 17205 striping pattern directs the client to send the READ to a data server 17206 other than the one that received the client's original WRITE, the 17207 data server servicing the READ may believe that the file's size is 17208 still 0 bytes. In that event, the data server's READ response will 17209 contain zero bytes and an indication of EOF. The data server can 17210 only return zeroes if it knows that the file's size has been 17211 extended. This would require the immediate propagation of the file's 17212 size to all data servers, which is potentially very costly. 17213 Therefore, the client that has initiated the extension of the file's 17214 size MUST be prepared to deal with these EOF conditions. When the 17215 offset in the arguments to READ is less than the client's view of the 17216 file size, if the READ response indicates EOF and/or contains fewer 17217 bytes than requested, the client will interpret such a response as a 17218 hole in the file, and the NFS client will substitute zeroes for the 17219 data. 17221 The NFSv4.1 protocol only provides close-to-open file data cache 17222 semantics; meaning that when the file is closed, all modified data is 17223 written to the server. When a subsequent OPEN of the file is done, 17224 the change attribute is inspected for a difference from a cached 17225 value for the change attribute. For the case above, this means that 17226 a LAYOUTCOMMIT will be done at close (along with the data WRITEs) and 17227 will update the file's size and change attribute. Access from 17228 another client after that point will result in the appropriate size 17229 being returned. 17231 13.11. Layout Revocation and Fencing 17233 As described in Section 12.7, the layout-type-specific storage 17234 protocol is responsible for handling the effects of I/Os that started 17235 before lease expiration and extend through lease expiration. The 17236 LAYOUT4_NFSV4_1_FILES layout type can prevent all I/Os to data 17237 servers from being executed after lease expiration (this prevention 17238 is called "fencing"), without relying on a precise client lease timer 17239 and without requiring data servers to maintain lease timers. The 17240 LAYOUT4_NFSV4_1_FILES pNFS server has the flexibility to revoke 17241 individual layouts, and thus fence I/O on a per-file basis. 17243 In addition to lease expiration, the reasons a layout can be revoked 17244 include: client fails to respond to a CB_LAYOUTRECALL, the metadata 17245 server restarts, or administrative intervention. Regardless of the 17246 reason, once a client's layout has been revoked, the pNFS server MUST 17247 prevent the client from sending I/O for the affected file from and to 17248 all data servers; in other words, it MUST fence the client from the 17249 affected file on the data servers. 17251 Fencing works as follows. As described in Section 13.1, in COMPOUND 17252 procedure requests to the data server, the data filehandle provided 17253 by the PUTFH operation and the stateid in the READ or WRITE operation 17254 are used to ensure that the client has a valid layout for the I/O 17255 being performed; if it does not, the I/O is rejected with 17256 NFS4ERR_PNFS_NO_LAYOUT. The server can simply check the stateid and, 17257 additionally, make the data filehandle stale if the layout specified 17258 a data filehandle that is different from the metadata server's 17259 filehandle for the file (see the nfl_fh_list description in 17260 Section 13.3). 17262 Before the metadata server takes any action to revoke layout state 17263 given out by a previous instance, it must make sure that all layout 17264 state from that previous instance are invalidated at the data 17265 servers. This has the following implications. 17267 * The metadata server must not restripe a file until it has 17268 contacted all of the data servers to invalidate the layouts from 17269 the previous instance. 17271 * The metadata server must not give out mandatory locks that 17272 conflict with layouts from the previous instance without either 17273 doing a specific layout invalidation (as it would have to do 17274 anyway) or doing a global data server invalidation. 17276 13.12. Security Considerations for the File Layout Type 17278 The NFSv4.1 file layout type MUST adhere to the security 17279 considerations outlined in Section 12.9. NFSv4.1 data servers MUST 17280 make all of the required access checks on each READ or WRITE I/O as 17281 determined by the NFSv4.1 protocol. If the metadata server would 17282 deny a READ or WRITE operation on a file due to its ACL, mode 17283 attribute, open access mode, open deny mode, mandatory byte-range 17284 lock state, or any other attributes and state, the data server MUST 17285 also deny the READ or WRITE operation. This impacts the control 17286 protocol and the propagation of state from the metadata server to the 17287 data servers; see Section 13.9.2 for more details. 17289 The methods for authentication, integrity, and privacy for data 17290 servers based on the LAYOUT4_NFSV4_1_FILES layout type are the same 17291 as those used by metadata servers. Metadata and data servers use ONC 17292 RPC security flavors to authenticate, and SECINFO and SECINFO_NO_NAME 17293 to negotiate the security mechanism and services to be used. Thus, 17294 when using the LAYOUT4_NFSV4_1_FILES layout type, the impact on the 17295 RPC-based security model due to pNFS (as alluded to in Sections 1.8.1 17296 and 1.8.2.2) is zero. 17298 For a given file object, a metadata server MAY require different 17299 security parameters (secinfo4 value) than the data server. For a 17300 given file object with multiple data servers, the secinfo4 value 17301 SHOULD be the same across all data servers. If the secinfo4 values 17302 across a metadata server and its data servers differ for a specific 17303 file, the mapping of the principal to the server's internal user 17304 identifier MUST be the same in order for the access-control checks 17305 based on ACL, mode, open and deny mode, and mandatory locking to be 17306 consistent across on the pNFS server. 17308 If an NFSv4.1 implementation supports pNFS and supports NFSv4.1 file 17309 layouts, then the implementation MUST support the SECINFO_NO_NAME 17310 operation on both the metadata and data servers. 17312 14. Internationalization 17314 The primary issue in which NFSv4.1 needs to deal with 17315 internationalization, or I18N, is with respect to file names and 17316 other strings as used within the protocol. The choice of string 17317 representation must allow reasonable name/string access to clients 17318 that use various languages. The UTF-8 encoding of the UCS (Universal 17319 Multiple-Octet Coded Character Set) as defined by ISO10646 [18] 17320 allows for this type of access and follows the policy described in 17321 "IETF Policy on Character Sets and Languages", RFC 2277 [19]. 17323 RFC 3454 [16], otherwise known as "stringprep", documents a framework 17324 for using Unicode/UTF-8 in networking protocols so as "to increase 17325 the likelihood that string input and string comparison work in ways 17326 that make sense for typical users throughout the world". A protocol 17327 must define a profile of stringprep "in order to fully specify the 17328 processing options". The remainder of this section defines the 17329 NFSv4.1 stringprep profiles. Much of the terminology used for the 17330 remainder of this section comes from stringprep. 17332 There are three UTF-8 string types defined for NFSv4.1: utf8str_cs, 17333 utf8str_cis, and utf8str_mixed. Separate profiles are defined for 17334 each. Each profile defines the following, as required by stringprep: 17336 * The intended applicability of the profile. 17338 * The character repertoire that is the input and output to 17339 stringprep (which is Unicode 3.2 for the referenced version of 17340 stringprep). However, NFSv4.1 implementations are not limited to 17341 3.2. 17343 * The mapping tables from stringprep used (as described in Section 3 17344 of stringprep). 17346 * Any additional mapping tables specific to the profile. 17348 * The Unicode normalization used, if any (as described in Section 4 17349 of stringprep). 17351 * The tables from the stringprep listing of characters that are 17352 prohibited as output (as described in Section 5 of stringprep). 17354 * The bidirectional string testing used, if any (as described in 17355 Section 6 of stringprep). 17357 * Any additional characters that are prohibited as output specific 17358 to the profile. 17360 Stringprep discusses Unicode characters, whereas NFSv4.1 renders 17361 UTF-8 characters. Since there is a one-to-one mapping from UTF-8 to 17362 Unicode, when the remainder of this document refers to Unicode, the 17363 reader should assume UTF-8. 17365 Much of the text for the profiles comes from RFC 3491 [20]. 17367 14.1. Stringprep Profile for the utf8str_cs Type 17369 Every use of the utf8str_cs type definition in the NFSv4 protocol 17370 specification follows the profile named nfs4_cs_prep. 17372 14.1.1. Intended Applicability of the nfs4_cs_prep Profile 17374 The utf8str_cs type is a case-sensitive string of UTF-8 characters. 17375 Its primary use in NFSv4.1 is for naming components and pathnames. 17376 Components and pathnames are stored on the server's file system. Two 17377 valid distinct UTF-8 strings might be the same after processing via 17378 the utf8str_cs profile. If the strings are two names inside a 17379 directory, the NFSv4.1 server will need to either: 17381 * disallow the creation of a second name if its post-processed form 17382 collides with that of an existing name, or 17384 * allow the creation of the second name, but arrange so that after 17385 post-processing, the second name is different than the post- 17386 processed form of the first name. 17388 14.1.2. Character Repertoire of nfs4_cs_prep 17390 The nfs4_cs_prep profile uses Unicode 3.2, as defined in stringprep's 17391 Appendix A.1. However, NFSv4.1 implementations are not limited to 17392 3.2. 17394 14.1.3. Mapping Used by nfs4_cs_prep 17396 The nfs4_cs_prep profile specifies mapping using the following tables 17397 from stringprep: 17399 Table B.1 17401 Table B.2 is normally not part of the nfs4_cs_prep profile as it is 17402 primarily for dealing with case-insensitive comparisons. However, if 17403 the NFSv4.1 file server supports the case_insensitive file system 17404 attribute, and if case_insensitive is TRUE, the NFSv4.1 server MUST 17405 use Table B.2 (in addition to Table B1) when processing utf8str_cs 17406 strings, and the NFSv4.1 client MUST assume Table B.2 (in addition to 17407 Table B.1) is being used. 17409 If the case_preserving attribute is present and set to FALSE, then 17410 the NFSv4.1 server MUST use Table B.2 to map case when processing 17411 utf8str_cs strings. Whether the server maps from lower to upper case 17412 or from upper to lower case is an implementation dependency. 17414 14.1.4. Normalization used by nfs4_cs_prep 17416 The nfs4_cs_prep profile does not specify a normalization form. A 17417 later revision of this specification may specify a particular 17418 normalization form. Therefore, the server and client can expect that 17419 they may receive unnormalized characters within protocol requests and 17420 responses. If the operating environment requires normalization, then 17421 the implementation must normalize utf8str_cs strings within the 17422 protocol before presenting the information to an application (at the 17423 client) or local file system (at the server). 17425 14.1.5. Prohibited Output for nfs4_cs_prep 17427 The nfs4_cs_prep profile RECOMMENDS prohibiting the use of the 17428 following tables from stringprep: 17430 Table C.5 17432 Table C.6 17434 14.1.6. Bidirectional Output for nfs4_cs_prep 17436 The nfs4_cs_prep profile does not specify any checking of 17437 bidirectional strings. 17439 14.2. Stringprep Profile for the utf8str_cis Type 17441 Every use of the utf8str_cis type definition in the NFSv4.1 protocol 17442 specification follows the profile named nfs4_cis_prep. 17444 14.2.1. Intended Applicability of the nfs4_cis_prep Profile 17446 The utf8str_cis type is a case-insensitive string of UTF-8 17447 characters. Its primary use in NFSv4.1 is for naming NFS servers. 17449 14.2.2. Character Repertoire of nfs4_cis_prep 17451 The nfs4_cis_prep profile uses Unicode 3.2, as defined in 17452 stringprep's Appendix A.1. However, NFSv4.1 implementations are not 17453 limited to 3.2. 17455 14.2.3. Mapping Used by nfs4_cis_prep 17457 The nfs4_cis_prep profile specifies mapping using the following 17458 tables from stringprep: 17460 Table B.1 17462 Table B.2 17464 14.2.4. Normalization Used by nfs4_cis_prep 17466 The nfs4_cis_prep profile specifies using Unicode normalization form 17467 KC, as described in stringprep. 17469 14.2.5. Prohibited Output for nfs4_cis_prep 17471 The nfs4_cis_prep profile specifies prohibiting using the following 17472 tables from stringprep: 17474 Table C.1.2 17476 Table C.2.2 17478 Table C.3 17480 Table C.4 17482 Table C.5 17484 Table C.6 17486 Table C.7 17488 Table C.8 17490 Table C.9 17492 14.2.6. Bidirectional Output for nfs4_cis_prep 17494 The nfs4_cis_prep profile specifies checking bidirectional strings as 17495 described in stringprep's Section 6. 17497 14.3. Stringprep Profile for the utf8str_mixed Type 17499 Every use of the utf8str_mixed type definition in the NFSv4.1 17500 protocol specification follows the profile named nfs4_mixed_prep. 17502 14.3.1. Intended Applicability of the nfs4_mixed_prep Profile 17504 The utf8str_mixed type is a string of UTF-8 characters, with a prefix 17505 that is case sensitive, a separator equal to '@', and a suffix that 17506 is a fully qualified domain name. Its primary use in NFSv4.1 is for 17507 naming principals identified in an Access Control Entry. 17509 14.3.2. Character Repertoire of nfs4_mixed_prep 17511 The nfs4_mixed_prep profile uses Unicode 3.2, as defined in 17512 stringprep's Appendix A.1. However, NFSv4.1 implementations are not 17513 limited to 3.2. 17515 14.3.3. Mapping Used by nfs4_cis_prep 17517 For the prefix and the separator of a utf8str_mixed string, the 17518 nfs4_mixed_prep profile specifies mapping using the following table 17519 from stringprep: 17521 Table B.1 17523 For the suffix of a utf8str_mixed string, the nfs4_mixed_prep profile 17524 specifies mapping using the following tables from stringprep: 17526 Table B.1 17528 Table B.2 17530 14.3.4. Normalization Used by nfs4_mixed_prep 17532 The nfs4_mixed_prep profile specifies using Unicode normalization 17533 form KC, as described in stringprep. 17535 14.3.5. Prohibited Output for nfs4_mixed_prep 17537 The nfs4_mixed_prep profile specifies prohibiting using the following 17538 tables from stringprep: 17540 Table C.1.2 17542 Table C.2.2 17544 Table C.3 17546 Table C.4 17548 Table C.5 17550 Table C.6 17552 Table C.7 17554 Table C.8 17556 Table C.9 17558 14.3.6. Bidirectional Output for nfs4_mixed_prep 17560 The nfs4_mixed_prep profile specifies checking bidirectional strings 17561 as described in stringprep's Section 6. 17563 14.4. UTF-8 Capabilities 17565 const FSCHARSET_CAP4_CONTAINS_NON_UTF8 = 0x1; 17566 const FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 = 0x2; 17568 typedef uint32_t fs_charset_cap4; 17570 Because some operating environments and file systems do not enforce 17571 character set encodings, NFSv4.1 supports the fs_charset_cap 17572 attribute (Section 5.8.2.11) that indicates to the client a file 17573 system's UTF-8 capabilities. The attribute is an integer containing 17574 a pair of flags. The first flag is FSCHARSET_CAP4_CONTAINS_NON_UTF8, 17575 which, if set to one, tells the client that the file system contains 17576 non-UTF-8 characters, and the server will not convert non-UTF 17577 characters to UTF-8 if the client reads a symbolic link or directory, 17578 neither will operations with component names or pathnames in the 17579 arguments convert the strings to UTF-8. The second flag is 17580 FSCHARSET_CAP4_ALLOWS_ONLY_UTF8, which, if set to one, indicates that 17581 the server will accept (and generate) only UTF-8 characters on the 17582 file system. If FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 is set to one, 17583 FSCHARSET_CAP4_CONTAINS_NON_UTF8 MUST be set to zero. 17584 FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 SHOULD always be set to one. 17586 14.5. UTF-8 Related Errors 17588 Where the client sends an invalid UTF-8 string, the server should 17589 return NFS4ERR_INVAL (see Table 11). This includes cases in which 17590 inappropriate prefixes are detected and where the count includes 17591 trailing bytes that do not constitute a full UCS character. 17593 Where the client-supplied string is valid UTF-8 but contains 17594 characters that are not supported by the server as a value for that 17595 string (e.g., names containing characters outside of Unicode plane 0 17596 on file systems that fail to support such characters despite their 17597 presence in the Unicode standard), the server should return 17598 NFS4ERR_BADCHAR. 17600 Where a UTF-8 string is used as a file name, and the file system 17601 (while supporting all of the characters within the name) does not 17602 allow that particular name to be used, the server should return the 17603 error NFS4ERR_BADNAME (Table 11). This includes situations in which 17604 the server file system imposes a normalization constraint on name 17605 strings, but will also include such situations as file system 17606 prohibitions of "." and ".." as file names for certain operations, 17607 and other such constraints. 17609 15. Error Values 17611 NFS error numbers are assigned to failed operations within a Compound 17612 (COMPOUND or CB_COMPOUND) request. A Compound request contains a 17613 number of NFS operations that have their results encoded in sequence 17614 in a Compound reply. The results of successful operations will 17615 consist of an NFS4_OK status followed by the encoded results of the 17616 operation. If an NFS operation fails, an error status will be 17617 entered in the reply and the Compound request will be terminated. 17619 15.1. Error Definitions 17621 +===================================+========+===================+ 17622 | Error | Number | Description | 17623 +===================================+========+===================+ 17624 | NFS4_OK | 0 | Section 15.1.3.1 | 17625 +-----------------------------------+--------+-------------------+ 17626 | NFS4ERR_ACCESS | 13 | Section 15.1.6.1 | 17627 +-----------------------------------+--------+-------------------+ 17628 | NFS4ERR_ATTRNOTSUPP | 10032 | Section 15.1.15.1 | 17629 +-----------------------------------+--------+-------------------+ 17630 | NFS4ERR_ADMIN_REVOKED | 10047 | Section 15.1.5.1 | 17631 +-----------------------------------+--------+-------------------+ 17632 | NFS4ERR_BACK_CHAN_BUSY | 10057 | Section 15.1.12.1 | 17633 +-----------------------------------+--------+-------------------+ 17634 | NFS4ERR_BADCHAR | 10040 | Section 15.1.7.1 | 17635 +-----------------------------------+--------+-------------------+ 17636 | NFS4ERR_BADHANDLE | 10001 | Section 15.1.2.1 | 17637 +-----------------------------------+--------+-------------------+ 17638 | NFS4ERR_BADIOMODE | 10049 | Section 15.1.10.1 | 17639 +-----------------------------------+--------+-------------------+ 17640 | NFS4ERR_BADLAYOUT | 10050 | Section 15.1.10.2 | 17641 +-----------------------------------+--------+-------------------+ 17642 | NFS4ERR_BADNAME | 10041 | Section 15.1.7.2 | 17643 +-----------------------------------+--------+-------------------+ 17644 | NFS4ERR_BADOWNER | 10039 | Section 15.1.15.2 | 17645 +-----------------------------------+--------+-------------------+ 17646 | NFS4ERR_BADSESSION | 10052 | Section 15.1.11.1 | 17647 +-----------------------------------+--------+-------------------+ 17648 | NFS4ERR_BADSLOT | 10053 | Section 15.1.11.2 | 17649 +-----------------------------------+--------+-------------------+ 17650 | NFS4ERR_BADTYPE | 10007 | Section 15.1.4.1 | 17651 +-----------------------------------+--------+-------------------+ 17652 | NFS4ERR_BADXDR | 10036 | Section 15.1.1.1 | 17653 +-----------------------------------+--------+-------------------+ 17654 | NFS4ERR_BAD_COOKIE | 10003 | Section 15.1.1.2 | 17655 +-----------------------------------+--------+-------------------+ 17656 | NFS4ERR_BAD_HIGH_SLOT | 10077 | Section 15.1.11.3 | 17657 +-----------------------------------+--------+-------------------+ 17658 | NFS4ERR_BAD_RANGE | 10042 | Section 15.1.8.1 | 17659 +-----------------------------------+--------+-------------------+ 17660 | NFS4ERR_BAD_SEQID | 10026 | Section 15.1.16.1 | 17661 +-----------------------------------+--------+-------------------+ 17662 | NFS4ERR_BAD_SESSION_DIGEST | 10051 | Section 15.1.12.2 | 17663 +-----------------------------------+--------+-------------------+ 17664 | NFS4ERR_BAD_STATEID | 10025 | Section 15.1.5.2 | 17665 +-----------------------------------+--------+-------------------+ 17666 | NFS4ERR_CB_PATH_DOWN | 10048 | Section 15.1.11.4 | 17667 +-----------------------------------+--------+-------------------+ 17668 | NFS4ERR_CLID_INUSE | 10017 | Section 15.1.13.2 | 17669 +-----------------------------------+--------+-------------------+ 17670 | NFS4ERR_CLIENTID_BUSY | 10074 | Section 15.1.13.1 | 17671 +-----------------------------------+--------+-------------------+ 17672 | NFS4ERR_COMPLETE_ALREADY | 10054 | Section 15.1.9.1 | 17673 +-----------------------------------+--------+-------------------+ 17674 | NFS4ERR_CONN_NOT_BOUND_TO_SESSION | 10055 | Section 15.1.11.6 | 17675 +-----------------------------------+--------+-------------------+ 17676 | NFS4ERR_DEADLOCK | 10045 | Section 15.1.8.2 | 17677 +-----------------------------------+--------+-------------------+ 17678 | NFS4ERR_DEADSESSION | 10078 | Section 15.1.11.5 | 17679 +-----------------------------------+--------+-------------------+ 17680 | NFS4ERR_DELAY | 10008 | Section 15.1.1.3 | 17681 +-----------------------------------+--------+-------------------+ 17682 | NFS4ERR_DELEG_ALREADY_WANTED | 10056 | Section 15.1.14.1 | 17683 +-----------------------------------+--------+-------------------+ 17684 | NFS4ERR_DELEG_REVOKED | 10087 | Section 15.1.5.3 | 17685 +-----------------------------------+--------+-------------------+ 17686 | NFS4ERR_DENIED | 10010 | Section 15.1.8.3 | 17687 +-----------------------------------+--------+-------------------+ 17688 | NFS4ERR_DIRDELEG_UNAVAIL | 10084 | Section 15.1.14.2 | 17689 +-----------------------------------+--------+-------------------+ 17690 | NFS4ERR_DQUOT | 69 | Section 15.1.4.2 | 17691 +-----------------------------------+--------+-------------------+ 17692 | NFS4ERR_ENCR_ALG_UNSUPP | 10079 | Section 15.1.13.3 | 17693 +-----------------------------------+--------+-------------------+ 17694 | NFS4ERR_EXIST | 17 | Section 15.1.4.3 | 17695 +-----------------------------------+--------+-------------------+ 17696 | NFS4ERR_EXPIRED | 10011 | Section 15.1.5.4 | 17697 +-----------------------------------+--------+-------------------+ 17698 | NFS4ERR_FBIG | 27 | Section 15.1.4.4 | 17699 +-----------------------------------+--------+-------------------+ 17700 | NFS4ERR_FHEXPIRED | 10014 | Section 15.1.2.2 | 17701 +-----------------------------------+--------+-------------------+ 17702 | NFS4ERR_FILE_OPEN | 10046 | Section 15.1.4.5 | 17703 +-----------------------------------+--------+-------------------+ 17704 | NFS4ERR_GRACE | 10013 | Section 15.1.9.2 | 17705 +-----------------------------------+--------+-------------------+ 17706 | NFS4ERR_HASH_ALG_UNSUPP | 10072 | Section 15.1.13.4 | 17707 +-----------------------------------+--------+-------------------+ 17708 | NFS4ERR_INVAL | 22 | Section 15.1.1.4 | 17709 +-----------------------------------+--------+-------------------+ 17710 | NFS4ERR_IO | 5 | Section 15.1.4.6 | 17711 +-----------------------------------+--------+-------------------+ 17712 | NFS4ERR_ISDIR | 21 | Section 15.1.2.3 | 17713 +-----------------------------------+--------+-------------------+ 17714 | NFS4ERR_LAYOUTTRYLATER | 10058 | Section 15.1.10.3 | 17715 +-----------------------------------+--------+-------------------+ 17716 | NFS4ERR_LAYOUTUNAVAILABLE | 10059 | Section 15.1.10.4 | 17717 +-----------------------------------+--------+-------------------+ 17718 | NFS4ERR_LEASE_MOVED | 10031 | Section 15.1.16.2 | 17719 +-----------------------------------+--------+-------------------+ 17720 | NFS4ERR_LOCKED | 10012 | Section 15.1.8.4 | 17721 +-----------------------------------+--------+-------------------+ 17722 | NFS4ERR_LOCKS_HELD | 10037 | Section 15.1.8.5 | 17723 +-----------------------------------+--------+-------------------+ 17724 | NFS4ERR_LOCK_NOTSUPP | 10043 | Section 15.1.8.6 | 17725 +-----------------------------------+--------+-------------------+ 17726 | NFS4ERR_LOCK_RANGE | 10028 | Section 15.1.8.7 | 17727 +-----------------------------------+--------+-------------------+ 17728 | NFS4ERR_MINOR_VERS_MISMATCH | 10021 | Section 15.1.3.2 | 17729 +-----------------------------------+--------+-------------------+ 17730 | NFS4ERR_MLINK | 31 | Section 15.1.4.7 | 17731 +-----------------------------------+--------+-------------------+ 17732 | NFS4ERR_MOVED | 10019 | Section 15.1.2.4 | 17733 +-----------------------------------+--------+-------------------+ 17734 | NFS4ERR_NAMETOOLONG | 63 | Section 15.1.7.3 | 17735 +-----------------------------------+--------+-------------------+ 17736 | NFS4ERR_NOENT | 2 | Section 15.1.4.8 | 17737 +-----------------------------------+--------+-------------------+ 17738 | NFS4ERR_NOFILEHANDLE | 10020 | Section 15.1.2.5 | 17739 +-----------------------------------+--------+-------------------+ 17740 | NFS4ERR_NOMATCHING_LAYOUT | 10060 | Section 15.1.10.5 | 17741 +-----------------------------------+--------+-------------------+ 17742 | NFS4ERR_NOSPC | 28 | Section 15.1.4.9 | 17743 +-----------------------------------+--------+-------------------+ 17744 | NFS4ERR_NOTDIR | 20 | Section 15.1.2.6 | 17745 +-----------------------------------+--------+-------------------+ 17746 | NFS4ERR_NOTEMPTY | 66 | Section 15.1.4.10 | 17747 +-----------------------------------+--------+-------------------+ 17748 | NFS4ERR_NOTSUPP | 10004 | Section 15.1.1.5 | 17749 +-----------------------------------+--------+-------------------+ 17750 | NFS4ERR_NOT_ONLY_OP | 10081 | Section 15.1.3.3 | 17751 +-----------------------------------+--------+-------------------+ 17752 | NFS4ERR_NOT_SAME | 10027 | Section 15.1.15.3 | 17753 +-----------------------------------+--------+-------------------+ 17754 | NFS4ERR_NO_GRACE | 10033 | Section 15.1.9.3 | 17755 +-----------------------------------+--------+-------------------+ 17756 | NFS4ERR_NXIO | 6 | Section 15.1.16.3 | 17757 +-----------------------------------+--------+-------------------+ 17758 | NFS4ERR_OLD_STATEID | 10024 | Section 15.1.5.5 | 17759 +-----------------------------------+--------+-------------------+ 17760 | NFS4ERR_OPENMODE | 10038 | Section 15.1.8.8 | 17761 +-----------------------------------+--------+-------------------+ 17762 | NFS4ERR_OP_ILLEGAL | 10044 | Section 15.1.3.4 | 17763 +-----------------------------------+--------+-------------------+ 17764 | NFS4ERR_OP_NOT_IN_SESSION | 10071 | Section 15.1.3.5 | 17765 +-----------------------------------+--------+-------------------+ 17766 | NFS4ERR_PERM | 1 | Section 15.1.6.2 | 17767 +-----------------------------------+--------+-------------------+ 17768 | NFS4ERR_PNFS_IO_HOLE | 10075 | Section 15.1.10.6 | 17769 +-----------------------------------+--------+-------------------+ 17770 | NFS4ERR_PNFS_NO_LAYOUT | 10080 | Section 15.1.10.7 | 17771 +-----------------------------------+--------+-------------------+ 17772 | NFS4ERR_RECALLCONFLICT | 10061 | Section 15.1.14.3 | 17773 +-----------------------------------+--------+-------------------+ 17774 | NFS4ERR_RECLAIM_BAD | 10034 | Section 15.1.9.4 | 17775 +-----------------------------------+--------+-------------------+ 17776 | NFS4ERR_RECLAIM_CONFLICT | 10035 | Section 15.1.9.5 | 17777 +-----------------------------------+--------+-------------------+ 17778 | NFS4ERR_REJECT_DELEG | 10085 | Section 15.1.14.4 | 17779 +-----------------------------------+--------+-------------------+ 17780 | NFS4ERR_REP_TOO_BIG | 10066 | Section 15.1.3.6 | 17781 +-----------------------------------+--------+-------------------+ 17782 | NFS4ERR_REP_TOO_BIG_TO_CACHE | 10067 | Section 15.1.3.7 | 17783 +-----------------------------------+--------+-------------------+ 17784 | NFS4ERR_REQ_TOO_BIG | 10065 | Section 15.1.3.8 | 17785 +-----------------------------------+--------+-------------------+ 17786 | NFS4ERR_RESTOREFH | 10030 | Section 15.1.16.4 | 17787 +-----------------------------------+--------+-------------------+ 17788 | NFS4ERR_RETRY_UNCACHED_REP | 10068 | Section 15.1.3.9 | 17789 +-----------------------------------+--------+-------------------+ 17790 | NFS4ERR_RETURNCONFLICT | 10086 | Section 15.1.10.8 | 17791 +-----------------------------------+--------+-------------------+ 17792 | NFS4ERR_ROFS | 30 | Section 15.1.4.11 | 17793 +-----------------------------------+--------+-------------------+ 17794 | NFS4ERR_SAME | 10009 | Section 15.1.15.4 | 17795 +-----------------------------------+--------+-------------------+ 17796 | NFS4ERR_SHARE_DENIED | 10015 | Section 15.1.8.9 | 17797 +-----------------------------------+--------+-------------------+ 17798 | NFS4ERR_SEQUENCE_POS | 10064 | Section 15.1.3.10 | 17799 +-----------------------------------+--------+-------------------+ 17800 | NFS4ERR_SEQ_FALSE_RETRY | 10076 | Section 15.1.11.7 | 17801 +-----------------------------------+--------+-------------------+ 17802 | NFS4ERR_SEQ_MISORDERED | 10063 | Section 15.1.11.8 | 17803 +-----------------------------------+--------+-------------------+ 17804 | NFS4ERR_SERVERFAULT | 10006 | Section 15.1.1.6 | 17805 +-----------------------------------+--------+-------------------+ 17806 | NFS4ERR_STALE | 70 | Section 15.1.2.7 | 17807 +-----------------------------------+--------+-------------------+ 17808 | NFS4ERR_STALE_CLIENTID | 10022 | Section 15.1.13.5 | 17809 +-----------------------------------+--------+-------------------+ 17810 | NFS4ERR_STALE_STATEID | 10023 | Section 15.1.16.5 | 17811 +-----------------------------------+--------+-------------------+ 17812 | NFS4ERR_SYMLINK | 10029 | Section 15.1.2.8 | 17813 +-----------------------------------+--------+-------------------+ 17814 | NFS4ERR_TOOSMALL | 10005 | Section 15.1.1.7 | 17815 +-----------------------------------+--------+-------------------+ 17816 | NFS4ERR_TOO_MANY_OPS | 10070 | Section 15.1.3.11 | 17817 +-----------------------------------+--------+-------------------+ 17818 | NFS4ERR_UNKNOWN_LAYOUTTYPE | 10062 | Section 15.1.10.9 | 17819 +-----------------------------------+--------+-------------------+ 17820 | NFS4ERR_UNSAFE_COMPOUND | 10069 | Section 15.1.3.12 | 17821 +-----------------------------------+--------+-------------------+ 17822 | NFS4ERR_WRONGSEC | 10016 | Section 15.1.6.3 | 17823 +-----------------------------------+--------+-------------------+ 17824 | NFS4ERR_WRONG_CRED | 10082 | Section 15.1.6.4 | 17825 +-----------------------------------+--------+-------------------+ 17826 | NFS4ERR_WRONG_TYPE | 10083 | Section 15.1.2.9 | 17827 +-----------------------------------+--------+-------------------+ 17828 | NFS4ERR_XDEV | 18 | Section 15.1.4.12 | 17829 +-----------------------------------+--------+-------------------+ 17831 Table 11: Protocol Error Definitions 17833 15.1.1. General Errors 17835 This section deals with errors that are applicable to a broad set of 17836 different purposes. 17838 15.1.1.1. NFS4ERR_BADXDR (Error Code 10036) 17840 The arguments for this operation do not match those specified in the 17841 XDR definition. This includes situations in which the request ends 17842 before all the arguments have been seen. Note that this error 17843 applies when fixed enumerations (these include booleans) have a value 17844 within the input stream that is not valid for the enum. A replier 17845 may pre-parse all operations for a Compound procedure before doing 17846 any operation execution and return RPC-level XDR errors in that case. 17848 15.1.1.2. NFS4ERR_BAD_COOKIE (Error Code 10003) 17850 Used for operations that provide a set of information indexed by some 17851 quantity provided by the client or cookie sent by the server for an 17852 earlier invocation. Where the value cannot be used for its intended 17853 purpose, this error results. 17855 15.1.1.3. NFS4ERR_DELAY (Error Code 10008) 17857 For any of a number of reasons, the replier could not process this 17858 operation in what was deemed a reasonable time. The client should 17859 wait and then try the request with a new slot and sequence value. 17861 Some examples of scenarios that might lead to this situation: 17863 * A server that supports hierarchical storage receives a request to 17864 process a file that had been migrated. 17866 * An operation requires a delegation recall to proceed, but the need 17867 to wait for this delegation to be recalled and returned makes 17868 processing this request in a timely fashion impossible. 17870 * A request is being performed on a session being migrated from 17871 another server as described in Section 11.14.3, and the lack of 17872 full information about the state of the session on the source 17873 makes it impossible to process the request immediately. 17875 In such cases, returning the error NFS4ERR_DELAY allows necessary 17876 preparatory operations to proceed without holding up requester 17877 resources such as a session slot. After delaying for period of time, 17878 the client can then re-send the operation in question, often as part 17879 of a nearly identical request. Because of the need to avoid spurious 17880 reissues of non-idempotent operations and to avoid acting in response 17881 to NFS4ERR_DELAY errors returned on responses returned from the 17882 replier's reply cache, integration with the session-provided reply 17883 cache is necessary. There are a number of cases to deal with, each 17884 of which requires different sorts of handling by the requester and 17885 replier: 17887 * If NFS4ERR_DELAY is returned on a SEQUENCE operation, the request 17888 is retried in full with the SEQUENCE operation containing the same 17889 slot and sequence values. In this case, the replier MUST avoid 17890 returning a response containing NFS4ERR_DELAY as the response to 17891 SEQUENCE solely because an earlier instance of the same request 17892 returned that error and it was stored in the reply cache. If the 17893 replier did this, the retries would not be effective as there 17894 would be no opportunity for the replier to see whether the 17895 condition that generated the NFS4ERR_DELAY had been rectified 17896 during the interim between the original request and the retry. 17898 * If NFS4ERR_DELAY is returned on an operation other than SEQUENCE 17899 that validly appears as the first operation of a request, the 17900 handling is similar. The request can be retried in full without 17901 modification. In this case as well, the replier MUST avoid 17902 returning a response containing NFS4ERR_DELAY as the response to 17903 an initial operation of a request solely on the basis of its 17904 presence in the reply cache. If the replier did this, the retries 17905 would not be effective as there would be no opportunity for the 17906 replier to see whether the condition that generated the 17907 NFS4ERR_DELAY had been rectified during the interim between the 17908 original request and the retry. 17910 * If NFS4ERR_DELAY is returned on an operation other than the first 17911 in the request, the request when retried MUST contain a SEQUENCE 17912 operation that is different than the original one, with either the 17913 slot ID or the sequence value different from that in the original 17914 request. Because requesters do this, there is no need for the 17915 replier to take special care to avoid returning an NFS4ERR_DELAY 17916 error obtained from the reply cache. When no non-idempotent 17917 operations have been processed before the NFS4ERR_DELAY was 17918 returned, the requester should retry the request in full, with the 17919 only difference from the original request being the modification 17920 to the slot ID or sequence value in the reissued SEQUENCE 17921 operation. 17923 * When NFS4ERR_DELAY is returned on an operation other than the 17924 first within a request and there has been a non-idempotent 17925 operation processed before the NFS4ERR_DELAY was returned, 17926 reissuing the request as is normally done would incorrectly cause 17927 the re-execution of the non-idempotent operation. 17929 To avoid this situation, the client should reissue the request 17930 without the non-idempotent operation. The request still must use 17931 a SEQUENCE operation with either a different slot ID or sequence 17932 value from the SEQUENCE in the original request. Because this is 17933 done, there is no way the replier could avoid spuriously re- 17934 executing the non-idempotent operation since the different 17935 SEQUENCE parameters prevent the requester from recognizing that 17936 the non-idempotent operation is being retried. 17938 Note that without the ability to return NFS4ERR_DELAY and the 17939 requester's willingness to re-send when receiving it, deadlock might 17940 result. For example, if a recall is done, and if the delegation 17941 return or operations preparatory to delegation return are held up by 17942 other operations that need the delegation to be returned, session 17943 slots might not be available. The result could be deadlock. 17945 15.1.1.4. NFS4ERR_INVAL (Error Code 22) 17947 The arguments for this operation are not valid for some reason, even 17948 though they do match those specified in the XDR definition for the 17949 request. 17951 15.1.1.5. NFS4ERR_NOTSUPP (Error Code 10004) 17953 Operation not supported, either because the operation is an OPTIONAL 17954 one and is not supported by this server or because the operation MUST 17955 NOT be implemented in the current minor version. 17957 15.1.1.6. NFS4ERR_SERVERFAULT (Error Code 10006) 17959 An error occurred on the server that does not map to any of the 17960 specific legal NFSv4.1 protocol error values. The client should 17961 translate this into an appropriate error. UNIX clients may choose to 17962 translate this to EIO. 17964 15.1.1.7. NFS4ERR_TOOSMALL (Error Code 10005) 17966 Used where an operation returns a variable amount of data, with a 17967 limit specified by the client. Where the data returned cannot be fit 17968 within the limit specified by the client, this error results. 17970 15.1.2. Filehandle Errors 17972 These errors deal with the situation in which the current or saved 17973 filehandle, or the filehandle passed to PUTFH intended to become the 17974 current filehandle, is invalid in some way. This includes situations 17975 in which the filehandle is a valid filehandle in general but is not 17976 of the appropriate object type for the current operation. 17978 Where the error description indicates a problem with the current or 17979 saved filehandle, it is to be understood that filehandles are only 17980 checked for the condition if they are implicit arguments of the 17981 operation in question. 17983 15.1.2.1. NFS4ERR_BADHANDLE (Error Code 10001) 17985 Illegal NFS filehandle for the current server. The current 17986 filehandle failed internal consistency checks. Once accepted as 17987 valid (by PUTFH), no subsequent status change can cause the 17988 filehandle to generate this error. 17990 15.1.2.2. NFS4ERR_FHEXPIRED (Error Code 10014) 17992 A current or saved filehandle that is an argument to the current 17993 operation is volatile and has expired at the server. 17995 15.1.2.3. NFS4ERR_ISDIR (Error Code 21) 17997 The current or saved filehandle designates a directory when the 17998 current operation does not allow a directory to be accepted as the 17999 target of this operation. 18001 15.1.2.4. NFS4ERR_MOVED (Error Code 10019) 18003 The file system that contains the current filehandle object is not 18004 present at the server or is not accessible with the network address 18005 used. It may have been made accessible on a different set of network 18006 addresses, relocated or migrated to another server, or it may have 18007 never been present. The client may obtain the new file system 18008 location by obtaining the fs_locations or fs_locations_info attribute 18009 for the current filehandle. For further discussion, refer to 18010 Section 11.3. 18012 As with the case of NFS4ERR_DELAY, it is possible that one or more 18013 non-idempotent operations may have been successfully executed within 18014 a COMPOUND before NFS4ERR_MOVED is returned. Because of this, once 18015 the new location is determined, the original request that received 18016 the NFS4ERR_MOVED should not be re-executed in full. Instead, the 18017 client should send a new COMPOUND with any successfully executed non- 18018 idempotent operations removed. When the client uses the same session 18019 for the new COMPOUND, its SEQUENCE operation should use a different 18020 slot ID or sequence. 18022 15.1.2.5. NFS4ERR_NOFILEHANDLE (Error Code 10020) 18024 The logical current or saved filehandle value is required by the 18025 current operation and is not set. This may be a result of a 18026 malformed COMPOUND operation (i.e., no PUTFH or PUTROOTFH before an 18027 operation that requires the current filehandle be set). 18029 15.1.2.6. NFS4ERR_NOTDIR (Error Code 20) 18031 The current (or saved) filehandle designates an object that is not a 18032 directory for an operation in which a directory is required. 18034 15.1.2.7. NFS4ERR_STALE (Error Code 70) 18036 The current or saved filehandle value designating an argument to the 18037 current operation is invalid. The file referred to by that 18038 filehandle no longer exists or access to it has been revoked. 18040 15.1.2.8. NFS4ERR_SYMLINK (Error Code 10029) 18042 The current filehandle designates a symbolic link when the current 18043 operation does not allow a symbolic link as the target. 18045 15.1.2.9. NFS4ERR_WRONG_TYPE (Error Code 10083) 18047 The current (or saved) filehandle designates an object that is of an 18048 invalid type for the current operation, and there is no more specific 18049 error (such as NFS4ERR_ISDIR or NFS4ERR_SYMLINK) that applies. Note 18050 that in NFSv4.0, such situations generally resulted in the less- 18051 specific error NFS4ERR_INVAL. 18053 15.1.3. Compound Structure Errors 18055 This section deals with errors that relate to the overall structure 18056 of a Compound request (by which we mean to include both COMPOUND and 18057 CB_COMPOUND), rather than to particular operations. 18059 There are a number of basic constraints on the operations that may 18060 appear in a Compound request. Sessions add to these basic 18061 constraints by requiring a Sequence operation (either SEQUENCE or 18062 CB_SEQUENCE) at the start of the Compound. 18064 15.1.3.1. NFS_OK (Error code 0) 18066 Indicates the operation completed successfully, in that all of the 18067 constituent operations completed without error. 18069 15.1.3.2. NFS4ERR_MINOR_VERS_MISMATCH (Error code 10021) 18071 The minor version specified is not one that the current listener 18072 supports. This value is returned in the overall status for the 18073 Compound but is not associated with a specific operation since the 18074 results will specify a result count of zero. 18076 15.1.3.3. NFS4ERR_NOT_ONLY_OP (Error Code 10081) 18078 Certain operations, which are allowed to be executed outside of a 18079 session, MUST be the only operation within a Compound whenever the 18080 Compound does not start with a Sequence operation. This error 18081 results when that constraint is not met. 18083 15.1.3.4. NFS4ERR_OP_ILLEGAL (Error Code 10044) 18085 The operation code is not a valid one for the current Compound 18086 procedure. The opcode in the result stream matched with this error 18087 is the ILLEGAL value, although the value that appears in the request 18088 stream may be different. Where an illegal value appears and the 18089 replier pre-parses all operations for a Compound procedure before 18090 doing any operation execution, an RPC-level XDR error may be 18091 returned. 18093 15.1.3.5. NFS4ERR_OP_NOT_IN_SESSION (Error Code 10071) 18095 Most forward operations and all callback operations are only valid 18096 within the context of a session, so that the Compound request in 18097 question MUST begin with a Sequence operation. If an attempt is made 18098 to execute these operations outside the context of session, this 18099 error results. 18101 15.1.3.6. NFS4ERR_REP_TOO_BIG (Error Code 10066) 18103 The reply to a Compound would exceed the channel's negotiated maximum 18104 response size. 18106 15.1.3.7. NFS4ERR_REP_TOO_BIG_TO_CACHE (Error Code 10067) 18108 The reply to a Compound would exceed the channel's negotiated maximum 18109 size for replies cached in the reply cache when the Sequence for the 18110 current request specifies that this request is to be cached. 18112 15.1.3.8. NFS4ERR_REQ_TOO_BIG (Error Code 10065) 18114 The Compound request exceeds the channel's negotiated maximum size 18115 for requests. 18117 15.1.3.9. NFS4ERR_RETRY_UNCACHED_REP (Error Code 10068) 18119 The requester has attempted a retry of a Compound that it previously 18120 requested not be placed in the reply cache. 18122 15.1.3.10. NFS4ERR_SEQUENCE_POS (Error Code 10064) 18124 A Sequence operation appeared in a position other than the first 18125 operation of a Compound request. 18127 15.1.3.11. NFS4ERR_TOO_MANY_OPS (Error Code 10070) 18129 The Compound request has too many operations, exceeding the count 18130 negotiated when the session was created. 18132 15.1.3.12. NFS4ERR_UNSAFE_COMPOUND (Error Code 10068) 18134 The client has sent a COMPOUND request with an unsafe mix of 18135 operations -- specifically, with a non-idempotent operation that 18136 changes the current filehandle and that is not followed by a GETFH. 18138 15.1.4. File System Errors 18140 These errors describe situations that occurred in the underlying file 18141 system implementation rather than in the protocol or any NFSv4.x 18142 feature. 18144 15.1.4.1. NFS4ERR_BADTYPE (Error Code 10007) 18146 An attempt was made to create an object with an inappropriate type 18147 specified to CREATE. This may be because the type is undefined, 18148 because the type is not supported by the server, or because the type 18149 is not intended to be created by CREATE (such as a regular file or 18150 named attribute, for which OPEN is used to do the file creation). 18152 15.1.4.2. NFS4ERR_DQUOT (Error Code 69) 18154 Resource (quota) hard limit exceeded. The user's resource limit on 18155 the server has been exceeded. 18157 15.1.4.3. NFS4ERR_EXIST (Error Code 17) 18159 A file of the specified target name (when creating, renaming, or 18160 linking) already exists. 18162 15.1.4.4. NFS4ERR_FBIG (Error Code 27) 18164 The file is too large. The operation would have caused the file to 18165 grow beyond the server's limit. 18167 15.1.4.5. NFS4ERR_FILE_OPEN (Error Code 10046) 18169 The operation is not allowed because a file involved in the operation 18170 is currently open. Servers may, but are not required to, disallow 18171 linking-to, removing, or renaming open files. 18173 15.1.4.6. NFS4ERR_IO (Error Code 5) 18175 Indicates that an I/O error occurred for which the file system was 18176 unable to provide recovery. 18178 15.1.4.7. NFS4ERR_MLINK (Error Code 31) 18180 The request would have caused the server's limit for the number of 18181 hard links a file may have to be exceeded. 18183 15.1.4.8. NFS4ERR_NOENT (Error Code 2) 18185 Indicates no such file or directory. The file or directory name 18186 specified does not exist. 18188 15.1.4.9. NFS4ERR_NOSPC (Error Code 28) 18190 Indicates there is no space left on the device. The operation would 18191 have caused the server's file system to exceed its limit. 18193 15.1.4.10. NFS4ERR_NOTEMPTY (Error Code 66) 18195 An attempt was made to remove a directory that was not empty. 18197 15.1.4.11. NFS4ERR_ROFS (Error Code 30) 18199 Indicates a read-only file system. A modifying operation was 18200 attempted on a read-only file system. 18202 15.1.4.12. NFS4ERR_XDEV (Error Code 18) 18204 Indicates an attempt to do an operation, such as linking, that 18205 inappropriately crosses a boundary. This may be due to such 18206 boundaries as: 18208 * that between file systems (where the fsids are different). 18210 * that between different named attribute directories or between a 18211 named attribute directory and an ordinary directory. 18213 * that between byte-ranges of a file system that the file system 18214 implementation treats as separate (for example, for space 18215 accounting purposes), and where cross-connection between the byte- 18216 ranges are not allowed. 18218 15.1.5. State Management Errors 18220 These errors indicate problems with the stateid (or one of the 18221 stateids) passed to a given operation. This includes situations in 18222 which the stateid is invalid as well as situations in which the 18223 stateid is valid but designates locking state that has been revoked. 18224 Depending on the operation, the stateid when valid may designate 18225 opens, byte-range locks, file or directory delegations, layouts, or 18226 device maps. 18228 15.1.5.1. NFS4ERR_ADMIN_REVOKED (Error Code 10047) 18230 A stateid designates locking state of any type that has been revoked 18231 due to administrative interaction, possibly while the lease is valid. 18233 15.1.5.2. NFS4ERR_BAD_STATEID (Error Code 10026) 18235 A stateid does not properly designate any valid state. See Sections 18236 8.2.4 and 8.2.3 for a discussion of how stateids are validated. 18238 15.1.5.3. NFS4ERR_DELEG_REVOKED (Error Code 10087) 18240 A stateid designates recallable locking state of any type (delegation 18241 or layout) that has been revoked due to the failure of the client to 18242 return the lock when it was recalled. 18244 15.1.5.4. NFS4ERR_EXPIRED (Error Code 10011) 18246 A stateid designates locking state of any type that has been revoked 18247 due to expiration of the client's lease, either immediately upon 18248 lease expiration, or following a later request for a conflicting 18249 lock. 18251 15.1.5.5. NFS4ERR_OLD_STATEID (Error Code 10024) 18253 A stateid with a non-zero seqid value does match the current seqid 18254 for the state designated by the user. 18256 15.1.6. Security Errors 18258 These are the various permission-related errors in NFSv4.1. 18260 15.1.6.1. NFS4ERR_ACCESS (Error Code 13) 18262 Indicates permission denied. The caller does not have the correct 18263 permission to perform the requested operation. Contrast this with 18264 NFS4ERR_PERM (Section 15.1.6.2), which restricts itself to owner or 18265 privileged-user permission failures, and NFS4ERR_WRONG_CRED 18266 (Section 15.1.6.4), which deals with appropriate permission to delete 18267 or modify transient objects based on the credentials of the user that 18268 created them. 18270 15.1.6.2. NFS4ERR_PERM (Error Code 1) 18272 Indicates requester is not the owner. The operation was not allowed 18273 because the caller is neither a privileged user (root) nor the owner 18274 of the target of the operation. 18276 15.1.6.3. NFS4ERR_WRONGSEC (Error Code 10016) 18278 Indicates that the security mechanism being used by the client for 18279 the operation does not match the server's security policy. The 18280 client should change the security mechanism being used and re-send 18281 the operation (but not with the same slot ID and sequence ID; one or 18282 both MUST be different on the re-send). SECINFO and SECINFO_NO_NAME 18283 can be used to determine the appropriate mechanism. 18285 15.1.6.4. NFS4ERR_WRONG_CRED (Error Code 10082) 18287 An operation that manipulates state was attempted by a principal that 18288 was not allowed to modify that piece of state. 18290 15.1.7. Name Errors 18292 Names in NFSv4 are UTF-8 strings. When the strings are not valid 18293 UTF-8 or are of length zero, the error NFS4ERR_INVAL results. 18294 Besides this, there are a number of other errors to indicate specific 18295 problems with names. 18297 15.1.7.1. NFS4ERR_BADCHAR (Error Code 10040) 18299 A UTF-8 string contains a character that is not supported by the 18300 server in the context in which it being used. 18302 15.1.7.2. NFS4ERR_BADNAME (Error Code 10041) 18304 A name string in a request consisted of valid UTF-8 characters 18305 supported by the server, but the name is not supported by the server 18306 as a valid name for the current operation. An example might be 18307 creating a file or directory named ".." on a server whose file system 18308 uses that name for links to parent directories. 18310 15.1.7.3. NFS4ERR_NAMETOOLONG (Error Code 63) 18312 Returned when the filename in an operation exceeds the server's 18313 implementation limit. 18315 15.1.8. Locking Errors 18317 This section deals with errors related to locking, both as to share 18318 reservations and byte-range locking. It does not deal with errors 18319 specific to the process of reclaiming locks. Those are dealt with in 18320 Section 15.1.9. 18322 15.1.8.1. NFS4ERR_BAD_RANGE (Error Code 10042) 18324 The byte-range of a LOCK, LOCKT, or LOCKU operation is not allowed by 18325 the server. For example, this error results when a server that only 18326 supports 32-bit ranges receives a range that cannot be handled by 18327 that server. (See Section 18.10.3.) 18329 15.1.8.2. NFS4ERR_DEADLOCK (Error Code 10045) 18331 The server has been able to determine a byte-range locking deadlock 18332 condition for a READW_LT or WRITEW_LT LOCK operation. 18334 15.1.8.3. NFS4ERR_DENIED (Error Code 10010) 18336 An attempt to lock a file is denied. Since this may be a temporary 18337 condition, the client is encouraged to re-send the lock request (but 18338 not with the same slot ID and sequence ID; one or both MUST be 18339 different on the re-send) until the lock is accepted. See 18340 Section 9.6 for a discussion of the re-send. 18342 15.1.8.4. NFS4ERR_LOCKED (Error Code 10012) 18344 A READ or WRITE operation was attempted on a file where there was a 18345 conflict between the I/O and an existing lock: 18347 * There is a share reservation inconsistent with the I/O being done. 18349 * The range to be read or written intersects an existing mandatory 18350 byte-range lock. 18352 15.1.8.5. NFS4ERR_LOCKS_HELD (Error Code 10037) 18354 An operation was prevented by the unexpected presence of locks. 18356 15.1.8.6. NFS4ERR_LOCK_NOTSUPP (Error Code 10043) 18358 A LOCK operation was attempted that would require the upgrade or 18359 downgrade of a byte-range lock range already held by the owner, and 18360 the server does not support atomic upgrade or downgrade of locks. 18362 15.1.8.7. NFS4ERR_LOCK_RANGE (Error Code 10028) 18364 A LOCK operation is operating on a range that overlaps in part a 18365 currently held byte-range lock for the current lock-owner and does 18366 not precisely match a single such byte-range lock where the server 18367 does not support this type of request, and thus does not implement 18368 POSIX locking semantics [21]. See Sections 18.10.4, 18.11.4, and 18369 18.12.4 for a discussion of how this applies to LOCK, LOCKT, and 18370 LOCKU respectively. 18372 15.1.8.8. NFS4ERR_OPENMODE (Error Code 10038) 18374 The client attempted a READ, WRITE, LOCK, or other operation not 18375 sanctioned by the stateid passed (e.g., writing to a file opened for 18376 read-only access). 18378 15.1.8.9. NFS4ERR_SHARE_DENIED (Error Code 10015) 18380 An attempt to OPEN a file with a share reservation has failed because 18381 of a share conflict. 18383 15.1.9. Reclaim Errors 18385 These errors relate to the process of reclaiming locks after a server 18386 restart. 18388 15.1.9.1. NFS4ERR_COMPLETE_ALREADY (Error Code 10054) 18390 The client previously sent a successful RECLAIM_COMPLETE operation 18391 specifying the same scope, whether that scope is global or for the 18392 same file system in the case of a per-fs RECLAIM_COMPLETE. An 18393 additional RECLAIM_COMPLETE operation is not necessary and results in 18394 this error. 18396 15.1.9.2. NFS4ERR_GRACE (Error Code 10013) 18398 This error is returned when the server is in its grace period with 18399 regard to the file system object for which the lock was requested. 18400 In this situation, a non-reclaim locking request cannot be granted. 18401 This can occur because either: 18403 * The server does not have sufficient information about locks that 18404 might be potentially reclaimed to determine whether the lock could 18405 be granted. 18407 * The request is made by a client responsible for reclaiming its 18408 locks that has not yet done the appropriate RECLAIM_COMPLETE 18409 operation, allowing it to proceed to obtain new locks. 18411 In the case of a per-fs grace period, there may be clients (i.e., 18412 those currently using the destination file system) who might be 18413 unaware of the circumstances resulting in the initiation of the grace 18414 period. Such clients need to periodically retry the request until 18415 the grace period is over, just as other clients do. 18417 15.1.9.3. NFS4ERR_NO_GRACE (Error Code 10033) 18419 A reclaim of client state was attempted in circumstances in which the 18420 server cannot guarantee that conflicting state has not been provided 18421 to another client. This occurs in any of the following situations: 18423 * There is no active grace period applying to the file system object 18424 for which the request was made. 18426 * The client making the request has no current role in reclaiming 18427 locks. 18429 * Previous operations have created a situation in which the server 18430 is not able to determine that a reclaim-interfering edge condition 18431 does not exist. 18433 15.1.9.4. NFS4ERR_RECLAIM_BAD (Error Code 10034) 18435 The server has determined that a reclaim attempted by the client is 18436 not valid, i.e., the lock specified as being reclaimed could not 18437 possibly have existed before the server restart or file system 18438 migration event. A server is not obliged to make this determination 18439 and will typically rely on the client to only reclaim locks that the 18440 client was granted prior to restart. However, when a server does 18441 have reliable information to enable it to make this determination, 18442 this error indicates that the reclaim has been rejected as invalid. 18443 This is as opposed to the error NFS4ERR_RECLAIM_CONFLICT (see 18444 Section 15.1.9.5) where the server can only determine that there has 18445 been an invalid reclaim, but cannot determine which request is 18446 invalid. 18448 15.1.9.5. NFS4ERR_RECLAIM_CONFLICT (Error Code 10035) 18450 The reclaim attempted by the client has encountered a conflict and 18451 cannot be satisfied. This potentially indicates a misbehaving 18452 client, although not necessarily the one receiving the error. The 18453 misbehavior might be on the part of the client that established the 18454 lock with which this client conflicted. See also Section 15.1.9.4 18455 for the related error, NFS4ERR_RECLAIM_BAD. 18457 15.1.10. pNFS Errors 18459 This section deals with pNFS-related errors including those that are 18460 associated with using NFSv4.1 to communicate with a data server. 18462 15.1.10.1. NFS4ERR_BADIOMODE (Error Code 10049) 18464 An invalid or inappropriate layout iomode was specified. For example 18465 an inappropriate layout iomode, suppose a client's LAYOUTGET 18466 operation specified an iomode of LAYOUTIOMODE4_RW, and the server is 18467 neither able nor willing to let the client send write requests to 18468 data servers; the server can reply with NFS4ERR_BADIOMODE. The 18469 client would then send another LAYOUTGET with an iomode of 18470 LAYOUTIOMODE4_READ. 18472 15.1.10.2. NFS4ERR_BADLAYOUT (Error Code 10050) 18474 The layout specified is invalid in some way. For LAYOUTCOMMIT, this 18475 indicates that the specified layout is not held by the client or is 18476 not of mode LAYOUTIOMODE4_RW. For LAYOUTGET, it indicates that a 18477 layout matching the client's specification as to minimum length 18478 cannot be granted. 18480 15.1.10.3. NFS4ERR_LAYOUTTRYLATER (Error Code 10058) 18482 Layouts are temporarily unavailable for the file. The client should 18483 re-send later (but not with the same slot ID and sequence ID; one or 18484 both MUST be different on the re-send). 18486 15.1.10.4. NFS4ERR_LAYOUTUNAVAILABLE (Error Code 10059) 18488 Returned when layouts are not available for the current file system 18489 or the particular specified file. 18491 15.1.10.5. NFS4ERR_NOMATCHING_LAYOUT (Error Code 10060) 18493 Returned when layouts are recalled and the client has no layouts 18494 matching the specification of the layouts being recalled. 18496 15.1.10.6. NFS4ERR_PNFS_IO_HOLE (Error Code 10075) 18498 The pNFS client has attempted to read from or write to an illegal 18499 hole of a file of a data server that is using sparse packing. See 18500 Section 13.4.4. 18502 15.1.10.7. NFS4ERR_PNFS_NO_LAYOUT (Error Code 10080) 18504 The pNFS client has attempted to read from or write to a file (using 18505 a request to a data server) without holding a valid layout. This 18506 includes the case where the client had a layout, but the iomode does 18507 not allow a WRITE. 18509 15.1.10.8. NFS4ERR_RETURNCONFLICT (Error Code 10086) 18511 A layout is unavailable due to an attempt to perform the LAYOUTGET 18512 before a pending LAYOUTRETURN on the file has been received. See 18513 Section 12.5.5.2.1.3. 18515 15.1.10.9. NFS4ERR_UNKNOWN_LAYOUTTYPE (Error Code 10062) 18517 The client has specified a layout type that is not supported by the 18518 server. 18520 15.1.11. Session Use Errors 18522 This section deals with errors encountered when using sessions, that 18523 is, errors encountered when a request uses a Sequence (i.e., either 18524 SEQUENCE or CB_SEQUENCE) operation. 18526 15.1.11.1. NFS4ERR_BADSESSION (Error Code 10052) 18528 The specified session ID is unknown to the server to which the 18529 operation is addressed. 18531 15.1.11.2. NFS4ERR_BADSLOT (Error Code 10053) 18533 The requester sent a Sequence operation that attempted to use a slot 18534 the replier does not have in its slot table. It is possible the slot 18535 may have been retired. 18537 15.1.11.3. NFS4ERR_BAD_HIGH_SLOT (Error Code 10077) 18539 The highest_slot argument in a Sequence operation exceeds the 18540 replier's enforced highest_slotid. 18542 15.1.11.4. NFS4ERR_CB_PATH_DOWN (Error Code 10048) 18544 There is a problem contacting the client via the callback path. The 18545 function of this error has been mostly superseded by the use of 18546 status flags in the reply to the SEQUENCE operation (see 18547 Section 18.46). 18549 15.1.11.5. NFS4ERR_DEADSESSION (Error Code 10078) 18551 The specified session is a persistent session that is dead and does 18552 not accept new requests or perform new operations on existing 18553 requests (in the case in which a request was partially executed 18554 before server restart). 18556 15.1.11.6. NFS4ERR_CONN_NOT_BOUND_TO_SESSION (Error Code 10055) 18558 A Sequence operation was sent on a connection that has not been 18559 associated with the specified session, where the client specified 18560 that connection association was to be enforced with SP4_MACH_CRED or 18561 SP4_SSV state protection. 18563 15.1.11.7. NFS4ERR_SEQ_FALSE_RETRY (Error Code 10076) 18565 The requester sent a Sequence operation with a slot ID and sequence 18566 ID that are in the reply cache, but the replier has detected that the 18567 retried request is not the same as the original request. See 18568 Section 2.10.6.1.3.1. 18570 15.1.11.8. NFS4ERR_SEQ_MISORDERED (Error Code 10063) 18572 The requester sent a Sequence operation with an invalid sequence ID. 18574 15.1.12. Session Management Errors 18576 This section deals with errors associated with requests used in 18577 session management. 18579 15.1.12.1. NFS4ERR_BACK_CHAN_BUSY (Error Code 10057) 18581 An attempt was made to destroy a session when the session cannot be 18582 destroyed because the server has callback requests outstanding. 18584 15.1.12.2. NFS4ERR_BAD_SESSION_DIGEST (Error Code 10051) 18586 The digest used in a SET_SSV request is not valid. 18588 15.1.13. Client Management Errors 18590 This section deals with errors associated with requests used to 18591 create and manage client IDs. 18593 15.1.13.1. NFS4ERR_CLIENTID_BUSY (Error Code 10074) 18595 The DESTROY_CLIENTID operation has found there are sessions and/or 18596 unexpired state associated with the client ID to be destroyed. 18598 15.1.13.2. NFS4ERR_CLID_INUSE (Error Code 10017) 18600 While processing an EXCHANGE_ID operation, the server was presented 18601 with a co_ownerid field that matches an existing client with valid 18602 leased state, but the principal sending the EXCHANGE_ID operation 18603 differs from the principal that established the existing client. 18604 This indicates a collision (most likely due to chance) between 18605 clients. The client should recover by changing the co_ownerid and 18606 re-sending EXCHANGE_ID (but not with the same slot ID and sequence 18607 ID; one or both MUST be different on the re-send). 18609 15.1.13.3. NFS4ERR_ENCR_ALG_UNSUPP (Error Code 10079) 18611 An EXCHANGE_ID was sent that specified state protection via SSV, and 18612 where the set of encryption algorithms presented by the client did 18613 not include any supported by the server. 18615 15.1.13.4. NFS4ERR_HASH_ALG_UNSUPP (Error Code 10072) 18617 An EXCHANGE_ID was sent that specified state protection via SSV, and 18618 where the set of hashing algorithms presented by the client did not 18619 include any supported by the server. 18621 15.1.13.5. NFS4ERR_STALE_CLIENTID (Error Code 10022) 18623 A client ID not recognized by the server was passed to an operation. 18624 Note that unlike the case of NFSv4.0, client IDs are not passed 18625 explicitly to the server in ordinary locking operations and cannot 18626 result in this error. Instead, when there is a server restart, it is 18627 first manifested through an error on the associated session, and the 18628 staleness of the client ID is detected when trying to associate a 18629 client ID with a new session. 18631 15.1.14. Delegation Errors 18633 This section deals with errors associated with requesting and 18634 returning delegations. 18636 15.1.14.1. NFS4ERR_DELEG_ALREADY_WANTED (Error Code 10056) 18638 The client has requested a delegation when it had already registered 18639 that it wants that same delegation. 18641 15.1.14.2. NFS4ERR_DIRDELEG_UNAVAIL (Error Code 10084) 18643 This error is returned when the server is unable or unwilling to 18644 provide a requested directory delegation. 18646 15.1.14.3. NFS4ERR_RECALLCONFLICT (Error Code 10061) 18648 A recallable object (i.e., a layout or delegation) is unavailable due 18649 to a conflicting recall operation that is currently in progress for 18650 that object. 18652 15.1.14.4. NFS4ERR_REJECT_DELEG (Error Code 10085) 18654 The callback operation invoked to deal with a new delegation has 18655 rejected it. 18657 15.1.15. Attribute Handling Errors 18659 This section deals with errors specific to attribute handling within 18660 NFSv4. 18662 15.1.15.1. NFS4ERR_ATTRNOTSUPP (Error Code 10032) 18664 An attribute specified is not supported by the server. This error 18665 MUST NOT be returned by the GETATTR operation. 18667 15.1.15.2. NFS4ERR_BADOWNER (Error Code 10039) 18669 This error is returned when an owner or owner_group attribute value 18670 or the who field of an ACE within an ACL attribute value cannot be 18671 translated to a local representation. 18673 15.1.15.3. NFS4ERR_NOT_SAME (Error Code 10027) 18675 This error is returned by the VERIFY operation to signify that the 18676 attributes compared were not the same as those provided in the 18677 client's request. 18679 15.1.15.4. NFS4ERR_SAME (Error Code 10009) 18681 This error is returned by the NVERIFY operation to signify that the 18682 attributes compared were the same as those provided in the client's 18683 request. 18685 15.1.16. Obsoleted Errors 18687 These errors MUST NOT be generated by any NFSv4.1 operation. This 18688 can be for a number of reasons. 18690 * The function provided by the error has been superseded by one of 18691 the status bits returned by the SEQUENCE operation. 18693 * The new session structure and associated change in locking have 18694 made the error unnecessary. 18696 * There has been a restructuring of some errors for NFSv4.1 that 18697 resulted in the elimination of certain errors. 18699 15.1.16.1. NFS4ERR_BAD_SEQID (Error Code 10026) 18701 The sequence number (seqid) in a locking request is neither the next 18702 expected number or the last number processed. These seqids are 18703 ignored in NFSv4.1. 18705 15.1.16.2. NFS4ERR_LEASE_MOVED (Error Code 10031) 18707 A lease being renewed is associated with a file system that has been 18708 migrated to a new server. The error has been superseded by the 18709 SEQ4_STATUS_LEASE_MOVED status bit (see Section 18.46). 18711 15.1.16.3. NFS4ERR_NXIO (Error Code 5) 18713 I/O error. No such device or address. This error is for errors 18714 involving block and character device access, but because NFSv4.1 is 18715 not a device-access protocol, this error is not applicable. 18717 15.1.16.4. NFS4ERR_RESTOREFH (Error Code 10030) 18719 The RESTOREFH operation does not have a saved filehandle (identified 18720 by SAVEFH) to operate upon. In NFSv4.1, this error has been 18721 superseded by NFS4ERR_NOFILEHANDLE. 18723 15.1.16.5. NFS4ERR_STALE_STATEID (Error Code 10023) 18725 A stateid generated by an earlier server instance was used. This 18726 error is moot in NFSv4.1 because all operations that take a stateid 18727 MUST be preceded by the SEQUENCE operation, and the earlier server 18728 instance is detected by the session infrastructure that supports 18729 SEQUENCE. 18731 15.2. Operations and Their Valid Errors 18733 This section contains a table that gives the valid error returns for 18734 each protocol operation. The error code NFS4_OK (indicating no 18735 error) is not listed but should be understood to be returnable by all 18736 operations with two important exceptions: 18738 * The operations that MUST NOT be implemented: OPEN_CONFIRM, 18739 RELEASE_LOCKOWNER, RENEW, SETCLIENTID, and SETCLIENTID_CONFIRM. 18741 * The invalid operation: ILLEGAL. 18743 +======================+========================================+ 18744 | Operation | Errors | 18745 +======================+========================================+ 18746 | ACCESS | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 18747 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18748 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 18749 | | NFS4ERR_IO, NFS4ERR_MOVED, | 18750 | | NFS4ERR_NOFILEHANDLE, | 18751 | | NFS4ERR_OP_NOT_IN_SESSION, | 18752 | | NFS4ERR_REP_TOO_BIG, | 18753 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18754 | | NFS4ERR_REQ_TOO_BIG, | 18755 | | NFS4ERR_RETRY_UNCACHED_REP, | 18756 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18757 | | NFS4ERR_TOO_MANY_OPS | 18758 +----------------------+----------------------------------------+ 18759 | BACKCHANNEL_CTL | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 18760 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 18761 | | NFS4ERR_NOENT, | 18762 | | NFS4ERR_OP_NOT_IN_SESSION, | 18763 | | NFS4ERR_REP_TOO_BIG, | 18764 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18765 | | NFS4ERR_REQ_TOO_BIG, | 18766 | | NFS4ERR_RETRY_UNCACHED_REP, | 18767 | | NFS4ERR_TOO_MANY_OPS | 18768 +----------------------+----------------------------------------+ 18769 | BIND_CONN_TO_SESSION | NFS4ERR_BADSESSION, NFS4ERR_BADXDR, | 18770 | | NFS4ERR_BAD_SESSION_DIGEST, | 18771 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18772 | | NFS4ERR_INVAL, NFS4ERR_NOT_ONLY_OP, | 18773 | | NFS4ERR_REP_TOO_BIG, | 18774 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18775 | | NFS4ERR_REQ_TOO_BIG, | 18776 | | NFS4ERR_RETRY_UNCACHED_REP, | 18777 | | NFS4ERR_SERVERFAULT, | 18778 | | NFS4ERR_TOO_MANY_OPS | 18779 +----------------------+----------------------------------------+ 18780 | CLOSE | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 18781 | | NFS4ERR_BAD_STATEID, | 18782 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18783 | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | 18784 | | NFS4ERR_LOCKS_HELD, NFS4ERR_MOVED, | 18785 | | NFS4ERR_NOFILEHANDLE, | 18786 | | NFS4ERR_OLD_STATEID, | 18787 | | NFS4ERR_OP_NOT_IN_SESSION, | 18788 | | NFS4ERR_REP_TOO_BIG, | 18789 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18790 | | NFS4ERR_REQ_TOO_BIG, | 18791 | | NFS4ERR_RETRY_UNCACHED_REP, | 18792 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18793 | | NFS4ERR_TOO_MANY_OPS, | 18794 | | NFS4ERR_WRONG_CRED | 18795 +----------------------+----------------------------------------+ 18796 | COMMIT | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 18797 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18798 | | NFS4ERR_FHEXPIRED, NFS4ERR_IO, | 18799 | | NFS4ERR_ISDIR, NFS4ERR_MOVED, | 18800 | | NFS4ERR_NOFILEHANDLE, | 18801 | | NFS4ERR_OP_NOT_IN_SESSION, | 18802 | | NFS4ERR_REP_TOO_BIG, | 18803 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18804 | | NFS4ERR_REQ_TOO_BIG, | 18805 | | NFS4ERR_RETRY_UNCACHED_REP, | 18806 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18807 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 18808 | | NFS4ERR_WRONG_TYPE | 18809 +----------------------+----------------------------------------+ 18810 | CREATE | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | 18811 | | NFS4ERR_BADCHAR, NFS4ERR_BADNAME, | 18812 | | NFS4ERR_BADOWNER, NFS4ERR_BADTYPE, | 18813 | | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 18814 | | NFS4ERR_DELAY, NFS4ERR_DQUOT, | 18815 | | NFS4ERR_EXIST, NFS4ERR_FHEXPIRED, | 18816 | | NFS4ERR_INVAL, NFS4ERR_IO, | 18817 | | NFS4ERR_MLINK, NFS4ERR_MOVED, | 18818 | | NFS4ERR_NAMETOOLONG, | 18819 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 18820 | | NFS4ERR_NOTDIR, | 18821 | | NFS4ERR_OP_NOT_IN_SESSION, | 18822 | | NFS4ERR_PERM, NFS4ERR_REP_TOO_BIG, | 18823 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18824 | | NFS4ERR_REQ_TOO_BIG, | 18825 | | NFS4ERR_RETRY_UNCACHED_REP, | 18826 | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | 18827 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | 18828 | | NFS4ERR_UNSAFE_COMPOUND | 18829 +----------------------+----------------------------------------+ 18830 | CREATE_SESSION | NFS4ERR_BADXDR, NFS4ERR_CLID_INUSE, | 18831 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18832 | | NFS4ERR_INVAL, NFS4ERR_NOENT, | 18833 | | NFS4ERR_NOT_ONLY_OP, NFS4ERR_NOSPC, | 18834 | | NFS4ERR_REP_TOO_BIG, | 18835 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18836 | | NFS4ERR_REQ_TOO_BIG, | 18837 | | NFS4ERR_RETRY_UNCACHED_REP, | 18838 | | NFS4ERR_SEQ_MISORDERED, | 18839 | | NFS4ERR_SERVERFAULT, | 18840 | | NFS4ERR_STALE_CLIENTID, | 18841 | | NFS4ERR_TOOSMALL, | 18842 | | NFS4ERR_TOO_MANY_OPS, | 18843 | | NFS4ERR_WRONG_CRED | 18844 +----------------------+----------------------------------------+ 18845 | DELEGPURGE | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 18846 | | NFS4ERR_DELAY, NFS4ERR_NOTSUPP, | 18847 | | NFS4ERR_OP_NOT_IN_SESSION, | 18848 | | NFS4ERR_REP_TOO_BIG, | 18849 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18850 | | NFS4ERR_REQ_TOO_BIG, | 18851 | | NFS4ERR_RETRY_UNCACHED_REP, | 18852 | | NFS4ERR_SERVERFAULT, | 18853 | | NFS4ERR_TOO_MANY_OPS, | 18854 | | NFS4ERR_WRONG_CRED | 18855 +----------------------+----------------------------------------+ 18856 | DELEGRETURN | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 18857 | | NFS4ERR_BAD_STATEID, | 18858 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18859 | | NFS4ERR_DELEG_REVOKED, | 18860 | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | 18861 | | NFS4ERR_INVAL, NFS4ERR_MOVED, | 18862 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | 18863 | | NFS4ERR_OLD_STATEID, | 18864 | | NFS4ERR_OP_NOT_IN_SESSION, | 18865 | | NFS4ERR_REP_TOO_BIG, | 18866 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18867 | | NFS4ERR_REQ_TOO_BIG, | 18868 | | NFS4ERR_RETRY_UNCACHED_REP, | 18869 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18870 | | NFS4ERR_TOO_MANY_OPS, | 18871 | | NFS4ERR_WRONG_CRED | 18872 +----------------------+----------------------------------------+ 18873 | DESTROY_CLIENTID | NFS4ERR_BADXDR, NFS4ERR_CLIENTID_BUSY, | 18874 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18875 | | NFS4ERR_NOT_ONLY_OP, | 18876 | | NFS4ERR_REP_TOO_BIG, | 18877 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18878 | | NFS4ERR_REQ_TOO_BIG, | 18879 | | NFS4ERR_RETRY_UNCACHED_REP, | 18880 | | NFS4ERR_SERVERFAULT, | 18881 | | NFS4ERR_STALE_CLIENTID, | 18882 | | NFS4ERR_TOO_MANY_OPS, | 18883 | | NFS4ERR_WRONG_CRED | 18884 +----------------------+----------------------------------------+ 18885 | DESTROY_SESSION | NFS4ERR_BACK_CHAN_BUSY, | 18886 | | NFS4ERR_BADSESSION, NFS4ERR_BADXDR, | 18887 | | NFS4ERR_CB_PATH_DOWN, | 18888 | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | 18889 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18890 | | NFS4ERR_NOT_ONLY_OP, | 18891 | | NFS4ERR_REP_TOO_BIG, | 18892 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18893 | | NFS4ERR_REQ_TOO_BIG, | 18894 | | NFS4ERR_RETRY_UNCACHED_REP, | 18895 | | NFS4ERR_SERVERFAULT, | 18896 | | NFS4ERR_STALE_CLIENTID, | 18897 | | NFS4ERR_TOO_MANY_OPS, | 18898 | | NFS4ERR_WRONG_CRED | 18899 +----------------------+----------------------------------------+ 18900 | EXCHANGE_ID | NFS4ERR_BADCHAR, NFS4ERR_BADXDR, | 18901 | | NFS4ERR_CLID_INUSE, | 18902 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18903 | | NFS4ERR_ENCR_ALG_UNSUPP, | 18904 | | NFS4ERR_HASH_ALG_UNSUPP, | 18905 | | NFS4ERR_INVAL, NFS4ERR_NOENT, | 18906 | | NFS4ERR_NOT_ONLY_OP, NFS4ERR_NOT_SAME, | 18907 | | NFS4ERR_REP_TOO_BIG, | 18908 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18909 | | NFS4ERR_REQ_TOO_BIG, | 18910 | | NFS4ERR_RETRY_UNCACHED_REP, | 18911 | | NFS4ERR_SERVERFAULT, | 18912 | | NFS4ERR_TOO_MANY_OPS | 18913 +----------------------+----------------------------------------+ 18914 | FREE_STATEID | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 18915 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18916 | | NFS4ERR_LOCKS_HELD, | 18917 | | NFS4ERR_OLD_STATEID, | 18918 | | NFS4ERR_OP_NOT_IN_SESSION, | 18919 | | NFS4ERR_REP_TOO_BIG, | 18920 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18921 | | NFS4ERR_REQ_TOO_BIG, | 18922 | | NFS4ERR_RETRY_UNCACHED_REP, | 18923 | | NFS4ERR_SERVERFAULT, | 18924 | | NFS4ERR_TOO_MANY_OPS, | 18925 | | NFS4ERR_WRONG_CRED | 18926 +----------------------+----------------------------------------+ 18927 | GET_DIR_DELEGATION | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 18928 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18929 | | NFS4ERR_DIRDELEG_UNAVAIL, | 18930 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18931 | | NFS4ERR_INVAL, NFS4ERR_IO, | 18932 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 18933 | | NFS4ERR_NOTDIR, NFS4ERR_NOTSUPP, | 18934 | | NFS4ERR_OP_NOT_IN_SESSION, | 18935 | | NFS4ERR_REP_TOO_BIG, | 18936 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18937 | | NFS4ERR_REQ_TOO_BIG, | 18938 | | NFS4ERR_RETRY_UNCACHED_REP, | 18939 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18940 | | NFS4ERR_TOO_MANY_OPS | 18941 +----------------------+----------------------------------------+ 18942 | GETATTR | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 18943 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18944 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18945 | | NFS4ERR_INVAL, NFS4ERR_IO, | 18946 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 18947 | | NFS4ERR_OP_NOT_IN_SESSION, | 18948 | | NFS4ERR_REP_TOO_BIG, | 18949 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18950 | | NFS4ERR_REQ_TOO_BIG, | 18951 | | NFS4ERR_RETRY_UNCACHED_REP, | 18952 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18953 | | NFS4ERR_TOO_MANY_OPS, | 18954 | | NFS4ERR_WRONG_TYPE | 18955 +----------------------+----------------------------------------+ 18956 | GETDEVICEINFO | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 18957 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 18958 | | NFS4ERR_NOENT, NFS4ERR_NOTSUPP, | 18959 | | NFS4ERR_OP_NOT_IN_SESSION, | 18960 | | NFS4ERR_REP_TOO_BIG, | 18961 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18962 | | NFS4ERR_REQ_TOO_BIG, | 18963 | | NFS4ERR_RETRY_UNCACHED_REP, | 18964 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOOSMALL, | 18965 | | NFS4ERR_TOO_MANY_OPS, | 18966 | | NFS4ERR_UNKNOWN_LAYOUTTYPE | 18967 +----------------------+----------------------------------------+ 18968 | GETDEVICELIST | NFS4ERR_BADXDR, NFS4ERR_BAD_COOKIE, | 18969 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18970 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 18971 | | NFS4ERR_IO, NFS4ERR_NOFILEHANDLE, | 18972 | | NFS4ERR_NOTSUPP, NFS4ERR_NOT_SAME, | 18973 | | NFS4ERR_OP_NOT_IN_SESSION, | 18974 | | NFS4ERR_REP_TOO_BIG, | 18975 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18976 | | NFS4ERR_REQ_TOO_BIG, | 18977 | | NFS4ERR_RETRY_UNCACHED_REP, | 18978 | | NFS4ERR_SERVERFAULT, | 18979 | | NFS4ERR_TOO_MANY_OPS, | 18980 | | NFS4ERR_UNKNOWN_LAYOUTTYPE | 18981 +----------------------+----------------------------------------+ 18982 | GETFH | NFS4ERR_FHEXPIRED, NFS4ERR_MOVED, | 18983 | | NFS4ERR_NOFILEHANDLE, | 18984 | | NFS4ERR_OP_NOT_IN_SESSION, | 18985 | | NFS4ERR_STALE | 18986 +----------------------+----------------------------------------+ 18987 | ILLEGAL | NFS4ERR_BADXDR, NFS4ERR_OP_ILLEGAL | 18988 +----------------------+----------------------------------------+ 18989 | LAYOUTCOMMIT | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 18990 | | NFS4ERR_ATTRNOTSUPP, | 18991 | | NFS4ERR_BADIOMODE, NFS4ERR_BADLAYOUT, | 18992 | | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 18993 | | NFS4ERR_DELAY, NFS4ERR_DELEG_REVOKED, | 18994 | | NFS4ERR_EXPIRED, NFS4ERR_FBIG, | 18995 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18996 | | NFS4ERR_INVAL, NFS4ERR_IO, | 18997 | | NFS4ERR_ISDIR NFS4ERR_MOVED, | 18998 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | 18999 | | NFS4ERR_NO_GRACE, | 19000 | | NFS4ERR_OP_NOT_IN_SESSION, | 19001 | | NFS4ERR_RECLAIM_BAD, | 19002 | | NFS4ERR_RECLAIM_CONFLICT, | 19003 | | NFS4ERR_REP_TOO_BIG, | 19004 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19005 | | NFS4ERR_REQ_TOO_BIG, | 19006 | | NFS4ERR_RETRY_UNCACHED_REP, | 19007 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19008 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 19009 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 19010 | | NFS4ERR_WRONG_CRED | 19011 +----------------------+----------------------------------------+ 19012 | LAYOUTGET | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 19013 | | NFS4ERR_BADIOMODE, NFS4ERR_BADLAYOUT, | 19014 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 19015 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19016 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | 19017 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 19018 | | NFS4ERR_INVAL, NFS4ERR_IO, | 19019 | | NFS4ERR_LAYOUTTRYLATER, | 19020 | | NFS4ERR_LAYOUTUNAVAILABLE, | 19021 | | NFS4ERR_LOCKED, NFS4ERR_MOVED, | 19022 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 19023 | | NFS4ERR_NOTSUPP, NFS4ERR_OLD_STATEID, | 19024 | | NFS4ERR_OPENMODE, | 19025 | | NFS4ERR_OP_NOT_IN_SESSION, | 19026 | | NFS4ERR_RECALLCONFLICT, | 19027 | | NFS4ERR_REP_TOO_BIG, | 19028 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19029 | | NFS4ERR_REQ_TOO_BIG, | 19030 | | NFS4ERR_RETRY_UNCACHED_REP, | 19031 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19032 | | NFS4ERR_TOOSMALL, | 19033 | | NFS4ERR_TOO_MANY_OPS, | 19034 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 19035 | | NFS4ERR_WRONG_TYPE | 19036 +----------------------+----------------------------------------+ 19037 | LAYOUTRETURN | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 19038 | | NFS4ERR_BAD_STATEID, | 19039 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19040 | | NFS4ERR_DELEG_REVOKED, | 19041 | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | 19042 | | NFS4ERR_GRACE, NFS4ERR_INVAL, | 19043 | | NFS4ERR_ISDIR, NFS4ERR_MOVED, | 19044 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | 19045 | | NFS4ERR_NO_GRACE, NFS4ERR_OLD_STATEID, | 19046 | | NFS4ERR_OP_NOT_IN_SESSION, | 19047 | | NFS4ERR_REP_TOO_BIG, | 19048 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19049 | | NFS4ERR_REQ_TOO_BIG, | 19050 | | NFS4ERR_RETRY_UNCACHED_REP, | 19051 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19052 | | NFS4ERR_TOO_MANY_OPS, | 19053 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 19054 | | NFS4ERR_WRONG_CRED, NFS4ERR_WRONG_TYPE | 19055 +----------------------+----------------------------------------+ 19056 | LINK | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 19057 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 19058 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19059 | | NFS4ERR_DQUOT, NFS4ERR_EXIST, | 19060 | | NFS4ERR_FHEXPIRED, NFS4ERR_FILE_OPEN, | 19061 | | NFS4ERR_GRACE, NFS4ERR_INVAL, | 19062 | | NFS4ERR_ISDIR, NFS4ERR_IO, | 19063 | | NFS4ERR_MLINK, NFS4ERR_MOVED, | 19064 | | NFS4ERR_NAMETOOLONG, | 19065 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 19066 | | NFS4ERR_NOTDIR, NFS4ERR_NOTSUPP, | 19067 | | NFS4ERR_OP_NOT_IN_SESSION, | 19068 | | NFS4ERR_REP_TOO_BIG, | 19069 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19070 | | NFS4ERR_REQ_TOO_BIG, | 19071 | | NFS4ERR_RETRY_UNCACHED_REP, | 19072 | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | 19073 | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | 19074 | | NFS4ERR_TOO_MANY_OPS, | 19075 | | NFS4ERR_WRONGSEC, NFS4ERR_WRONG_TYPE, | 19076 | | NFS4ERR_XDEV | 19077 +----------------------+----------------------------------------+ 19078 | LOCK | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 19079 | | NFS4ERR_BADXDR, NFS4ERR_BAD_RANGE, | 19080 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADLOCK, | 19081 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19082 | | NFS4ERR_DENIED, NFS4ERR_EXPIRED, | 19083 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 19084 | | NFS4ERR_INVAL, NFS4ERR_ISDIR, | 19085 | | NFS4ERR_LOCK_NOTSUPP, | 19086 | | NFS4ERR_LOCK_RANGE, NFS4ERR_MOVED, | 19087 | | NFS4ERR_NOFILEHANDLE, | 19088 | | NFS4ERR_NO_GRACE, NFS4ERR_OLD_STATEID, | 19089 | | NFS4ERR_OPENMODE, | 19090 | | NFS4ERR_OP_NOT_IN_SESSION, | 19091 | | NFS4ERR_RECLAIM_BAD, | 19092 | | NFS4ERR_RECLAIM_CONFLICT, | 19093 | | NFS4ERR_REP_TOO_BIG, | 19094 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19095 | | NFS4ERR_REQ_TOO_BIG, | 19096 | | NFS4ERR_RETRY_UNCACHED_REP, | 19097 | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | 19098 | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | 19099 | | NFS4ERR_TOO_MANY_OPS, | 19100 | | NFS4ERR_WRONG_CRED, NFS4ERR_WRONG_TYPE | 19101 +----------------------+----------------------------------------+ 19102 | LOCKT | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 19103 | | NFS4ERR_BAD_RANGE, | 19104 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19105 | | NFS4ERR_DENIED, NFS4ERR_FHEXPIRED, | 19106 | | NFS4ERR_GRACE, NFS4ERR_INVAL, | 19107 | | NFS4ERR_ISDIR, NFS4ERR_LOCK_RANGE, | 19108 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 19109 | | NFS4ERR_OP_NOT_IN_SESSION, | 19110 | | NFS4ERR_REP_TOO_BIG, | 19111 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19112 | | NFS4ERR_REQ_TOO_BIG, | 19113 | | NFS4ERR_RETRY_UNCACHED_REP, | 19114 | | NFS4ERR_ROFS, NFS4ERR_STALE, | 19115 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 19116 | | NFS4ERR_WRONG_CRED, NFS4ERR_WRONG_TYPE | 19117 +----------------------+----------------------------------------+ 19118 | LOCKU | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 19119 | | NFS4ERR_BADXDR, NFS4ERR_BAD_RANGE, | 19120 | | NFS4ERR_BAD_STATEID, | 19121 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19122 | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | 19123 | | NFS4ERR_INVAL, NFS4ERR_LOCK_RANGE, | 19124 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 19125 | | NFS4ERR_OLD_STATEID, | 19126 | | NFS4ERR_OP_NOT_IN_SESSION, | 19127 | | NFS4ERR_REP_TOO_BIG, | 19128 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19129 | | NFS4ERR_REQ_TOO_BIG, | 19130 | | NFS4ERR_RETRY_UNCACHED_REP, | 19131 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19132 | | NFS4ERR_TOO_MANY_OPS, | 19133 | | NFS4ERR_WRONG_CRED | 19134 +----------------------+----------------------------------------+ 19135 | LOOKUP | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 19136 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 19137 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19138 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 19139 | | NFS4ERR_IO, NFS4ERR_MOVED, | 19140 | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | 19141 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 19142 | | NFS4ERR_OP_NOT_IN_SESSION, | 19143 | | NFS4ERR_REP_TOO_BIG, | 19144 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19145 | | NFS4ERR_REQ_TOO_BIG, | 19146 | | NFS4ERR_RETRY_UNCACHED_REP, | 19147 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19148 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 19149 | | NFS4ERR_WRONGSEC | 19150 +----------------------+----------------------------------------+ 19151 | LOOKUPP | NFS4ERR_ACCESS, NFS4ERR_DEADSESSION, | 19152 | | NFS4ERR_DELAY, NFS4ERR_FHEXPIRED, | 19153 | | NFS4ERR_IO, NFS4ERR_MOVED, | 19154 | | NFS4ERR_NOENT, NFS4ERR_NOFILEHANDLE, | 19155 | | NFS4ERR_NOTDIR, | 19156 | | NFS4ERR_OP_NOT_IN_SESSION, | 19157 | | NFS4ERR_REP_TOO_BIG, | 19158 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19159 | | NFS4ERR_REQ_TOO_BIG, | 19160 | | NFS4ERR_RETRY_UNCACHED_REP, | 19161 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19162 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 19163 | | NFS4ERR_WRONGSEC | 19164 +----------------------+----------------------------------------+ 19165 | NVERIFY | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | 19166 | | NFS4ERR_BADCHAR, NFS4ERR_BADXDR, | 19167 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19168 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 19169 | | NFS4ERR_INVAL, NFS4ERR_IO, | 19170 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 19171 | | NFS4ERR_OP_NOT_IN_SESSION, | 19172 | | NFS4ERR_REP_TOO_BIG, | 19173 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19174 | | NFS4ERR_REQ_TOO_BIG, | 19175 | | NFS4ERR_RETRY_UNCACHED_REP, | 19176 | | NFS4ERR_SAME, NFS4ERR_SERVERFAULT, | 19177 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | 19178 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 19179 | | NFS4ERR_WRONG_TYPE | 19180 +----------------------+----------------------------------------+ 19181 | OPEN | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 19182 | | NFS4ERR_ATTRNOTSUPP, NFS4ERR_BADCHAR, | 19183 | | NFS4ERR_BADNAME, NFS4ERR_BADOWNER, | 19184 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 19185 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19186 | | NFS4ERR_DELEG_ALREADY_WANTED, | 19187 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | 19188 | | NFS4ERR_EXIST, NFS4ERR_EXPIRED, | 19189 | | NFS4ERR_FBIG, NFS4ERR_FHEXPIRED, | 19190 | | NFS4ERR_GRACE, NFS4ERR_INVAL, | 19191 | | NFS4ERR_ISDIR, NFS4ERR_IO, | 19192 | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | 19193 | | NFS4ERR_NOENT, NFS4ERR_NOFILEHANDLE, | 19194 | | NFS4ERR_NOSPC, NFS4ERR_NOTDIR, | 19195 | | NFS4ERR_NO_GRACE, NFS4ERR_OLD_STATEID, | 19196 | | NFS4ERR_OP_NOT_IN_SESSION, | 19197 | | NFS4ERR_PERM, NFS4ERR_RECLAIM_BAD, | 19198 | | NFS4ERR_RECLAIM_CONFLICT, | 19199 | | NFS4ERR_REP_TOO_BIG, | 19200 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19201 | | NFS4ERR_REQ_TOO_BIG, | 19202 | | NFS4ERR_RETRY_UNCACHED_REP, | 19203 | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | 19204 | | NFS4ERR_SHARE_DENIED, NFS4ERR_STALE, | 19205 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 19206 | | NFS4ERR_UNSAFE_COMPOUND, | 19207 | | NFS4ERR_WRONGSEC, NFS4ERR_WRONG_TYPE | 19208 +----------------------+----------------------------------------+ 19209 | OPEN_CONFIRM | NFS4ERR_NOTSUPP | 19210 +----------------------+----------------------------------------+ 19211 | OPEN_DOWNGRADE | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 19212 | | NFS4ERR_BAD_STATEID, | 19213 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19214 | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | 19215 | | NFS4ERR_INVAL, NFS4ERR_MOVED, | 19216 | | NFS4ERR_NOFILEHANDLE, | 19217 | | NFS4ERR_OLD_STATEID, | 19218 | | NFS4ERR_OP_NOT_IN_SESSION, | 19219 | | NFS4ERR_REP_TOO_BIG, | 19220 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19221 | | NFS4ERR_REQ_TOO_BIG, | 19222 | | NFS4ERR_RETRY_UNCACHED_REP, | 19223 | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | 19224 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | 19225 | | NFS4ERR_WRONG_CRED | 19226 +----------------------+----------------------------------------+ 19227 | OPENATTR | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 19228 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19229 | | NFS4ERR_DQUOT, NFS4ERR_FHEXPIRED, | 19230 | | NFS4ERR_IO, NFS4ERR_MOVED, | 19231 | | NFS4ERR_NOENT, NFS4ERR_NOFILEHANDLE, | 19232 | | NFS4ERR_NOSPC, NFS4ERR_NOTSUPP, | 19233 | | NFS4ERR_OP_NOT_IN_SESSION, | 19234 | | NFS4ERR_REP_TOO_BIG, | 19235 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19236 | | NFS4ERR_REQ_TOO_BIG, | 19237 | | NFS4ERR_RETRY_UNCACHED_REP, | 19238 | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | 19239 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | 19240 | | NFS4ERR_UNSAFE_COMPOUND, | 19241 | | NFS4ERR_WRONG_TYPE | 19242 +----------------------+----------------------------------------+ 19243 | PUTFH | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 19244 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19245 | | NFS4ERR_MOVED, | 19246 | | NFS4ERR_OP_NOT_IN_SESSION, | 19247 | | NFS4ERR_REP_TOO_BIG, | 19248 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19249 | | NFS4ERR_REQ_TOO_BIG, | 19250 | | NFS4ERR_RETRY_UNCACHED_REP, | 19251 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19252 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC | 19253 +----------------------+----------------------------------------+ 19254 | PUTPUBFH | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19255 | | NFS4ERR_OP_NOT_IN_SESSION, | 19256 | | NFS4ERR_REP_TOO_BIG, | 19257 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19258 | | NFS4ERR_REQ_TOO_BIG, | 19259 | | NFS4ERR_RETRY_UNCACHED_REP, | 19260 | | NFS4ERR_SERVERFAULT, | 19261 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC | 19262 +----------------------+----------------------------------------+ 19263 | PUTROOTFH | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19264 | | NFS4ERR_OP_NOT_IN_SESSION, | 19265 | | NFS4ERR_REP_TOO_BIG, | 19266 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19267 | | NFS4ERR_REQ_TOO_BIG, | 19268 | | NFS4ERR_RETRY_UNCACHED_REP, | 19269 | | NFS4ERR_SERVERFAULT, | 19270 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC | 19271 +----------------------+----------------------------------------+ 19272 | READ | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 19273 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 19274 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19275 | | NFS4ERR_DELEG_REVOKED, | 19276 | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | 19277 | | NFS4ERR_GRACE, NFS4ERR_INVAL, | 19278 | | NFS4ERR_ISDIR, NFS4ERR_IO, | 19279 | | NFS4ERR_LOCKED, NFS4ERR_MOVED, | 19280 | | NFS4ERR_NOFILEHANDLE, | 19281 | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | 19282 | | NFS4ERR_OP_NOT_IN_SESSION, | 19283 | | NFS4ERR_PNFS_IO_HOLE, | 19284 | | NFS4ERR_PNFS_NO_LAYOUT, | 19285 | | NFS4ERR_REP_TOO_BIG, | 19286 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19287 | | NFS4ERR_REQ_TOO_BIG, | 19288 | | NFS4ERR_RETRY_UNCACHED_REP, | 19289 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19290 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 19291 | | NFS4ERR_WRONG_TYPE | 19292 +----------------------+----------------------------------------+ 19293 | READDIR | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 19294 | | NFS4ERR_BAD_COOKIE, | 19295 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19296 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 19297 | | NFS4ERR_IO, NFS4ERR_MOVED, | 19298 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 19299 | | NFS4ERR_NOT_SAME, | 19300 | | NFS4ERR_OP_NOT_IN_SESSION, | 19301 | | NFS4ERR_REP_TOO_BIG, | 19302 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19303 | | NFS4ERR_REQ_TOO_BIG, | 19304 | | NFS4ERR_RETRY_UNCACHED_REP, | 19305 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19306 | | NFS4ERR_TOOSMALL, NFS4ERR_TOO_MANY_OPS | 19307 +----------------------+----------------------------------------+ 19308 | READLINK | NFS4ERR_ACCESS, NFS4ERR_DEADSESSION, | 19309 | | NFS4ERR_DELAY, NFS4ERR_FHEXPIRED, | 19310 | | NFS4ERR_INVAL, NFS4ERR_IO, | 19311 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 19312 | | NFS4ERR_OP_NOT_IN_SESSION, | 19313 | | NFS4ERR_REP_TOO_BIG, | 19314 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19315 | | NFS4ERR_REQ_TOO_BIG, | 19316 | | NFS4ERR_RETRY_UNCACHED_REP, | 19317 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19318 | | NFS4ERR_TOO_MANY_OPS, | 19319 | | NFS4ERR_WRONG_TYPE | 19320 +----------------------+----------------------------------------+ 19321 | RECLAIM_COMPLETE | NFS4ERR_BADXDR, | 19322 | | NFS4ERR_COMPLETE_ALREADY, | 19323 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19324 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 19325 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 19326 | | NFS4ERR_OP_NOT_IN_SESSION, | 19327 | | NFS4ERR_REP_TOO_BIG, | 19328 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19329 | | NFS4ERR_REQ_TOO_BIG, | 19330 | | NFS4ERR_RETRY_UNCACHED_REP, | 19331 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19332 | | NFS4ERR_TOO_MANY_OPS, | 19333 | | NFS4ERR_WRONG_CRED, NFS4ERR_WRONG_TYPE | 19334 +----------------------+----------------------------------------+ 19335 | RELEASE_LOCKOWNER | NFS4ERR_NOTSUPP | 19336 +----------------------+----------------------------------------+ 19337 | REMOVE | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 19338 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 19339 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19340 | | NFS4ERR_FHEXPIRED, NFS4ERR_FILE_OPEN, | 19341 | | NFS4ERR_GRACE, NFS4ERR_INVAL, | 19342 | | NFS4ERR_IO, NFS4ERR_MOVED, | 19343 | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | 19344 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 19345 | | NFS4ERR_NOTEMPTY, | 19346 | | NFS4ERR_OP_NOT_IN_SESSION, | 19347 | | NFS4ERR_REP_TOO_BIG, | 19348 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19349 | | NFS4ERR_REQ_TOO_BIG, | 19350 | | NFS4ERR_RETRY_UNCACHED_REP, | 19351 | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | 19352 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS | 19353 +----------------------+----------------------------------------+ 19354 | RENAME | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 19355 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 19356 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19357 | | NFS4ERR_DQUOT, NFS4ERR_EXIST, | 19358 | | NFS4ERR_FHEXPIRED, NFS4ERR_FILE_OPEN, | 19359 | | NFS4ERR_GRACE, NFS4ERR_INVAL, | 19360 | | NFS4ERR_IO, NFS4ERR_MLINK, | 19361 | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | 19362 | | NFS4ERR_NOENT, NFS4ERR_NOFILEHANDLE, | 19363 | | NFS4ERR_NOSPC, NFS4ERR_NOTDIR, | 19364 | | NFS4ERR_NOTEMPTY, | 19365 | | NFS4ERR_OP_NOT_IN_SESSION, | 19366 | | NFS4ERR_REP_TOO_BIG, | 19367 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19368 | | NFS4ERR_REQ_TOO_BIG, | 19369 | | NFS4ERR_RETRY_UNCACHED_REP, | 19370 | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | 19371 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | 19372 | | NFS4ERR_WRONGSEC, NFS4ERR_XDEV | 19373 +----------------------+----------------------------------------+ 19374 | RENEW | NFS4ERR_NOTSUPP | 19375 +----------------------+----------------------------------------+ 19376 | RESTOREFH | NFS4ERR_DEADSESSION, | 19377 | | NFS4ERR_FHEXPIRED, NFS4ERR_MOVED, | 19378 | | NFS4ERR_NOFILEHANDLE, | 19379 | | NFS4ERR_OP_NOT_IN_SESSION, | 19380 | | NFS4ERR_REP_TOO_BIG, | 19381 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19382 | | NFS4ERR_REQ_TOO_BIG, | 19383 | | NFS4ERR_RETRY_UNCACHED_REP, | 19384 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19385 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC | 19386 +----------------------+----------------------------------------+ 19387 | SAVEFH | NFS4ERR_DEADSESSION, | 19388 | | NFS4ERR_FHEXPIRED, NFS4ERR_MOVED, | 19389 | | NFS4ERR_NOFILEHANDLE, | 19390 | | NFS4ERR_OP_NOT_IN_SESSION, | 19391 | | NFS4ERR_REP_TOO_BIG, | 19392 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19393 | | NFS4ERR_REQ_TOO_BIG, | 19394 | | NFS4ERR_RETRY_UNCACHED_REP, | 19395 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19396 | | NFS4ERR_TOO_MANY_OPS | 19397 +----------------------+----------------------------------------+ 19398 | SECINFO | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 19399 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 19400 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19401 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 19402 | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | 19403 | | NFS4ERR_NOENT, NFS4ERR_NOFILEHANDLE, | 19404 | | NFS4ERR_NOTDIR, | 19405 | | NFS4ERR_OP_NOT_IN_SESSION, | 19406 | | NFS4ERR_REP_TOO_BIG, | 19407 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19408 | | NFS4ERR_REQ_TOO_BIG, | 19409 | | NFS4ERR_RETRY_UNCACHED_REP, | 19410 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19411 | | NFS4ERR_TOO_MANY_OPS | 19412 +----------------------+----------------------------------------+ 19413 | SECINFO_NO_NAME | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 19414 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19415 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 19416 | | NFS4ERR_MOVED, NFS4ERR_NOENT, | 19417 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 19418 | | NFS4ERR_NOTSUPP, | 19419 | | NFS4ERR_OP_NOT_IN_SESSION, | 19420 | | NFS4ERR_REP_TOO_BIG, | 19421 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19422 | | NFS4ERR_REQ_TOO_BIG, | 19423 | | NFS4ERR_RETRY_UNCACHED_REP, | 19424 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19425 | | NFS4ERR_TOO_MANY_OPS | 19426 +----------------------+----------------------------------------+ 19427 | SEQUENCE | NFS4ERR_BADSESSION, NFS4ERR_BADSLOT, | 19428 | | NFS4ERR_BADXDR, NFS4ERR_BAD_HIGH_SLOT, | 19429 | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | 19430 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19431 | | NFS4ERR_REP_TOO_BIG, | 19432 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19433 | | NFS4ERR_REQ_TOO_BIG, | 19434 | | NFS4ERR_RETRY_UNCACHED_REP, | 19435 | | NFS4ERR_SEQUENCE_POS, | 19436 | | NFS4ERR_SEQ_FALSE_RETRY, | 19437 | | NFS4ERR_SEQ_MISORDERED, | 19438 | | NFS4ERR_TOO_MANY_OPS | 19439 +----------------------+----------------------------------------+ 19440 | SET_SSV | NFS4ERR_BADXDR, | 19441 | | NFS4ERR_BAD_SESSION_DIGEST, | 19442 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19443 | | NFS4ERR_INVAL, | 19444 | | NFS4ERR_OP_NOT_IN_SESSION, | 19445 | | NFS4ERR_REP_TOO_BIG, | 19446 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19447 | | NFS4ERR_REQ_TOO_BIG, | 19448 | | NFS4ERR_RETRY_UNCACHED_REP, | 19449 | | NFS4ERR_TOO_MANY_OPS | 19450 +----------------------+----------------------------------------+ 19451 | SETATTR | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 19452 | | NFS4ERR_ATTRNOTSUPP, NFS4ERR_BADCHAR, | 19453 | | NFS4ERR_BADOWNER, NFS4ERR_BADXDR, | 19454 | | NFS4ERR_BAD_STATEID, | 19455 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19456 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | 19457 | | NFS4ERR_EXPIRED, NFS4ERR_FBIG, | 19458 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 19459 | | NFS4ERR_INVAL, NFS4ERR_IO, | 19460 | | NFS4ERR_LOCKED, NFS4ERR_MOVED, | 19461 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 19462 | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | 19463 | | NFS4ERR_OP_NOT_IN_SESSION, | 19464 | | NFS4ERR_PERM, NFS4ERR_REP_TOO_BIG, | 19465 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19466 | | NFS4ERR_REQ_TOO_BIG, | 19467 | | NFS4ERR_RETRY_UNCACHED_REP, | 19468 | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | 19469 | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | 19470 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 19471 | | NFS4ERR_WRONG_TYPE | 19472 +----------------------+----------------------------------------+ 19473 | SETCLIENTID | NFS4ERR_NOTSUPP | 19474 +----------------------+----------------------------------------+ 19475 | SETCLIENTID_CONFIRM | NFS4ERR_NOTSUPP | 19476 +----------------------+----------------------------------------+ 19477 | TEST_STATEID | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 19478 | | NFS4ERR_DELAY, | 19479 | | NFS4ERR_OP_NOT_IN_SESSION, | 19480 | | NFS4ERR_REP_TOO_BIG, | 19481 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19482 | | NFS4ERR_REQ_TOO_BIG, | 19483 | | NFS4ERR_RETRY_UNCACHED_REP, | 19484 | | NFS4ERR_SERVERFAULT, | 19485 | | NFS4ERR_TOO_MANY_OPS | 19486 +----------------------+----------------------------------------+ 19487 | VERIFY | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | 19488 | | NFS4ERR_BADCHAR, NFS4ERR_BADXDR, | 19489 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19490 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 19491 | | NFS4ERR_INVAL, NFS4ERR_IO, | 19492 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 19493 | | NFS4ERR_NOT_SAME, | 19494 | | NFS4ERR_OP_NOT_IN_SESSION, | 19495 | | NFS4ERR_REP_TOO_BIG, | 19496 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19497 | | NFS4ERR_REQ_TOO_BIG, | 19498 | | NFS4ERR_RETRY_UNCACHED_REP, | 19499 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19500 | | NFS4ERR_TOO_MANY_OPS, | 19501 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 19502 | | NFS4ERR_WRONG_TYPE | 19503 +----------------------+----------------------------------------+ 19504 | WANT_DELEGATION | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 19505 | | NFS4ERR_DELAY, | 19506 | | NFS4ERR_DELEG_ALREADY_WANTED, | 19507 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 19508 | | NFS4ERR_INVAL, NFS4ERR_IO, | 19509 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 19510 | | NFS4ERR_NOTSUPP, NFS4ERR_NO_GRACE, | 19511 | | NFS4ERR_OP_NOT_IN_SESSION, | 19512 | | NFS4ERR_RECALLCONFLICT, | 19513 | | NFS4ERR_RECLAIM_BAD, | 19514 | | NFS4ERR_RECLAIM_CONFLICT, | 19515 | | NFS4ERR_REP_TOO_BIG, | 19516 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19517 | | NFS4ERR_REQ_TOO_BIG, | 19518 | | NFS4ERR_RETRY_UNCACHED_REP, | 19519 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19520 | | NFS4ERR_TOO_MANY_OPS, | 19521 | | NFS4ERR_WRONG_TYPE | 19522 +----------------------+----------------------------------------+ 19523 | WRITE | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 19524 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 19525 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19526 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | 19527 | | NFS4ERR_EXPIRED, NFS4ERR_FBIG, | 19528 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 19529 | | NFS4ERR_INVAL, NFS4ERR_IO, | 19530 | | NFS4ERR_ISDIR, NFS4ERR_LOCKED, | 19531 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 19532 | | NFS4ERR_NOSPC, NFS4ERR_OLD_STATEID, | 19533 | | NFS4ERR_OPENMODE, | 19534 | | NFS4ERR_OP_NOT_IN_SESSION, | 19535 | | NFS4ERR_PNFS_IO_HOLE, | 19536 | | NFS4ERR_PNFS_NO_LAYOUT, | 19537 | | NFS4ERR_REP_TOO_BIG, | 19538 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19539 | | NFS4ERR_REQ_TOO_BIG, | 19540 | | NFS4ERR_RETRY_UNCACHED_REP, | 19541 | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | 19542 | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | 19543 | | NFS4ERR_TOO_MANY_OPS, | 19544 | | NFS4ERR_WRONG_TYPE | 19545 +----------------------+----------------------------------------+ 19547 Table 12: Valid Error Returns for Each Protocol Operation 19549 15.3. Callback Operations and Their Valid Errors 19551 This section contains a table that gives the valid error returns for 19552 each callback operation. The error code NFS4_OK (indicating no 19553 error) is not listed but should be understood to be returnable by all 19554 callback operations with the exception of CB_ILLEGAL. 19556 +=========================+=======================================+ 19557 | Callback Operation | Errors | 19558 +=========================+=======================================+ 19559 | CB_GETATTR | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 19560 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 19561 | | NFS4ERR_OP_NOT_IN_SESSION, | 19562 | | NFS4ERR_REP_TOO_BIG, | 19563 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19564 | | NFS4ERR_REQ_TOO_BIG, | 19565 | | NFS4ERR_RETRY_UNCACHED_REP, | 19566 | | NFS4ERR_SERVERFAULT, | 19567 | | NFS4ERR_TOO_MANY_OPS, | 19568 +-------------------------+---------------------------------------+ 19569 | CB_ILLEGAL | NFS4ERR_BADXDR, NFS4ERR_OP_ILLEGAL | 19570 +-------------------------+---------------------------------------+ 19571 | CB_LAYOUTRECALL | NFS4ERR_BADHANDLE, NFS4ERR_BADIOMODE, | 19572 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 19573 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 19574 | | NFS4ERR_NOMATCHING_LAYOUT, | 19575 | | NFS4ERR_NOTSUPP, | 19576 | | NFS4ERR_OP_NOT_IN_SESSION, | 19577 | | NFS4ERR_REP_TOO_BIG, | 19578 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19579 | | NFS4ERR_REQ_TOO_BIG, | 19580 | | NFS4ERR_RETRY_UNCACHED_REP, | 19581 | | NFS4ERR_TOO_MANY_OPS, | 19582 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 19583 | | NFS4ERR_WRONG_TYPE | 19584 +-------------------------+---------------------------------------+ 19585 | CB_NOTIFY | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 19586 | | NFS4ERR_BAD_STATEID, NFS4ERR_DELAY, | 19587 | | NFS4ERR_INVAL, NFS4ERR_NOTSUPP, | 19588 | | NFS4ERR_OP_NOT_IN_SESSION, | 19589 | | NFS4ERR_REP_TOO_BIG, | 19590 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19591 | | NFS4ERR_REQ_TOO_BIG, | 19592 | | NFS4ERR_RETRY_UNCACHED_REP, | 19593 | | NFS4ERR_SERVERFAULT, | 19594 | | NFS4ERR_TOO_MANY_OPS | 19595 +-------------------------+---------------------------------------+ 19596 | CB_NOTIFY_DEVICEID | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 19597 | | NFS4ERR_INVAL, NFS4ERR_NOTSUPP, | 19598 | | NFS4ERR_OP_NOT_IN_SESSION, | 19599 | | NFS4ERR_REP_TOO_BIG, | 19600 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19601 | | NFS4ERR_REQ_TOO_BIG, | 19602 | | NFS4ERR_RETRY_UNCACHED_REP, | 19603 | | NFS4ERR_SERVERFAULT, | 19604 | | NFS4ERR_TOO_MANY_OPS | 19605 +-------------------------+---------------------------------------+ 19606 | CB_NOTIFY_LOCK | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 19607 | | NFS4ERR_BAD_STATEID, NFS4ERR_DELAY, | 19608 | | NFS4ERR_NOTSUPP, | 19609 | | NFS4ERR_OP_NOT_IN_SESSION, | 19610 | | NFS4ERR_REP_TOO_BIG, | 19611 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19612 | | NFS4ERR_REQ_TOO_BIG, | 19613 | | NFS4ERR_RETRY_UNCACHED_REP, | 19614 | | NFS4ERR_SERVERFAULT, | 19615 | | NFS4ERR_TOO_MANY_OPS | 19616 +-------------------------+---------------------------------------+ 19617 | CB_PUSH_DELEG | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 19618 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 19619 | | NFS4ERR_NOTSUPP, | 19620 | | NFS4ERR_OP_NOT_IN_SESSION, | 19621 | | NFS4ERR_REJECT_DELEG, | 19622 | | NFS4ERR_REP_TOO_BIG, | 19623 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19624 | | NFS4ERR_REQ_TOO_BIG, | 19625 | | NFS4ERR_RETRY_UNCACHED_REP, | 19626 | | NFS4ERR_SERVERFAULT, | 19627 | | NFS4ERR_TOO_MANY_OPS, | 19628 | | NFS4ERR_WRONG_TYPE | 19629 +-------------------------+---------------------------------------+ 19630 | CB_RECALL | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 19631 | | NFS4ERR_BAD_STATEID, NFS4ERR_DELAY, | 19632 | | NFS4ERR_OP_NOT_IN_SESSION, | 19633 | | NFS4ERR_REP_TOO_BIG, | 19634 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19635 | | NFS4ERR_REQ_TOO_BIG, | 19636 | | NFS4ERR_RETRY_UNCACHED_REP, | 19637 | | NFS4ERR_SERVERFAULT, | 19638 | | NFS4ERR_TOO_MANY_OPS | 19639 +-------------------------+---------------------------------------+ 19640 | CB_RECALL_ANY | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 19641 | | NFS4ERR_INVAL, | 19642 | | NFS4ERR_OP_NOT_IN_SESSION, | 19643 | | NFS4ERR_REP_TOO_BIG, | 19644 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19645 | | NFS4ERR_REQ_TOO_BIG, | 19646 | | NFS4ERR_RETRY_UNCACHED_REP, | 19647 | | NFS4ERR_TOO_MANY_OPS | 19648 +-------------------------+---------------------------------------+ 19649 | CB_RECALLABLE_OBJ_AVAIL | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 19650 | | NFS4ERR_INVAL, NFS4ERR_NOTSUPP, | 19651 | | NFS4ERR_OP_NOT_IN_SESSION, | 19652 | | NFS4ERR_REP_TOO_BIG, | 19653 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19654 | | NFS4ERR_REQ_TOO_BIG, | 19655 | | NFS4ERR_RETRY_UNCACHED_REP, | 19656 | | NFS4ERR_SERVERFAULT, | 19657 | | NFS4ERR_TOO_MANY_OPS | 19658 +-------------------------+---------------------------------------+ 19659 | CB_RECALL_SLOT | NFS4ERR_BADXDR, | 19660 | | NFS4ERR_BAD_HIGH_SLOT, NFS4ERR_DELAY, | 19661 | | NFS4ERR_OP_NOT_IN_SESSION, | 19662 | | NFS4ERR_REP_TOO_BIG, | 19663 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19664 | | NFS4ERR_REQ_TOO_BIG, | 19665 | | NFS4ERR_RETRY_UNCACHED_REP, | 19666 | | NFS4ERR_TOO_MANY_OPS | 19667 +-------------------------+---------------------------------------+ 19668 | CB_SEQUENCE | NFS4ERR_BADSESSION, NFS4ERR_BADSLOT, | 19669 | | NFS4ERR_BADXDR, | 19670 | | NFS4ERR_BAD_HIGH_SLOT, | 19671 | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | 19672 | | NFS4ERR_DELAY, NFS4ERR_REP_TOO_BIG, | 19673 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19674 | | NFS4ERR_REQ_TOO_BIG, | 19675 | | NFS4ERR_RETRY_UNCACHED_REP, | 19676 | | NFS4ERR_SEQUENCE_POS, | 19677 | | NFS4ERR_SEQ_FALSE_RETRY, | 19678 | | NFS4ERR_SEQ_MISORDERED, | 19679 | | NFS4ERR_TOO_MANY_OPS | 19680 +-------------------------+---------------------------------------+ 19681 | CB_WANTS_CANCELLED | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 19682 | | NFS4ERR_NOTSUPP, | 19683 | | NFS4ERR_OP_NOT_IN_SESSION, | 19684 | | NFS4ERR_REP_TOO_BIG, | 19685 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19686 | | NFS4ERR_REQ_TOO_BIG, | 19687 | | NFS4ERR_RETRY_UNCACHED_REP, | 19688 | | NFS4ERR_SERVERFAULT, | 19689 | | NFS4ERR_TOO_MANY_OPS | 19690 +-------------------------+---------------------------------------+ 19692 Table 13: Valid Error Returns for Each Protocol Callback Operation 19694 15.4. Errors and the Operations That Use Them 19696 +===================================+===============================+ 19697 | Error | Operations | 19698 +===================================+===============================+ 19699 | NFS4ERR_ACCESS | ACCESS, COMMIT, CREATE, | 19700 | | GETATTR, GET_DIR_DELEGATION, | 19701 | | LAYOUTCOMMIT, LAYOUTGET, | 19702 | | LINK, LOCK, LOCKT, LOCKU, | 19703 | | LOOKUP, LOOKUPP, NVERIFY, | 19704 | | OPEN, OPENATTR, READ, | 19705 | | READDIR, READLINK, REMOVE, | 19706 | | RENAME, SECINFO, | 19707 | | SECINFO_NO_NAME, SETATTR, | 19708 | | VERIFY, WRITE | 19709 +-----------------------------------+-------------------------------+ 19710 | NFS4ERR_ADMIN_REVOKED | CLOSE, DELEGRETURN, | 19711 | | LAYOUTCOMMIT, LAYOUTGET, | 19712 | | LAYOUTRETURN, LOCK, LOCKU, | 19713 | | OPEN, OPEN_DOWNGRADE, READ, | 19714 | | SETATTR, WRITE | 19715 +-----------------------------------+-------------------------------+ 19716 | NFS4ERR_ATTRNOTSUPP | CREATE, LAYOUTCOMMIT, | 19717 | | NVERIFY, OPEN, SETATTR, | 19718 | | VERIFY | 19719 +-----------------------------------+-------------------------------+ 19720 | NFS4ERR_BACK_CHAN_BUSY | DESTROY_SESSION | 19721 +-----------------------------------+-------------------------------+ 19722 | NFS4ERR_BADCHAR | CREATE, EXCHANGE_ID, LINK, | 19723 | | LOOKUP, NVERIFY, OPEN, | 19724 | | REMOVE, RENAME, SECINFO, | 19725 | | SETATTR, VERIFY | 19726 +-----------------------------------+-------------------------------+ 19727 | NFS4ERR_BADHANDLE | CB_GETATTR, CB_LAYOUTRECALL, | 19728 | | CB_NOTIFY, CB_NOTIFY_LOCK, | 19729 | | CB_PUSH_DELEG, CB_RECALL, | 19730 | | PUTFH | 19731 +-----------------------------------+-------------------------------+ 19732 | NFS4ERR_BADIOMODE | CB_LAYOUTRECALL, | 19733 | | LAYOUTCOMMIT, LAYOUTGET | 19734 +-----------------------------------+-------------------------------+ 19735 | NFS4ERR_BADLAYOUT | LAYOUTCOMMIT, LAYOUTGET | 19736 +-----------------------------------+-------------------------------+ 19737 | NFS4ERR_BADNAME | CREATE, LINK, LOOKUP, OPEN, | 19738 | | REMOVE, RENAME, SECINFO | 19739 +-----------------------------------+-------------------------------+ 19740 | NFS4ERR_BADOWNER | CREATE, OPEN, SETATTR | 19741 +-----------------------------------+-------------------------------+ 19742 | NFS4ERR_BADSESSION | BIND_CONN_TO_SESSION, | 19743 | | CB_SEQUENCE, | 19744 | | DESTROY_SESSION, SEQUENCE | 19745 +-----------------------------------+-------------------------------+ 19746 | NFS4ERR_BADSLOT | CB_SEQUENCE, SEQUENCE | 19747 +-----------------------------------+-------------------------------+ 19748 | NFS4ERR_BADTYPE | CREATE | 19749 +-----------------------------------+-------------------------------+ 19750 | NFS4ERR_BADXDR | ACCESS, BACKCHANNEL_CTL, | 19751 | | BIND_CONN_TO_SESSION, | 19752 | | CB_GETATTR, CB_ILLEGAL, | 19753 | | CB_LAYOUTRECALL, CB_NOTIFY, | 19754 | | CB_NOTIFY_DEVICEID, | 19755 | | CB_NOTIFY_LOCK, | 19756 | | CB_PUSH_DELEG, CB_RECALL, | 19757 | | CB_RECALLABLE_OBJ_AVAIL, | 19758 | | CB_RECALL_ANY, | 19759 | | CB_RECALL_SLOT, CB_SEQUENCE, | 19760 | | CB_WANTS_CANCELLED, CLOSE, | 19761 | | COMMIT, CREATE, | 19762 | | CREATE_SESSION, DELEGPURGE, | 19763 | | DELEGRETURN, | 19764 | | DESTROY_CLIENTID, | 19765 | | DESTROY_SESSION, | 19766 | | EXCHANGE_ID, FREE_STATEID, | 19767 | | GETATTR, GETDEVICEINFO, | 19768 | | GETDEVICELIST, | 19769 | | GET_DIR_DELEGATION, ILLEGAL, | 19770 | | LAYOUTCOMMIT, LAYOUTGET, | 19771 | | LAYOUTRETURN, LINK, LOCK, | 19772 | | LOCKT, LOCKU, LOOKUP, | 19773 | | NVERIFY, OPEN, OPENATTR, | 19774 | | OPEN_DOWNGRADE, PUTFH, READ, | 19775 | | READDIR, RECLAIM_COMPLETE, | 19776 | | REMOVE, RENAME, SECINFO, | 19777 | | SECINFO_NO_NAME, SEQUENCE, | 19778 | | SETATTR, SET_SSV, | 19779 | | TEST_STATEID, VERIFY, | 19780 | | WANT_DELEGATION, WRITE | 19781 +-----------------------------------+-------------------------------+ 19782 | NFS4ERR_BAD_COOKIE | GETDEVICELIST, READDIR | 19783 +-----------------------------------+-------------------------------+ 19784 | NFS4ERR_BAD_HIGH_SLOT | CB_RECALL_SLOT, CB_SEQUENCE, | 19785 | | SEQUENCE | 19786 +-----------------------------------+-------------------------------+ 19787 | NFS4ERR_BAD_RANGE | LOCK, LOCKT, LOCKU | 19788 +-----------------------------------+-------------------------------+ 19789 | NFS4ERR_BAD_SESSION_DIGEST | BIND_CONN_TO_SESSION, | 19790 | | SET_SSV | 19791 +-----------------------------------+-------------------------------+ 19792 | NFS4ERR_BAD_STATEID | CB_LAYOUTRECALL, CB_NOTIFY, | 19793 | | CB_NOTIFY_LOCK, CB_RECALL, | 19794 | | CLOSE, DELEGRETURN, | 19795 | | FREE_STATEID, LAYOUTGET, | 19796 | | LAYOUTRETURN, LOCK, LOCKU, | 19797 | | OPEN, OPEN_DOWNGRADE, READ, | 19798 | | SETATTR, WRITE | 19799 +-----------------------------------+-------------------------------+ 19800 | NFS4ERR_CB_PATH_DOWN | DESTROY_SESSION | 19801 +-----------------------------------+-------------------------------+ 19802 | NFS4ERR_CLID_INUSE | CREATE_SESSION, EXCHANGE_ID | 19803 +-----------------------------------+-------------------------------+ 19804 | NFS4ERR_CLIENTID_BUSY | DESTROY_CLIENTID | 19805 +-----------------------------------+-------------------------------+ 19806 | NFS4ERR_COMPLETE_ALREADY | RECLAIM_COMPLETE | 19807 +-----------------------------------+-------------------------------+ 19808 | NFS4ERR_CONN_NOT_BOUND_TO_SESSION | CB_SEQUENCE, | 19809 | | DESTROY_SESSION, SEQUENCE | 19810 +-----------------------------------+-------------------------------+ 19811 | NFS4ERR_DEADLOCK | LOCK | 19812 +-----------------------------------+-------------------------------+ 19813 | NFS4ERR_DEADSESSION | ACCESS, BACKCHANNEL_CTL, | 19814 | | BIND_CONN_TO_SESSION, CLOSE, | 19815 | | COMMIT, CREATE, | 19816 | | CREATE_SESSION, DELEGPURGE, | 19817 | | DELEGRETURN, | 19818 | | DESTROY_CLIENTID, | 19819 | | DESTROY_SESSION, | 19820 | | EXCHANGE_ID, FREE_STATEID, | 19821 | | GETATTR, GETDEVICEINFO, | 19822 | | GETDEVICELIST, | 19823 | | GET_DIR_DELEGATION, | 19824 | | LAYOUTCOMMIT, LAYOUTGET, | 19825 | | LAYOUTRETURN, LINK, LOCK, | 19826 | | LOCKT, LOCKU, LOOKUP, | 19827 | | LOOKUPP, NVERIFY, OPEN, | 19828 | | OPENATTR, OPEN_DOWNGRADE, | 19829 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19830 | | READ, READDIR, READLINK, | 19831 | | RECLAIM_COMPLETE, REMOVE, | 19832 | | RENAME, RESTOREFH, SAVEFH, | 19833 | | SECINFO, SECINFO_NO_NAME, | 19834 | | SEQUENCE, SETATTR, SET_SSV, | 19835 | | TEST_STATEID, VERIFY, | 19836 | | WANT_DELEGATION, WRITE | 19837 +-----------------------------------+-------------------------------+ 19838 | NFS4ERR_DELAY | ACCESS, BACKCHANNEL_CTL, | 19839 | | BIND_CONN_TO_SESSION, | 19840 | | CB_GETATTR, CB_LAYOUTRECALL, | 19841 | | CB_NOTIFY, | 19842 | | CB_NOTIFY_DEVICEID, | 19843 | | CB_NOTIFY_LOCK, | 19844 | | CB_PUSH_DELEG, CB_RECALL, | 19845 | | CB_RECALLABLE_OBJ_AVAIL, | 19846 | | CB_RECALL_ANY, | 19847 | | CB_RECALL_SLOT, CB_SEQUENCE, | 19848 | | CB_WANTS_CANCELLED, CLOSE, | 19849 | | COMMIT, CREATE, | 19850 | | CREATE_SESSION, DELEGPURGE, | 19851 | | DELEGRETURN, | 19852 | | DESTROY_CLIENTID, | 19853 | | DESTROY_SESSION, | 19854 | | EXCHANGE_ID, FREE_STATEID, | 19855 | | GETATTR, GETDEVICEINFO, | 19856 | | GETDEVICELIST, | 19857 | | GET_DIR_DELEGATION, | 19858 | | LAYOUTCOMMIT, LAYOUTGET, | 19859 | | LAYOUTRETURN, LINK, LOCK, | 19860 | | LOCKT, LOCKU, LOOKUP, | 19861 | | LOOKUPP, NVERIFY, OPEN, | 19862 | | OPENATTR, OPEN_DOWNGRADE, | 19863 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19864 | | READ, READDIR, READLINK, | 19865 | | RECLAIM_COMPLETE, REMOVE, | 19866 | | RENAME, SECINFO, | 19867 | | SECINFO_NO_NAME, SEQUENCE, | 19868 | | SETATTR, SET_SSV, | 19869 | | TEST_STATEID, VERIFY, | 19870 | | WANT_DELEGATION, WRITE | 19871 +-----------------------------------+-------------------------------+ 19872 | NFS4ERR_DELEG_ALREADY_WANTED | OPEN, WANT_DELEGATION | 19873 +-----------------------------------+-------------------------------+ 19874 | NFS4ERR_DELEG_REVOKED | DELEGRETURN, LAYOUTCOMMIT, | 19875 | | LAYOUTGET, LAYOUTRETURN, | 19876 | | OPEN, READ, SETATTR, WRITE | 19877 +-----------------------------------+-------------------------------+ 19878 | NFS4ERR_DENIED | LOCK, LOCKT | 19879 +-----------------------------------+-------------------------------+ 19880 | NFS4ERR_DIRDELEG_UNAVAIL | GET_DIR_DELEGATION | 19881 +-----------------------------------+-------------------------------+ 19882 | NFS4ERR_DQUOT | CREATE, LAYOUTGET, LINK, | 19883 | | OPEN, OPENATTR, RENAME, | 19884 | | SETATTR, WRITE | 19885 +-----------------------------------+-------------------------------+ 19886 | NFS4ERR_ENCR_ALG_UNSUPP | EXCHANGE_ID | 19887 +-----------------------------------+-------------------------------+ 19888 | NFS4ERR_EXIST | CREATE, LINK, OPEN, RENAME | 19889 +-----------------------------------+-------------------------------+ 19890 | NFS4ERR_EXPIRED | CLOSE, DELEGRETURN, | 19891 | | LAYOUTCOMMIT, LAYOUTRETURN, | 19892 | | LOCK, LOCKU, OPEN, | 19893 | | OPEN_DOWNGRADE, READ, | 19894 | | SETATTR, WRITE | 19895 +-----------------------------------+-------------------------------+ 19896 | NFS4ERR_FBIG | LAYOUTCOMMIT, OPEN, SETATTR, | 19897 | | WRITE | 19898 +-----------------------------------+-------------------------------+ 19899 | NFS4ERR_FHEXPIRED | ACCESS, CLOSE, COMMIT, | 19900 | | CREATE, DELEGRETURN, | 19901 | | GETATTR, GETDEVICELIST, | 19902 | | GETFH, GET_DIR_DELEGATION, | 19903 | | LAYOUTCOMMIT, LAYOUTGET, | 19904 | | LAYOUTRETURN, LINK, LOCK, | 19905 | | LOCKT, LOCKU, LOOKUP, | 19906 | | LOOKUPP, NVERIFY, OPEN, | 19907 | | OPENATTR, OPEN_DOWNGRADE, | 19908 | | READ, READDIR, READLINK, | 19909 | | RECLAIM_COMPLETE, REMOVE, | 19910 | | RENAME, RESTOREFH, SAVEFH, | 19911 | | SECINFO, SECINFO_NO_NAME, | 19912 | | SETATTR, VERIFY, | 19913 | | WANT_DELEGATION, WRITE | 19914 +-----------------------------------+-------------------------------+ 19915 | NFS4ERR_FILE_OPEN | LINK, REMOVE, RENAME | 19916 +-----------------------------------+-------------------------------+ 19917 | NFS4ERR_GRACE | GETATTR, GET_DIR_DELEGATION, | 19918 | | LAYOUTCOMMIT, LAYOUTGET, | 19919 | | LAYOUTRETURN, LINK, LOCK, | 19920 | | LOCKT, NVERIFY, OPEN, READ, | 19921 | | REMOVE, RENAME, SETATTR, | 19922 | | VERIFY, WANT_DELEGATION, | 19923 | | WRITE | 19924 +-----------------------------------+-------------------------------+ 19925 | NFS4ERR_HASH_ALG_UNSUPP | EXCHANGE_ID | 19926 +-----------------------------------+-------------------------------+ 19927 | NFS4ERR_INVAL | ACCESS, BACKCHANNEL_CTL, | 19928 | | BIND_CONN_TO_SESSION, | 19929 | | CB_GETATTR, CB_LAYOUTRECALL, | 19930 | | CB_NOTIFY, | 19931 | | CB_NOTIFY_DEVICEID, | 19932 | | CB_PUSH_DELEG, | 19933 | | CB_RECALLABLE_OBJ_AVAIL, | 19934 | | CB_RECALL_ANY, CREATE, | 19935 | | CREATE_SESSION, DELEGRETURN, | 19936 | | EXCHANGE_ID, GETATTR, | 19937 | | GETDEVICEINFO, | 19938 | | GETDEVICELIST, | 19939 | | GET_DIR_DELEGATION, | 19940 | | LAYOUTCOMMIT, LAYOUTGET, | 19941 | | LAYOUTRETURN, LINK, LOCK, | 19942 | | LOCKT, LOCKU, LOOKUP, | 19943 | | NVERIFY, OPEN, | 19944 | | OPEN_DOWNGRADE, READ, | 19945 | | READDIR, READLINK, | 19946 | | RECLAIM_COMPLETE, REMOVE, | 19947 | | RENAME, SECINFO, | 19948 | | SECINFO_NO_NAME, SETATTR, | 19949 | | SET_SSV, VERIFY, | 19950 | | WANT_DELEGATION, WRITE | 19951 +-----------------------------------+-------------------------------+ 19952 | NFS4ERR_IO | ACCESS, COMMIT, CREATE, | 19953 | | GETATTR, GETDEVICELIST, | 19954 | | GET_DIR_DELEGATION, | 19955 | | LAYOUTCOMMIT, LAYOUTGET, | 19956 | | LINK, LOOKUP, LOOKUPP, | 19957 | | NVERIFY, OPEN, OPENATTR, | 19958 | | READ, READDIR, READLINK, | 19959 | | REMOVE, RENAME, SETATTR, | 19960 | | VERIFY, WANT_DELEGATION, | 19961 | | WRITE | 19962 +-----------------------------------+-------------------------------+ 19963 | NFS4ERR_ISDIR | COMMIT, LAYOUTCOMMIT, | 19964 | | LAYOUTRETURN, LINK, LOCK, | 19965 | | LOCKT, OPEN, READ, WRITE | 19966 +-----------------------------------+-------------------------------+ 19967 | NFS4ERR_LAYOUTTRYLATER | LAYOUTGET | 19968 +-----------------------------------+-------------------------------+ 19969 | NFS4ERR_LAYOUTUNAVAILABLE | LAYOUTGET | 19970 +-----------------------------------+-------------------------------+ 19971 | NFS4ERR_LOCKED | LAYOUTGET, READ, SETATTR, | 19972 | | WRITE | 19973 +-----------------------------------+-------------------------------+ 19974 | NFS4ERR_LOCKS_HELD | CLOSE, FREE_STATEID | 19975 +-----------------------------------+-------------------------------+ 19976 | NFS4ERR_LOCK_NOTSUPP | LOCK | 19977 +-----------------------------------+-------------------------------+ 19978 | NFS4ERR_LOCK_RANGE | LOCK, LOCKT, LOCKU | 19979 +-----------------------------------+-------------------------------+ 19980 | NFS4ERR_MLINK | CREATE, LINK, RENAME | 19981 +-----------------------------------+-------------------------------+ 19982 | NFS4ERR_MOVED | ACCESS, CLOSE, COMMIT, | 19983 | | CREATE, DELEGRETURN, | 19984 | | GETATTR, GETFH, | 19985 | | GET_DIR_DELEGATION, | 19986 | | LAYOUTCOMMIT, LAYOUTGET, | 19987 | | LAYOUTRETURN, LINK, LOCK, | 19988 | | LOCKT, LOCKU, LOOKUP, | 19989 | | LOOKUPP, NVERIFY, OPEN, | 19990 | | OPENATTR, OPEN_DOWNGRADE, | 19991 | | PUTFH, READ, READDIR, | 19992 | | READLINK, RECLAIM_COMPLETE, | 19993 | | REMOVE, RENAME, RESTOREFH, | 19994 | | SAVEFH, SECINFO, | 19995 | | SECINFO_NO_NAME, SETATTR, | 19996 | | VERIFY, WANT_DELEGATION, | 19997 | | WRITE | 19998 +-----------------------------------+-------------------------------+ 19999 | NFS4ERR_NAMETOOLONG | CREATE, LINK, LOOKUP, OPEN, | 20000 | | REMOVE, RENAME, SECINFO | 20001 +-----------------------------------+-------------------------------+ 20002 | NFS4ERR_NOENT | BACKCHANNEL_CTL, | 20003 | | CREATE_SESSION, EXCHANGE_ID, | 20004 | | GETDEVICEINFO, LOOKUP, | 20005 | | LOOKUPP, OPEN, OPENATTR, | 20006 | | REMOVE, RENAME, SECINFO, | 20007 | | SECINFO_NO_NAME | 20008 +-----------------------------------+-------------------------------+ 20009 | NFS4ERR_NOFILEHANDLE | ACCESS, CLOSE, COMMIT, | 20010 | | CREATE, DELEGRETURN, | 20011 | | GETATTR, GETDEVICELIST, | 20012 | | GETFH, GET_DIR_DELEGATION, | 20013 | | LAYOUTCOMMIT, LAYOUTGET, | 20014 | | LAYOUTRETURN, LINK, LOCK, | 20015 | | LOCKT, LOCKU, LOOKUP, | 20016 | | LOOKUPP, NVERIFY, OPEN, | 20017 | | OPENATTR, OPEN_DOWNGRADE, | 20018 | | READ, READDIR, READLINK, | 20019 | | RECLAIM_COMPLETE, REMOVE, | 20020 | | RENAME, RESTOREFH, SAVEFH, | 20021 | | SECINFO, SECINFO_NO_NAME, | 20022 | | SETATTR, VERIFY, | 20023 | | WANT_DELEGATION, WRITE | 20024 +-----------------------------------+-------------------------------+ 20025 | NFS4ERR_NOMATCHING_LAYOUT | CB_LAYOUTRECALL | 20026 +-----------------------------------+-------------------------------+ 20027 | NFS4ERR_NOSPC | CREATE, CREATE_SESSION, | 20028 | | LAYOUTGET, LINK, OPEN, | 20029 | | OPENATTR, RENAME, SETATTR, | 20030 | | WRITE | 20031 +-----------------------------------+-------------------------------+ 20032 | NFS4ERR_NOTDIR | CREATE, GET_DIR_DELEGATION, | 20033 | | LINK, LOOKUP, LOOKUPP, OPEN, | 20034 | | READDIR, REMOVE, RENAME, | 20035 | | SECINFO, SECINFO_NO_NAME | 20036 +-----------------------------------+-------------------------------+ 20037 | NFS4ERR_NOTEMPTY | REMOVE, RENAME | 20038 +-----------------------------------+-------------------------------+ 20039 | NFS4ERR_NOTSUPP | CB_LAYOUTRECALL, CB_NOTIFY, | 20040 | | CB_NOTIFY_DEVICEID, | 20041 | | CB_NOTIFY_LOCK, | 20042 | | CB_PUSH_DELEG, | 20043 | | CB_RECALLABLE_OBJ_AVAIL, | 20044 | | CB_WANTS_CANCELLED, | 20045 | | DELEGPURGE, DELEGRETURN, | 20046 | | GETDEVICEINFO, | 20047 | | GETDEVICELIST, | 20048 | | GET_DIR_DELEGATION, | 20049 | | LAYOUTCOMMIT, LAYOUTGET, | 20050 | | LAYOUTRETURN, LINK, | 20051 | | OPENATTR, OPEN_CONFIRM, | 20052 | | RELEASE_LOCKOWNER, RENEW, | 20053 | | SECINFO_NO_NAME, | 20054 | | SETCLIENTID, | 20055 | | SETCLIENTID_CONFIRM, | 20056 | | WANT_DELEGATION | 20057 +-----------------------------------+-------------------------------+ 20058 | NFS4ERR_NOT_ONLY_OP | BIND_CONN_TO_SESSION, | 20059 | | CREATE_SESSION, | 20060 | | DESTROY_CLIENTID, | 20061 | | DESTROY_SESSION, EXCHANGE_ID | 20062 +-----------------------------------+-------------------------------+ 20063 | NFS4ERR_NOT_SAME | EXCHANGE_ID, GETDEVICELIST, | 20064 | | READDIR, VERIFY | 20065 +-----------------------------------+-------------------------------+ 20066 | NFS4ERR_NO_GRACE | LAYOUTCOMMIT, LAYOUTRETURN, | 20067 | | LOCK, OPEN, WANT_DELEGATION | 20068 +-----------------------------------+-------------------------------+ 20069 | NFS4ERR_OLD_STATEID | CLOSE, DELEGRETURN, | 20070 | | FREE_STATEID, LAYOUTGET, | 20071 | | LAYOUTRETURN, LOCK, LOCKU, | 20072 | | OPEN, OPEN_DOWNGRADE, READ, | 20073 | | SETATTR, WRITE | 20074 +-----------------------------------+-------------------------------+ 20075 | NFS4ERR_OPENMODE | LAYOUTGET, LOCK, READ, | 20076 | | SETATTR, WRITE | 20077 +-----------------------------------+-------------------------------+ 20078 | NFS4ERR_OP_ILLEGAL | CB_ILLEGAL, ILLEGAL | 20079 +-----------------------------------+-------------------------------+ 20080 | NFS4ERR_OP_NOT_IN_SESSION | ACCESS, BACKCHANNEL_CTL, | 20081 | | CB_GETATTR, CB_LAYOUTRECALL, | 20082 | | CB_NOTIFY, | 20083 | | CB_NOTIFY_DEVICEID, | 20084 | | CB_NOTIFY_LOCK, | 20085 | | CB_PUSH_DELEG, CB_RECALL, | 20086 | | CB_RECALLABLE_OBJ_AVAIL, | 20087 | | CB_RECALL_ANY, | 20088 | | CB_RECALL_SLOT, | 20089 | | CB_WANTS_CANCELLED, CLOSE, | 20090 | | COMMIT, CREATE, DELEGPURGE, | 20091 | | DELEGRETURN, FREE_STATEID, | 20092 | | GETATTR, GETDEVICEINFO, | 20093 | | GETDEVICELIST, GETFH, | 20094 | | GET_DIR_DELEGATION, | 20095 | | LAYOUTCOMMIT, LAYOUTGET, | 20096 | | LAYOUTRETURN, LINK, LOCK, | 20097 | | LOCKT, LOCKU, LOOKUP, | 20098 | | LOOKUPP, NVERIFY, OPEN, | 20099 | | OPENATTR, OPEN_DOWNGRADE, | 20100 | | PUTFH, PUTPUBFH, PUTROOTFH, | 20101 | | READ, READDIR, READLINK, | 20102 | | RECLAIM_COMPLETE, REMOVE, | 20103 | | RENAME, RESTOREFH, SAVEFH, | 20104 | | SECINFO, SECINFO_NO_NAME, | 20105 | | SETATTR, SET_SSV, | 20106 | | TEST_STATEID, VERIFY, | 20107 | | WANT_DELEGATION, WRITE | 20108 +-----------------------------------+-------------------------------+ 20109 | NFS4ERR_PERM | CREATE, OPEN, SETATTR | 20110 +-----------------------------------+-------------------------------+ 20111 | NFS4ERR_PNFS_IO_HOLE | READ, WRITE | 20112 +-----------------------------------+-------------------------------+ 20113 | NFS4ERR_PNFS_NO_LAYOUT | READ, WRITE | 20114 +-----------------------------------+-------------------------------+ 20115 | NFS4ERR_RECALLCONFLICT | LAYOUTGET, WANT_DELEGATION | 20116 +-----------------------------------+-------------------------------+ 20117 | NFS4ERR_RECLAIM_BAD | LAYOUTCOMMIT, LOCK, OPEN, | 20118 | | WANT_DELEGATION | 20119 +-----------------------------------+-------------------------------+ 20120 | NFS4ERR_RECLAIM_CONFLICT | LAYOUTCOMMIT, LOCK, OPEN, | 20121 | | WANT_DELEGATION | 20122 +-----------------------------------+-------------------------------+ 20123 | NFS4ERR_REJECT_DELEG | CB_PUSH_DELEG | 20124 +-----------------------------------+-------------------------------+ 20125 | NFS4ERR_REP_TOO_BIG | ACCESS, BACKCHANNEL_CTL, | 20126 | | BIND_CONN_TO_SESSION, | 20127 | | CB_GETATTR, CB_LAYOUTRECALL, | 20128 | | CB_NOTIFY, | 20129 | | CB_NOTIFY_DEVICEID, | 20130 | | CB_NOTIFY_LOCK, | 20131 | | CB_PUSH_DELEG, CB_RECALL, | 20132 | | CB_RECALLABLE_OBJ_AVAIL, | 20133 | | CB_RECALL_ANY, | 20134 | | CB_RECALL_SLOT, CB_SEQUENCE, | 20135 | | CB_WANTS_CANCELLED, CLOSE, | 20136 | | COMMIT, CREATE, | 20137 | | CREATE_SESSION, DELEGPURGE, | 20138 | | DELEGRETURN, | 20139 | | DESTROY_CLIENTID, | 20140 | | DESTROY_SESSION, | 20141 | | EXCHANGE_ID, FREE_STATEID, | 20142 | | GETATTR, GETDEVICEINFO, | 20143 | | GETDEVICELIST, | 20144 | | GET_DIR_DELEGATION, | 20145 | | LAYOUTCOMMIT, LAYOUTGET, | 20146 | | LAYOUTRETURN, LINK, LOCK, | 20147 | | LOCKT, LOCKU, LOOKUP, | 20148 | | LOOKUPP, NVERIFY, OPEN, | 20149 | | OPENATTR, OPEN_DOWNGRADE, | 20150 | | PUTFH, PUTPUBFH, PUTROOTFH, | 20151 | | READ, READDIR, READLINK, | 20152 | | RECLAIM_COMPLETE, REMOVE, | 20153 | | RENAME, RESTOREFH, SAVEFH, | 20154 | | SECINFO, SECINFO_NO_NAME, | 20155 | | SEQUENCE, SETATTR, SET_SSV, | 20156 | | TEST_STATEID, VERIFY, | 20157 | | WANT_DELEGATION, WRITE | 20158 +-----------------------------------+-------------------------------+ 20159 | NFS4ERR_REP_TOO_BIG_TO_CACHE | ACCESS, BACKCHANNEL_CTL, | 20160 | | BIND_CONN_TO_SESSION, | 20161 | | CB_GETATTR, CB_LAYOUTRECALL, | 20162 | | CB_NOTIFY, | 20163 | | CB_NOTIFY_DEVICEID, | 20164 | | CB_NOTIFY_LOCK, | 20165 | | CB_PUSH_DELEG, CB_RECALL, | 20166 | | CB_RECALLABLE_OBJ_AVAIL, | 20167 | | CB_RECALL_ANY, | 20168 | | CB_RECALL_SLOT, CB_SEQUENCE, | 20169 | | CB_WANTS_CANCELLED, CLOSE, | 20170 | | COMMIT, CREATE, | 20171 | | CREATE_SESSION, DELEGPURGE, | 20172 | | DELEGRETURN, | 20173 | | DESTROY_CLIENTID, | 20174 | | DESTROY_SESSION, | 20175 | | EXCHANGE_ID, FREE_STATEID, | 20176 | | GETATTR, GETDEVICEINFO, | 20177 | | GETDEVICELIST, | 20178 | | GET_DIR_DELEGATION, | 20179 | | LAYOUTCOMMIT, LAYOUTGET, | 20180 | | LAYOUTRETURN, LINK, LOCK, | 20181 | | LOCKT, LOCKU, LOOKUP, | 20182 | | LOOKUPP, NVERIFY, OPEN, | 20183 | | OPENATTR, OPEN_DOWNGRADE, | 20184 | | PUTFH, PUTPUBFH, PUTROOTFH, | 20185 | | READ, READDIR, READLINK, | 20186 | | RECLAIM_COMPLETE, REMOVE, | 20187 | | RENAME, RESTOREFH, SAVEFH, | 20188 | | SECINFO, SECINFO_NO_NAME, | 20189 | | SEQUENCE, SETATTR, SET_SSV, | 20190 | | TEST_STATEID, VERIFY, | 20191 | | WANT_DELEGATION, WRITE | 20192 +-----------------------------------+-------------------------------+ 20193 | NFS4ERR_REQ_TOO_BIG | ACCESS, BACKCHANNEL_CTL, | 20194 | | BIND_CONN_TO_SESSION, | 20195 | | CB_GETATTR, CB_LAYOUTRECALL, | 20196 | | CB_NOTIFY, | 20197 | | CB_NOTIFY_DEVICEID, | 20198 | | CB_NOTIFY_LOCK, | 20199 | | CB_PUSH_DELEG, CB_RECALL, | 20200 | | CB_RECALLABLE_OBJ_AVAIL, | 20201 | | CB_RECALL_ANY, | 20202 | | CB_RECALL_SLOT, CB_SEQUENCE, | 20203 | | CB_WANTS_CANCELLED, CLOSE, | 20204 | | COMMIT, CREATE, | 20205 | | CREATE_SESSION, DELEGPURGE, | 20206 | | DELEGRETURN, | 20207 | | DESTROY_CLIENTID, | 20208 | | DESTROY_SESSION, | 20209 | | EXCHANGE_ID, FREE_STATEID, | 20210 | | GETATTR, GETDEVICEINFO, | 20211 | | GETDEVICELIST, | 20212 | | GET_DIR_DELEGATION, | 20213 | | LAYOUTCOMMIT, LAYOUTGET, | 20214 | | LAYOUTRETURN, LINK, LOCK, | 20215 | | LOCKT, LOCKU, LOOKUP, | 20216 | | LOOKUPP, NVERIFY, OPEN, | 20217 | | OPENATTR, OPEN_DOWNGRADE, | 20218 | | PUTFH, PUTPUBFH, PUTROOTFH, | 20219 | | READ, READDIR, READLINK, | 20220 | | RECLAIM_COMPLETE, REMOVE, | 20221 | | RENAME, RESTOREFH, SAVEFH, | 20222 | | SECINFO, SECINFO_NO_NAME, | 20223 | | SEQUENCE, SETATTR, SET_SSV, | 20224 | | TEST_STATEID, VERIFY, | 20225 | | WANT_DELEGATION, WRITE | 20226 +-----------------------------------+-------------------------------+ 20227 | NFS4ERR_RETRY_UNCACHED_REP | ACCESS, BACKCHANNEL_CTL, | 20228 | | BIND_CONN_TO_SESSION, | 20229 | | CB_GETATTR, CB_LAYOUTRECALL, | 20230 | | CB_NOTIFY, | 20231 | | CB_NOTIFY_DEVICEID, | 20232 | | CB_NOTIFY_LOCK, | 20233 | | CB_PUSH_DELEG, CB_RECALL, | 20234 | | CB_RECALLABLE_OBJ_AVAIL, | 20235 | | CB_RECALL_ANY, | 20236 | | CB_RECALL_SLOT, CB_SEQUENCE, | 20237 | | CB_WANTS_CANCELLED, CLOSE, | 20238 | | COMMIT, CREATE, | 20239 | | CREATE_SESSION, DELEGPURGE, | 20240 | | DELEGRETURN, | 20241 | | DESTROY_CLIENTID, | 20242 | | DESTROY_SESSION, | 20243 | | EXCHANGE_ID, FREE_STATEID, | 20244 | | GETATTR, GETDEVICEINFO, | 20245 | | GETDEVICELIST, | 20246 | | GET_DIR_DELEGATION, | 20247 | | LAYOUTCOMMIT, LAYOUTGET, | 20248 | | LAYOUTRETURN, LINK, LOCK, | 20249 | | LOCKT, LOCKU, LOOKUP, | 20250 | | LOOKUPP, NVERIFY, OPEN, | 20251 | | OPENATTR, OPEN_DOWNGRADE, | 20252 | | PUTFH, PUTPUBFH, PUTROOTFH, | 20253 | | READ, READDIR, READLINK, | 20254 | | RECLAIM_COMPLETE, REMOVE, | 20255 | | RENAME, RESTOREFH, SAVEFH, | 20256 | | SECINFO, SECINFO_NO_NAME, | 20257 | | SEQUENCE, SETATTR, SET_SSV, | 20258 | | TEST_STATEID, VERIFY, | 20259 | | WANT_DELEGATION, WRITE | 20260 +-----------------------------------+-------------------------------+ 20261 | NFS4ERR_ROFS | CREATE, LINK, LOCK, LOCKT, | 20262 | | OPEN, OPENATTR, | 20263 | | OPEN_DOWNGRADE, REMOVE, | 20264 | | RENAME, SETATTR, WRITE | 20265 +-----------------------------------+-------------------------------+ 20266 | NFS4ERR_SAME | NVERIFY | 20267 +-----------------------------------+-------------------------------+ 20268 | NFS4ERR_SEQUENCE_POS | CB_SEQUENCE, SEQUENCE | 20269 +-----------------------------------+-------------------------------+ 20270 | NFS4ERR_SEQ_FALSE_RETRY | CB_SEQUENCE, SEQUENCE | 20271 +-----------------------------------+-------------------------------+ 20272 | NFS4ERR_SEQ_MISORDERED | CB_SEQUENCE, CREATE_SESSION, | 20273 | | SEQUENCE | 20274 +-----------------------------------+-------------------------------+ 20275 | NFS4ERR_SERVERFAULT | ACCESS, | 20276 | | BIND_CONN_TO_SESSION, | 20277 | | CB_GETATTR, CB_NOTIFY, | 20278 | | CB_NOTIFY_DEVICEID, | 20279 | | CB_NOTIFY_LOCK, | 20280 | | CB_PUSH_DELEG, CB_RECALL, | 20281 | | CB_RECALLABLE_OBJ_AVAIL, | 20282 | | CB_WANTS_CANCELLED, CLOSE, | 20283 | | COMMIT, CREATE, | 20284 | | CREATE_SESSION, DELEGPURGE, | 20285 | | DELEGRETURN, | 20286 | | DESTROY_CLIENTID, | 20287 | | DESTROY_SESSION, | 20288 | | EXCHANGE_ID, FREE_STATEID, | 20289 | | GETATTR, GETDEVICEINFO, | 20290 | | GETDEVICELIST, | 20291 | | GET_DIR_DELEGATION, | 20292 | | LAYOUTCOMMIT, LAYOUTGET, | 20293 | | LAYOUTRETURN, LINK, LOCK, | 20294 | | LOCKU, LOOKUP, LOOKUPP, | 20295 | | NVERIFY, OPEN, OPENATTR, | 20296 | | OPEN_DOWNGRADE, PUTFH, | 20297 | | PUTPUBFH, PUTROOTFH, READ, | 20298 | | READDIR, READLINK, | 20299 | | RECLAIM_COMPLETE, REMOVE, | 20300 | | RENAME, RESTOREFH, SAVEFH, | 20301 | | SECINFO, SECINFO_NO_NAME, | 20302 | | SETATTR, TEST_STATEID, | 20303 | | VERIFY, WANT_DELEGATION, | 20304 | | WRITE | 20305 +-----------------------------------+-------------------------------+ 20306 | NFS4ERR_SHARE_DENIED | OPEN | 20307 +-----------------------------------+-------------------------------+ 20308 | NFS4ERR_STALE | ACCESS, CLOSE, COMMIT, | 20309 | | CREATE, DELEGRETURN, | 20310 | | GETATTR, GETFH, | 20311 | | GET_DIR_DELEGATION, | 20312 | | LAYOUTCOMMIT, LAYOUTGET, | 20313 | | LAYOUTRETURN, LINK, LOCK, | 20314 | | LOCKT, LOCKU, LOOKUP, | 20315 | | LOOKUPP, NVERIFY, OPEN, | 20316 | | OPENATTR, OPEN_DOWNGRADE, | 20317 | | PUTFH, READ, READDIR, | 20318 | | READLINK, RECLAIM_COMPLETE, | 20319 | | REMOVE, RENAME, RESTOREFH, | 20320 | | SAVEFH, SECINFO, | 20321 | | SECINFO_NO_NAME, SETATTR, | 20322 | | VERIFY, WANT_DELEGATION, | 20323 | | WRITE | 20324 +-----------------------------------+-------------------------------+ 20325 | NFS4ERR_STALE_CLIENTID | CREATE_SESSION, | 20326 | | DESTROY_CLIENTID, | 20327 | | DESTROY_SESSION | 20328 +-----------------------------------+-------------------------------+ 20329 | NFS4ERR_SYMLINK | COMMIT, LAYOUTCOMMIT, LINK, | 20330 | | LOCK, LOCKT, LOOKUP, | 20331 | | LOOKUPP, OPEN, READ, WRITE | 20332 +-----------------------------------+-------------------------------+ 20333 | NFS4ERR_TOOSMALL | CREATE_SESSION, | 20334 | | GETDEVICEINFO, LAYOUTGET, | 20335 | | READDIR | 20336 +-----------------------------------+-------------------------------+ 20337 | NFS4ERR_TOO_MANY_OPS | ACCESS, BACKCHANNEL_CTL, | 20338 | | BIND_CONN_TO_SESSION, | 20339 | | CB_GETATTR, CB_LAYOUTRECALL, | 20340 | | CB_NOTIFY, | 20341 | | CB_NOTIFY_DEVICEID, | 20342 | | CB_NOTIFY_LOCK, | 20343 | | CB_PUSH_DELEG, CB_RECALL, | 20344 | | CB_RECALLABLE_OBJ_AVAIL, | 20345 | | CB_RECALL_ANY, | 20346 | | CB_RECALL_SLOT, CB_SEQUENCE, | 20347 | | CB_WANTS_CANCELLED, CLOSE, | 20348 | | COMMIT, CREATE, | 20349 | | CREATE_SESSION, DELEGPURGE, | 20350 | | DELEGRETURN, | 20351 | | DESTROY_CLIENTID, | 20352 | | DESTROY_SESSION, | 20353 | | EXCHANGE_ID, FREE_STATEID, | 20354 | | GETATTR, GETDEVICEINFO, | 20355 | | GETDEVICELIST, | 20356 | | GET_DIR_DELEGATION, | 20357 | | LAYOUTCOMMIT, LAYOUTGET, | 20358 | | LAYOUTRETURN, LINK, LOCK, | 20359 | | LOCKT, LOCKU, LOOKUP, | 20360 | | LOOKUPP, NVERIFY, OPEN, | 20361 | | OPENATTR, OPEN_DOWNGRADE, | 20362 | | PUTFH, PUTPUBFH, PUTROOTFH, | 20363 | | READ, READDIR, READLINK, | 20364 | | RECLAIM_COMPLETE, REMOVE, | 20365 | | RENAME, RESTOREFH, SAVEFH, | 20366 | | SECINFO, SECINFO_NO_NAME, | 20367 | | SEQUENCE, SETATTR, SET_SSV, | 20368 | | TEST_STATEID, VERIFY, | 20369 | | WANT_DELEGATION, WRITE | 20370 +-----------------------------------+-------------------------------+ 20371 | NFS4ERR_UNKNOWN_LAYOUTTYPE | CB_LAYOUTRECALL, | 20372 | | GETDEVICEINFO, | 20373 | | GETDEVICELIST, LAYOUTCOMMIT, | 20374 | | LAYOUTGET, LAYOUTRETURN, | 20375 | | NVERIFY, SETATTR, VERIFY | 20376 +-----------------------------------+-------------------------------+ 20377 | NFS4ERR_UNSAFE_COMPOUND | CREATE, OPEN, OPENATTR | 20378 +-----------------------------------+-------------------------------+ 20379 | NFS4ERR_WRONGSEC | LINK, LOOKUP, LOOKUPP, OPEN, | 20380 | | PUTFH, PUTPUBFH, PUTROOTFH, | 20381 | | RENAME, RESTOREFH | 20382 +-----------------------------------+-------------------------------+ 20383 | NFS4ERR_WRONG_CRED | CLOSE, CREATE_SESSION, | 20384 | | DELEGPURGE, DELEGRETURN, | 20385 | | DESTROY_CLIENTID, | 20386 | | DESTROY_SESSION, | 20387 | | FREE_STATEID, LAYOUTCOMMIT, | 20388 | | LAYOUTRETURN, LOCK, LOCKT, | 20389 | | LOCKU, OPEN_DOWNGRADE, | 20390 | | RECLAIM_COMPLETE | 20391 +-----------------------------------+-------------------------------+ 20392 | NFS4ERR_WRONG_TYPE | CB_LAYOUTRECALL, | 20393 | | CB_PUSH_DELEG, COMMIT, | 20394 | | GETATTR, LAYOUTGET, | 20395 | | LAYOUTRETURN, LINK, LOCK, | 20396 | | LOCKT, NVERIFY, OPEN, | 20397 | | OPENATTR, READ, READLINK, | 20398 | | RECLAIM_COMPLETE, SETATTR, | 20399 | | VERIFY, WANT_DELEGATION, | 20400 | | WRITE | 20401 +-----------------------------------+-------------------------------+ 20402 | NFS4ERR_XDEV | LINK, RENAME | 20403 +-----------------------------------+-------------------------------+ 20405 Table 14: Errors and the Operations That Use Them 20407 16. NFSv4.1 Procedures 20409 Both procedures, NULL and COMPOUND, MUST be implemented. 20411 16.1. Procedure 0: NULL - No Operation 20413 16.1.1. ARGUMENTS 20415 void; 20417 16.1.2. RESULTS 20419 void; 20421 16.1.3. DESCRIPTION 20423 This is the standard NULL procedure with the standard void argument 20424 and void response. This procedure has no functionality associated 20425 with it. Because of this, it is sometimes used to measure the 20426 overhead of processing a service request. Therefore, the server 20427 SHOULD ensure that no unnecessary work is done in servicing this 20428 procedure. 20430 16.1.4. ERRORS 20432 None. 20434 16.2. Procedure 1: COMPOUND - Compound Operations 20436 16.2.1. ARGUMENTS 20437 enum nfs_opnum4 { 20438 OP_ACCESS = 3, 20439 OP_CLOSE = 4, 20440 OP_COMMIT = 5, 20441 OP_CREATE = 6, 20442 OP_DELEGPURGE = 7, 20443 OP_DELEGRETURN = 8, 20444 OP_GETATTR = 9, 20445 OP_GETFH = 10, 20446 OP_LINK = 11, 20447 OP_LOCK = 12, 20448 OP_LOCKT = 13, 20449 OP_LOCKU = 14, 20450 OP_LOOKUP = 15, 20451 OP_LOOKUPP = 16, 20452 OP_NVERIFY = 17, 20453 OP_OPEN = 18, 20454 OP_OPENATTR = 19, 20455 OP_OPEN_CONFIRM = 20, /* Mandatory not-to-implement */ 20456 OP_OPEN_DOWNGRADE = 21, 20457 OP_PUTFH = 22, 20458 OP_PUTPUBFH = 23, 20459 OP_PUTROOTFH = 24, 20460 OP_READ = 25, 20461 OP_READDIR = 26, 20462 OP_READLINK = 27, 20463 OP_REMOVE = 28, 20464 OP_RENAME = 29, 20465 OP_RENEW = 30, /* Mandatory not-to-implement */ 20466 OP_RESTOREFH = 31, 20467 OP_SAVEFH = 32, 20468 OP_SECINFO = 33, 20469 OP_SETATTR = 34, 20470 OP_SETCLIENTID = 35, /* Mandatory not-to-implement */ 20471 OP_SETCLIENTID_CONFIRM = 36, /* Mandatory not-to-implement */ 20472 OP_VERIFY = 37, 20473 OP_WRITE = 38, 20474 OP_RELEASE_LOCKOWNER = 39, /* Mandatory not-to-implement */ 20476 /* new operations for NFSv4.1 */ 20478 OP_BACKCHANNEL_CTL = 40, 20479 OP_BIND_CONN_TO_SESSION = 41, 20480 OP_EXCHANGE_ID = 42, 20481 OP_CREATE_SESSION = 43, 20482 OP_DESTROY_SESSION = 44, 20483 OP_FREE_STATEID = 45, 20484 OP_GET_DIR_DELEGATION = 46, 20485 OP_GETDEVICEINFO = 47, 20486 OP_GETDEVICELIST = 48, 20487 OP_LAYOUTCOMMIT = 49, 20488 OP_LAYOUTGET = 50, 20489 OP_LAYOUTRETURN = 51, 20490 OP_SECINFO_NO_NAME = 52, 20491 OP_SEQUENCE = 53, 20492 OP_SET_SSV = 54, 20493 OP_TEST_STATEID = 55, 20494 OP_WANT_DELEGATION = 56, 20495 OP_DESTROY_CLIENTID = 57, 20496 OP_RECLAIM_COMPLETE = 58, 20497 OP_ILLEGAL = 10044 20498 }; 20500 union nfs_argop4 switch (nfs_opnum4 argop) { 20501 case OP_ACCESS: ACCESS4args opaccess; 20502 case OP_CLOSE: CLOSE4args opclose; 20503 case OP_COMMIT: COMMIT4args opcommit; 20504 case OP_CREATE: CREATE4args opcreate; 20505 case OP_DELEGPURGE: DELEGPURGE4args opdelegpurge; 20506 case OP_DELEGRETURN: DELEGRETURN4args opdelegreturn; 20507 case OP_GETATTR: GETATTR4args opgetattr; 20508 case OP_GETFH: void; 20509 case OP_LINK: LINK4args oplink; 20510 case OP_LOCK: LOCK4args oplock; 20511 case OP_LOCKT: LOCKT4args oplockt; 20512 case OP_LOCKU: LOCKU4args oplocku; 20513 case OP_LOOKUP: LOOKUP4args oplookup; 20514 case OP_LOOKUPP: void; 20515 case OP_NVERIFY: NVERIFY4args opnverify; 20516 case OP_OPEN: OPEN4args opopen; 20517 case OP_OPENATTR: OPENATTR4args opopenattr; 20519 /* Not for NFSv4.1 */ 20520 case OP_OPEN_CONFIRM: OPEN_CONFIRM4args opopen_confirm; 20522 case OP_OPEN_DOWNGRADE: 20523 OPEN_DOWNGRADE4args opopen_downgrade; 20525 case OP_PUTFH: PUTFH4args opputfh; 20526 case OP_PUTPUBFH: void; 20527 case OP_PUTROOTFH: void; 20528 case OP_READ: READ4args opread; 20529 case OP_READDIR: READDIR4args opreaddir; 20530 case OP_READLINK: void; 20531 case OP_REMOVE: REMOVE4args opremove; 20532 case OP_RENAME: RENAME4args oprename; 20533 /* Not for NFSv4.1 */ 20534 case OP_RENEW: RENEW4args oprenew; 20536 case OP_RESTOREFH: void; 20537 case OP_SAVEFH: void; 20538 case OP_SECINFO: SECINFO4args opsecinfo; 20539 case OP_SETATTR: SETATTR4args opsetattr; 20541 /* Not for NFSv4.1 */ 20542 case OP_SETCLIENTID: SETCLIENTID4args opsetclientid; 20544 /* Not for NFSv4.1 */ 20545 case OP_SETCLIENTID_CONFIRM: SETCLIENTID_CONFIRM4args 20546 opsetclientid_confirm; 20547 case OP_VERIFY: VERIFY4args opverify; 20548 case OP_WRITE: WRITE4args opwrite; 20550 /* Not for NFSv4.1 */ 20551 case OP_RELEASE_LOCKOWNER: 20552 RELEASE_LOCKOWNER4args 20553 oprelease_lockowner; 20555 /* Operations new to NFSv4.1 */ 20556 case OP_BACKCHANNEL_CTL: 20557 BACKCHANNEL_CTL4args opbackchannel_ctl; 20559 case OP_BIND_CONN_TO_SESSION: 20560 BIND_CONN_TO_SESSION4args 20561 opbind_conn_to_session; 20563 case OP_EXCHANGE_ID: EXCHANGE_ID4args opexchange_id; 20565 case OP_CREATE_SESSION: 20566 CREATE_SESSION4args opcreate_session; 20568 case OP_DESTROY_SESSION: 20569 DESTROY_SESSION4args opdestroy_session; 20571 case OP_FREE_STATEID: FREE_STATEID4args opfree_stateid; 20573 case OP_GET_DIR_DELEGATION: 20574 GET_DIR_DELEGATION4args 20575 opget_dir_delegation; 20577 case OP_GETDEVICEINFO: GETDEVICEINFO4args opgetdeviceinfo; 20578 case OP_GETDEVICELIST: GETDEVICELIST4args opgetdevicelist; 20579 case OP_LAYOUTCOMMIT: LAYOUTCOMMIT4args oplayoutcommit; 20580 case OP_LAYOUTGET: LAYOUTGET4args oplayoutget; 20581 case OP_LAYOUTRETURN: LAYOUTRETURN4args oplayoutreturn; 20583 case OP_SECINFO_NO_NAME: 20584 SECINFO_NO_NAME4args opsecinfo_no_name; 20586 case OP_SEQUENCE: SEQUENCE4args opsequence; 20587 case OP_SET_SSV: SET_SSV4args opset_ssv; 20588 case OP_TEST_STATEID: TEST_STATEID4args optest_stateid; 20590 case OP_WANT_DELEGATION: 20591 WANT_DELEGATION4args opwant_delegation; 20593 case OP_DESTROY_CLIENTID: 20594 DESTROY_CLIENTID4args 20595 opdestroy_clientid; 20597 case OP_RECLAIM_COMPLETE: 20598 RECLAIM_COMPLETE4args 20599 opreclaim_complete; 20601 /* Operations not new to NFSv4.1 */ 20602 case OP_ILLEGAL: void; 20603 }; 20605 struct COMPOUND4args { 20606 utf8str_cs tag; 20607 uint32_t minorversion; 20608 nfs_argop4 argarray<>; 20609 }; 20611 16.2.2. RESULTS 20613 union nfs_resop4 switch (nfs_opnum4 resop) { 20614 case OP_ACCESS: ACCESS4res opaccess; 20615 case OP_CLOSE: CLOSE4res opclose; 20616 case OP_COMMIT: COMMIT4res opcommit; 20617 case OP_CREATE: CREATE4res opcreate; 20618 case OP_DELEGPURGE: DELEGPURGE4res opdelegpurge; 20619 case OP_DELEGRETURN: DELEGRETURN4res opdelegreturn; 20620 case OP_GETATTR: GETATTR4res opgetattr; 20621 case OP_GETFH: GETFH4res opgetfh; 20622 case OP_LINK: LINK4res oplink; 20623 case OP_LOCK: LOCK4res oplock; 20624 case OP_LOCKT: LOCKT4res oplockt; 20625 case OP_LOCKU: LOCKU4res oplocku; 20626 case OP_LOOKUP: LOOKUP4res oplookup; 20627 case OP_LOOKUPP: LOOKUPP4res oplookupp; 20628 case OP_NVERIFY: NVERIFY4res opnverify; 20629 case OP_OPEN: OPEN4res opopen; 20630 case OP_OPENATTR: OPENATTR4res opopenattr; 20631 /* Not for NFSv4.1 */ 20632 case OP_OPEN_CONFIRM: OPEN_CONFIRM4res opopen_confirm; 20634 case OP_OPEN_DOWNGRADE: 20635 OPEN_DOWNGRADE4res 20636 opopen_downgrade; 20638 case OP_PUTFH: PUTFH4res opputfh; 20639 case OP_PUTPUBFH: PUTPUBFH4res opputpubfh; 20640 case OP_PUTROOTFH: PUTROOTFH4res opputrootfh; 20641 case OP_READ: READ4res opread; 20642 case OP_READDIR: READDIR4res opreaddir; 20643 case OP_READLINK: READLINK4res opreadlink; 20644 case OP_REMOVE: REMOVE4res opremove; 20645 case OP_RENAME: RENAME4res oprename; 20646 /* Not for NFSv4.1 */ 20647 case OP_RENEW: RENEW4res oprenew; 20648 case OP_RESTOREFH: RESTOREFH4res oprestorefh; 20649 case OP_SAVEFH: SAVEFH4res opsavefh; 20650 case OP_SECINFO: SECINFO4res opsecinfo; 20651 case OP_SETATTR: SETATTR4res opsetattr; 20652 /* Not for NFSv4.1 */ 20653 case OP_SETCLIENTID: SETCLIENTID4res opsetclientid; 20655 /* Not for NFSv4.1 */ 20656 case OP_SETCLIENTID_CONFIRM: 20657 SETCLIENTID_CONFIRM4res 20658 opsetclientid_confirm; 20659 case OP_VERIFY: VERIFY4res opverify; 20660 case OP_WRITE: WRITE4res opwrite; 20662 /* Not for NFSv4.1 */ 20663 case OP_RELEASE_LOCKOWNER: 20664 RELEASE_LOCKOWNER4res 20665 oprelease_lockowner; 20667 /* Operations new to NFSv4.1 */ 20668 case OP_BACKCHANNEL_CTL: 20669 BACKCHANNEL_CTL4res 20670 opbackchannel_ctl; 20672 case OP_BIND_CONN_TO_SESSION: 20673 BIND_CONN_TO_SESSION4res 20674 opbind_conn_to_session; 20676 case OP_EXCHANGE_ID: EXCHANGE_ID4res opexchange_id; 20677 case OP_CREATE_SESSION: 20678 CREATE_SESSION4res 20679 opcreate_session; 20681 case OP_DESTROY_SESSION: 20682 DESTROY_SESSION4res 20683 opdestroy_session; 20685 case OP_FREE_STATEID: FREE_STATEID4res 20686 opfree_stateid; 20688 case OP_GET_DIR_DELEGATION: 20689 GET_DIR_DELEGATION4res 20690 opget_dir_delegation; 20692 case OP_GETDEVICEINFO: GETDEVICEINFO4res 20693 opgetdeviceinfo; 20695 case OP_GETDEVICELIST: GETDEVICELIST4res 20696 opgetdevicelist; 20698 case OP_LAYOUTCOMMIT: LAYOUTCOMMIT4res oplayoutcommit; 20699 case OP_LAYOUTGET: LAYOUTGET4res oplayoutget; 20700 case OP_LAYOUTRETURN: LAYOUTRETURN4res oplayoutreturn; 20702 case OP_SECINFO_NO_NAME: 20703 SECINFO_NO_NAME4res 20704 opsecinfo_no_name; 20706 case OP_SEQUENCE: SEQUENCE4res opsequence; 20707 case OP_SET_SSV: SET_SSV4res opset_ssv; 20708 case OP_TEST_STATEID: TEST_STATEID4res optest_stateid; 20710 case OP_WANT_DELEGATION: 20711 WANT_DELEGATION4res 20712 opwant_delegation; 20714 case OP_DESTROY_CLIENTID: 20715 DESTROY_CLIENTID4res 20716 opdestroy_clientid; 20718 case OP_RECLAIM_COMPLETE: 20719 RECLAIM_COMPLETE4res 20720 opreclaim_complete; 20722 /* Operations not new to NFSv4.1 */ 20723 case OP_ILLEGAL: ILLEGAL4res opillegal; 20724 }; 20725 struct COMPOUND4res { 20726 nfsstat4 status; 20727 utf8str_cs tag; 20728 nfs_resop4 resarray<>; 20729 }; 20731 16.2.3. DESCRIPTION 20733 The COMPOUND procedure is used to combine one or more NFSv4 20734 operations into a single RPC request. The server interprets each of 20735 the operations in turn. If an operation is executed by the server 20736 and the status of that operation is NFS4_OK, then the next operation 20737 in the COMPOUND procedure is executed. The server continues this 20738 process until there are no more operations to be executed or until 20739 one of the operations has a status value other than NFS4_OK. 20741 In the processing of the COMPOUND procedure, the server may find that 20742 it does not have the available resources to execute any or all of the 20743 operations within the COMPOUND sequence. See Section 2.10.6.4 for a 20744 more detailed discussion. 20746 The server will generally choose between two methods of decoding the 20747 client's request. The first would be the traditional one-pass XDR 20748 decode. If there is an XDR decoding error in this case, the RPC XDR 20749 decode error would be returned. The second method would be to make 20750 an initial pass to decode the basic COMPOUND request and then to XDR 20751 decode the individual operations; the most interesting is the decode 20752 of attributes. In this case, the server may encounter an XDR decode 20753 error during the second pass. If it does, the server would return 20754 the error NFS4ERR_BADXDR to signify the decode error. 20756 The COMPOUND arguments contain a "minorversion" field. For NFSv4.1, 20757 the value for this field is 1. If the server receives a COMPOUND 20758 procedure with a minorversion field value that it does not support, 20759 the server MUST return an error of NFS4ERR_MINOR_VERS_MISMATCH and a 20760 zero-length resultdata array. 20762 Contained within the COMPOUND results is a "status" field. If the 20763 results array length is non-zero, this status must be equivalent to 20764 the status of the last operation that was executed within the 20765 COMPOUND procedure. Therefore, if an operation incurred an error 20766 then the "status" value will be the same error value as is being 20767 returned for the operation that failed. 20769 Note that operations zero and one are not defined for the COMPOUND 20770 procedure. Operation 2 is not defined and is reserved for future 20771 definition and use with minor versioning. If the server receives an 20772 operation array that contains operation 2 and the minorversion field 20773 has a value of zero, an error of NFS4ERR_OP_ILLEGAL, as described in 20774 the next paragraph, is returned to the client. If an operation array 20775 contains an operation 2 and the minorversion field is non-zero and 20776 the server does not support the minor version, the server returns an 20777 error of NFS4ERR_MINOR_VERS_MISMATCH. Therefore, the 20778 NFS4ERR_MINOR_VERS_MISMATCH error takes precedence over all other 20779 errors. 20781 It is possible that the server receives a request that contains an 20782 operation that is less than the first legal operation (OP_ACCESS) or 20783 greater than the last legal operation (OP_RELEASE_LOCKOWNER). In 20784 this case, the server's response will encode the opcode OP_ILLEGAL 20785 rather than the illegal opcode of the request. The status field in 20786 the ILLEGAL return results will be set to NFS4ERR_OP_ILLEGAL. The 20787 COMPOUND procedure's return results will also be NFS4ERR_OP_ILLEGAL. 20789 The definition of the "tag" in the request is left to the 20790 implementor. It may be used to summarize the content of the Compound 20791 request for the benefit of packet-sniffers and engineers debugging 20792 implementations. However, the value of "tag" in the response SHOULD 20793 be the same value as provided in the request. This applies to the 20794 tag field of the CB_COMPOUND procedure as well. 20796 16.2.3.1. Current Filehandle and Stateid 20798 The COMPOUND procedure offers a simple environment for the execution 20799 of the operations specified by the client. The first two relate to 20800 the filehandle while the second two relate to the current stateid. 20802 16.2.3.1.1. Current Filehandle 20804 The current and saved filehandles are used throughout the protocol. 20805 Most operations implicitly use the current filehandle as an argument, 20806 and many set the current filehandle as part of the results. The 20807 combination of client-specified sequences of operations and current 20808 and saved filehandle arguments and results allows for greater 20809 protocol flexibility. The best or easiest example of current 20810 filehandle usage is a sequence like the following: 20812 PUTFH fh1 {fh1} 20813 LOOKUP "compA" {fh2} 20814 GETATTR {fh2} 20815 LOOKUP "compB" {fh3} 20816 GETATTR {fh3} 20817 LOOKUP "compC" {fh4} 20818 GETATTR {fh4} 20819 GETFH 20820 Figure 2 20822 In this example, the PUTFH (Section 18.19) operation explicitly sets 20823 the current filehandle value while the result of each LOOKUP 20824 operation sets the current filehandle value to the resultant file 20825 system object. Also, the client is able to insert GETATTR operations 20826 using the current filehandle as an argument. 20828 The PUTROOTFH (Section 18.21) and PUTPUBFH (Section 18.20) operations 20829 also set the current filehandle. The above example would replace 20830 "PUTFH fh1" with PUTROOTFH or PUTPUBFH with no filehandle argument in 20831 order to achieve the same effect (on the assumption that "compA" is 20832 directly below the root of the namespace). 20834 Along with the current filehandle, there is a saved filehandle. 20835 While the current filehandle is set as the result of operations like 20836 LOOKUP, the saved filehandle must be set directly with the use of the 20837 SAVEFH operation. The SAVEFH operation copies the current filehandle 20838 value to the saved value. The saved filehandle value is used in 20839 combination with the current filehandle value for the LINK and RENAME 20840 operations. The RESTOREFH operation will copy the saved filehandle 20841 value to the current filehandle value; as a result, the saved 20842 filehandle value may be used a sort of "scratch" area for the 20843 client's series of operations. 20845 16.2.3.1.2. Current Stateid 20847 With NFSv4.1, additions of a current stateid and a saved stateid have 20848 been made to the COMPOUND processing environment; this allows for the 20849 passing of stateids between operations. There are no changes to the 20850 syntax of the protocol, only changes to the semantics of a few 20851 operations. 20853 A "current stateid" is the stateid that is associated with the 20854 current filehandle. The current stateid may only be changed by an 20855 operation that modifies the current filehandle or returns a stateid. 20856 If an operation returns a stateid, it MUST set the current stateid to 20857 the returned value. If an operation sets the current filehandle but 20858 does not return a stateid, the current stateid MUST be set to the 20859 all-zeros special stateid, i.e., (seqid, other) = (0, 0). If an 20860 operation uses a stateid as an argument but does not return a 20861 stateid, the current stateid MUST NOT be changed. For example, 20862 PUTFH, PUTROOTFH, and PUTPUBFH will change the current server state 20863 from {ocfh, (osid)} to {cfh, (0, 0)}, while LOCK will change the 20864 current state from {cfh, (osid} to {cfh, (nsid)}. Operations like 20865 LOOKUP that transform a current filehandle and component name into a 20866 new current filehandle will also change the current state to {0, 0}. 20867 The SAVEFH and RESTOREFH operations will save and restore both the 20868 current filehandle and the current stateid as a set. 20870 The following example is the common case of a simple READ operation 20871 with a normal stateid showing that the PUTFH initializes the current 20872 stateid to (0, 0). The subsequent READ with stateid (sid1) leaves 20873 the current stateid unchanged. 20875 PUTFH fh1 - -> {fh1, (0, 0)} 20876 READ (sid1), 0, 1024 {fh1, (0, 0)} -> {fh1, (0, 0)} 20878 Figure 3 20880 This next example performs an OPEN with the root filehandle and, as a 20881 result, generates stateid (sid1). The next operation specifies the 20882 READ with the argument stateid set such that (seqid, other) are equal 20883 to (1, 0), but the current stateid set by the previous operation is 20884 actually used when the operation is evaluated. This allows correct 20885 interaction with any existing, potentially conflicting, locks. 20887 PUTROOTFH - -> {fh1, (0, 0)} 20888 OPEN "compA" {fh1, (0, 0)} -> {fh2, (sid1)} 20889 READ (1, 0), 0, 1024 {fh2, (sid1)} -> {fh2, (sid1)} 20890 CLOSE (1, 0) {fh2, (sid1)} -> {fh2, (sid2)} 20892 Figure 4 20894 This next example is similar to the second in how it passes the 20895 stateid sid2 generated by the LOCK operation to the next READ 20896 operation. This allows the client to explicitly surround a single I/ 20897 O operation with a lock and its appropriate stateid to guarantee 20898 correctness with other client locks. The example also shows how 20899 SAVEFH and RESTOREFH can save and later reuse a filehandle and 20900 stateid, passing them as the current filehandle and stateid to a READ 20901 operation. 20903 PUTFH fh1 - -> {fh1, (0, 0)} 20904 LOCK 0, 1024, (sid1) {fh1, (sid1)} -> {fh1, (sid2)} 20905 READ (1, 0), 0, 1024 {fh1, (sid2)} -> {fh1, (sid2)} 20906 LOCKU 0, 1024, (1, 0) {fh1, (sid2)} -> {fh1, (sid3)} 20907 SAVEFH {fh1, (sid3)} -> {fh1, (sid3)} 20909 PUTFH fh2 {fh1, (sid3)} -> {fh2, (0, 0)} 20910 WRITE (1, 0), 0, 1024 {fh2, (0, 0)} -> {fh2, (0, 0)} 20912 RESTOREFH {fh2, (0, 0)} -> {fh1, (sid3)} 20913 READ (1, 0), 1024, 1024 {fh1, (sid3)} -> {fh1, (sid3)} 20915 Figure 5 20917 The final example shows a disallowed use of the current stateid. The 20918 client is attempting to implicitly pass an anonymous special stateid, 20919 (0,0), to the READ operation. The server MUST return 20920 NFS4ERR_BAD_STATEID in the reply to the READ operation. 20922 PUTFH fh1 - -> {fh1, (0, 0)} 20923 READ (1, 0), 0, 1024 {fh1, (0, 0)} -> NFS4ERR_BAD_STATEID 20925 Figure 6 20927 16.2.4. ERRORS 20929 COMPOUND will of course return every error that each operation on the 20930 fore channel can return (see Table 12). However, if COMPOUND returns 20931 zero operations, obviously the error returned by COMPOUND has nothing 20932 to do with an error returned by an operation. The list of errors 20933 COMPOUND will return if it processes zero operations include: 20935 +==============================+==================================+ 20936 | Error | Notes | 20937 +==============================+==================================+ 20938 | NFS4ERR_BADCHAR | The tag argument has a character | 20939 | | the replier does not support. | 20940 +------------------------------+----------------------------------+ 20941 | NFS4ERR_BADXDR | | 20942 +------------------------------+----------------------------------+ 20943 | NFS4ERR_DELAY | | 20944 +------------------------------+----------------------------------+ 20945 | NFS4ERR_INVAL | The tag argument is not in UTF-8 | 20946 | | encoding. | 20947 +------------------------------+----------------------------------+ 20948 | NFS4ERR_MINOR_VERS_MISMATCH | | 20949 +------------------------------+----------------------------------+ 20950 | NFS4ERR_SERVERFAULT | | 20951 +------------------------------+----------------------------------+ 20952 | NFS4ERR_TOO_MANY_OPS | | 20953 +------------------------------+----------------------------------+ 20954 | NFS4ERR_REP_TOO_BIG | | 20955 +------------------------------+----------------------------------+ 20956 | NFS4ERR_REP_TOO_BIG_TO_CACHE | | 20957 +------------------------------+----------------------------------+ 20958 | NFS4ERR_REQ_TOO_BIG | | 20959 +------------------------------+----------------------------------+ 20961 Table 15: COMPOUND Error Returns 20963 17. Operations: REQUIRED, RECOMMENDED, or OPTIONAL 20965 The following tables summarize the operations of the NFSv4.1 protocol 20966 and the corresponding designation of REQUIRED, RECOMMENDED, and 20967 OPTIONAL to implement or MUST NOT implement. The designation of MUST 20968 NOT implement is reserved for those operations that were defined in 20969 NFSv4.0 and MUST NOT be implemented in NFSv4.1. 20971 For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation 20972 for operations sent by the client is for the server implementation. 20973 The client is generally required to implement the operations needed 20974 for the operating environment for which it serves. For example, a 20975 read-only NFSv4.1 client would have no need to implement the WRITE 20976 operation and is not required to do so. 20978 The REQUIRED or OPTIONAL designation for callback operations sent by 20979 the server is for both the client and server. Generally, the client 20980 has the option of creating the backchannel and sending the operations 20981 on the fore channel that will be a catalyst for the server sending 20982 callback operations. A partial exception is CB_RECALL_SLOT; the only 20983 way the client can avoid supporting this operation is by not creating 20984 a backchannel. 20986 Since this is a summary of the operations and their designation, 20987 there are subtleties that are not presented here. Therefore, if 20988 there is a question of the requirements of implementation, the 20989 operation descriptions themselves must be consulted along with other 20990 relevant explanatory text within this specification. 20992 The abbreviations used in the second and third columns of the table 20993 are defined as follows. 20995 REQ REQUIRED to implement 20997 REC RECOMMEND to implement 20999 OPT OPTIONAL to implement 21001 MNI MUST NOT implement 21003 For the NFSv4.1 features that are OPTIONAL, the operations that 21004 support those features are OPTIONAL, and the server would return 21005 NFS4ERR_NOTSUPP in response to the client's use of those operations. 21006 If an OPTIONAL feature is supported, it is possible that a set of 21007 operations related to the feature become REQUIRED to implement. The 21008 third column of the table designates the feature(s) and if the 21009 operation is REQUIRED or OPTIONAL in the presence of support for the 21010 feature. 21012 The OPTIONAL features identified and their abbreviations are as 21013 follows: 21015 pNFS Parallel NFS 21017 FDELG File Delegations 21019 DDELG Directory Delegations 21020 +======================+=============+============+===============+ 21021 | Operation | REQ, REC, | Feature | Definition | 21022 | | OPT, or MNI | (REQ, REC, | | 21023 | | | or OPT) | | 21024 +======================+=============+============+===============+ 21025 | ACCESS | REQ | | Section 18.1 | 21026 +----------------------+-------------+------------+---------------+ 21027 | BACKCHANNEL_CTL | REQ | | Section 18.33 | 21028 +----------------------+-------------+------------+---------------+ 21029 | BIND_CONN_TO_SESSION | REQ | | Section 18.34 | 21030 +----------------------+-------------+------------+---------------+ 21031 | CLOSE | REQ | | Section 18.2 | 21032 +----------------------+-------------+------------+---------------+ 21033 | COMMIT | REQ | | Section 18.3 | 21034 +----------------------+-------------+------------+---------------+ 21035 | CREATE | REQ | | Section 18.4 | 21036 +----------------------+-------------+------------+---------------+ 21037 | CREATE_SESSION | REQ | | Section 18.36 | 21038 +----------------------+-------------+------------+---------------+ 21039 | DELEGPURGE | OPT | FDELG | Section 18.5 | 21040 | | | (REQ) | | 21041 +----------------------+-------------+------------+---------------+ 21042 | DELEGRETURN | OPT | FDELG, | Section 18.6 | 21043 | | | DDELG, | | 21044 | | | pNFS (REQ) | | 21045 +----------------------+-------------+------------+---------------+ 21046 | DESTROY_CLIENTID | REQ | | Section 18.50 | 21047 +----------------------+-------------+------------+---------------+ 21048 | DESTROY_SESSION | REQ | | Section 18.37 | 21049 +----------------------+-------------+------------+---------------+ 21050 | EXCHANGE_ID | REQ | | Section 18.35 | 21051 +----------------------+-------------+------------+---------------+ 21052 | FREE_STATEID | REQ | | Section 18.38 | 21053 +----------------------+-------------+------------+---------------+ 21054 | GETATTR | REQ | | Section 18.7 | 21055 +----------------------+-------------+------------+---------------+ 21056 | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 | 21057 +----------------------+-------------+------------+---------------+ 21058 | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 | 21059 +----------------------+-------------+------------+---------------+ 21060 | GETFH | REQ | | Section 18.8 | 21061 +----------------------+-------------+------------+---------------+ 21062 | GET_DIR_DELEGATION | OPT | DDELG | Section 18.39 | 21063 | | | (REQ) | | 21064 +----------------------+-------------+------------+---------------+ 21065 | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 | 21066 +----------------------+-------------+------------+---------------+ 21067 | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 | 21068 +----------------------+-------------+------------+---------------+ 21069 | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 | 21070 +----------------------+-------------+------------+---------------+ 21071 | LINK | OPT | | Section 18.9 | 21072 +----------------------+-------------+------------+---------------+ 21073 | LOCK | REQ | | Section 18.10 | 21074 +----------------------+-------------+------------+---------------+ 21075 | LOCKT | REQ | | Section 18.11 | 21076 +----------------------+-------------+------------+---------------+ 21077 | LOCKU | REQ | | Section 18.12 | 21078 +----------------------+-------------+------------+---------------+ 21079 | LOOKUP | REQ | | Section 18.13 | 21080 +----------------------+-------------+------------+---------------+ 21081 | LOOKUPP | REQ | | Section 18.14 | 21082 +----------------------+-------------+------------+---------------+ 21083 | NVERIFY | REQ | | Section 18.15 | 21084 +----------------------+-------------+------------+---------------+ 21085 | OPEN | REQ | | Section 18.16 | 21086 +----------------------+-------------+------------+---------------+ 21087 | OPENATTR | OPT | | Section 18.17 | 21088 +----------------------+-------------+------------+---------------+ 21089 | OPEN_CONFIRM | MNI | | N/A | 21090 +----------------------+-------------+------------+---------------+ 21091 | OPEN_DOWNGRADE | REQ | | Section 18.18 | 21092 +----------------------+-------------+------------+---------------+ 21093 | PUTFH | REQ | | Section 18.19 | 21094 +----------------------+-------------+------------+---------------+ 21095 | PUTPUBFH | REQ | | Section 18.20 | 21096 +----------------------+-------------+------------+---------------+ 21097 | PUTROOTFH | REQ | | Section 18.21 | 21098 +----------------------+-------------+------------+---------------+ 21099 | READ | REQ | | Section 18.22 | 21100 +----------------------+-------------+------------+---------------+ 21101 | READDIR | REQ | | Section 18.23 | 21102 +----------------------+-------------+------------+---------------+ 21103 | READLINK | OPT | | Section 18.24 | 21104 +----------------------+-------------+------------+---------------+ 21105 | RECLAIM_COMPLETE | REQ | | Section 18.51 | 21106 +----------------------+-------------+------------+---------------+ 21107 | RELEASE_LOCKOWNER | MNI | | N/A | 21108 +----------------------+-------------+------------+---------------+ 21109 | REMOVE | REQ | | Section 18.25 | 21110 +----------------------+-------------+------------+---------------+ 21111 | RENAME | REQ | | Section 18.26 | 21112 +----------------------+-------------+------------+---------------+ 21113 | RENEW | MNI | | N/A | 21114 +----------------------+-------------+------------+---------------+ 21115 | RESTOREFH | REQ | | Section 18.27 | 21116 +----------------------+-------------+------------+---------------+ 21117 | SAVEFH | REQ | | Section 18.28 | 21118 +----------------------+-------------+------------+---------------+ 21119 | SECINFO | REQ | | Section 18.29 | 21120 +----------------------+-------------+------------+---------------+ 21121 | SECINFO_NO_NAME | REC | pNFS file | Section | 21122 | | | layout | 18.45, | 21123 | | | (REQ) | Section 13.12 | 21124 +----------------------+-------------+------------+---------------+ 21125 | SEQUENCE | REQ | | Section 18.46 | 21126 +----------------------+-------------+------------+---------------+ 21127 | SETATTR | REQ | | Section 18.30 | 21128 +----------------------+-------------+------------+---------------+ 21129 | SETCLIENTID | MNI | | N/A | 21130 +----------------------+-------------+------------+---------------+ 21131 | SETCLIENTID_CONFIRM | MNI | | N/A | 21132 +----------------------+-------------+------------+---------------+ 21133 | SET_SSV | REQ | | Section 18.47 | 21134 +----------------------+-------------+------------+---------------+ 21135 | TEST_STATEID | REQ | | Section 18.48 | 21136 +----------------------+-------------+------------+---------------+ 21137 | VERIFY | REQ | | Section 18.31 | 21138 +----------------------+-------------+------------+---------------+ 21139 | WANT_DELEGATION | OPT | FDELG | Section 18.49 | 21140 | | | (OPT) | | 21141 +----------------------+-------------+------------+---------------+ 21142 | WRITE | REQ | | Section 18.32 | 21143 +----------------------+-------------+------------+---------------+ 21145 Table 16: Operations 21147 +=========================+=============+============+============+ 21148 | Operation | REQ, REC, | Feature | Definition | 21149 | | OPT, or MNI | (REQ, REC, | | 21150 | | | or OPT) | | 21151 +=========================+=============+============+============+ 21152 | CB_GETATTR | OPT | FDELG | Section | 21153 | | | (REQ) | 20.1 | 21154 +-------------------------+-------------+------------+------------+ 21155 | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section | 21156 | | | | 20.3 | 21157 +-------------------------+-------------+------------+------------+ 21158 | CB_NOTIFY | OPT | DDELG | Section | 21159 | | | (REQ) | 20.4 | 21160 +-------------------------+-------------+------------+------------+ 21161 | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section | 21162 | | | | 20.12 | 21163 +-------------------------+-------------+------------+------------+ 21164 | CB_NOTIFY_LOCK | OPT | | Section | 21165 | | | | 20.11 | 21166 +-------------------------+-------------+------------+------------+ 21167 | CB_PUSH_DELEG | OPT | FDELG | Section | 21168 | | | (OPT) | 20.5 | 21169 +-------------------------+-------------+------------+------------+ 21170 | CB_RECALL | OPT | FDELG, | Section | 21171 | | | DDELG, | 20.2 | 21172 | | | pNFS (REQ) | | 21173 +-------------------------+-------------+------------+------------+ 21174 | CB_RECALL_ANY | OPT | FDELG, | Section | 21175 | | | DDELG, | 20.6 | 21176 | | | pNFS (REQ) | | 21177 +-------------------------+-------------+------------+------------+ 21178 | CB_RECALL_SLOT | REQ | | Section | 21179 | | | | 20.8 | 21180 +-------------------------+-------------+------------+------------+ 21181 | CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, | Section | 21182 | | | pNFS (REQ) | 20.7 | 21183 +-------------------------+-------------+------------+------------+ 21184 | CB_SEQUENCE | OPT | FDELG, | Section | 21185 | | | DDELG, | 20.9 | 21186 | | | pNFS (REQ) | | 21187 +-------------------------+-------------+------------+------------+ 21188 | CB_WANTS_CANCELLED | OPT | FDELG, | Section | 21189 | | | DDELG, | 20.10 | 21190 | | | pNFS (REQ) | | 21191 +-------------------------+-------------+------------+------------+ 21193 Table 17: Callback Operations 21195 18. NFSv4.1 Operations 21197 18.1. Operation 3: ACCESS - Check Access Rights 21199 18.1.1. ARGUMENTS 21201 const ACCESS4_READ = 0x00000001; 21202 const ACCESS4_LOOKUP = 0x00000002; 21203 const ACCESS4_MODIFY = 0x00000004; 21204 const ACCESS4_EXTEND = 0x00000008; 21205 const ACCESS4_DELETE = 0x00000010; 21206 const ACCESS4_EXECUTE = 0x00000020; 21208 struct ACCESS4args { 21209 /* CURRENT_FH: object */ 21210 uint32_t access; 21211 }; 21213 18.1.2. RESULTS 21215 struct ACCESS4resok { 21216 uint32_t supported; 21217 uint32_t access; 21218 }; 21220 union ACCESS4res switch (nfsstat4 status) { 21221 case NFS4_OK: 21222 ACCESS4resok resok4; 21223 default: 21224 void; 21225 }; 21227 18.1.3. DESCRIPTION 21229 ACCESS determines the access rights that a user, as identified by the 21230 credentials in the RPC request, has with respect to the file system 21231 object specified by the current filehandle. The client encodes the 21232 set of access rights that are to be checked in the bit mask "access". 21233 The server checks the permissions encoded in the bit mask. If a 21234 status of NFS4_OK is returned, two bit masks are included in the 21235 response. The first, "supported", represents the access rights for 21236 which the server can verify reliably. The second, "access", 21237 represents the access rights available to the user for the filehandle 21238 provided. On success, the current filehandle retains its value. 21240 Note that the reply's supported and access fields MUST NOT contain 21241 more values than originally set in the request's access field. For 21242 example, if the client sends an ACCESS operation with just the 21243 ACCESS4_READ value set and the server supports this value, the server 21244 MUST NOT set more than ACCESS4_READ in the supported field even if it 21245 could have reliably checked other values. 21247 The reply's access field MUST NOT contain more values than the 21248 supported field. 21250 The results of this operation are necessarily advisory in nature. A 21251 return status of NFS4_OK and the appropriate bit set in the bit mask 21252 do not imply that such access will be allowed to the file system 21253 object in the future. This is because access rights can be revoked 21254 by the server at any time. 21256 The following access permissions may be requested: 21258 ACCESS4_READ Read data from file or read a directory. 21260 ACCESS4_LOOKUP Look up a name in a directory (no meaning for non- 21261 directory objects). 21263 ACCESS4_MODIFY Rewrite existing file data or modify existing 21264 directory entries. 21266 ACCESS4_EXTEND Write new data or add directory entries. 21268 ACCESS4_DELETE Delete an existing directory entry. 21270 ACCESS4_EXECUTE Execute a regular file (no meaning for a directory). 21272 On success, the current filehandle retains its value. 21274 ACCESS4_EXECUTE is a challenging semantic to implement because NFS 21275 provides remote file access, not remote execution. This leads to the 21276 following: 21278 * Whether or not a regular file is executable ought to be the 21279 responsibility of the NFS client and not the server. And yet the 21280 ACCESS operation is specified to seemingly require a server to own 21281 that responsibility. 21283 * When a client executes a regular file, it has to read the file 21284 from the server. Strictly speaking, the server should not allow 21285 the client to read a file being executed unless the user has read 21286 permissions on the file. Requiring explicit read permissions on 21287 executable files in order to access them over NFS is not going to 21288 be acceptable to some users and storage administrators. 21289 Historically, NFS servers have allowed a user to READ a file if 21290 the user has execute access to the file. 21292 As a practical example, the UNIX specification [60] states that an 21293 implementation claiming conformance to UNIX may indicate in the 21294 access() programming interface's result that a privileged user has 21295 execute rights, even if no execute permission bits are set on the 21296 regular file's attributes. It is possible to claim conformance to 21297 the UNIX specification and instead not indicate execute rights in 21298 that situation, which is true for some operating environments. 21299 Suppose the operating environments of the client and server are 21300 implementing the access() semantics for privileged users differently, 21301 and the ACCESS operation implementations of the client and server 21302 follow their respective access() semantics. This can cause undesired 21303 behavior: 21305 * Suppose the client's access() interface returns X_OK if the user 21306 is privileged and no execute permission bits are set on the 21307 regular file's attribute, and the server's access() interface does 21308 not return X_OK in that situation. Then the client will be unable 21309 to execute files stored on the NFS server that could be executed 21310 if stored on a non-NFS file system. 21312 * Suppose the client's access() interface does not return X_OK if 21313 the user is privileged, and no execute permission bits are set on 21314 the regular file's attribute, and the server's access() interface 21315 does return X_OK in that situation. Then: 21317 - The client will be able to execute files stored on the NFS 21318 server that could be executed if stored on a non-NFS file 21319 system, unless the client's execution subsystem also checks for 21320 execute permission bits. 21322 - Even if the execution subsystem is checking for execute 21323 permission bits, there are more potential issues. For example, 21324 suppose the client is invoking access() to build a "path search 21325 table" of all executable files in the user's "search path", 21326 where the path is a list of directories each containing 21327 executable files. Suppose there are two files each in separate 21328 directories of the search path, such that files have the same 21329 component name. In the first directory the file has no execute 21330 permission bits set, and in the second directory the file has 21331 execute bits set. The path search table will indicate that the 21332 first directory has the executable file, but the execute 21333 subsystem will fail to execute it. The command shell might 21334 fail to try the second file in the second directory. And even 21335 if it did, this is a potential performance issue. Clearly, the 21336 desired outcome for the client is for the path search table to 21337 not contain the first file. 21339 To deal with the problems described above, the "smart client, stupid 21340 server" principle is used. The client owns overall responsibility 21341 for determining execute access and relies on the server to parse the 21342 execution permissions within the file's mode, acl, and dacl 21343 attributes. The rules for the client and server follow: 21345 * If the client is sending ACCESS in order to determine if the user 21346 can read the file, the client SHOULD set ACCESS4_READ in the 21347 request's access field. 21349 * If the client's operating environment only grants execution to the 21350 user if the user has execute access according to the execute 21351 permissions in the mode, acl, and dacl attributes, then if the 21352 client wants to determine execute access, the client SHOULD send 21353 an ACCESS request with ACCESS4_EXECUTE bit set in the request's 21354 access field. 21356 * If the client's operating environment grants execution to the user 21357 even if the user does not have execute access according to the 21358 execute permissions in the mode, acl, and dacl attributes, then if 21359 the client wants to determine execute access, it SHOULD send an 21360 ACCESS request with both the ACCESS4_EXECUTE and ACCESS4_READ bits 21361 set in the request's access field. This way, if any read or 21362 execute permission grants the user read or execute access (or if 21363 the server interprets the user as privileged), as indicated by the 21364 presence of ACCESS4_EXECUTE and/or ACCESS4_READ in the reply's 21365 access field, the client will be able to grant the user execute 21366 access to the file. 21368 * If the server supports execute permission bits, or some other 21369 method for denoting executability (e.g., the suffix of the name of 21370 the file might indicate execute), it MUST check only execute 21371 permissions, not read permissions, when determining whether or not 21372 the reply will have ACCESS4_EXECUTE set in the access field. The 21373 server MUST NOT also examine read permission bits when determining 21374 whether or not the reply will have ACCESS4_EXECUTE set in the 21375 access field. Even if the server's operating environment would 21376 grant execute access to the user (e.g., the user is privileged), 21377 the server MUST NOT reply with ACCESS4_EXECUTE set in reply's 21378 access field unless there is at least one execute permission bit 21379 set in the mode, acl, or dacl attributes. In the case of acl and 21380 dacl, the "one execute permission bit" MUST be an ACE4_EXECUTE bit 21381 set in an ALLOW ACE. 21383 * If the server does not support execute permission bits or some 21384 other method for denoting executability, it MUST NOT set 21385 ACCESS4_EXECUTE in the reply's supported and access fields. If 21386 the client set ACCESS4_EXECUTE in the ACCESS request's access 21387 field, and ACCESS4_EXECUTE is not set in the reply's supported 21388 field, then the client will have to send an ACCESS request with 21389 the ACCESS4_READ bit set in the request's access field. 21391 * If the server supports read permission bits, it MUST only check 21392 for read permissions in the mode, acl, and dacl attributes when it 21393 receives an ACCESS request with ACCESS4_READ set in the access 21394 field. The server MUST NOT also examine execute permission bits 21395 when determining whether the reply will have ACCESS4_READ set in 21396 the access field or not. 21398 Note that if the ACCESS reply has ACCESS4_READ or ACCESS_EXECUTE set, 21399 then the user also has permissions to OPEN (Section 18.16) or READ 21400 (Section 18.22) the file. In other words, if the client sends an 21401 ACCESS request with the ACCESS4_READ and ACCESS_EXECUTE set in the 21402 access field (or two separate requests, one with ACCESS4_READ set and 21403 the other with ACCESS4_EXECUTE set), and the reply has just 21404 ACCESS4_EXECUTE set in the access field (or just one reply has 21405 ACCESS4_EXECUTE set), then the user has authorization to OPEN or READ 21406 the file. 21408 18.1.4. IMPLEMENTATION 21410 In general, it is not sufficient for the client to attempt to deduce 21411 access permissions by inspecting the uid, gid, and mode fields in the 21412 file attributes or by attempting to interpret the contents of the ACL 21413 attribute. This is because the server may perform uid or gid mapping 21414 or enforce additional access-control restrictions. It is also 21415 possible that the server may not be in the same ID space as the 21416 client. In these cases (and perhaps others), the client cannot 21417 reliably perform an access check with only current file attributes. 21419 In the NFSv2 protocol, the only reliable way to determine whether an 21420 operation was allowed was to try it and see if it succeeded or 21421 failed. Using the ACCESS operation in the NFSv4.1 protocol, the 21422 client can ask the server to indicate whether or not one or more 21423 classes of operations are permitted. The ACCESS operation is 21424 provided to allow clients to check before doing a series of 21425 operations that will result in an access failure. The OPEN operation 21426 provides a point where the server can verify access to the file 21427 object and a method to return that information to the client. The 21428 ACCESS operation is still useful for directory operations or for use 21429 in the case that the UNIX interface access() is used on the client. 21431 The information returned by the server in response to an ACCESS call 21432 is not permanent. It was correct at the exact time that the server 21433 performed the checks, but not necessarily afterwards. The server can 21434 revoke access permission at any time. 21436 The client should use the effective credentials of the user to build 21437 the authentication information in the ACCESS request used to 21438 determine access rights. It is the effective user and group 21439 credentials that are used in subsequent READ and WRITE operations. 21441 Many implementations do not directly support the ACCESS4_DELETE 21442 permission. Operating systems like UNIX will ignore the 21443 ACCESS4_DELETE bit if set on an access request on a non-directory 21444 object. In these systems, delete permission on a file is determined 21445 by the access permissions on the directory in which the file resides, 21446 instead of being determined by the permissions of the file itself. 21447 Therefore, the mask returned enumerating which access rights can be 21448 determined will have the ACCESS4_DELETE value set to 0. This 21449 indicates to the client that the server was unable to check that 21450 particular access right. The ACCESS4_DELETE bit in the access mask 21451 returned will then be ignored by the client. 21453 18.2. Operation 4: CLOSE - Close File 21455 18.2.1. ARGUMENTS 21457 struct CLOSE4args { 21458 /* CURRENT_FH: object */ 21459 seqid4 seqid; 21460 stateid4 open_stateid; 21461 }; 21463 18.2.2. RESULTS 21465 union CLOSE4res switch (nfsstat4 status) { 21466 case NFS4_OK: 21467 stateid4 open_stateid; 21468 default: 21469 void; 21470 }; 21472 18.2.3. DESCRIPTION 21474 The CLOSE operation releases share reservations for the regular or 21475 named attribute file as specified by the current filehandle. The 21476 share reservations and other state information released at the server 21477 as a result of this CLOSE are only those associated with the supplied 21478 stateid. State associated with other OPENs is not affected. 21480 If byte-range locks are held, the client SHOULD release all locks 21481 before sending a CLOSE. The server MAY free all outstanding locks on 21482 CLOSE, but some servers may not support the CLOSE of a file that 21483 still has byte-range locks held. The server MUST return failure if 21484 any locks would exist after the CLOSE. 21486 The argument seqid MAY have any value, and the server MUST ignore 21487 seqid. 21489 On success, the current filehandle retains its value. 21491 The server MAY require that the combination of principal, security 21492 flavor, and, if applicable, GSS mechanism that sent the OPEN request 21493 also be the one to CLOSE the file. This might not be possible if 21494 credentials for the principal are no longer available. The server 21495 MAY allow the machine credential or SSV credential (see 21496 Section 18.35) to send CLOSE. 21498 18.2.4. IMPLEMENTATION 21500 Even though CLOSE returns a stateid, this stateid is not useful to 21501 the client and should be treated as deprecated. CLOSE "shuts down" 21502 the state associated with all OPENs for the file by a single open- 21503 owner. As noted above, CLOSE will either release all file-locking 21504 state or return an error. Therefore, the stateid returned by CLOSE 21505 is not useful for operations that follow. To help find any uses of 21506 this stateid by clients, the server SHOULD return the invalid special 21507 stateid (the "other" value is zero and the "seqid" field is 21508 NFS4_UINT32_MAX, see Section 8.2.3). 21510 A CLOSE operation may make delegations grantable where they were not 21511 previously. Servers may choose to respond immediately if there are 21512 pending delegation want requests or may respond to the situation at a 21513 later time. 21515 18.3. Operation 5: COMMIT - Commit Cached Data 21517 18.3.1. ARGUMENTS 21519 struct COMMIT4args { 21520 /* CURRENT_FH: file */ 21521 offset4 offset; 21522 count4 count; 21523 }; 21525 18.3.2. RESULTS 21526 struct COMMIT4resok { 21527 verifier4 writeverf; 21528 }; 21530 union COMMIT4res switch (nfsstat4 status) { 21531 case NFS4_OK: 21532 COMMIT4resok resok4; 21533 default: 21534 void; 21535 }; 21537 18.3.3. DESCRIPTION 21539 The COMMIT operation forces or flushes uncommitted, modified data to 21540 stable storage for the file specified by the current filehandle. The 21541 flushed data is that which was previously written with one or more 21542 WRITE operations that had the "committed" field of their results 21543 field set to UNSTABLE4. 21545 The offset specifies the position within the file where the flush is 21546 to begin. An offset value of zero means to flush data starting at 21547 the beginning of the file. The count specifies the number of bytes 21548 of data to flush. If the count is zero, a flush from the offset to 21549 the end of the file is done. 21551 The server returns a write verifier upon successful completion of the 21552 COMMIT. The write verifier is used by the client to determine if the 21553 server has restarted between the initial WRITE operations and the 21554 COMMIT. The client does this by comparing the write verifier 21555 returned from the initial WRITE operations and the verifier returned 21556 by the COMMIT operation. The server must vary the value of the write 21557 verifier at each server event or instantiation that may lead to a 21558 loss of uncommitted data. Most commonly this occurs when the server 21559 is restarted; however, other events at the server may result in 21560 uncommitted data loss as well. 21562 On success, the current filehandle retains its value. 21564 18.3.4. IMPLEMENTATION 21566 The COMMIT operation is similar in operation and semantics to the 21567 POSIX fsync() [22] system interface that synchronizes a file's state 21568 with the disk (file data and metadata is flushed to disk or stable 21569 storage). COMMIT performs the same operation for a client, flushing 21570 any unsynchronized data and metadata on the server to the server's 21571 disk or stable storage for the specified file. Like fsync(), it may 21572 be that there is some modified data or no modified data to 21573 synchronize. The data may have been synchronized by the server's 21574 normal periodic buffer synchronization activity. COMMIT should 21575 return NFS4_OK, unless there has been an unexpected error. 21577 COMMIT differs from fsync() in that it is possible for the client to 21578 flush a range of the file (most likely triggered by a buffer- 21579 reclamation scheme on the client before the file has been completely 21580 written). 21582 The server implementation of COMMIT is reasonably simple. If the 21583 server receives a full file COMMIT request, that is, starting at 21584 offset zero and count zero, it should do the equivalent of applying 21585 fsync() to the entire file. Otherwise, it should arrange to have the 21586 modified data in the range specified by offset and count to be 21587 flushed to stable storage. In both cases, any metadata associated 21588 with the file must be flushed to stable storage before returning. It 21589 is not an error for there to be nothing to flush on the server. This 21590 means that the data and metadata that needed to be flushed have 21591 already been flushed or lost during the last server failure. 21593 The client implementation of COMMIT is a little more complex. There 21594 are two reasons for wanting to commit a client buffer to stable 21595 storage. The first is that the client wants to reuse a buffer. In 21596 this case, the offset and count of the buffer are sent to the server 21597 in the COMMIT request. The server then flushes any modified data 21598 based on the offset and count, and flushes any modified metadata 21599 associated with the file. It then returns the status of the flush 21600 and the write verifier. The second reason for the client to generate 21601 a COMMIT is for a full file flush, such as may be done at close. In 21602 this case, the client would gather all of the buffers for this file 21603 that contain uncommitted data, do the COMMIT operation with an offset 21604 of zero and count of zero, and then free all of those buffers. Any 21605 other dirty buffers would be sent to the server in the normal 21606 fashion. 21608 After a buffer is written (via the WRITE operation) by the client 21609 with the "committed" field in the result of WRITE set to UNSTABLE4, 21610 the buffer must be considered as modified by the client until the 21611 buffer has either been flushed via a COMMIT operation or written via 21612 a WRITE operation with the "committed" field in the result set to 21613 FILE_SYNC4 or DATA_SYNC4. This is done to prevent the buffer from 21614 being freed and reused before the data can be flushed to stable 21615 storage on the server. 21617 When a response is returned from either a WRITE or a COMMIT operation 21618 and it contains a write verifier that differs from that previously 21619 returned by the server, the client will need to retransmit all of the 21620 buffers containing uncommitted data to the server. How this is to be 21621 done is up to the implementor. If there is only one buffer of 21622 interest, then it should be sent in a WRITE request with the 21623 FILE_SYNC4 stable parameter. If there is more than one buffer, it 21624 might be worthwhile retransmitting all of the buffers in WRITE 21625 operations with the stable parameter set to UNSTABLE4 and then 21626 retransmitting the COMMIT operation to flush all of the data on the 21627 server to stable storage. However, if the server repeatably returns 21628 from COMMIT a verifier that differs from that returned by WRITE, the 21629 only way to ensure progress is to retransmit all of the buffers with 21630 WRITE requests with the FILE_SYNC4 stable parameter. 21632 The above description applies to page-cache-based systems as well as 21633 buffer-cache-based systems. In the former systems, the virtual 21634 memory system will need to be modified instead of the buffer cache. 21636 18.4. Operation 6: CREATE - Create a Non-Regular File Object 21638 18.4.1. ARGUMENTS 21640 union createtype4 switch (nfs_ftype4 type) { 21641 case NF4LNK: 21642 linktext4 linkdata; 21643 case NF4BLK: 21644 case NF4CHR: 21645 specdata4 devdata; 21646 case NF4SOCK: 21647 case NF4FIFO: 21648 case NF4DIR: 21649 void; 21650 default: 21651 void; /* server should return NFS4ERR_BADTYPE */ 21652 }; 21654 struct CREATE4args { 21655 /* CURRENT_FH: directory for creation */ 21656 createtype4 objtype; 21657 component4 objname; 21658 fattr4 createattrs; 21659 }; 21661 18.4.2. RESULTS 21662 struct CREATE4resok { 21663 change_info4 cinfo; 21664 bitmap4 attrset; /* attributes set */ 21665 }; 21667 union CREATE4res switch (nfsstat4 status) { 21668 case NFS4_OK: 21669 /* new CURRENTFH: created object */ 21670 CREATE4resok resok4; 21671 default: 21672 void; 21673 }; 21675 18.4.3. DESCRIPTION 21677 The CREATE operation creates a file object other than an ordinary 21678 file in a directory with a given name. The OPEN operation MUST be 21679 used to create a regular file or a named attribute. 21681 The current filehandle must be a directory: an object of type NF4DIR. 21682 If the current filehandle is an attribute directory (type 21683 NF4ATTRDIR), the error NFS4ERR_WRONG_TYPE is returned. If the 21684 current filehandle designates any other type of object, the error 21685 NFS4ERR_NOTDIR results. 21687 The objname specifies the name for the new object. The objtype 21688 determines the type of object to be created: directory, symlink, etc. 21689 If the object type specified is that of an ordinary file, a named 21690 attribute, or a named attribute directory, the error NFS4ERR_BADTYPE 21691 results. 21693 If an object of the same name already exists in the directory, the 21694 server will return the error NFS4ERR_EXIST. 21696 For the directory where the new file object was created, the server 21697 returns change_info4 information in cinfo. With the atomic field of 21698 the change_info4 data type, the server will indicate if the before 21699 and after change attributes were obtained atomically with respect to 21700 the file object creation. 21702 If the objname has a length of zero, or if objname does not obey the 21703 UTF-8 definition, the error NFS4ERR_INVAL will be returned. 21705 The current filehandle is replaced by that of the new object. 21707 The createattrs specifies the initial set of attributes for the 21708 object. The set of attributes may include any writable attribute 21709 valid for the object type. When the operation is successful, the 21710 server will return to the client an attribute mask signifying which 21711 attributes were successfully set for the object. 21713 If createattrs includes neither the owner attribute nor an ACL with 21714 an ACE for the owner, and if the server's file system both supports 21715 and requires an owner attribute (or an owner ACE), then the server 21716 MUST derive the owner (or the owner ACE). This would typically be 21717 from the principal indicated in the RPC credentials of the call, but 21718 the server's operating environment or file system semantics may 21719 dictate other methods of derivation. Similarly, if createattrs 21720 includes neither the group attribute nor a group ACE, and if the 21721 server's file system both supports and requires the notion of a group 21722 attribute (or group ACE), the server MUST derive the group attribute 21723 (or the corresponding owner ACE) for the file. This could be from 21724 the RPC call's credentials, such as the group principal if the 21725 credentials include it (such as with AUTH_SYS), from the group 21726 identifier associated with the principal in the credentials (e.g., 21727 POSIX systems have a user database [23] that has a group identifier 21728 for every user identifier), inherited from the directory in which the 21729 object is created, or whatever else the server's operating 21730 environment or file system semantics dictate. This applies to the 21731 OPEN operation too. 21733 Conversely, it is possible that the client will specify in 21734 createattrs an owner attribute, group attribute, or ACL that the 21735 principal indicated the RPC call's credentials does not have 21736 permissions to create files for. The error to be returned in this 21737 instance is NFS4ERR_PERM. This applies to the OPEN operation too. 21739 If the current filehandle designates a directory for which another 21740 client holds a directory delegation, then, unless the delegation is 21741 such that the situation can be resolved by sending a notification, 21742 the delegation MUST be recalled, and the CREATE operation MUST NOT 21743 proceed until the delegation is returned or revoked. Except where 21744 this happens very quickly, one or more NFS4ERR_DELAY errors will be 21745 returned to requests made while delegation remains outstanding. 21747 When the current filehandle designates a directory for which one or 21748 more directory delegations exist, then, when those delegations 21749 request such notifications, NOTIFY4_ADD_ENTRY will be generated as a 21750 result of this operation. 21752 If the capability FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 is set 21753 (Section 14.4), and a symbolic link is being created, then the 21754 content of the symbolic link MUST be in UTF-8 encoding. 21756 18.4.4. IMPLEMENTATION 21758 If the client desires to set attribute values after the create, a 21759 SETATTR operation can be added to the COMPOUND request so that the 21760 appropriate attributes will be set. 21762 18.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting Recovery 21764 18.5.1. ARGUMENTS 21766 struct DELEGPURGE4args { 21767 clientid4 clientid; 21768 }; 21770 18.5.2. RESULTS 21772 struct DELEGPURGE4res { 21773 nfsstat4 status; 21774 }; 21776 18.5.3. DESCRIPTION 21778 This operation purges all of the delegations awaiting recovery for a 21779 given client. This is useful for clients that do not commit 21780 delegation information to stable storage to indicate that conflicting 21781 requests need not be delayed by the server awaiting recovery of 21782 delegation information. 21784 The client is NOT specified by the clientid field of the request. 21785 The client SHOULD set the client field to zero, and the server MUST 21786 ignore the clientid field. Instead, the server MUST derive the 21787 client ID from the value of the session ID in the arguments of the 21788 SEQUENCE operation that precedes DELEGPURGE in the COMPOUND request. 21790 The DELEGPURGE operation should be used by clients that record 21791 delegation information on stable storage on the client. In this 21792 case, after the client recovers all delegations it knows of, it 21793 should immediately send a DELEGPURGE operation. Doing so will notify 21794 the server that no additional delegations for the client will be 21795 recovered allowing it to free resources, and avoid delaying other 21796 clients which make requests that conflict with the unrecovered 21797 delegations. The set of delegations known to the server and the 21798 client might be different. The reason for this is that after sending 21799 a request that resulted in a delegation, the client might experience 21800 a failure before it both received the delegation and committed the 21801 delegation to the client's stable storage. 21803 The server MAY support DELEGPURGE, but if it does not, it MUST NOT 21804 support CLAIM_DELEGATE_PREV and MUST NOT support CLAIM_DELEG_PREV_FH. 21806 18.6. Operation 8: DELEGRETURN - Return Delegation 21808 18.6.1. ARGUMENTS 21810 struct DELEGRETURN4args { 21811 /* CURRENT_FH: delegated object */ 21812 stateid4 deleg_stateid; 21813 }; 21815 18.6.2. RESULTS 21817 struct DELEGRETURN4res { 21818 nfsstat4 status; 21819 }; 21821 18.6.3. DESCRIPTION 21823 The DELEGRETURN operation returns the delegation represented by the 21824 current filehandle and stateid. 21826 Delegations may be returned voluntarily (i.e., before the server has 21827 recalled them) or when recalled. In either case, the client must 21828 properly propagate state changed under the context of the delegation 21829 to the server before returning the delegation. 21831 The server MAY require that the principal, security flavor, and if 21832 applicable, the GSS mechanism, combination that acquired the 21833 delegation also be the one to send DELEGRETURN on the file. This 21834 might not be possible if credentials for the principal are no longer 21835 available. The server MAY allow the machine credential or SSV 21836 credential (see Section 18.35) to send DELEGRETURN. 21838 18.7. Operation 9: GETATTR - Get Attributes 21840 18.7.1. ARGUMENTS 21842 struct GETATTR4args { 21843 /* CURRENT_FH: object */ 21844 bitmap4 attr_request; 21845 }; 21847 18.7.2. RESULTS 21848 struct GETATTR4resok { 21849 fattr4 obj_attributes; 21850 }; 21852 union GETATTR4res switch (nfsstat4 status) { 21853 case NFS4_OK: 21854 GETATTR4resok resok4; 21855 default: 21856 void; 21857 }; 21859 18.7.3. DESCRIPTION 21861 The GETATTR operation will obtain attributes for the file system 21862 object specified by the current filehandle. The client sets a bit in 21863 the bitmap argument for each attribute value that it would like the 21864 server to return. The server returns an attribute bitmap that 21865 indicates the attribute values that it was able to return, which will 21866 include all attributes requested by the client that are attributes 21867 supported by the server for the target file system. This bitmap is 21868 followed by the attribute values ordered lowest attribute number 21869 first. 21871 The server MUST return a value for each attribute that the client 21872 requests if the attribute is supported by the server for the target 21873 file system. If the server does not support a particular attribute 21874 on the target file system, then it MUST NOT return the attribute 21875 value and MUST NOT set the attribute bit in the result bitmap. The 21876 server MUST return an error if it supports an attribute on the target 21877 but cannot obtain its value. In that case, no attribute values will 21878 be returned. 21880 File systems that are absent should be treated as having support for 21881 a very small set of attributes as described in Section 11.4.1, even 21882 if previously, when the file system was present, more attributes were 21883 supported. 21885 All servers MUST support the REQUIRED attributes as specified in 21886 Section 5.6, for all file systems, with the exception of absent file 21887 systems. 21889 On success, the current filehandle retains its value. 21891 18.7.4. IMPLEMENTATION 21893 Suppose there is an OPEN_DELEGATE_WRITE delegation held by another 21894 client for the file in question and size and/or change are among the 21895 set of attributes being interrogated. The server has two choices. 21896 First, the server can obtain the actual current value of these 21897 attributes from the client holding the delegation by using the 21898 CB_GETATTR callback. Second, the server, particularly when the 21899 delegated client is unresponsive, can recall the delegation in 21900 question. The GETATTR MUST NOT proceed until one of the following 21901 occurs: 21903 * The requested attribute values are returned in the response to 21904 CB_GETATTR. 21906 * The OPEN_DELEGATE_WRITE delegation is returned. 21908 * The OPEN_DELEGATE_WRITE delegation is revoked. 21910 Unless one of the above happens very quickly, one or more 21911 NFS4ERR_DELAY errors will be returned while a delegation is 21912 outstanding. 21914 18.8. Operation 10: GETFH - Get Current Filehandle 21916 18.8.1. ARGUMENTS 21918 /* CURRENT_FH: */ 21919 void; 21921 18.8.2. RESULTS 21923 struct GETFH4resok { 21924 nfs_fh4 object; 21925 }; 21927 union GETFH4res switch (nfsstat4 status) { 21928 case NFS4_OK: 21929 GETFH4resok resok4; 21930 default: 21931 void; 21932 }; 21934 18.8.3. DESCRIPTION 21936 This operation returns the current filehandle value. 21938 On success, the current filehandle retains its value. 21940 As described in Section 2.10.6.4, GETFH is REQUIRED or RECOMMENDED to 21941 immediately follow certain operations, and servers are free to reject 21942 such operations if the client fails to insert GETFH in the request as 21943 REQUIRED or RECOMMENDED. Section 18.16.4.1 provides additional 21944 justification for why GETFH MUST follow OPEN. 21946 18.8.4. IMPLEMENTATION 21948 Operations that change the current filehandle like LOOKUP or CREATE 21949 do not automatically return the new filehandle as a result. For 21950 instance, if a client needs to look up a directory entry and obtain 21951 its filehandle, then the following request is needed. 21953 PUTFH (directory filehandle) 21955 LOOKUP (entry name) 21957 GETFH 21959 18.9. Operation 11: LINK - Create Link to a File 21961 18.9.1. ARGUMENTS 21963 struct LINK4args { 21964 /* SAVED_FH: source object */ 21965 /* CURRENT_FH: target directory */ 21966 component4 newname; 21967 }; 21969 18.9.2. RESULTS 21971 struct LINK4resok { 21972 change_info4 cinfo; 21973 }; 21975 union LINK4res switch (nfsstat4 status) { 21976 case NFS4_OK: 21977 LINK4resok resok4; 21978 default: 21979 void; 21980 }; 21982 18.9.3. DESCRIPTION 21984 The LINK operation creates an additional newname for the file 21985 represented by the saved filehandle, as set by the SAVEFH operation, 21986 in the directory represented by the current filehandle. The existing 21987 file and the target directory must reside within the same file system 21988 on the server. On success, the current filehandle will continue to 21989 be the target directory. If an object exists in the target directory 21990 with the same name as newname, the server must return NFS4ERR_EXIST. 21992 For the target directory, the server returns change_info4 information 21993 in cinfo. With the atomic field of the change_info4 data type, the 21994 server will indicate if the before and after change attributes were 21995 obtained atomically with respect to the link creation. 21997 If the newname has a length of zero, or if newname does not obey the 21998 UTF-8 definition, the error NFS4ERR_INVAL will be returned. 22000 18.9.4. IMPLEMENTATION 22002 The server MAY impose restrictions on the LINK operation such that 22003 LINK may not be done when the file is open or when that open is done 22004 by particular protocols, or with particular options or access modes. 22005 When LINK is rejected because of such restrictions, the error 22006 NFS4ERR_FILE_OPEN is returned. 22008 If a server does implement such restrictions and those restrictions 22009 include cases of NFSv4 opens preventing successful execution of a 22010 link, the server needs to recall any delegations that could hide the 22011 existence of opens relevant to that decision. The reason is that 22012 when a client holds a delegation, the server might not have an 22013 accurate account of the opens for that client, since the client may 22014 execute OPENs and CLOSEs locally. The LINK operation must be delayed 22015 only until a definitive result can be obtained. For example, suppose 22016 there are multiple delegations and one of them establishes an open 22017 whose presence would prevent the link. Given the server's semantics, 22018 NFS4ERR_FILE_OPEN may be returned to the caller as soon as that 22019 delegation is returned without waiting for other delegations to be 22020 returned. Similarly, if such opens are not associated with 22021 delegations, NFS4ERR_FILE_OPEN can be returned immediately with no 22022 delegation recall being done. 22024 If the current filehandle designates a directory for which another 22025 client holds a directory delegation, then, unless the delegation is 22026 such that the situation can be resolved by sending a notification, 22027 the delegation MUST be recalled, and the operation cannot be 22028 performed successfully until the delegation is returned or revoked. 22029 Except where this happens very quickly, one or more NFS4ERR_DELAY 22030 errors will be returned to requests made while delegation remains 22031 outstanding. 22033 When the current filehandle designates a directory for which one or 22034 more directory delegations exist, then, when those delegations 22035 request such notifications, instead of a recall, NOTIFY4_ADD_ENTRY 22036 will be generated as a result of the LINK operation. 22038 If the current file system supports the numlinks attribute, and other 22039 clients have delegations to the file being linked, then those 22040 delegations MUST be recalled and the LINK operation MUST NOT proceed 22041 until all delegations are returned or revoked. Except where this 22042 happens very quickly, one or more NFS4ERR_DELAY errors will be 22043 returned to requests made while delegation remains outstanding. 22045 Changes to any property of the "hard" linked files are reflected in 22046 all of the linked files. When a link is made to a file, the 22047 attributes for the file should have a value for numlinks that is one 22048 greater than the value before the LINK operation. 22050 The statement "file and the target directory must reside within the 22051 same file system on the server" means that the fsid fields in the 22052 attributes for the objects are the same. If they reside on different 22053 file systems, the error NFS4ERR_XDEV is returned. This error may be 22054 returned by some servers when there is an internal partitioning of a 22055 file system that the LINK operation would violate. 22057 On some servers, "." and ".." are illegal values for newname and the 22058 error NFS4ERR_BADNAME will be returned if they are specified. 22060 When the current filehandle designates a named attribute directory 22061 and the object to be linked (the saved filehandle) is not a named 22062 attribute for the same object, the error NFS4ERR_XDEV MUST be 22063 returned. When the saved filehandle designates a named attribute and 22064 the current filehandle is not the appropriate named attribute 22065 directory, the error NFS4ERR_XDEV MUST also be returned. 22067 When the current filehandle designates a named attribute directory 22068 and the object to be linked (the saved filehandle) is a named 22069 attribute within that directory, the server may return the error 22070 NFS4ERR_NOTSUPP. 22072 In the case that newname is already linked to the file represented by 22073 the saved filehandle, the server will return NFS4ERR_EXIST. 22075 Note that symbolic links are created with the CREATE operation. 22077 18.10. Operation 12: LOCK - Create Lock 22079 18.10.1. ARGUMENTS 22081 /* 22082 * For LOCK, transition from open_stateid and lock_owner 22083 * to a lock stateid. 22084 */ 22085 struct open_to_lock_owner4 { 22086 seqid4 open_seqid; 22087 stateid4 open_stateid; 22088 seqid4 lock_seqid; 22089 lock_owner4 lock_owner; 22090 }; 22092 /* 22093 * For LOCK, existing lock stateid continues to request new 22094 * file lock for the same lock_owner and open_stateid. 22095 */ 22096 struct exist_lock_owner4 { 22097 stateid4 lock_stateid; 22098 seqid4 lock_seqid; 22099 }; 22101 union locker4 switch (bool new_lock_owner) { 22102 case TRUE: 22103 open_to_lock_owner4 open_owner; 22104 case FALSE: 22105 exist_lock_owner4 lock_owner; 22106 }; 22108 /* 22109 * LOCK/LOCKT/LOCKU: Record lock management 22110 */ 22111 struct LOCK4args { 22112 /* CURRENT_FH: file */ 22113 nfs_lock_type4 locktype; 22114 bool reclaim; 22115 offset4 offset; 22116 length4 length; 22117 locker4 locker; 22118 }; 22120 18.10.2. RESULTS 22122 struct LOCK4denied { 22123 offset4 offset; 22124 length4 length; 22125 nfs_lock_type4 locktype; 22126 lock_owner4 owner; 22127 }; 22129 struct LOCK4resok { 22130 stateid4 lock_stateid; 22131 }; 22133 union LOCK4res switch (nfsstat4 status) { 22134 case NFS4_OK: 22135 LOCK4resok resok4; 22136 case NFS4ERR_DENIED: 22137 LOCK4denied denied; 22138 default: 22139 void; 22140 }; 22142 18.10.3. DESCRIPTION 22144 The LOCK operation requests a byte-range lock for the byte-range 22145 specified by the offset and length parameters, and lock type 22146 specified in the locktype parameter. If this is a reclaim request, 22147 the reclaim parameter will be TRUE. 22149 Bytes in a file may be locked even if those bytes are not currently 22150 allocated to the file. To lock the file from a specific offset 22151 through the end-of-file (no matter how long the file actually is) use 22152 a length field equal to NFS4_UINT64_MAX. The server MUST return 22153 NFS4ERR_INVAL under the following combinations of length and offset: 22155 * Length is equal to zero. 22157 * Length is not equal to NFS4_UINT64_MAX, and the sum of length and 22158 offset exceeds NFS4_UINT64_MAX. 22160 32-bit servers are servers that support locking for byte offsets that 22161 fit within 32 bits (i.e., less than or equal to NFS4_UINT32_MAX). If 22162 the client specifies a range that overlaps one or more bytes beyond 22163 offset NFS4_UINT32_MAX but does not end at offset NFS4_UINT64_MAX, 22164 then such a 32-bit server MUST return the error NFS4ERR_BAD_RANGE. 22166 If the server returns NFS4ERR_DENIED, the owner, offset, and length 22167 of a conflicting lock are returned. 22169 The locker argument specifies the lock-owner that is associated with 22170 the LOCK operation. The locker4 structure is a switched union that 22171 indicates whether the client has already created byte-range locking 22172 state associated with the current open file and lock-owner. In the 22173 case in which it has, the argument is just a stateid representing the 22174 set of locks associated with that open file and lock-owner, together 22175 with a lock_seqid value that MAY be any value and MUST be ignored by 22176 the server. In the case where no byte-range locking state has been 22177 established, or the client does not have the stateid available, the 22178 argument contains the stateid of the open file with which this lock 22179 is to be associated, together with the lock-owner with which the lock 22180 is to be associated. The open_to_lock_owner case covers the very 22181 first lock done by a lock-owner for a given open file and offers a 22182 method to use the established state of the open_stateid to transition 22183 to the use of a lock stateid. 22185 The following fields of the locker parameter MAY be set to any value 22186 by the client and MUST be ignored by the server: 22188 * The clientid field of the lock_owner field of the open_owner field 22189 (locker.open_owner.lock_owner.clientid). The reason the server 22190 MUST ignore the clientid field is that the server MUST derive the 22191 client ID from the session ID from the SEQUENCE operation of the 22192 COMPOUND request. 22194 * The open_seqid and lock_seqid fields of the open_owner field 22195 (locker.open_owner.open_seqid and locker.open_owner.lock_seqid). 22197 * The lock_seqid field of the lock_owner field 22198 (locker.lock_owner.lock_seqid). 22200 Note that the client ID appearing in a LOCK4denied structure is the 22201 actual client associated with the conflicting lock, whether this is 22202 the client ID associated with the current session or a different one. 22203 Thus, if the server returns NFS4ERR_DENIED, it MUST set the clientid 22204 field of the owner field of the denied field. 22206 If the current filehandle is not an ordinary file, an error will be 22207 returned to the client. In the case that the current filehandle 22208 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If 22209 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 22210 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 22212 On success, the current filehandle retains its value. 22214 18.10.4. IMPLEMENTATION 22216 If the server is unable to determine the exact offset and length of 22217 the conflicting byte-range lock, the same offset and length that were 22218 provided in the arguments should be returned in the denied results. 22220 LOCK operations are subject to permission checks and to checks 22221 against the access type of the associated file. However, the 22222 specific right and modes required for various types of locks reflect 22223 the semantics of the server-exported file system, and are not 22224 specified by the protocol. For example, Windows 2000 allows a write 22225 lock of a file open for read access, while a POSIX-compliant system 22226 does not. 22228 When the client sends a LOCK operation that corresponds to a range 22229 that the lock-owner has locked already (with the same or different 22230 lock type), or to a sub-range of such a range, or to a byte-range 22231 that includes multiple locks already granted to that lock-owner, in 22232 whole or in part, and the server does not support such locking 22233 operations (i.e., does not support POSIX locking semantics), the 22234 server will return the error NFS4ERR_LOCK_RANGE. In that case, the 22235 client may return an error, or it may emulate the required 22236 operations, using only LOCK for ranges that do not include any bytes 22237 already locked by that lock-owner and LOCKU of locks held by that 22238 lock-owner (specifying an exactly matching range and type). 22239 Similarly, when the client sends a LOCK operation that amounts to 22240 upgrading (changing from a READ_LT lock to a WRITE_LT lock) or 22241 downgrading (changing from WRITE_LT lock to a READ_LT lock) an 22242 existing byte-range lock, and the server does not support such a 22243 lock, the server will return NFS4ERR_LOCK_NOTSUPP. Such operations 22244 may not perfectly reflect the required semantics in the face of 22245 conflicting LOCK operations from other clients. 22247 When a client holds an OPEN_DELEGATE_WRITE delegation, the client 22248 holding that delegation is assured that there are no opens by other 22249 clients. Thus, there can be no conflicting LOCK operations from such 22250 clients. Therefore, the client may be handling locking requests 22251 locally, without doing LOCK operations on the server. If it does 22252 that, it must be prepared to update the lock status on the server, by 22253 sending appropriate LOCK and LOCKU operations before returning the 22254 delegation. 22256 When one or more clients hold OPEN_DELEGATE_READ delegations, any 22257 LOCK operation where the server is implementing mandatory locking 22258 semantics MUST result in the recall of all such delegations. The 22259 LOCK operation may not be granted until all such delegations are 22260 returned or revoked. Except where this happens very quickly, one or 22261 more NFS4ERR_DELAY errors will be returned to requests made while the 22262 delegation remains outstanding. 22264 18.11. Operation 13: LOCKT - Test for Lock 22266 18.11.1. ARGUMENTS 22268 struct LOCKT4args { 22269 /* CURRENT_FH: file */ 22270 nfs_lock_type4 locktype; 22271 offset4 offset; 22272 length4 length; 22273 lock_owner4 owner; 22274 }; 22276 18.11.2. RESULTS 22278 union LOCKT4res switch (nfsstat4 status) { 22279 case NFS4ERR_DENIED: 22280 LOCK4denied denied; 22281 case NFS4_OK: 22282 void; 22283 default: 22284 void; 22285 }; 22287 18.11.3. DESCRIPTION 22289 The LOCKT operation tests the lock as specified in the arguments. If 22290 a conflicting lock exists, the owner, offset, length, and type of the 22291 conflicting lock are returned. The owner field in the results 22292 includes the client ID of the owner of the conflicting lock, whether 22293 this is the client ID associated with the current session or a 22294 different client ID. If no lock is held, nothing other than NFS4_OK 22295 is returned. Lock types READ_LT and READW_LT are processed in the 22296 same way in that a conflicting lock test is done without regard to 22297 blocking or non-blocking. The same is true for WRITE_LT and 22298 WRITEW_LT. 22300 The ranges are specified as for LOCK. The NFS4ERR_INVAL and 22301 NFS4ERR_BAD_RANGE errors are returned under the same circumstances as 22302 for LOCK. 22304 The clientid field of the owner MAY be set to any value by the client 22305 and MUST be ignored by the server. The reason the server MUST ignore 22306 the clientid field is that the server MUST derive the client ID from 22307 the session ID from the SEQUENCE operation of the COMPOUND request. 22309 If the current filehandle is not an ordinary file, an error will be 22310 returned to the client. In the case that the current filehandle 22311 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If 22312 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 22313 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 22315 On success, the current filehandle retains its value. 22317 18.11.4. IMPLEMENTATION 22319 If the server is unable to determine the exact offset and length of 22320 the conflicting lock, the same offset and length that were provided 22321 in the arguments should be returned in the denied results. 22323 LOCKT uses a lock_owner4 rather a stateid4, as is used in LOCK to 22324 identify the owner. This is because the client does not have to open 22325 the file to test for the existence of a lock, so a stateid might not 22326 be available. 22328 As noted in Section 18.10.4, some servers may return 22329 NFS4ERR_LOCK_RANGE to certain (otherwise non-conflicting) LOCK 22330 operations that overlap ranges already granted to the current lock- 22331 owner. 22333 The LOCKT operation's test for conflicting locks SHOULD exclude locks 22334 for the current lock-owner, and thus should return NFS4_OK in such 22335 cases. Note that this means that a server might return NFS4_OK to a 22336 LOCKT request even though a LOCK operation for the same range and 22337 lock-owner would fail with NFS4ERR_LOCK_RANGE. 22339 When a client holds an OPEN_DELEGATE_WRITE delegation, it may choose 22340 (see Section 18.10.4) to handle LOCK requests locally. In such a 22341 case, LOCKT requests will similarly be handled locally. 22343 18.12. Operation 14: LOCKU - Unlock File 22345 18.12.1. ARGUMENTS 22346 struct LOCKU4args { 22347 /* CURRENT_FH: file */ 22348 nfs_lock_type4 locktype; 22349 seqid4 seqid; 22350 stateid4 lock_stateid; 22351 offset4 offset; 22352 length4 length; 22353 }; 22355 18.12.2. RESULTS 22357 union LOCKU4res switch (nfsstat4 status) { 22358 case NFS4_OK: 22359 stateid4 lock_stateid; 22360 default: 22361 void; 22362 }; 22364 18.12.3. DESCRIPTION 22366 The LOCKU operation unlocks the byte-range lock specified by the 22367 parameters. The client may set the locktype field to any value that 22368 is legal for the nfs_lock_type4 enumerated type, and the server MUST 22369 accept any legal value for locktype. Any legal value for locktype 22370 has no effect on the success or failure of the LOCKU operation. 22372 The ranges are specified as for LOCK. The NFS4ERR_INVAL and 22373 NFS4ERR_BAD_RANGE errors are returned under the same circumstances as 22374 for LOCK. 22376 The seqid parameter MAY be any value and the server MUST ignore it. 22378 If the current filehandle is not an ordinary file, an error will be 22379 returned to the client. In the case that the current filehandle 22380 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If 22381 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 22382 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 22384 On success, the current filehandle retains its value. 22386 The server MAY require that the principal, security flavor, and if 22387 applicable, the GSS mechanism, combination that sent a LOCK operation 22388 also be the one to send LOCKU on the file. This might not be 22389 possible if credentials for the principal are no longer available. 22390 The server MAY allow the machine credential or SSV credential (see 22391 Section 18.35) to send LOCKU. 22393 18.12.4. IMPLEMENTATION 22395 If the area to be unlocked does not correspond exactly to a lock 22396 actually held by the lock-owner, the server may return the error 22397 NFS4ERR_LOCK_RANGE. This includes the case in which the area is not 22398 locked, where the area is a sub-range of the area locked, where it 22399 overlaps the area locked without matching exactly, or the area 22400 specified includes multiple locks held by the lock-owner. In all of 22401 these cases, allowed by POSIX locking [21] semantics, a client 22402 receiving this error should, if it desires support for such 22403 operations, simulate the operation using LOCKU on ranges 22404 corresponding to locks it actually holds, possibly followed by LOCK 22405 operations for the sub-ranges not being unlocked. 22407 When a client holds an OPEN_DELEGATE_WRITE delegation, it may choose 22408 (see Section 18.10.4) to handle LOCK requests locally. In such a 22409 case, LOCKU operations will similarly be handled locally. 22411 18.13. Operation 15: LOOKUP - Lookup Filename 22413 18.13.1. ARGUMENTS 22415 struct LOOKUP4args { 22416 /* CURRENT_FH: directory */ 22417 component4 objname; 22418 }; 22420 18.13.2. RESULTS 22422 struct LOOKUP4res { 22423 /* New CURRENT_FH: object */ 22424 nfsstat4 status; 22425 }; 22427 18.13.3. DESCRIPTION 22429 The LOOKUP operation looks up or finds a file system object using the 22430 directory specified by the current filehandle. LOOKUP evaluates the 22431 component and if the object exists, the current filehandle is 22432 replaced with the component's filehandle. 22434 If the component cannot be evaluated either because it does not exist 22435 or because the client does not have permission to evaluate the 22436 component, then an error will be returned and the current filehandle 22437 will be unchanged. 22439 If the component is a zero-length string or if any component does not 22440 obey the UTF-8 definition, the error NFS4ERR_INVAL will be returned. 22442 18.13.4. IMPLEMENTATION 22444 If the client wants to achieve the effect of a multi-component look 22445 up, it may construct a COMPOUND request such as (and obtain each 22446 filehandle): 22448 PUTFH (directory filehandle) 22449 LOOKUP "pub" 22450 GETFH 22451 LOOKUP "foo" 22452 GETFH 22453 LOOKUP "bar" 22454 GETFH 22456 Unlike NFSv3, NFSv4.1 allows LOOKUP requests to cross mountpoints on 22457 the server. The client can detect a mountpoint crossing by comparing 22458 the fsid attribute of the directory with the fsid attribute of the 22459 directory looked up. If the fsids are different, then the new 22460 directory is a server mountpoint. UNIX clients that detect a 22461 mountpoint crossing will need to mount the server's file system. 22462 This needs to be done to maintain the file object identity checking 22463 mechanisms common to UNIX clients. 22465 Servers that limit NFS access to "shared" or "exported" file systems 22466 should provide a pseudo file system into which the exported file 22467 systems can be integrated, so that clients can browse the server's 22468 namespace. The clients view of a pseudo file system will be limited 22469 to paths that lead to exported file systems. 22471 Note: previous versions of the protocol assigned special semantics to 22472 the names "." and "..". NFSv4.1 assigns no special semantics to 22473 these names. The LOOKUPP operator must be used to look up a parent 22474 directory. 22476 Note that this operation does not follow symbolic links. The client 22477 is responsible for all parsing of filenames including filenames that 22478 are modified by symbolic links encountered during the look up 22479 process. 22481 If the current filehandle supplied is not a directory but a symbolic 22482 link, the error NFS4ERR_SYMLINK is returned as the error. For all 22483 other non-directory file types, the error NFS4ERR_NOTDIR is returned. 22485 18.14. Operation 16: LOOKUPP - Lookup Parent Directory 22487 18.14.1. ARGUMENTS 22488 /* CURRENT_FH: object */ 22489 void; 22491 18.14.2. RESULTS 22493 struct LOOKUPP4res { 22494 /* new CURRENT_FH: parent directory */ 22495 nfsstat4 status; 22496 }; 22498 18.14.3. DESCRIPTION 22500 The current filehandle is assumed to refer to a regular directory or 22501 a named attribute directory. LOOKUPP assigns the filehandle for its 22502 parent directory to be the current filehandle. If there is no parent 22503 directory, an NFS4ERR_NOENT error must be returned. Therefore, 22504 NFS4ERR_NOENT will be returned by the server when the current 22505 filehandle is at the root or top of the server's file tree. 22507 As is the case with LOOKUP, LOOKUPP will also cross mountpoints. 22509 If the current filehandle is not a directory or named attribute 22510 directory, the error NFS4ERR_NOTDIR is returned. 22512 If the requester's security flavor does not match that configured for 22513 the parent directory, then the server SHOULD return NFS4ERR_WRONGSEC 22514 (a future minor revision of NFSv4 may upgrade this to MUST) in the 22515 LOOKUPP response. However, if the server does so, it MUST support 22516 the SECINFO_NO_NAME operation (Section 18.45), so that the client can 22517 gracefully determine the correct security flavor. 22519 If the current filehandle is a named attribute directory that is 22520 associated with a file system object via OPENATTR (i.e., not a sub- 22521 directory of a named attribute directory), LOOKUPP SHOULD return the 22522 filehandle of the associated file system object. 22524 18.14.4. IMPLEMENTATION 22526 An issue to note is upward navigation from named attribute 22527 directories. The named attribute directories are essentially 22528 detached from the namespace, and this property should be safely 22529 represented in the client operating environment. LOOKUPP on a named 22530 attribute directory may return the filehandle of the associated file, 22531 and conveying this to applications might be unsafe as many 22532 applications expect the parent of an object to always be a directory. 22533 Therefore, the client may want to hide the parent of named attribute 22534 directories (represented as ".." in UNIX) or represent the named 22535 attribute directory as its own parent (as is typically done for the 22536 file system root directory in UNIX). 22538 18.15. Operation 17: NVERIFY - Verify Difference in Attributes 22540 18.15.1. ARGUMENTS 22542 struct NVERIFY4args { 22543 /* CURRENT_FH: object */ 22544 fattr4 obj_attributes; 22545 }; 22547 18.15.2. RESULTS 22549 struct NVERIFY4res { 22550 nfsstat4 status; 22551 }; 22553 18.15.3. DESCRIPTION 22555 This operation is used to prefix a sequence of operations to be 22556 performed if one or more attributes have changed on some file system 22557 object. If all the attributes match, then the error NFS4ERR_SAME 22558 MUST be returned. 22560 On success, the current filehandle retains its value. 22562 18.15.4. IMPLEMENTATION 22564 This operation is useful as a cache validation operator. If the 22565 object to which the attributes belong has changed, then the following 22566 operations may obtain new data associated with that object, for 22567 instance, to check if a file has been changed and obtain new data if 22568 it has: 22570 SEQUENCE 22571 PUTFH fh 22572 NVERIFY attrbits attrs 22573 READ 0 32767 22575 Contrast this with NFSv3, which would first send a GETATTR in one 22576 request/reply round trip, and then if attributes indicated that the 22577 client's cache was stale, then send a READ in another request/reply 22578 round trip. 22580 In the case that a RECOMMENDED attribute is specified in the NVERIFY 22581 operation and the server does not support that attribute for the file 22582 system object, the error NFS4ERR_ATTRNOTSUPP is returned to the 22583 client. 22585 When the attribute rdattr_error or any set-only attribute (e.g., 22586 time_modify_set) is specified, the error NFS4ERR_INVAL is returned to 22587 the client. 22589 18.16. Operation 18: OPEN - Open a Regular File 22591 18.16.1. ARGUMENTS 22593 /* 22594 * Various definitions for OPEN 22595 */ 22596 enum createmode4 { 22597 UNCHECKED4 = 0, 22598 GUARDED4 = 1, 22599 /* Deprecated in NFSv4.1. */ 22600 EXCLUSIVE4 = 2, 22601 /* 22602 * New to NFSv4.1. If session is persistent, 22603 * GUARDED4 MUST be used. Otherwise, use 22604 * EXCLUSIVE4_1 instead of EXCLUSIVE4. 22605 */ 22606 EXCLUSIVE4_1 = 3 22607 }; 22609 struct creatverfattr { 22610 verifier4 cva_verf; 22611 fattr4 cva_attrs; 22612 }; 22614 union createhow4 switch (createmode4 mode) { 22615 case UNCHECKED4: 22616 case GUARDED4: 22617 fattr4 createattrs; 22619 case EXCLUSIVE4: 22620 verifier4 createverf; 22621 case EXCLUSIVE4_1: 22622 creatverfattr ch_createboth; 22623 }; 22625 enum opentype4 { 22626 OPEN4_NOCREATE = 0, 22627 OPEN4_CREATE = 1 22628 }; 22630 union openflag4 switch (opentype4 opentype) { 22631 case OPEN4_CREATE: 22632 createhow4 how; 22633 default: 22634 void; 22635 }; 22637 /* Next definitions used for OPEN delegation */ 22638 enum limit_by4 { 22639 NFS_LIMIT_SIZE = 1, 22640 NFS_LIMIT_BLOCKS = 2 22641 /* others as needed */ 22642 }; 22644 struct nfs_modified_limit4 { 22645 uint32_t num_blocks; 22646 uint32_t bytes_per_block; 22647 }; 22649 union nfs_space_limit4 switch (limit_by4 limitby) { 22650 /* limit specified as file size */ 22651 case NFS_LIMIT_SIZE: 22652 uint64_t filesize; 22653 /* limit specified by number of blocks */ 22654 case NFS_LIMIT_BLOCKS: 22655 nfs_modified_limit4 mod_blocks; 22656 } ; 22658 /* 22659 * Share Access and Deny constants for open argument 22660 */ 22661 const OPEN4_SHARE_ACCESS_READ = 0x00000001; 22662 const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; 22663 const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; 22665 const OPEN4_SHARE_DENY_NONE = 0x00000000; 22666 const OPEN4_SHARE_DENY_READ = 0x00000001; 22667 const OPEN4_SHARE_DENY_WRITE = 0x00000002; 22668 const OPEN4_SHARE_DENY_BOTH = 0x00000003; 22670 /* new flags for share_access field of OPEN4args */ 22671 const OPEN4_SHARE_ACCESS_WANT_DELEG_MASK = 0xFF00; 22672 const OPEN4_SHARE_ACCESS_WANT_NO_PREFERENCE = 0x0000; 22673 const OPEN4_SHARE_ACCESS_WANT_READ_DELEG = 0x0100; 22674 const OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG = 0x0200; 22675 const OPEN4_SHARE_ACCESS_WANT_ANY_DELEG = 0x0300; 22676 const OPEN4_SHARE_ACCESS_WANT_NO_DELEG = 0x0400; 22677 const OPEN4_SHARE_ACCESS_WANT_CANCEL = 0x0500; 22679 const 22680 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL 22681 = 0x10000; 22683 const 22684 OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED 22685 = 0x20000; 22687 enum open_delegation_type4 { 22688 OPEN_DELEGATE_NONE = 0, 22689 OPEN_DELEGATE_READ = 1, 22690 OPEN_DELEGATE_WRITE = 2, 22691 OPEN_DELEGATE_NONE_EXT = 3 /* new to v4.1 */ 22692 }; 22694 enum open_claim_type4 { 22695 /* 22696 * Not a reclaim. 22697 */ 22698 CLAIM_NULL = 0, 22700 CLAIM_PREVIOUS = 1, 22701 CLAIM_DELEGATE_CUR = 2, 22702 CLAIM_DELEGATE_PREV = 3, 22704 /* 22705 * Not a reclaim. 22706 * 22707 * Like CLAIM_NULL, but object identified 22708 * by the current filehandle. 22709 */ 22710 CLAIM_FH = 4, /* new to v4.1 */ 22712 /* 22713 * Like CLAIM_DELEGATE_CUR, but object identified 22714 * by current filehandle. 22715 */ 22716 CLAIM_DELEG_CUR_FH = 5, /* new to v4.1 */ 22718 /* 22719 * Like CLAIM_DELEGATE_PREV, but object identified 22720 * by current filehandle. 22721 */ 22722 CLAIM_DELEG_PREV_FH = 6 /* new to v4.1 */ 22723 }; 22725 struct open_claim_delegate_cur4 { 22726 stateid4 delegate_stateid; 22727 component4 file; 22728 }; 22730 union open_claim4 switch (open_claim_type4 claim) { 22731 /* 22732 * No special rights to file. 22733 * Ordinary OPEN of the specified file. 22734 */ 22735 case CLAIM_NULL: 22736 /* CURRENT_FH: directory */ 22737 component4 file; 22738 /* 22739 * Right to the file established by an 22740 * open previous to server reboot. File 22741 * identified by filehandle obtained at 22742 * that time rather than by name. 22743 */ 22744 case CLAIM_PREVIOUS: 22745 /* CURRENT_FH: file being reclaimed */ 22746 open_delegation_type4 delegate_type; 22748 /* 22749 * Right to file based on a delegation 22750 * granted by the server. File is 22751 * specified by name. 22752 */ 22753 case CLAIM_DELEGATE_CUR: 22754 /* CURRENT_FH: directory */ 22755 open_claim_delegate_cur4 delegate_cur_info; 22757 /* 22758 * Right to file based on a delegation 22759 * granted to a previous boot instance 22760 * of the client. File is specified by name. 22761 */ 22763 case CLAIM_DELEGATE_PREV: 22764 /* CURRENT_FH: directory */ 22765 component4 file_delegate_prev; 22767 /* 22768 * Like CLAIM_NULL. No special rights 22769 * to file. Ordinary OPEN of the 22770 * specified file by current filehandle. 22771 */ 22772 case CLAIM_FH: /* new to v4.1 */ 22773 /* CURRENT_FH: regular file to open */ 22774 void; 22776 /* 22777 * Like CLAIM_DELEGATE_PREV. Right to file based on a 22778 * delegation granted to a previous boot 22779 * instance of the client. File is identified 22780 * by filehandle. 22781 */ 22782 case CLAIM_DELEG_PREV_FH: /* new to v4.1 */ 22783 /* CURRENT_FH: file being opened */ 22784 void; 22786 /* 22787 * Like CLAIM_DELEGATE_CUR. Right to file based on 22788 * a delegation granted by the server. 22789 * File is identified by filehandle. 22790 */ 22791 case CLAIM_DELEG_CUR_FH: /* new to v4.1 */ 22792 /* CURRENT_FH: file being opened */ 22793 stateid4 oc_delegate_stateid; 22795 }; 22797 /* 22798 * OPEN: Open a file, potentially receiving an OPEN delegation 22799 */ 22800 struct OPEN4args { 22801 seqid4 seqid; 22802 uint32_t share_access; 22803 uint32_t share_deny; 22804 open_owner4 owner; 22805 openflag4 openhow; 22806 open_claim4 claim; 22807 }; 22809 18.16.2. RESULTS 22810 struct open_read_delegation4 { 22811 stateid4 stateid; /* Stateid for delegation*/ 22812 bool recall; /* Pre-recalled flag for 22813 delegations obtained 22814 by reclaim (CLAIM_PREVIOUS) */ 22816 nfsace4 permissions; /* Defines users who don't 22817 need an ACCESS call to 22818 open for read */ 22819 }; 22821 struct open_write_delegation4 { 22822 stateid4 stateid; /* Stateid for delegation */ 22823 bool recall; /* Pre-recalled flag for 22824 delegations obtained 22825 by reclaim 22826 (CLAIM_PREVIOUS) */ 22828 nfs_space_limit4 22829 space_limit; /* Defines condition that 22830 the client must check to 22831 determine whether the 22832 file needs to be flushed 22833 to the server on close. */ 22835 nfsace4 permissions; /* Defines users who don't 22836 need an ACCESS call as 22837 part of a delegated 22838 open. */ 22839 }; 22841 enum why_no_delegation4 { /* new to v4.1 */ 22842 WND4_NOT_WANTED = 0, 22843 WND4_CONTENTION = 1, 22844 WND4_RESOURCE = 2, 22845 WND4_NOT_SUPP_FTYPE = 3, 22846 WND4_WRITE_DELEG_NOT_SUPP_FTYPE = 4, 22847 WND4_NOT_SUPP_UPGRADE = 5, 22848 WND4_NOT_SUPP_DOWNGRADE = 6, 22849 WND4_CANCELLED = 7, 22850 WND4_IS_DIR = 8 22851 }; 22853 union open_none_delegation4 /* new to v4.1 */ 22854 switch (why_no_delegation4 ond_why) { 22855 case WND4_CONTENTION: 22856 bool ond_server_will_push_deleg; 22858 case WND4_RESOURCE: 22859 bool ond_server_will_signal_avail; 22860 default: 22861 void; 22862 }; 22864 union open_delegation4 22865 switch (open_delegation_type4 delegation_type) { 22866 case OPEN_DELEGATE_NONE: 22867 void; 22868 case OPEN_DELEGATE_READ: 22869 open_read_delegation4 read; 22870 case OPEN_DELEGATE_WRITE: 22871 open_write_delegation4 write; 22872 case OPEN_DELEGATE_NONE_EXT: /* new to v4.1 */ 22873 open_none_delegation4 od_whynone; 22874 }; 22876 /* 22877 * Result flags 22878 */ 22880 /* Client must confirm open */ 22881 const OPEN4_RESULT_CONFIRM = 0x00000002; 22882 /* Type of file locking behavior at the server */ 22883 const OPEN4_RESULT_LOCKTYPE_POSIX = 0x00000004; 22884 /* Server will preserve file if removed while open */ 22885 const OPEN4_RESULT_PRESERVE_UNLINKED = 0x00000008; 22887 /* 22888 * Server may use CB_NOTIFY_LOCK on locks 22889 * derived from this open 22890 */ 22891 const OPEN4_RESULT_MAY_NOTIFY_LOCK = 0x00000020; 22893 struct OPEN4resok { 22894 stateid4 stateid; /* Stateid for open */ 22895 change_info4 cinfo; /* Directory Change Info */ 22896 uint32_t rflags; /* Result flags */ 22897 bitmap4 attrset; /* attribute set for create*/ 22898 open_delegation4 delegation; /* Info on any open 22899 delegation */ 22900 }; 22902 union OPEN4res switch (nfsstat4 status) { 22903 case NFS4_OK: 22904 /* New CURRENT_FH: opened file */ 22905 OPEN4resok resok4; 22907 default: 22908 void; 22909 }; 22911 18.16.3. DESCRIPTION 22913 The OPEN operation opens a regular file in a directory with the 22914 provided name or filehandle. OPEN can also create a file if a name 22915 is provided, and the client specifies it wants to create a file. 22916 Specification of whether or not a file is to be created, and the 22917 method of creation is via the openhow parameter. The openhow 22918 parameter consists of a switched union (data type opengflag4), which 22919 switches on the value of opentype (OPEN4_NOCREATE or OPEN4_CREATE). 22920 If OPEN4_CREATE is specified, this leads to another switched union 22921 (data type createhow4) that supports four cases of creation methods: 22922 UNCHECKED4, GUARDED4, EXCLUSIVE4, or EXCLUSIVE4_1. If opentype is 22923 OPEN4_CREATE, then the claim field of the claim field MUST be one of 22924 CLAIM_NULL, CLAIM_DELEGATE_CUR, or CLAIM_DELEGATE_PREV, because these 22925 claim methods include a component of a file name. 22927 Upon success (which might entail creation of a new file), the current 22928 filehandle is replaced by that of the created or existing object. 22930 If the current filehandle is a named attribute directory, OPEN will 22931 then create or open a named attribute file. Note that exclusive 22932 create of a named attribute is not supported. If the createmode is 22933 EXCLUSIVE4 or EXCLUSIVE4_1 and the current filehandle is a named 22934 attribute directory, the server will return EINVAL. 22936 UNCHECKED4 means that the file should be created if a file of that 22937 name does not exist and encountering an existing regular file of that 22938 name is not an error. For this type of create, createattrs specifies 22939 the initial set of attributes for the file. The set of attributes 22940 may include any writable attribute valid for regular files. When an 22941 UNCHECKED4 create encounters an existing file, the attributes 22942 specified by createattrs are not used, except that when createattrs 22943 specifies the size attribute with a size of zero, the existing file 22944 is truncated. 22946 If GUARDED4 is specified, the server checks for the presence of a 22947 duplicate object by name before performing the create. If a 22948 duplicate exists, NFS4ERR_EXIST is returned. If the object does not 22949 exist, the request is performed as described for UNCHECKED4. 22951 For the UNCHECKED4 and GUARDED4 cases, where the operation is 22952 successful, the server will return to the client an attribute mask 22953 signifying which attributes were successfully set for the object. 22955 EXCLUSIVE4_1 and EXCLUSIVE4 specify that the server is to follow 22956 exclusive creation semantics, using the verifier to ensure exclusive 22957 creation of the target. The server should check for the presence of 22958 a duplicate object by name. If the object does not exist, the server 22959 creates the object and stores the verifier with the object. If the 22960 object does exist and the stored verifier matches the client provided 22961 verifier, the server uses the existing object as the newly created 22962 object. If the stored verifier does not match, then an error of 22963 NFS4ERR_EXIST is returned. 22965 If using EXCLUSIVE4, and if the server uses attributes to store the 22966 exclusive create verifier, the server will signify which attributes 22967 it used by setting the appropriate bits in the attribute mask that is 22968 returned in the results. Unlike UNCHECKED4, GUARDED4, and 22969 EXCLUSIVE4_1, EXCLUSIVE4 does not support the setting of attributes 22970 at file creation, and after a successful OPEN via EXCLUSIVE4, the 22971 client MUST send a SETATTR to set attributes to a known state. 22973 In NFSv4.1, EXCLUSIVE4 has been deprecated in favor of EXCLUSIVE4_1. 22974 Unlike EXCLUSIVE4, attributes may be provided in the EXCLUSIVE4_1 22975 case, but because the server may use attributes of the target object 22976 to store the verifier, the set of allowable attributes may be fewer 22977 than the set of attributes SETATTR allows. The allowable attributes 22978 for EXCLUSIVE4_1 are indicated in the suppattr_exclcreat 22979 (Section 5.8.1.14) attribute. If the client attempts to set in 22980 cva_attrs an attribute that is not in suppattr_exclcreat, the server 22981 MUST return NFS4ERR_INVAL. The response field, attrset, indicates 22982 both which attributes the server set from cva_attrs and which 22983 attributes the server used to store the verifier. As described in 22984 Section 18.16.4, the client can compare cva_attrs.attrmask with 22985 attrset to determine which attributes were used to store the 22986 verifier. 22988 With the addition of persistent sessions and pNFS, under some 22989 conditions EXCLUSIVE4 MUST NOT be used by the client or supported by 22990 the server. The following table summarizes the appropriate and 22991 mandated exclusive create methods for implementations of NFSv4.1: 22993 +=============+==========+==============+=======================+ 22994 | Persistent | Server | Server | Client Allowed | 22995 | Reply Cache | Supports | REQUIRED | | 22996 | Enabled | pNFS | | | 22997 +=============+==========+==============+=======================+ 22998 | no | no | EXCLUSIVE4_1 | EXCLUSIVE4_1 (SHOULD) | 22999 | | | and | or EXCLUSIVE4 (SHOULD | 23000 | | | EXCLUSIVE4 | NOT) | 23001 +-------------+----------+--------------+-----------------------+ 23002 | no | yes | EXCLUSIVE4_1 | EXCLUSIVE4_1 | 23003 +-------------+----------+--------------+-----------------------+ 23004 | yes | no | GUARDED4 | GUARDED4 | 23005 +-------------+----------+--------------+-----------------------+ 23006 | yes | yes | GUARDED4 | GUARDED4 | 23007 +-------------+----------+--------------+-----------------------+ 23009 Table 18: Required Methods for Exclusive Create 23011 If CREATE_SESSION4_FLAG_PERSIST is set in the results of 23012 CREATE_SESSION, the reply cache is persistent (see Section 18.36). 23013 If the EXCHGID4_FLAG_USE_PNFS_MDS flag is set in the results from 23014 EXCHANGE_ID, the server is a pNFS server (see Section 18.35). If the 23015 client attempts to use EXCLUSIVE4 on a persistent session, or a 23016 session derived from an EXCHGID4_FLAG_USE_PNFS_MDS client ID, the 23017 server MUST return NFS4ERR_INVAL. 23019 With persistent sessions, exclusive create semantics are fully 23020 achievable via GUARDED4, and so EXCLUSIVE4 or EXCLUSIVE4_1 MUST NOT 23021 be used. When pNFS is being used, the layout_hint attribute might 23022 not be supported after the file is created. Only the EXCLUSIVE4_1 23023 and GUARDED methods of exclusive file creation allow the atomic 23024 setting of attributes. 23026 For the target directory, the server returns change_info4 information 23027 in cinfo. With the atomic field of the change_info4 data type, the 23028 server will indicate if the before and after change attributes were 23029 obtained atomically with respect to the link creation. 23031 The OPEN operation provides for Windows share reservation capability 23032 with the use of the share_access and share_deny fields of the OPEN 23033 arguments. The client specifies at OPEN the required share_access 23034 and share_deny modes. For clients that do not directly support 23035 SHAREs (i.e., UNIX), the expected deny value is 23036 OPEN4_SHARE_DENY_NONE. In the case that there is an existing SHARE 23037 reservation that conflicts with the OPEN request, the server returns 23038 the error NFS4ERR_SHARE_DENIED. For additional discussion of SHARE 23039 semantics, see Section 9.7. 23041 For each OPEN, the client provides a value for the owner field of the 23042 OPEN argument. The owner field is of data type open_owner4, and 23043 contains a field called clientid and a field called owner. The 23044 client can set the clientid field to any value and the server MUST 23045 ignore it. Instead, the server MUST derive the client ID from the 23046 session ID of the SEQUENCE operation of the COMPOUND request. 23048 The "seqid" field of the request is not used in NFSv4.1, but it MAY 23049 be any value and the server MUST ignore it. 23051 In the case that the client is recovering state from a server 23052 failure, the claim field of the OPEN argument is used to signify that 23053 the request is meant to reclaim state previously held. 23055 The "claim" field of the OPEN argument is used to specify the file to 23056 be opened and the state information that the client claims to 23057 possess. There are seven claim types as follows: 23059 +======================+============================================+ 23060 | open type | description | 23061 +======================+============================================+ 23062 | CLAIM_NULL, CLAIM_FH | For the client, this is a new OPEN | 23063 | | request and there is no previous state | 23064 | | associated with the file for the | 23065 | | client. With CLAIM_NULL, the file is | 23066 | | identified by the current filehandle | 23067 | | and the specified component name. | 23068 | | With CLAIM_FH (new to NFSv4.1), the | 23069 | | file is identified by just the current | 23070 | | filehandle. | 23071 +----------------------+--------------------------------------------+ 23072 | CLAIM_PREVIOUS | The client is claiming basic OPEN | 23073 | | state for a file that was held | 23074 | | previous to a server restart. | 23075 | | Generally used when a server is | 23076 | | returning persistent filehandles; the | 23077 | | client may not have the file name to | 23078 | | reclaim the OPEN. | 23079 +----------------------+--------------------------------------------+ 23080 | CLAIM_DELEGATE_CUR, | The client is claiming a delegation | 23081 | CLAIM_DELEG_CUR_FH | for OPEN as granted by the server. | 23082 | | Generally, this is done as part of | 23083 | | recalling a delegation. With | 23084 | | CLAIM_DELEGATE_CUR, the file is | 23085 | | identified by the current filehandle | 23086 | | and the specified component name. | 23087 | | With CLAIM_DELEG_CUR_FH (new to | 23088 | | NFSv4.1), the file is identified by | 23089 | | just the current filehandle. | 23090 +----------------------+--------------------------------------------+ 23091 | CLAIM_DELEGATE_PREV, | The client is claiming a delegation | 23092 | CLAIM_DELEG_PREV_FH | granted to a previous client instance; | 23093 | | used after the client restarts. The | 23094 | | server MAY support CLAIM_DELEGATE_PREV | 23095 | | and/or CLAIM_DELEG_PREV_FH (new to | 23096 | | NFSv4.1). If it does support either | 23097 | | claim type, CREATE_SESSION MUST NOT | 23098 | | remove the client's delegation state, | 23099 | | and the server MUST support the | 23100 | | DELEGPURGE operation. | 23101 +----------------------+--------------------------------------------+ 23103 Table 19 23105 For OPEN requests that reach the server during the grace period, the 23106 server returns an error of NFS4ERR_GRACE. The following claim types 23107 are exceptions: 23109 * OPEN requests specifying the claim type CLAIM_PREVIOUS are devoted 23110 to reclaiming opens after a server restart and are typically only 23111 valid during the grace period. 23113 * OPEN requests specifying the claim types CLAIM_DELEGATE_CUR and 23114 CLAIM_DELEG_CUR_FH are valid both during and after the grace 23115 period. Since the granting of the delegation that they are 23116 subordinate to assures that there is no conflict with locks to be 23117 reclaimed by other clients, the server need not return 23118 NFS4ERR_GRACE when these are received during the grace period. 23120 For any OPEN request, the server may return an OPEN delegation, which 23121 allows further opens and closes to be handled locally on the client 23122 as described in Section 10.4. Note that delegation is up to the 23123 server to decide. The client should never assume that delegation 23124 will or will not be granted in a particular instance. It should 23125 always be prepared for either case. A partial exception is the 23126 reclaim (CLAIM_PREVIOUS) case, in which a delegation type is claimed. 23127 In this case, delegation will always be granted, although the server 23128 may specify an immediate recall in the delegation structure. 23130 The rflags returned by a successful OPEN allow the server to return 23131 information governing how the open file is to be handled. 23133 * OPEN4_RESULT_CONFIRM is deprecated and MUST NOT be returned by an 23134 NFSv4.1 server. 23136 * OPEN4_RESULT_LOCKTYPE_POSIX indicates that the server's byte-range 23137 locking behavior supports the complete set of POSIX locking 23138 techniques [21]. From this, the client can choose to manage byte- 23139 range locking state in a way to handle a mismatch of byte-range 23140 locking management. 23142 * OPEN4_RESULT_PRESERVE_UNLINKED indicates that the server will 23143 preserve the open file if the client (or any other client) removes 23144 the file as long as it is open. Furthermore, the server promises 23145 to preserve the file through the grace period after server 23146 restart, thereby giving the client the opportunity to reclaim its 23147 open. 23149 * OPEN4_RESULT_MAY_NOTIFY_LOCK indicates that the server may attempt 23150 CB_NOTIFY_LOCK callbacks for locks on this file. This flag is a 23151 hint only, and may be safely ignored by the client. 23153 If the component is of zero length, NFS4ERR_INVAL will be returned. 23154 The component is also subject to the normal UTF-8, character support, 23155 and name checks. See Section 14.5 for further discussion. 23157 When an OPEN is done and the specified open-owner already has the 23158 resulting filehandle open, the result is to "OR" together the new 23159 share and deny status together with the existing status. In this 23160 case, only a single CLOSE need be done, even though multiple OPENs 23161 were completed. When such an OPEN is done, checking of share 23162 reservations for the new OPEN proceeds normally, with no exception 23163 for the existing OPEN held by the same open-owner. In this case, the 23164 stateid returned as an "other" field that matches that of the 23165 previous open while the "seqid" field is incremented to reflect the 23166 change status due to the new open. 23168 If the underlying file system at the server is only accessible in a 23169 read-only mode and the OPEN request has specified ACCESS_WRITE or 23170 ACCESS_BOTH, the server will return NFS4ERR_ROFS to indicate a read- 23171 only file system. 23173 As with the CREATE operation, the server MUST derive the owner, owner 23174 ACE, group, or group ACE if any of the four attributes are required 23175 and supported by the server's file system. For an OPEN with the 23176 EXCLUSIVE4 createmode, the server has no choice, since such OPEN 23177 calls do not include the createattrs field. Conversely, if 23178 createattrs (UNCHECKED4 or GUARDED4) or cva_attrs (EXCLUSIVE4_1) is 23179 specified, and includes an owner, owner_group, or ACE that the 23180 principal in the RPC call's credentials does not have authorization 23181 to create files for, then the server may return NFS4ERR_PERM. 23183 In the case of an OPEN that specifies a size of zero (e.g., 23184 truncation) and the file has named attributes, the named attributes 23185 are left as is and are not removed. 23187 NFSv4.1 gives more precise control to clients over acquisition of 23188 delegations via the following new flags for the share_access field of 23189 OPEN4args: 23191 OPEN4_SHARE_ACCESS_WANT_READ_DELEG 23193 OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 23195 OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 23197 OPEN4_SHARE_ACCESS_WANT_NO_DELEG 23199 OPEN4_SHARE_ACCESS_WANT_CANCEL 23200 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL 23202 OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED 23204 If (share_access & OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) is not zero, 23205 then the client will have specified one and only one of: 23207 OPEN4_SHARE_ACCESS_WANT_READ_DELEG 23209 OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 23211 OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 23213 OPEN4_SHARE_ACCESS_WANT_NO_DELEG 23215 OPEN4_SHARE_ACCESS_WANT_CANCEL 23217 Otherwise, the client is neither indicating a desire nor a non-desire 23218 for a delegation, and the server MAY or MAY not return a delegation 23219 in the OPEN response. 23221 If the server supports the new _WANT_ flags and the client sends one 23222 or more of the new flags, then in the event the server does not 23223 return a delegation, it MUST return a delegation type of 23224 OPEN_DELEGATE_NONE_EXT. The field ond_why in the reply indicates why 23225 no delegation was returned and will be one of: 23227 WND4_NOT_WANTED 23228 The client specified OPEN4_SHARE_ACCESS_WANT_NO_DELEG. 23230 WND4_CONTENTION 23231 There is a conflicting delegation or open on the file. 23233 WND4_RESOURCE 23234 Resource limitations prevent the server from granting a 23235 delegation. 23237 WND4_NOT_SUPP_FTYPE 23238 The server does not support delegations on this file type. 23240 WND4_WRITE_DELEG_NOT_SUPP_FTYPE 23241 The server does not support OPEN_DELEGATE_WRITE delegations on 23242 this file type. 23244 WND4_NOT_SUPP_UPGRADE 23245 The server does not support atomic upgrade of an 23246 OPEN_DELEGATE_READ delegation to an OPEN_DELEGATE_WRITE 23247 delegation. 23249 WND4_NOT_SUPP_DOWNGRADE 23250 The server does not support atomic downgrade of an 23251 OPEN_DELEGATE_WRITE delegation to an OPEN_DELEGATE_READ 23252 delegation. 23254 WND4_CANCELED 23255 The client specified OPEN4_SHARE_ACCESS_WANT_CANCEL and now any 23256 "want" for this file object is cancelled. 23258 WND4_IS_DIR 23259 The specified file object is a directory, and the operation is 23260 OPEN or WANT_DELEGATION, which do not support delegations on 23261 directories. 23263 OPEN4_SHARE_ACCESS_WANT_READ_DELEG, 23264 OPEN_SHARE_ACCESS_WANT_WRITE_DELEG, or 23265 OPEN_SHARE_ACCESS_WANT_ANY_DELEG mean, respectively, the client wants 23266 an OPEN_DELEGATE_READ, OPEN_DELEGATE_WRITE, or any delegation 23267 regardless which of OPEN4_SHARE_ACCESS_READ, 23268 OPEN4_SHARE_ACCESS_WRITE, or OPEN4_SHARE_ACCESS_BOTH is set. If the 23269 client has an OPEN_DELEGATE_READ delegation on a file and requests an 23270 OPEN_DELEGATE_WRITE delegation, then the client is requesting atomic 23271 upgrade of its OPEN_DELEGATE_READ delegation to an 23272 OPEN_DELEGATE_WRITE delegation. If the client has an 23273 OPEN_DELEGATE_WRITE delegation on a file and requests an 23274 OPEN_DELEGATE_READ delegation, then the client is requesting atomic 23275 downgrade to an OPEN_DELEGATE_READ delegation. A server MAY support 23276 atomic upgrade or downgrade. If it does, then the returned 23277 delegation_type of OPEN_DELEGATE_READ or OPEN_DELEGATE_WRITE that is 23278 different from the delegation type the client currently has, 23279 indicates successful upgrade or downgrade. If the server does not 23280 support atomic delegation upgrade or downgrade, then ond_why will be 23281 set to WND4_NOT_SUPP_UPGRADE or WND4_NOT_SUPP_DOWNGRADE. 23283 OPEN4_SHARE_ACCESS_WANT_NO_DELEG means that the client wants no 23284 delegation. 23286 OPEN4_SHARE_ACCESS_WANT_CANCEL means that the client wants no 23287 delegation and wants to cancel any previously registered "want" for a 23288 delegation. 23290 The client may set one or both of 23291 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL and 23292 OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED. However, they 23293 will have no effect unless one of following is set: 23295 * OPEN4_SHARE_ACCESS_WANT_READ_DELEG 23296 * OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 23298 * OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 23300 If the client specifies 23301 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL, then it wishes 23302 to register a "want" for a delegation, in the event the OPEN results 23303 do not include a delegation. If so and the server denies the 23304 delegation due to insufficient resources, the server MAY later inform 23305 the client, via the CB_RECALLABLE_OBJ_AVAIL operation, that the 23306 resource limitation condition has eased. The server will tell the 23307 client that it intends to send a future CB_RECALLABLE_OBJ_AVAIL 23308 operation by setting delegation_type in the results to 23309 OPEN_DELEGATE_NONE_EXT, ond_why to WND4_RESOURCE, and 23310 ond_server_will_signal_avail set to TRUE. If 23311 ond_server_will_signal_avail is set to TRUE, the server MUST later 23312 send a CB_RECALLABLE_OBJ_AVAIL operation. 23314 If the client specifies 23315 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_UNCONTENDED, then it wishes 23316 to register a "want" for a delegation, in the event the OPEN results 23317 do not include a delegation. If so and the server denies the 23318 delegation due to contention, the server MAY later inform the client, 23319 via the CB_PUSH_DELEG operation, that the contention condition has 23320 eased. The server will tell the client that it intends to send a 23321 future CB_PUSH_DELEG operation by setting delegation_type in the 23322 results to OPEN_DELEGATE_NONE_EXT, ond_why to WND4_CONTENTION, and 23323 ond_server_will_push_deleg to TRUE. If ond_server_will_push_deleg is 23324 TRUE, the server MUST later send a CB_PUSH_DELEG operation. 23326 If the client has previously registered a want for a delegation on a 23327 file, and then sends a request to register a want for a delegation on 23328 the same file, the server MUST return a new error: 23329 NFS4ERR_DELEG_ALREADY_WANTED. If the client wishes to register a 23330 different type of delegation want for the same file, it MUST cancel 23331 the existing delegation WANT. 23333 18.16.4. IMPLEMENTATION 23335 In absence of a persistent session, the client invokes exclusive 23336 create by setting the how parameter to EXCLUSIVE4 or EXCLUSIVE4_1. 23337 In these cases, the client provides a verifier that can reasonably be 23338 expected to be unique. A combination of a client identifier, perhaps 23339 the client network address, and a unique number generated by the 23340 client, perhaps the RPC transaction identifier, may be appropriate. 23342 If the object does not exist, the server creates the object and 23343 stores the verifier in stable storage. For file systems that do not 23344 provide a mechanism for the storage of arbitrary file attributes, the 23345 server may use one or more elements of the object's metadata to store 23346 the verifier. The verifier MUST be stored in stable storage to 23347 prevent erroneous failure on retransmission of the request. It is 23348 assumed that an exclusive create is being performed because exclusive 23349 semantics are critical to the application. Because of the expected 23350 usage, exclusive CREATE does not rely solely on the server's reply 23351 cache for storage of the verifier. A nonpersistent reply cache does 23352 not survive a crash and the session and reply cache may be deleted 23353 after a network partition that exceeds the lease time, thus opening 23354 failure windows. 23356 An NFSv4.1 server SHOULD NOT store the verifier in any of the file's 23357 RECOMMENDED or REQUIRED attributes. If it does, the server SHOULD 23358 use time_modify_set or time_access_set to store the verifier. The 23359 server SHOULD NOT store the verifier in the following attributes: 23361 acl (it is desirable for access control to be established at 23362 creation), 23364 dacl (ditto), 23366 mode (ditto), 23368 owner (ditto), 23370 owner_group (ditto), 23372 retentevt_set (it may be desired to establish retention at 23373 creation) 23375 retention_hold (ditto), 23377 retention_set (ditto), 23379 sacl (it is desirable for auditing control to be established at 23380 creation), 23382 size (on some servers, size may have a limited range of values), 23384 mode_set_masked (as with mode), 23386 and 23388 time_creation (a meaningful file creation should be set when the 23389 file is created). 23391 Another alternative for the server is to use a named attribute to 23392 store the verifier. 23394 Because the EXCLUSIVE4 create method does not specify initial 23395 attributes when processing an EXCLUSIVE4 create, the server 23397 * SHOULD set the owner of the file to that corresponding to the 23398 credential of request's RPC header. 23400 * SHOULD NOT leave the file's access control to anyone but the owner 23401 of the file. 23403 If the server cannot support exclusive create semantics, possibly 23404 because of the requirement to commit the verifier to stable storage, 23405 it should fail the OPEN request with the error NFS4ERR_NOTSUPP. 23407 During an exclusive CREATE request, if the object already exists, the 23408 server reconstructs the object's verifier and compares it with the 23409 verifier in the request. If they match, the server treats the 23410 request as a success. The request is presumed to be a duplicate of 23411 an earlier, successful request for which the reply was lost and that 23412 the server duplicate request cache mechanism did not detect. If the 23413 verifiers do not match, the request is rejected with the status 23414 NFS4ERR_EXIST. 23416 After the client has performed a successful exclusive create, the 23417 attrset response indicates which attributes were used to store the 23418 verifier. If EXCLUSIVE4 was used, the attributes set in attrset were 23419 used for the verifier. If EXCLUSIVE4_1 was used, the client 23420 determines the attributes used for the verifier by comparing attrset 23421 with cva_attrs.attrmask; any bits set in the former but not the 23422 latter identify the attributes used to store the verifier. The 23423 client MUST immediately send a SETATTR to set attributes used to 23424 store the verifier. Until it does so, the attributes used to store 23425 the verifier cannot be relied upon. The subsequent SETATTR MUST NOT 23426 occur in the same COMPOUND request as the OPEN. 23428 Unless a persistent session is used, use of the GUARDED4 attribute 23429 does not provide exactly once semantics. In particular, if a reply 23430 is lost and the server does not detect the retransmission of the 23431 request, the operation can fail with NFS4ERR_EXIST, even though the 23432 create was performed successfully. The client would use this 23433 behavior in the case that the application has not requested an 23434 exclusive create but has asked to have the file truncated when the 23435 file is opened. In the case of the client timing out and 23436 retransmitting the create request, the client can use GUARDED4 to 23437 prevent against a sequence like create, write, create (retransmitted) 23438 from occurring. 23440 For SHARE reservations, the value of the expression (share_access & 23441 ~OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) MUST be one of 23442 OPEN4_SHARE_ACCESS_READ, OPEN4_SHARE_ACCESS_WRITE, or 23443 OPEN4_SHARE_ACCESS_BOTH. If not, the server MUST return 23444 NFS4ERR_INVAL. The value of share_deny MUST be one of 23445 OPEN4_SHARE_DENY_NONE, OPEN4_SHARE_DENY_READ, OPEN4_SHARE_DENY_WRITE, 23446 or OPEN4_SHARE_DENY_BOTH. If not, the server MUST return 23447 NFS4ERR_INVAL. 23449 Based on the share_access value (OPEN4_SHARE_ACCESS_READ, 23450 OPEN4_SHARE_ACCESS_WRITE, or OPEN4_SHARE_ACCESS_BOTH), the client 23451 should check that the requester has the proper access rights to 23452 perform the specified operation. This would generally be the results 23453 of applying the ACL access rules to the file for the current 23454 requester. However, just as with the ACCESS operation, the client 23455 should not attempt to second-guess the server's decisions, as access 23456 rights may change and may be subject to server administrative 23457 controls outside the ACL framework. If the requester's READ or WRITE 23458 operation is not authorized (depending on the share_access value), 23459 the server MUST return NFS4ERR_ACCESS. 23461 Note that if the client ID was not created with the 23462 EXCHGID4_FLAG_BIND_PRINC_STATEID capability set in the reply to 23463 EXCHANGE_ID, then the server MUST NOT impose any requirement that 23464 READs and WRITEs sent for an open file have the same credentials as 23465 the OPEN itself, and the server is REQUIRED to perform access 23466 checking on the READs and WRITEs themselves. Otherwise, if the reply 23467 to EXCHANGE_ID did have EXCHGID4_FLAG_BIND_PRINC_STATEID set, then 23468 with one exception, the credentials used in the OPEN request MUST 23469 match those used in the READs and WRITEs, and the stateids in the 23470 READs and WRITEs MUST match, or be derived from the stateid from the 23471 reply to OPEN. The exception is if SP4_SSV or SP4_MACH_CRED state 23472 protection is used, and the spo_must_allow result of EXCHANGE_ID 23473 includes the READ and/or WRITE operations. In that case, the machine 23474 or SSV credential will be allowed to send READ and/or WRITE. See 23475 Section 18.35. 23477 If the component provided to OPEN is a symbolic link, the error 23478 NFS4ERR_SYMLINK will be returned to the client, while if it is a 23479 directory the error NFS4ERR_ISDIR will be returned. If the component 23480 is neither of those but not an ordinary file, the error 23481 NFS4ERR_WRONG_TYPE is returned. If the current filehandle is not a 23482 directory, the error NFS4ERR_NOTDIR will be returned. 23484 The use of the OPEN4_RESULT_PRESERVE_UNLINKED result flag allows a 23485 client to avoid the common implementation practice of renaming an 23486 open file to ".nfs" after it removes the file. After 23487 the server returns OPEN4_RESULT_PRESERVE_UNLINKED, if a client sends 23488 a REMOVE operation that would reduce the file's link count to zero, 23489 the server SHOULD report a value of zero for the numlinks attribute 23490 on the file. 23492 If another client has a delegation of the file being opened that 23493 conflicts with open being done (sometimes depending on the 23494 share_access or share_deny value specified), the delegation(s) MUST 23495 be recalled, and the operation cannot proceed until each such 23496 delegation is returned or revoked. Except where this happens very 23497 quickly, one or more NFS4ERR_DELAY errors will be returned to 23498 requests made while delegation remains outstanding. In the case of 23499 an OPEN_DELEGATE_WRITE delegation, any open by a different client 23500 will conflict, while for an OPEN_DELEGATE_READ delegation, only opens 23501 with one of the following characteristics will be considered 23502 conflicting: 23504 * The value of share_access includes the bit 23505 OPEN4_SHARE_ACCESS_WRITE. 23507 * The value of share_deny specifies OPEN4_SHARE_DENY_READ or 23508 OPEN4_SHARE_DENY_BOTH. 23510 * OPEN4_CREATE is specified together with UNCHECKED4, the size 23511 attribute is specified as zero (for truncation), and an existing 23512 file is truncated. 23514 If OPEN4_CREATE is specified and the file does not exist and the 23515 current filehandle designates a directory for which another client 23516 holds a directory delegation, then, unless the delegation is such 23517 that the situation can be resolved by sending a notification, the 23518 delegation MUST be recalled, and the operation cannot proceed until 23519 the delegation is returned or revoked. Except where this happens 23520 very quickly, one or more NFS4ERR_DELAY errors will be returned to 23521 requests made while delegation remains outstanding. 23523 If OPEN4_CREATE is specified and the file does not exist and the 23524 current filehandle designates a directory for which one or more 23525 directory delegations exist, then, when those delegations request 23526 such notifications, NOTIFY4_ADD_ENTRY will be generated as a result 23527 of this operation. 23529 18.16.4.1. Warning to Client Implementors 23531 OPEN resembles LOOKUP in that it generates a filehandle for the 23532 client to use. Unlike LOOKUP though, OPEN creates server state on 23533 the filehandle. In normal circumstances, the client can only release 23534 this state with a CLOSE operation. CLOSE uses the current filehandle 23535 to determine which file to close. Therefore, the client MUST follow 23536 every OPEN operation with a GETFH operation in the same COMPOUND 23537 procedure. This will supply the client with the filehandle such that 23538 CLOSE can be used appropriately. 23540 Simply waiting for the lease on the file to expire is insufficient 23541 because the server may maintain the state indefinitely as long as 23542 another client does not attempt to make a conflicting access to the 23543 same file. 23545 See also Section 2.10.6.4. 23547 18.17. Operation 19: OPENATTR - Open Named Attribute Directory 23549 18.17.1. ARGUMENTS 23551 struct OPENATTR4args { 23552 /* CURRENT_FH: object */ 23553 bool createdir; 23554 }; 23556 18.17.2. RESULTS 23558 struct OPENATTR4res { 23559 /* 23560 * If status is NFS4_OK, 23561 * new CURRENT_FH: named attribute 23562 * directory 23563 */ 23564 nfsstat4 status; 23565 }; 23567 18.17.3. DESCRIPTION 23569 The OPENATTR operation is used to obtain the filehandle of the named 23570 attribute directory associated with the current filehandle. The 23571 result of the OPENATTR will be a filehandle to an object of type 23572 NF4ATTRDIR. From this filehandle, READDIR and LOOKUP operations can 23573 be used to obtain filehandles for the various named attributes 23574 associated with the original file system object. Filehandles 23575 returned within the named attribute directory will designate objects 23576 of type of NF4NAMEDATTR. 23578 The createdir argument allows the client to signify if a named 23579 attribute directory should be created as a result of the OPENATTR 23580 operation. Some clients may use the OPENATTR operation with a value 23581 of FALSE for createdir to determine if any named attributes exist for 23582 the object. If none exist, then NFS4ERR_NOENT will be returned. If 23583 createdir has a value of TRUE and no named attribute directory 23584 exists, one is created and its filehandle becomes the current 23585 filehandle. On the other hand, if createdir has a value of TRUE and 23586 the named attribute directory already exists, no error results and 23587 the filehandle of the existing directory becomes the current 23588 filehandle. The creation of a named attribute directory assumes that 23589 the server has implemented named attribute support in this fashion 23590 and is not required to do so by this definition. 23592 If the current filehandle designates an object of type NF4NAMEDATTR 23593 (a named attribute) or NF4ATTRDIR (a named attribute directory), an 23594 error of NFS4ERR_WRONG_TYPE is returned to the client. Named 23595 attributes or a named attribute directory MUST NOT have their own 23596 named attributes. 23598 18.17.4. IMPLEMENTATION 23600 If the server does not support named attributes for the current 23601 filehandle, an error of NFS4ERR_NOTSUPP will be returned to the 23602 client. 23604 18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access 23606 18.18.1. ARGUMENTS 23608 struct OPEN_DOWNGRADE4args { 23609 /* CURRENT_FH: opened file */ 23610 stateid4 open_stateid; 23611 seqid4 seqid; 23612 uint32_t share_access; 23613 uint32_t share_deny; 23614 }; 23616 18.18.2. RESULTS 23617 struct OPEN_DOWNGRADE4resok { 23618 stateid4 open_stateid; 23619 }; 23621 union OPEN_DOWNGRADE4res switch(nfsstat4 status) { 23622 case NFS4_OK: 23623 OPEN_DOWNGRADE4resok resok4; 23624 default: 23625 void; 23626 }; 23628 18.18.3. DESCRIPTION 23630 This operation is used to adjust the access and deny states for a 23631 given open. This is necessary when a given open-owner opens the same 23632 file multiple times with different access and deny values. In this 23633 situation, a close of one of the opens may change the appropriate 23634 share_access and share_deny flags to remove bits associated with 23635 opens no longer in effect. 23637 Valid values for the expression (share_access & 23638 ~OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) are OPEN4_SHARE_ACCESS_READ, 23639 OPEN4_SHARE_ACCESS_WRITE, or OPEN4_SHARE_ACCESS_BOTH. If the client 23640 specifies other values, the server MUST reply with NFS4ERR_INVAL. 23642 Valid values for the share_deny field are OPEN4_SHARE_DENY_NONE, 23643 OPEN4_SHARE_DENY_READ, OPEN4_SHARE_DENY_WRITE, or 23644 OPEN4_SHARE_DENY_BOTH. If the client specifies other values, the 23645 server MUST reply with NFS4ERR_INVAL. 23647 After checking for valid values of share_access and share_deny, the 23648 server replaces the current access and deny modes on the file with 23649 share_access and share_deny subject to the following constraints: 23651 * The bits in share_access SHOULD equal the union of the 23652 share_access bits (not including OPEN4_SHARE_WANT_* bits) 23653 specified for some subset of the OPENs in effect for the current 23654 open-owner on the current file. 23656 * The bits in share_deny SHOULD equal the union of the share_deny 23657 bits specified for some subset of the OPENs in effect for the 23658 current open-owner on the current file. 23660 If the above constraints are not respected, the server SHOULD return 23661 the error NFS4ERR_INVAL. Since share_access and share_deny bits 23662 should be subsets of those already granted, short of a defect in the 23663 client or server implementation, it is not possible for the 23664 OPEN_DOWNGRADE request to be denied because of conflicting share 23665 reservations. 23667 The seqid argument is not used in NFSv4.1, MAY be any value, and MUST 23668 be ignored by the server. 23670 On success, the current filehandle retains its value. 23672 18.18.4. IMPLEMENTATION 23674 An OPEN_DOWNGRADE operation may make OPEN_DELEGATE_READ delegations 23675 grantable where they were not previously. Servers may choose to 23676 respond immediately if there are pending delegation want requests or 23677 may respond to the situation at a later time. 23679 18.19. Operation 22: PUTFH - Set Current Filehandle 23681 18.19.1. ARGUMENTS 23683 struct PUTFH4args { 23684 nfs_fh4 object; 23685 }; 23687 18.19.2. RESULTS 23689 struct PUTFH4res { 23690 /* 23691 * If status is NFS4_OK, 23692 * new CURRENT_FH: argument to PUTFH 23693 */ 23694 nfsstat4 status; 23695 }; 23697 18.19.3. DESCRIPTION 23699 This operation replaces the current filehandle with the filehandle 23700 provided as an argument. It clears the current stateid. 23702 If the security mechanism used by the requester does not meet the 23703 requirements of the filehandle provided to this operation, the server 23704 MUST return NFS4ERR_WRONGSEC. 23706 See Section 16.2.3.1.1 for more details on the current filehandle. 23708 See Section 16.2.3.1.2 for more details on the current stateid. 23710 18.19.4. IMPLEMENTATION 23712 This operation is used in an NFS request to set the context for file 23713 accessing operations that follow in the same COMPOUND request. 23715 18.20. Operation 23: PUTPUBFH - Set Public Filehandle 23717 18.20.1. ARGUMENT 23719 void; 23721 18.20.2. RESULT 23723 struct PUTPUBFH4res { 23724 /* 23725 * If status is NFS4_OK, 23726 * new CURRENT_FH: public fh 23727 */ 23728 nfsstat4 status; 23729 }; 23731 18.20.3. DESCRIPTION 23733 This operation replaces the current filehandle with the filehandle 23734 that represents the public filehandle of the server's namespace. 23735 This filehandle may be different from the "root" filehandle that may 23736 be associated with some other directory on the server. 23738 PUTPUBFH also clears the current stateid. 23740 The public filehandle represents the concepts embodied in RFC 2054 23741 [49], RFC 2055 [50], and RFC 2224 [61]. The intent for NFSv4.1 is 23742 that the public filehandle (represented by the PUTPUBFH operation) be 23743 used as a method of providing WebNFS server compatibility with NFSv3. 23745 The public filehandle and the root filehandle (represented by the 23746 PUTROOTFH operation) SHOULD be equivalent. If the public and root 23747 filehandles are not equivalent, then the directory corresponding to 23748 the public filehandle MUST be a descendant of the directory 23749 corresponding to the root filehandle. 23751 See Section 16.2.3.1.1 for more details on the current filehandle. 23753 See Section 16.2.3.1.2 for more details on the current stateid. 23755 18.20.4. IMPLEMENTATION 23757 This operation is used in an NFS request to set the context for file 23758 accessing operations that follow in the same COMPOUND request. 23760 With the NFSv3 public filehandle, the client is able to specify 23761 whether the pathname provided in the LOOKUP should be evaluated as 23762 either an absolute path relative to the server's root or relative to 23763 the public filehandle. RFC 2224 [61] contains further discussion of 23764 the functionality. With NFSv4.1, that type of specification is not 23765 directly available in the LOOKUP operation. The reason for this is 23766 because the component separators needed to specify absolute vs. 23767 relative are not allowed in NFSv4. Therefore, the client is 23768 responsible for constructing its request such that the use of either 23769 PUTROOTFH or PUTPUBFH signifies absolute or relative evaluation of an 23770 NFS URL, respectively. 23772 Note that there are warnings mentioned in RFC 2224 [61] with respect 23773 to the use of absolute evaluation and the restrictions the server may 23774 place on that evaluation with respect to how much of its namespace 23775 has been made available. These same warnings apply to NFSv4.1. It 23776 is likely, therefore, that because of server implementation details, 23777 an NFSv3 absolute public filehandle look up may behave differently 23778 than an NFSv4.1 absolute resolution. 23780 There is a form of security negotiation as described in RFC 2755 [62] 23781 that uses the public filehandle and an overloading of the pathname. 23782 This method is not available with NFSv4.1 as filehandles are not 23783 overloaded with special meaning and therefore do not provide the same 23784 framework as NFSv3. Clients should therefore use the security 23785 negotiation mechanisms described in Section 2.6. 23787 18.21. Operation 24: PUTROOTFH - Set Root Filehandle 23789 18.21.1. ARGUMENTS 23791 void; 23793 18.21.2. RESULTS 23795 struct PUTROOTFH4res { 23796 /* 23797 * If status is NFS4_OK, 23798 * new CURRENT_FH: root fh 23799 */ 23800 nfsstat4 status; 23801 }; 23803 18.21.3. DESCRIPTION 23805 This operation replaces the current filehandle with the filehandle 23806 that represents the root of the server's namespace. From this 23807 filehandle, a LOOKUP operation can locate any other filehandle on the 23808 server. This filehandle may be different from the "public" 23809 filehandle that may be associated with some other directory on the 23810 server. 23812 PUTROOTFH also clears the current stateid. 23814 See Section 16.2.3.1.1 for more details on the current filehandle. 23816 See Section 16.2.3.1.2 for more details on the current stateid. 23818 18.21.4. IMPLEMENTATION 23820 This operation is used in an NFS request to set the context for file 23821 accessing operations that follow in the same COMPOUND request. 23823 18.22. Operation 25: READ - Read from File 23825 18.22.1. ARGUMENTS 23827 struct READ4args { 23828 /* CURRENT_FH: file */ 23829 stateid4 stateid; 23830 offset4 offset; 23831 count4 count; 23832 }; 23834 18.22.2. RESULTS 23836 struct READ4resok { 23837 bool eof; 23838 opaque data<>; 23839 }; 23841 union READ4res switch (nfsstat4 status) { 23842 case NFS4_OK: 23843 READ4resok resok4; 23844 default: 23845 void; 23846 }; 23848 18.22.3. DESCRIPTION 23850 The READ operation reads data from the regular file identified by the 23851 current filehandle. 23853 The client provides an offset of where the READ is to start and a 23854 count of how many bytes are to be read. An offset of zero means to 23855 read data starting at the beginning of the file. If offset is 23856 greater than or equal to the size of the file, the status NFS4_OK is 23857 returned with a data length set to zero and eof is set to TRUE. The 23858 READ is subject to access permissions checking. 23860 If the client specifies a count value of zero, the READ succeeds and 23861 returns zero bytes of data again subject to access permissions 23862 checking. The server may choose to return fewer bytes than specified 23863 by the client. The client needs to check for this condition and 23864 handle the condition appropriately. 23866 Except when special stateids are used, the stateid value for a READ 23867 request represents a value returned from a previous byte-range lock 23868 or share reservation request or the stateid associated with a 23869 delegation. The stateid identifies the associated owners if any and 23870 is used by the server to verify that the associated locks are still 23871 valid (e.g., have not been revoked). 23873 If the read ended at the end-of-file (formally, in a correctly formed 23874 READ operation, if offset + count is equal to the size of the file), 23875 or the READ operation extends beyond the size of the file (if offset 23876 + count is greater than the size of the file), eof is returned as 23877 TRUE; otherwise, it is FALSE. A successful READ of an empty file 23878 will always return eof as TRUE. 23880 If the current filehandle is not an ordinary file, an error will be 23881 returned to the client. In the case that the current filehandle 23882 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If 23883 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 23884 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 23886 For a READ with a stateid value of all bits equal to zero, the server 23887 MAY allow the READ to be serviced subject to mandatory byte-range 23888 locks or the current share deny modes for the file. For a READ with 23889 a stateid value of all bits equal to one, the server MAY allow READ 23890 operations to bypass locking checks at the server. 23892 On success, the current filehandle retains its value. 23894 18.22.4. IMPLEMENTATION 23896 If the server returns a "short read" (i.e., fewer data than requested 23897 and eof is set to FALSE), the client should send another READ to get 23898 the remaining data. A server may return less data than requested 23899 under several circumstances. The file may have been truncated by 23900 another client or perhaps on the server itself, changing the file 23901 size from what the requesting client believes to be the case. This 23902 would reduce the actual amount of data available to the client. It 23903 is possible that the server reduce the transfer size and so return a 23904 short read result. Server resource exhaustion may also occur in a 23905 short read. 23907 If mandatory byte-range locking is in effect for the file, and if the 23908 byte-range corresponding to the data to be read from the file is 23909 WRITE_LT locked by an owner not associated with the stateid, the 23910 server will return the NFS4ERR_LOCKED error. The client should try 23911 to get the appropriate READ_LT via the LOCK operation before re- 23912 attempting the READ. When the READ completes, the client should 23913 release the byte-range lock via LOCKU. 23915 If another client has an OPEN_DELEGATE_WRITE delegation for the file 23916 being read, the delegation must be recalled, and the operation cannot 23917 proceed until that delegation is returned or revoked. Except where 23918 this happens very quickly, one or more NFS4ERR_DELAY errors will be 23919 returned to requests made while the delegation remains outstanding. 23920 Normally, delegations will not be recalled as a result of a READ 23921 operation since the recall will occur as a result of an earlier OPEN. 23922 However, since it is possible for a READ to be done with a special 23923 stateid, the server needs to check for this case even though the 23924 client should have done an OPEN previously. 23926 18.23. Operation 26: READDIR - Read Directory 23928 18.23.1. ARGUMENTS 23930 struct READDIR4args { 23931 /* CURRENT_FH: directory */ 23932 nfs_cookie4 cookie; 23933 verifier4 cookieverf; 23934 count4 dircount; 23935 count4 maxcount; 23936 bitmap4 attr_request; 23937 }; 23939 18.23.2. RESULTS 23940 struct entry4 { 23941 nfs_cookie4 cookie; 23942 component4 name; 23943 fattr4 attrs; 23944 entry4 *nextentry; 23945 }; 23947 struct dirlist4 { 23948 entry4 *entries; 23949 bool eof; 23950 }; 23952 struct READDIR4resok { 23953 verifier4 cookieverf; 23954 dirlist4 reply; 23955 }; 23957 union READDIR4res switch (nfsstat4 status) { 23958 case NFS4_OK: 23959 READDIR4resok resok4; 23960 default: 23961 void; 23962 }; 23964 18.23.3. DESCRIPTION 23966 The READDIR operation retrieves a variable number of entries from a 23967 file system directory and returns client-requested attributes for 23968 each entry along with information to allow the client to request 23969 additional directory entries in a subsequent READDIR. 23971 The arguments contain a cookie value that represents where the 23972 READDIR should start within the directory. A value of zero for the 23973 cookie is used to start reading at the beginning of the directory. 23974 For subsequent READDIR requests, the client specifies a cookie value 23975 that is provided by the server on a previous READDIR request. 23977 The request's cookieverf field should be set to 0 zero) when the 23978 request's cookie field is zero (first read of the directory). On 23979 subsequent requests, the cookieverf field must match the cookieverf 23980 returned by the READDIR in which the cookie was acquired. If the 23981 server determines that the cookieverf is no longer valid for the 23982 directory, the error NFS4ERR_NOT_SAME must be returned. 23984 The dircount field of the request is a hint of the maximum number of 23985 bytes of directory information that should be returned. This value 23986 represents the total length of the names of the directory entries and 23987 the cookie value for these entries. This length represents the XDR 23988 encoding of the data (names and cookies) and not the length in the 23989 native format of the server. 23991 The maxcount field of the request represents the maximum total size 23992 of all of the data being returned within the READDIR4resok structure 23993 and includes the XDR overhead. The server MAY return less data. If 23994 the server is unable to return a single directory entry within the 23995 maxcount limit, the error NFS4ERR_TOOSMALL MUST be returned to the 23996 client. 23998 Finally, the request's attr_request field represents the list of 23999 attributes to be returned for each directory entry supplied by the 24000 server. 24002 A successful reply consists of a list of directory entries. Each of 24003 these entries contains the name of the directory entry, a cookie 24004 value for that entry, and the associated attributes as requested. 24005 The "eof" flag has a value of TRUE if there are no more entries in 24006 the directory. 24008 The cookie value is only meaningful to the server and is used as a 24009 cursor for the directory entry. As mentioned, this cookie is used by 24010 the client for subsequent READDIR operations so that it may continue 24011 reading a directory. The cookie is similar in concept to a READ 24012 offset but MUST NOT be interpreted as such by the client. Ideally, 24013 the cookie value SHOULD NOT change if the directory is modified since 24014 the client may be caching these values. 24016 In some cases, the server may encounter an error while obtaining the 24017 attributes for a directory entry. Instead of returning an error for 24018 the entire READDIR operation, the server can instead return the 24019 attribute rdattr_error (Section 5.8.1.12). With this, the server is 24020 able to communicate the failure to the client and not fail the entire 24021 operation in the instance of what might be a transient failure. 24022 Obviously, the client must request the fattr4_rdattr_error attribute 24023 for this method to work properly. If the client does not request the 24024 attribute, the server has no choice but to return failure for the 24025 entire READDIR operation. 24027 For some file system environments, the directory entries "." and ".." 24028 have special meaning, and in other environments, they do not. If the 24029 server supports these special entries within a directory, they SHOULD 24030 NOT be returned to the client as part of the READDIR response. To 24031 enable some client environments, the cookie values of zero, 1, and 2 24032 are to be considered reserved. Note that the UNIX client will use 24033 these values when combining the server's response and local 24034 representations to enable a fully formed UNIX directory presentation 24035 to the application. 24037 For READDIR arguments, cookie values of one and two SHOULD NOT be 24038 used, and for READDIR results, cookie values of zero, one, and two 24039 SHOULD NOT be returned. 24041 On success, the current filehandle retains its value. 24043 18.23.4. IMPLEMENTATION 24045 The server's file system directory representations can differ 24046 greatly. A client's programming interfaces may also be bound to the 24047 local operating environment in a way that does not translate well 24048 into the NFS protocol. Therefore, the use of the dircount and 24049 maxcount fields are provided to enable the client to provide hints to 24050 the server. If the client is aggressive about attribute collection 24051 during a READDIR, the server has an idea of how to limit the encoded 24052 response. 24054 If dircount is zero, the server bounds the reply's size based on the 24055 request's maxcount field. 24057 The cookieverf may be used by the server to help manage cookie values 24058 that may become stale. It should be a rare occurrence that a server 24059 is unable to continue properly reading a directory with the provided 24060 cookie/cookieverf pair. The server SHOULD make every effort to avoid 24061 this condition since the application at the client might be unable to 24062 properly handle this type of failure. 24064 The use of the cookieverf will also protect the client from using 24065 READDIR cookie values that might be stale. For example, if the file 24066 system has been migrated, the server might or might not be able to 24067 use the same cookie values to service READDIR as the previous server 24068 used. With the client providing the cookieverf, the server is able 24069 to provide the appropriate response to the client. This prevents the 24070 case where the server accepts a cookie value but the underlying 24071 directory has changed and the response is invalid from the client's 24072 context of its previous READDIR. 24074 Since some servers will not be returning "." and ".." entries as has 24075 been done with previous versions of the NFS protocol, the client that 24076 requires these entries be present in READDIR responses must fabricate 24077 them. 24079 18.24. Operation 27: READLINK - Read Symbolic Link 24081 18.24.1. ARGUMENTS 24083 /* CURRENT_FH: symlink */ 24084 void; 24086 18.24.2. RESULTS 24088 struct READLINK4resok { 24089 linktext4 link; 24090 }; 24092 union READLINK4res switch (nfsstat4 status) { 24093 case NFS4_OK: 24094 READLINK4resok resok4; 24095 default: 24096 void; 24097 }; 24099 18.24.3. DESCRIPTION 24101 READLINK reads the data associated with a symbolic link. Depending 24102 on the value of the UTF-8 capability attribute (Section 14.4), the 24103 data is encoded in UTF-8. Whether created by an NFS client or 24104 created locally on the server, the data in a symbolic link is not 24105 interpreted (except possibly to check for proper UTF-8 encoding) when 24106 created, but is simply stored. 24108 On success, the current filehandle retains its value. 24110 18.24.4. IMPLEMENTATION 24112 A symbolic link is nominally a pointer to another file. The data is 24113 not necessarily interpreted by the server, just stored in the file. 24114 It is possible for a client implementation to store a pathname that 24115 is not meaningful to the server operating system in a symbolic link. 24116 A READLINK operation returns the data to the client for 24117 interpretation. If different implementations want to share access to 24118 symbolic links, then they must agree on the interpretation of the 24119 data in the symbolic link. 24121 The READLINK operation is only allowed on objects of type NF4LNK. 24122 The server should return the error NFS4ERR_WRONG_TYPE if the object 24123 is not of type NF4LNK. 24125 18.25. Operation 28: REMOVE - Remove File System Object 24127 18.25.1. ARGUMENTS 24129 struct REMOVE4args { 24130 /* CURRENT_FH: directory */ 24131 component4 target; 24132 }; 24134 18.25.2. RESULTS 24136 struct REMOVE4resok { 24137 change_info4 cinfo; 24138 }; 24140 union REMOVE4res switch (nfsstat4 status) { 24141 case NFS4_OK: 24142 REMOVE4resok resok4; 24143 default: 24144 void; 24145 }; 24147 18.25.3. DESCRIPTION 24149 The REMOVE operation removes (deletes) a directory entry named by 24150 filename from the directory corresponding to the current filehandle. 24151 If the entry in the directory was the last reference to the 24152 corresponding file system object, the object may be destroyed. The 24153 directory may be either of type NF4DIR or NF4ATTRDIR. 24155 For the directory where the filename was removed, the server returns 24156 change_info4 information in cinfo. With the atomic field of the 24157 change_info4 data type, the server will indicate if the before and 24158 after change attributes were obtained atomically with respect to the 24159 removal. 24161 If the target has a length of zero, or if the target does not obey 24162 the UTF-8 definition (and the server is enforcing UTF-8 encoding; see 24163 Section 14.4), the error NFS4ERR_INVAL will be returned. 24165 On success, the current filehandle retains its value. 24167 18.25.4. IMPLEMENTATION 24169 NFSv3 required a different operator RMDIR for directory removal and 24170 REMOVE for non-directory removal. This allowed clients to skip 24171 checking the file type when being passed a non-directory delete 24172 system call (e.g., unlink() [24] in POSIX) to remove a directory, as 24173 well as the converse (e.g., a rmdir() on a non-directory) because 24174 they knew the server would check the file type. NFSv4.1 REMOVE can 24175 be used to delete any directory entry independent of its file type. 24176 The implementor of an NFSv4.1 client's entry points from the unlink() 24177 and rmdir() system calls should first check the file type against the 24178 types the system call is allowed to remove before sending a REMOVE 24179 operation. Alternatively, the implementor can produce a COMPOUND 24180 call that includes a LOOKUP/VERIFY sequence of operations to verify 24181 the file type before a REMOVE operation in the same COMPOUND call. 24183 The concept of last reference is server specific. However, if the 24184 numlinks field in the previous attributes of the object had the value 24185 1, the client should not rely on referring to the object via a 24186 filehandle. Likewise, the client should not rely on the resources 24187 (disk space, directory entry, and so on) formerly associated with the 24188 object becoming immediately available. Thus, if a client needs to be 24189 able to continue to access a file after using REMOVE to remove it, 24190 the client should take steps to make sure that the file will still be 24191 accessible. While the traditional mechanism used is to RENAME the 24192 file from its old name to a new hidden name, the NFSv4.1 OPEN 24193 operation MAY return a result flag, OPEN4_RESULT_PRESERVE_UNLINKED, 24194 which indicates to the client that the file will be preserved if the 24195 file has an outstanding open (see Section 18.16). 24197 If the server finds that the file is still open when the REMOVE 24198 arrives: 24200 * The server SHOULD NOT delete the file's directory entry if the 24201 file was opened with OPEN4_SHARE_DENY_WRITE or 24202 OPEN4_SHARE_DENY_BOTH. 24204 * If the file was not opened with OPEN4_SHARE_DENY_WRITE or 24205 OPEN4_SHARE_DENY_BOTH, the server SHOULD delete the file's 24206 directory entry. However, until last CLOSE of the file, the 24207 server MAY continue to allow access to the file via its 24208 filehandle. 24210 * The server MUST NOT delete the directory entry if the reply from 24211 OPEN had the flag OPEN4_RESULT_PRESERVE_UNLINKED set. 24213 The server MAY implement its own restrictions on removal of a file 24214 while it is open. The server might disallow such a REMOVE (or a 24215 removal that occurs as part of RENAME). The conditions that 24216 influence the restrictions on removal of a file while it is still 24217 open include: 24219 * Whether certain access protocols (i.e., not just NFS) are holding 24220 the file open. 24222 * Whether particular options, access modes, or policies on the 24223 server are enabled. 24225 If a file has an outstanding OPEN and this prevents the removal of 24226 the file's directory entry, the error NFS4ERR_FILE_OPEN is returned. 24228 Where the determination above cannot be made definitively because 24229 delegations are being held, they MUST be recalled to allow processing 24230 of the REMOVE to continue. When a delegation is held, the server has 24231 no reliable knowledge of the status of OPENs for that client, so 24232 unless there are files opened with the particular deny modes by 24233 clients without delegations, the determination cannot be made until 24234 delegations are recalled, and the operation cannot proceed until each 24235 sufficient delegation has been returned or revoked to allow the 24236 server to make a correct determination. 24238 In all cases in which delegations are recalled, the server is likely 24239 to return one or more NFS4ERR_DELAY errors while delegations remain 24240 outstanding. 24242 If the current filehandle designates a directory for which another 24243 client holds a directory delegation, then, unless the situation can 24244 be resolved by sending a notification, the directory delegation MUST 24245 be recalled, and the operation MUST NOT proceed until the delegation 24246 is returned or revoked. Except where this happens very quickly, one 24247 or more NFS4ERR_DELAY errors will be returned to requests made while 24248 delegation remains outstanding. 24250 When the current filehandle designates a directory for which one or 24251 more directory delegations exist, then, when those delegations 24252 request such notifications, NOTIFY4_REMOVE_ENTRY will be generated as 24253 a result of this operation. 24255 Note that when a remove occurs as a result of a RENAME, 24256 NOTIFY4_REMOVE_ENTRY will only be generated if the removal happens as 24257 a separate operation. In the case in which the removal is integrated 24258 and atomic with RENAME, the notification of the removal is integrated 24259 with notification for the RENAME. See the discussion of the 24260 NOTIFY4_RENAME_ENTRY notification in Section 20.4. 24262 18.26. Operation 29: RENAME - Rename Directory Entry 24264 18.26.1. ARGUMENTS 24266 struct RENAME4args { 24267 /* SAVED_FH: source directory */ 24268 component4 oldname; 24269 /* CURRENT_FH: target directory */ 24270 component4 newname; 24271 }; 24273 18.26.2. RESULTS 24275 struct RENAME4resok { 24276 change_info4 source_cinfo; 24277 change_info4 target_cinfo; 24278 }; 24280 union RENAME4res switch (nfsstat4 status) { 24281 case NFS4_OK: 24282 RENAME4resok resok4; 24283 default: 24284 void; 24285 }; 24287 18.26.3. DESCRIPTION 24289 The RENAME operation renames the object identified by oldname in the 24290 source directory corresponding to the saved filehandle, as set by the 24291 SAVEFH operation, to newname in the target directory corresponding to 24292 the current filehandle. The operation is required to be atomic to 24293 the client. Source and target directories MUST reside on the same 24294 file system on the server. On success, the current filehandle will 24295 continue to be the target directory. 24297 If the target directory already contains an entry with the name 24298 newname, the source object MUST be compatible with the target: either 24299 both are non-directories or both are directories and the target MUST 24300 be empty. If compatible, the existing target is removed before the 24301 rename occurs or, preferably, the target is removed atomically as 24302 part of the rename. See Section 18.25.4 for client and server 24303 actions whenever a target is removed. Note however that when the 24304 removal is performed atomically with the rename, certain parts of the 24305 removal described there are integrated with the rename. For example, 24306 notification of the removal will not be via a NOTIFY4_REMOVE_ENTRY 24307 but will be indicated as part of the NOTIFY4_ADD_ENTRY or 24308 NOTIFY4_RENAME_ENTRY generated by the rename. 24310 If the source object and the target are not compatible or if the 24311 target is a directory but not empty, the server will return the error 24312 NFS4ERR_EXIST. 24314 If oldname and newname both refer to the same file (e.g., they might 24315 be hard links of each other), then unless the file is open (see 24316 Section 18.26.4), RENAME MUST perform no action and return NFS4_OK. 24318 For both directories involved in the RENAME, the server returns 24319 change_info4 information. With the atomic field of the change_info4 24320 data type, the server will indicate if the before and after change 24321 attributes were obtained atomically with respect to the rename. 24323 If oldname refers to a named attribute and the saved and current 24324 filehandles refer to different file system objects, the server will 24325 return NFS4ERR_XDEV just as if the saved and current filehandles 24326 represented directories on different file systems. 24328 If oldname or newname has a length of zero, or if oldname or newname 24329 does not obey the UTF-8 definition, the error NFS4ERR_INVAL will be 24330 returned. 24332 18.26.4. IMPLEMENTATION 24334 The server MAY impose restrictions on the RENAME operation such that 24335 RENAME may not be done when the file being renamed is open or when 24336 that open is done by particular protocols, or with particular options 24337 or access modes. Similar restrictions may be applied when a file 24338 exists with the target name and is open. When RENAME is rejected 24339 because of such restrictions, the error NFS4ERR_FILE_OPEN is 24340 returned. 24342 When oldname and rename refer to the same file and that file is open 24343 in a fashion such that RENAME would normally be rejected with 24344 NFS4ERR_FILE_OPEN if oldname and newname were different files, then 24345 RENAME SHOULD be rejected with NFS4ERR_FILE_OPEN. 24347 If a server does implement such restrictions and those restrictions 24348 include cases of NFSv4 opens preventing successful execution of a 24349 rename, the server needs to recall any delegations that could hide 24350 the existence of opens relevant to that decision. This is because 24351 when a client holds a delegation, the server might not have an 24352 accurate account of the opens for that client, since the client may 24353 execute OPENs and CLOSEs locally. The RENAME operation need only be 24354 delayed until a definitive result can be obtained. For example, if 24355 there are multiple delegations and one of them establishes an open 24356 whose presence would prevent the rename, given the server's 24357 semantics, NFS4ERR_FILE_OPEN may be returned to the caller as soon as 24358 that delegation is returned without waiting for other delegations to 24359 be returned. Similarly, if such opens are not associated with 24360 delegations, NFS4ERR_FILE_OPEN can be returned immediately with no 24361 delegation recall being done. 24363 If the current filehandle or the saved filehandle designates a 24364 directory for which another client holds a directory delegation, 24365 then, unless the situation can be resolved by sending a notification, 24366 the delegation MUST be recalled, and the operation cannot proceed 24367 until the delegation is returned or revoked. Except where this 24368 happens very quickly, one or more NFS4ERR_DELAY errors will be 24369 returned to requests made while delegation remains outstanding. 24371 When the current and saved filehandles are the same and they 24372 designate a directory for which one or more directory delegations 24373 exist, then, when those delegations request such notifications, a 24374 notification of type NOTIFY4_RENAME_ENTRY will be generated as a 24375 result of this operation. When oldname and rename refer to the same 24376 file, no notification is generated (because, as Section 18.26.3 24377 states, the server MUST take no action). When a file is removed 24378 because it has the same name as the target, if that removal is done 24379 atomically with the rename, a NOTIFY4_REMOVE_ENTRY notification will 24380 not be generated. Instead, the deletion of the file will be reported 24381 as part of the NOTIFY4_RENAME_ENTRY notification. 24383 When the current and saved filehandles are not the same: 24385 * If the current filehandle designates a directory for which one or 24386 more directory delegations exist, then, when those delegations 24387 request such notifications, NOTIFY4_ADD_ENTRY will be generated as 24388 a result of this operation. When a file is removed because it has 24389 the same name as the target, if that removal is done atomically 24390 with the rename, a NOTIFY4_REMOVE_ENTRY notification will not be 24391 generated. Instead, the deletion of the file will be reported as 24392 part of the NOTIFY4_ADD_ENTRY notification. 24394 * If the saved filehandle designates a directory for which one or 24395 more directory delegations exist, then, when those delegations 24396 request such notifications, NOTIFY4_REMOVE_ENTRY will be generated 24397 as a result of this operation. 24399 If the object being renamed has file delegations held by clients 24400 other than the one doing the RENAME, the delegations MUST be 24401 recalled, and the operation cannot proceed until each such delegation 24402 is returned or revoked. Note that in the case of multiply linked 24403 files, the delegation recall requirement applies even if the 24404 delegation was obtained through a different name than the one being 24405 renamed. In all cases in which delegations are recalled, the server 24406 is likely to return one or more NFS4ERR_DELAY errors while the 24407 delegation(s) remains outstanding, although it might not do that if 24408 the delegations are returned quickly. 24410 The RENAME operation must be atomic to the client. The statement 24411 "source and target directories MUST reside on the same file system on 24412 the server" means that the fsid fields in the attributes for the 24413 directories are the same. If they reside on different file systems, 24414 the error NFS4ERR_XDEV is returned. 24416 Based on the value of the fh_expire_type attribute for the object, 24417 the filehandle may or may not expire on a RENAME. However, server 24418 implementors are strongly encouraged to attempt to keep filehandles 24419 from expiring in this fashion. 24421 On some servers, the file names "." and ".." are illegal as either 24422 oldname or newname, and will result in the error NFS4ERR_BADNAME. In 24423 addition, on many servers the case of oldname or newname being an 24424 alias for the source directory will be checked for. Such servers 24425 will return the error NFS4ERR_INVAL in these cases. 24427 If either of the source or target filehandles are not directories, 24428 the server will return NFS4ERR_NOTDIR. 24430 18.27. Operation 31: RESTOREFH - Restore Saved Filehandle 24432 18.27.1. ARGUMENTS 24434 /* SAVED_FH: */ 24435 void; 24437 18.27.2. RESULTS 24439 struct RESTOREFH4res { 24440 /* 24441 * If status is NFS4_OK, 24442 * new CURRENT_FH: value of saved fh 24443 */ 24444 nfsstat4 status; 24445 }; 24447 18.27.3. DESCRIPTION 24449 The RESTOREFH operation sets the current filehandle and stateid to 24450 the values in the saved filehandle and stateid. If there is no saved 24451 filehandle, then the server will return the error 24452 NFS4ERR_NOFILEHANDLE. 24454 See Section 16.2.3.1.1 for more details on the current filehandle. 24456 See Section 16.2.3.1.2 for more details on the current stateid. 24458 18.27.4. IMPLEMENTATION 24460 Operations like OPEN and LOOKUP use the current filehandle to 24461 represent a directory and replace it with a new filehandle. Assuming 24462 that the previous filehandle was saved with a SAVEFH operator, the 24463 previous filehandle can be restored as the current filehandle. This 24464 is commonly used to obtain post-operation attributes for the 24465 directory, e.g., 24467 PUTFH (directory filehandle) 24468 SAVEFH 24469 GETATTR attrbits (pre-op dir attrs) 24470 CREATE optbits "foo" attrs 24471 GETATTR attrbits (file attributes) 24472 RESTOREFH 24473 GETATTR attrbits (post-op dir attrs) 24475 18.28. Operation 32: SAVEFH - Save Current Filehandle 24477 18.28.1. ARGUMENTS 24479 /* CURRENT_FH: */ 24480 void; 24482 18.28.2. RESULTS 24484 struct SAVEFH4res { 24485 /* 24486 * If status is NFS4_OK, 24487 * new SAVED_FH: value of current fh 24488 */ 24489 nfsstat4 status; 24490 }; 24492 18.28.3. DESCRIPTION 24494 The SAVEFH operation saves the current filehandle and stateid. If a 24495 previous filehandle was saved, then it is no longer accessible. The 24496 saved filehandle can be restored as the current filehandle with the 24497 RESTOREFH operator. 24499 On success, the current filehandle retains its value. 24501 See Section 16.2.3.1.1 for more details on the current filehandle. 24503 See Section 16.2.3.1.2 for more details on the current stateid. 24505 18.28.4. IMPLEMENTATION 24507 18.29. Operation 33: SECINFO - Obtain Available Security 24509 18.29.1. ARGUMENTS 24511 struct SECINFO4args { 24512 /* CURRENT_FH: directory */ 24513 component4 name; 24514 }; 24516 18.29.2. RESULTS 24518 /* 24519 * From RFC 2203 24520 */ 24521 enum rpc_gss_svc_t { 24522 RPC_GSS_SVC_NONE = 1, 24523 RPC_GSS_SVC_INTEGRITY = 2, 24524 RPC_GSS_SVC_PRIVACY = 3 24525 }; 24527 struct rpcsec_gss_info { 24528 sec_oid4 oid; 24529 qop4 qop; 24530 rpc_gss_svc_t service; 24531 }; 24533 /* RPCSEC_GSS has a value of '6' - See RFC 2203 */ 24534 union secinfo4 switch (uint32_t flavor) { 24535 case RPCSEC_GSS: 24536 rpcsec_gss_info flavor_info; 24537 default: 24538 void; 24539 }; 24541 typedef secinfo4 SECINFO4resok<>; 24543 union SECINFO4res switch (nfsstat4 status) { 24544 case NFS4_OK: 24545 /* CURRENTFH: consumed */ 24546 SECINFO4resok resok4; 24547 default: 24548 void; 24549 }; 24551 18.29.3. DESCRIPTION 24553 The SECINFO operation is used by the client to obtain a list of valid 24554 RPC authentication flavors for a specific directory filehandle, file 24555 name pair. SECINFO should apply the same access methodology used for 24556 LOOKUP when evaluating the name. Therefore, if the requester does 24557 not have the appropriate access to LOOKUP the name, then SECINFO MUST 24558 behave the same way and return NFS4ERR_ACCESS. 24560 The result will contain an array that represents the security 24561 mechanisms available, with an order corresponding to the server's 24562 preferences, the most preferred being first in the array. The client 24563 is free to pick whatever security mechanism it both desires and 24564 supports, or to pick in the server's preference order the first one 24565 it supports. The array entries are represented by the secinfo4 24566 structure. The field 'flavor' will contain a value of AUTH_NONE, 24567 AUTH_SYS (as defined in RFC 5531 [3]), or RPCSEC_GSS (as defined in 24568 RFC 2203 [4]). The field flavor can also be any other security 24569 flavor registered with IANA. 24571 For the flavors AUTH_NONE and AUTH_SYS, no additional security 24572 information is returned. The same is true of many (if not most) 24573 other security flavors, including AUTH_DH. For a return value of 24574 RPCSEC_GSS, a security triple is returned that contains the mechanism 24575 object identifier (OID, as defined in RFC 2743 [7]), the quality of 24576 protection (as defined in RFC 2743 [7]), and the service type (as 24577 defined in RFC 2203 [4]). It is possible for SECINFO to return 24578 multiple entries with flavor equal to RPCSEC_GSS with different 24579 security triple values. 24581 On success, the current filehandle is consumed (see 24582 Section 2.6.3.1.1.8), and if the next operation after SECINFO tries 24583 to use the current filehandle, that operation will fail with the 24584 status NFS4ERR_NOFILEHANDLE. 24586 If the name has a length of zero, or if the name does not obey the 24587 UTF-8 definition (assuming UTF-8 capabilities are enabled; see 24588 Section 14.4), the error NFS4ERR_INVAL will be returned. 24590 See Section 2.6 for additional information on the use of SECINFO. 24592 18.29.4. IMPLEMENTATION 24594 The SECINFO operation is expected to be used by the NFS client when 24595 the error value of NFS4ERR_WRONGSEC is returned from another NFS 24596 operation. This signifies to the client that the server's security 24597 policy is different from what the client is currently using. At this 24598 point, the client is expected to obtain a list of possible security 24599 flavors and choose what best suits its policies. 24601 As mentioned, the server's security policies will determine when a 24602 client request receives NFS4ERR_WRONGSEC. See Table 14 for a list of 24603 operations that can return NFS4ERR_WRONGSEC. In addition, when 24604 READDIR returns attributes, the rdattr_error (Section 5.8.1.12) can 24605 contain NFS4ERR_WRONGSEC. Note that CREATE and REMOVE MUST NOT 24606 return NFS4ERR_WRONGSEC. The rationale for CREATE is that unless the 24607 target name exists, it cannot have a separate security policy from 24608 the parent directory, and the security policy of the parent was 24609 checked when its filehandle was injected into the COMPOUND request's 24610 operations stream (for similar reasons, an OPEN operation that 24611 creates the target MUST NOT return NFS4ERR_WRONGSEC). If the target 24612 name exists, while it might have a separate security policy, that is 24613 irrelevant because CREATE MUST return NFS4ERR_EXIST. The rationale 24614 for REMOVE is that while that target might have a separate security 24615 policy, the target is going to be removed, and so the security policy 24616 of the parent trumps that of the object being removed. RENAME and 24617 LINK MAY return NFS4ERR_WRONGSEC, but the NFS4ERR_WRONGSEC error 24618 applies only to the saved filehandle (see Section 2.6.3.1.2). Any 24619 NFS4ERR_WRONGSEC error on the current filehandle used by LINK and 24620 RENAME MUST be returned by the PUTFH, PUTPUBFH, PUTROOTFH, or 24621 RESTOREFH operation that injected the current filehandle. 24623 With the exception of LINK and RENAME, the set of operations that can 24624 return NFS4ERR_WRONGSEC represents the point at which the client can 24625 inject a filehandle into the "current filehandle" at the server. The 24626 filehandle is either provided by the client (PUTFH, PUTPUBFH, 24627 PUTROOTFH), generated as a result of a name-to-filehandle translation 24628 (LOOKUP and OPEN), or generated from the saved filehandle via 24629 RESTOREFH. As Section 2.6.3.1.1.1 states, a put filehandle operation 24630 followed by SAVEFH MUST NOT return NFS4ERR_WRONGSEC. Thus, the 24631 RESTOREFH operation, under certain conditions (see 24632 Section 2.6.3.1.1), is permitted to return NFS4ERR_WRONGSEC so that 24633 security policies can be honored. 24635 The READDIR operation will not directly return the NFS4ERR_WRONGSEC 24636 error. However, if the READDIR request included a request for 24637 attributes, it is possible that the READDIR request's security triple 24638 did not match that of a directory entry. If this is the case and the 24639 client has requested the rdattr_error attribute, the server will 24640 return the NFS4ERR_WRONGSEC error in rdattr_error for the entry. 24642 To resolve an error return of NFS4ERR_WRONGSEC, the client does the 24643 following: 24645 * For LOOKUP and OPEN, the client will use SECINFO with the same 24646 current filehandle and name as provided in the original LOOKUP or 24647 OPEN to enumerate the available security triples. 24649 * For the rdattr_error, the client will use SECINFO with the same 24650 current filehandle as provided in the original READDIR. The name 24651 passed to SECINFO will be that of the directory entry (as returned 24652 from READDIR) that had the NFS4ERR_WRONGSEC error in the 24653 rdattr_error attribute. 24655 * For PUTFH, PUTROOTFH, PUTPUBFH, RESTOREFH, LINK, and RENAME, the 24656 client will use SECINFO_NO_NAME { style = 24657 SECINFO_STYLE4_CURRENT_FH }. The client will prefix the 24658 SECINFO_NO_NAME operation with the appropriate PUTFH, PUTPUBFH, or 24659 PUTROOTFH operation that provides the filehandle originally 24660 provided by the PUTFH, PUTPUBFH, PUTROOTFH, or RESTOREFH 24661 operation. 24663 NOTE: In NFSv4.0, the client was required to use SECINFO, and had 24664 to reconstruct the parent of the original filehandle and the 24665 component name of the original filehandle. The introduction in 24666 NFSv4.1 of SECINFO_NO_NAME obviates the need for reconstruction. 24668 * For LOOKUPP, the client will use SECINFO_NO_NAME { style = 24669 SECINFO_STYLE4_PARENT } and provide the filehandle that equals the 24670 filehandle originally provided to LOOKUPP. 24672 See Section 21 for a discussion on the recommendations for the 24673 security flavor used by SECINFO and SECINFO_NO_NAME. 24675 18.30. Operation 34: SETATTR - Set Attributes 24677 18.30.1. ARGUMENTS 24678 struct SETATTR4args { 24679 /* CURRENT_FH: target object */ 24680 stateid4 stateid; 24681 fattr4 obj_attributes; 24682 }; 24684 18.30.2. RESULTS 24686 struct SETATTR4res { 24687 nfsstat4 status; 24688 bitmap4 attrsset; 24689 }; 24691 18.30.3. DESCRIPTION 24693 The SETATTR operation changes one or more of the attributes of a file 24694 system object. The new attributes are specified with a bitmap and 24695 the attributes that follow the bitmap in bit order. 24697 The stateid argument for SETATTR is used to provide byte-range 24698 locking context that is necessary for SETATTR requests that set the 24699 size attribute. Since setting the size attribute modifies the file's 24700 data, it has the same locking requirements as a corresponding WRITE. 24701 Any SETATTR that sets the size attribute is incompatible with a share 24702 reservation that specifies OPEN4_SHARE_DENY_WRITE. The area between 24703 the old end-of-file and the new end-of-file is considered to be 24704 modified just as would have been the case had the area in question 24705 been specified as the target of WRITE, for the purpose of checking 24706 conflicts with byte-range locks, for those cases in which a server is 24707 implementing mandatory byte-range locking behavior. A valid stateid 24708 SHOULD always be specified. When the file size attribute is not set, 24709 the special stateid consisting of all bits equal to zero MAY be 24710 passed. 24712 On either success or failure of the operation, the server will return 24713 the attrsset bitmask to represent what (if any) attributes were 24714 successfully set. The attrsset in the response is a subset of the 24715 attrmask field of the obj_attributes field in the argument. 24717 On success, the current filehandle retains its value. 24719 18.30.4. IMPLEMENTATION 24721 If the request specifies the owner attribute to be set, the server 24722 SHOULD allow the operation to succeed if the current owner of the 24723 object matches the value specified in the request. Some servers may 24724 be implemented in a way as to prohibit the setting of the owner 24725 attribute unless the requester has privilege to do so. If the server 24726 is lenient in this one case of matching owner values, the client 24727 implementation may be simplified in cases of creation of an object 24728 (e.g., an exclusive create via OPEN) followed by a SETATTR. 24730 The file size attribute is used to request changes to the size of a 24731 file. A value of zero causes the file to be truncated, a value less 24732 than the current size of the file causes data from new size to the 24733 end of the file to be discarded, and a size greater than the current 24734 size of the file causes logically zeroed data bytes to be added to 24735 the end of the file. Servers are free to implement this using 24736 unallocated bytes (holes) or allocated data bytes set to zero. 24737 Clients should not make any assumptions regarding a server's 24738 implementation of this feature, beyond that the bytes in the affected 24739 byte-range returned by READ will be zeroed. Servers MUST support 24740 extending the file size via SETATTR. 24742 SETATTR is not guaranteed to be atomic. A failed SETATTR may 24743 partially change a file's attributes, hence the reason why the reply 24744 always includes the status and the list of attributes that were set. 24746 If the object whose attributes are being changed has a file 24747 delegation that is held by a client other than the one doing the 24748 SETATTR, the delegation(s) must be recalled, and the operation cannot 24749 proceed to actually change an attribute until each such delegation is 24750 returned or revoked. In all cases in which delegations are recalled, 24751 the server is likely to return one or more NFS4ERR_DELAY errors while 24752 the delegation(s) remains outstanding, although it might not do that 24753 if the delegations are returned quickly. 24755 If the object whose attributes are being set is a directory and 24756 another client holds a directory delegation for that directory, then 24757 if enabled, asynchronous notifications will be generated when the set 24758 of attributes changed has a non-null intersection with the set of 24759 attributes for which notification is requested. Notifications of 24760 type NOTIFY4_CHANGE_DIR_ATTRS will be sent to the appropriate 24761 client(s), but the SETATTR is not delayed by waiting for these 24762 notifications to be sent. 24764 If the object whose attributes are being set is a member of the 24765 directory for which another client holds a directory delegation, then 24766 asynchronous notifications will be generated when the set of 24767 attributes changed has a non-null intersection with the set of 24768 attributes for which notification is requested. Notifications of 24769 type NOTIFY4_CHANGE_CHILD_ATTRS will be sent to the appropriate 24770 clients, but the SETATTR is not delayed by waiting for these 24771 notifications to be sent. 24773 Changing the size of a file with SETATTR indirectly changes the 24774 time_modify and change attributes. A client must account for this as 24775 size changes can result in data deletion. 24777 The attributes time_access_set and time_modify_set are write-only 24778 attributes constructed as a switched union so the client can direct 24779 the server in setting the time values. If the switched union 24780 specifies SET_TO_CLIENT_TIME4, the client has provided an nfstime4 to 24781 be used for the operation. If the switch union does not specify 24782 SET_TO_CLIENT_TIME4, the server is to use its current time for the 24783 SETATTR operation. 24785 If server and client times differ, programs that compare client time 24786 to file times can break. A time synchronization protocol should be 24787 used to limit client/server time skew. 24789 Use of a COMPOUND containing a VERIFY operation specifying only the 24790 change attribute, immediately followed by a SETATTR, provides a means 24791 whereby a client may specify a request that emulates the 24792 functionality of the SETATTR guard mechanism of NFSv3. Since the 24793 function of the guard mechanism is to avoid changes to the file 24794 attributes based on stale information, delays between checking of the 24795 guard condition and the setting of the attributes have the potential 24796 to compromise this function, as would the corresponding delay in the 24797 NFSv4 emulation. Therefore, NFSv4.1 servers SHOULD take care to 24798 avoid such delays, to the degree possible, when executing such a 24799 request. 24801 If the server does not support an attribute as requested by the 24802 client, the server SHOULD return NFS4ERR_ATTRNOTSUPP. 24804 A mask of the attributes actually set is returned by SETATTR in all 24805 cases. That mask MUST NOT include attribute bits not requested to be 24806 set by the client. If the attribute masks in the request and reply 24807 are equal, the status field in the reply MUST be NFS4_OK. 24809 18.31. Operation 37: VERIFY - Verify Same Attributes 24811 18.31.1. ARGUMENTS 24812 struct VERIFY4args { 24813 /* CURRENT_FH: object */ 24814 fattr4 obj_attributes; 24815 }; 24817 18.31.2. RESULTS 24819 struct VERIFY4res { 24820 nfsstat4 status; 24821 }; 24823 18.31.3. DESCRIPTION 24825 The VERIFY operation is used to verify that attributes have the value 24826 assumed by the client before proceeding with the following operations 24827 in the COMPOUND request. If any of the attributes do not match, then 24828 the error NFS4ERR_NOT_SAME must be returned. The current filehandle 24829 retains its value after successful completion of the operation. 24831 18.31.4. IMPLEMENTATION 24833 One possible use of the VERIFY operation is the following series of 24834 operations. With this, the client is attempting to verify that the 24835 file being removed will match what the client expects to be removed. 24836 This series can help prevent the unintended deletion of a file. 24838 PUTFH (directory filehandle) 24839 LOOKUP (file name) 24840 VERIFY (filehandle == fh) 24841 PUTFH (directory filehandle) 24842 REMOVE (file name) 24844 This series does not prevent a second client from removing and 24845 creating a new file in the middle of this sequence, but it does help 24846 avoid the unintended result. 24848 In the case that a RECOMMENDED attribute is specified in the VERIFY 24849 operation and the server does not support that attribute for the file 24850 system object, the error NFS4ERR_ATTRNOTSUPP is returned to the 24851 client. 24853 When the attribute rdattr_error or any set-only attribute (e.g., 24854 time_modify_set) is specified, the error NFS4ERR_INVAL is returned to 24855 the client. 24857 18.32. Operation 38: WRITE - Write to File 24859 18.32.1. ARGUMENTS 24861 enum stable_how4 { 24862 UNSTABLE4 = 0, 24863 DATA_SYNC4 = 1, 24864 FILE_SYNC4 = 2 24865 }; 24867 struct WRITE4args { 24868 /* CURRENT_FH: file */ 24869 stateid4 stateid; 24870 offset4 offset; 24871 stable_how4 stable; 24872 opaque data<>; 24873 }; 24875 18.32.2. RESULTS 24877 struct WRITE4resok { 24878 count4 count; 24879 stable_how4 committed; 24880 verifier4 writeverf; 24881 }; 24883 union WRITE4res switch (nfsstat4 status) { 24884 case NFS4_OK: 24885 WRITE4resok resok4; 24886 default: 24887 void; 24888 }; 24890 18.32.3. DESCRIPTION 24892 The WRITE operation is used to write data to a regular file. The 24893 target file is specified by the current filehandle. The offset 24894 specifies the offset where the data should be written. An offset of 24895 zero specifies that the write should start at the beginning of the 24896 file. The count, as encoded as part of the opaque data parameter, 24897 represents the number of bytes of data that are to be written. If 24898 the count is zero, the WRITE will succeed and return a count of zero 24899 subject to permissions checking. The server MAY write fewer bytes 24900 than requested by the client. 24902 The client specifies with the stable parameter the method of how the 24903 data is to be processed by the server. If stable is FILE_SYNC4, the 24904 server MUST commit the data written plus all file system metadata to 24905 stable storage before returning results. This corresponds to the 24906 NFSv2 protocol semantics. Any other behavior constitutes a protocol 24907 violation. If stable is DATA_SYNC4, then the server MUST commit all 24908 of the data to stable storage and enough of the metadata to retrieve 24909 the data before returning. The server implementor is free to 24910 implement DATA_SYNC4 in the same fashion as FILE_SYNC4, but with a 24911 possible performance drop. If stable is UNSTABLE4, the server is 24912 free to commit any part of the data and the metadata to stable 24913 storage, including all or none, before returning a reply to the 24914 client. There is no guarantee whether or when any uncommitted data 24915 will subsequently be committed to stable storage. The only 24916 guarantees made by the server are that it will not destroy any data 24917 without changing the value of writeverf and that it will not commit 24918 the data and metadata at a level less than that requested by the 24919 client. 24921 Except when special stateids are used, the stateid value for a WRITE 24922 request represents a value returned from a previous byte-range LOCK 24923 or OPEN request or the stateid associated with a delegation. The 24924 stateid identifies the associated owners if any and is used by the 24925 server to verify that the associated locks are still valid (e.g., 24926 have not been revoked). 24928 Upon successful completion, the following results are returned. The 24929 count result is the number of bytes of data written to the file. The 24930 server may write fewer bytes than requested. If so, the actual 24931 number of bytes written starting at location, offset, is returned. 24933 The server also returns an indication of the level of commitment of 24934 the data and metadata via committed. Per Table 20, 24936 * The server MAY commit the data at a stronger level than requested. 24938 * The server MUST commit the data at a level at least as high as 24939 that committed. 24941 +============+===================================+ 24942 | stable | committed | 24943 +============+===================================+ 24944 | UNSTABLE4 | FILE_SYNC4, DATA_SYNC4, UNSTABLE4 | 24945 +------------+-----------------------------------+ 24946 | DATA_SYNC4 | FILE_SYNC4, DATA_SYNC4 | 24947 +------------+-----------------------------------+ 24948 | FILE_SYNC4 | FILE_SYNC4 | 24949 +------------+-----------------------------------+ 24951 Table 20: Valid Combinations of the Fields 24952 Stable in the Request and Committed in the 24953 Reply 24955 The final portion of the result is the field writeverf. This field 24956 is the write verifier and is a cookie that the client can use to 24957 determine whether a server has changed instance state (e.g., server 24958 restart) between a call to WRITE and a subsequent call to either 24959 WRITE or COMMIT. This cookie MUST be unchanged during a single 24960 instance of the NFSv4.1 server and MUST be unique between instances 24961 of the NFSv4.1 server. If the cookie changes, then the client MUST 24962 assume that any data written with an UNSTABLE4 value for committed 24963 and an old writeverf in the reply has been lost and will need to be 24964 recovered. 24966 If a client writes data to the server with the stable argument set to 24967 UNSTABLE4 and the reply yields a committed response of DATA_SYNC4 or 24968 UNSTABLE4, the client will follow up some time in the future with a 24969 COMMIT operation to synchronize outstanding asynchronous data and 24970 metadata with the server's stable storage, barring client error. It 24971 is possible that due to client crash or other error that a subsequent 24972 COMMIT will not be received by the server. 24974 For a WRITE with a stateid value of all bits equal to zero, the 24975 server MAY allow the WRITE to be serviced subject to mandatory byte- 24976 range locks or the current share deny modes for the file. For a 24977 WRITE with a stateid value of all bits equal to 1, the server MUST 24978 NOT allow the WRITE operation to bypass locking checks at the server 24979 and otherwise is treated as if a stateid of all bits equal to zero 24980 were used. 24982 On success, the current filehandle retains its value. 24984 18.32.4. IMPLEMENTATION 24986 It is possible for the server to write fewer bytes of data than 24987 requested by the client. In this case, the server SHOULD NOT return 24988 an error unless no data was written at all. If the server writes 24989 less than the number of bytes specified, the client will need to send 24990 another WRITE to write the remaining data. 24992 It is assumed that the act of writing data to a file will cause the 24993 time_modified and change attributes of the file to be updated. 24994 However, these attributes SHOULD NOT be changed unless the contents 24995 of the file are changed. Thus, a WRITE request with count set to 24996 zero SHOULD NOT cause the time_modified and change attributes of the 24997 file to be updated. 24999 Stable storage is persistent storage that survives: 25001 1. Repeated power failures. 25003 2. Hardware failures (of any board, power supply, etc.). 25005 3. Repeated software crashes and restarts. 25007 This definition does not address failure of the stable storage module 25008 itself. 25010 The verifier is defined to allow a client to detect different 25011 instances of an NFSv4.1 protocol server over which cached, 25012 uncommitted data may be lost. In the most likely case, the verifier 25013 allows the client to detect server restarts. This information is 25014 required so that the client can safely determine whether the server 25015 could have lost cached data. If the server fails unexpectedly and 25016 the client has uncommitted data from previous WRITE requests (done 25017 with the stable argument set to UNSTABLE4 and in which the result 25018 committed was returned as UNSTABLE4 as well), the server might not 25019 have flushed cached data to stable storage. The burden of recovery 25020 is on the client, and the client will need to retransmit the data to 25021 the server. 25023 A suggested verifier would be to use the time that the server was 25024 last started (if restarting the server results in lost buffers). 25026 The reply's committed field allows the client to do more effective 25027 caching. If the server is committing all WRITE requests to stable 25028 storage, then it SHOULD return with committed set to FILE_SYNC4, 25029 regardless of the value of the stable field in the arguments. A 25030 server that uses an NVRAM accelerator may choose to implement this 25031 policy. The client can use this to increase the effectiveness of the 25032 cache by discarding cached data that has already been committed on 25033 the server. 25035 Some implementations may return NFS4ERR_NOSPC instead of 25036 NFS4ERR_DQUOT when a user's quota is exceeded. 25038 In the case that the current filehandle is of type NF4DIR, the server 25039 will return NFS4ERR_ISDIR. If the current file is a symbolic link, 25040 the error NFS4ERR_SYMLINK will be returned. Otherwise, if the 25041 current filehandle does not designate an ordinary file, the server 25042 will return NFS4ERR_WRONG_TYPE. 25044 If mandatory byte-range locking is in effect for the file, and the 25045 corresponding byte-range of the data to be written to the file is 25046 READ_LT or WRITE_LT locked by an owner that is not associated with 25047 the stateid, the server MUST return NFS4ERR_LOCKED. If so, the 25048 client MUST check if the owner corresponding to the stateid used with 25049 the WRITE operation has a conflicting READ_LT lock that overlaps with 25050 the byte-range that was to be written. If the stateid's owner has no 25051 conflicting READ_LT lock, then the client SHOULD try to get the 25052 appropriate write byte-range lock via the LOCK operation before re- 25053 attempting the WRITE. When the WRITE completes, the client SHOULD 25054 release the byte-range lock via LOCKU. 25056 If the stateid's owner had a conflicting READ_LT lock, then the 25057 client has no choice but to return an error to the application that 25058 attempted the WRITE. The reason is that since the stateid's owner 25059 had a READ_LT lock, either the server attempted to temporarily 25060 effectively upgrade this READ_LT lock to a WRITE_LT lock or the 25061 server has no upgrade capability. If the server attempted to upgrade 25062 the READ_LT lock and failed, it is pointless for the client to re- 25063 attempt the upgrade via the LOCK operation, because there might be 25064 another client also trying to upgrade. If two clients are blocked 25065 trying to upgrade the same lock, the clients deadlock. If the server 25066 has no upgrade capability, then it is pointless to try a LOCK 25067 operation to upgrade. 25069 If one or more other clients have delegations for the file being 25070 written, those delegations MUST be recalled, and the operation cannot 25071 proceed until those delegations are returned or revoked. Except 25072 where this happens very quickly, one or more NFS4ERR_DELAY errors 25073 will be returned to requests made while the delegation remains 25074 outstanding. Normally, delegations will not be recalled as a result 25075 of a WRITE operation since the recall will occur as a result of an 25076 earlier OPEN. However, since it is possible for a WRITE to be done 25077 with a special stateid, the server needs to check for this case even 25078 though the client should have done an OPEN previously. 25080 18.33. Operation 40: BACKCHANNEL_CTL - Backchannel Control 25082 18.33.1. ARGUMENT 25084 typedef opaque gsshandle4_t<>; 25086 struct gss_cb_handles4 { 25087 rpc_gss_svc_t gcbp_service; /* RFC 2203 */ 25088 gsshandle4_t gcbp_handle_from_server; 25089 gsshandle4_t gcbp_handle_from_client; 25090 }; 25092 union callback_sec_parms4 switch (uint32_t cb_secflavor) { 25093 case AUTH_NONE: 25094 void; 25095 case AUTH_SYS: 25096 authsys_parms cbsp_sys_cred; /* RFC 5531 */ 25097 case RPCSEC_GSS: 25098 gss_cb_handles4 cbsp_gss_handles; 25099 }; 25101 struct BACKCHANNEL_CTL4args { 25102 uint32_t bca_cb_program; 25103 callback_sec_parms4 bca_sec_parms<>; 25104 }; 25106 18.33.2. RESULT 25108 struct BACKCHANNEL_CTL4res { 25109 nfsstat4 bcr_status; 25110 }; 25112 18.33.3. DESCRIPTION 25114 The BACKCHANNEL_CTL operation replaces the backchannel's callback 25115 program number and adds (not replaces) RPCSEC_GSS handles for use by 25116 the backchannel. 25118 The arguments of the BACKCHANNEL_CTL call are a subset of the 25119 CREATE_SESSION parameters. In the arguments of BACKCHANNEL_CTL, the 25120 bca_cb_program field and bca_sec_parms fields correspond respectively 25121 to the csa_cb_program and csa_sec_parms fields of the arguments of 25122 CREATE_SESSION (Section 18.36). 25124 BACKCHANNEL_CTL MUST appear in a COMPOUND that starts with SEQUENCE. 25126 If the RPCSEC_GSS handle identified by gcbp_handle_from_server does 25127 not exist on the server, the server MUST return NFS4ERR_NOENT. 25129 If an RPCSEC_GSS handle is using the SSV context (see 25130 Section 2.10.9), then because each SSV RPCSEC_GSS handle shares a 25131 common SSV GSS context, there are security considerations specific to 25132 this situation discussed in Section 2.10.10. 25134 18.34. Operation 41: BIND_CONN_TO_SESSION - Associate Connection with 25135 Session 25137 18.34.1. ARGUMENT 25139 enum channel_dir_from_client4 { 25140 CDFC4_FORE = 0x1, 25141 CDFC4_BACK = 0x2, 25142 CDFC4_FORE_OR_BOTH = 0x3, 25143 CDFC4_BACK_OR_BOTH = 0x7 25144 }; 25146 struct BIND_CONN_TO_SESSION4args { 25147 sessionid4 bctsa_sessid; 25149 channel_dir_from_client4 25150 bctsa_dir; 25152 bool bctsa_use_conn_in_rdma_mode; 25153 }; 25155 18.34.2. RESULT 25156 enum channel_dir_from_server4 { 25157 CDFS4_FORE = 0x1, 25158 CDFS4_BACK = 0x2, 25159 CDFS4_BOTH = 0x3 25160 }; 25162 struct BIND_CONN_TO_SESSION4resok { 25163 sessionid4 bctsr_sessid; 25165 channel_dir_from_server4 25166 bctsr_dir; 25168 bool bctsr_use_conn_in_rdma_mode; 25169 }; 25171 union BIND_CONN_TO_SESSION4res 25172 switch (nfsstat4 bctsr_status) { 25174 case NFS4_OK: 25175 BIND_CONN_TO_SESSION4resok 25176 bctsr_resok4; 25178 default: void; 25179 }; 25181 18.34.3. DESCRIPTION 25183 BIND_CONN_TO_SESSION is used to associate additional connections with 25184 a session. It MUST be used on the connection being associated with 25185 the session. It MUST be the only operation in the COMPOUND 25186 procedure. If SP4_NONE (Section 18.35) state protection is used, any 25187 principal, security flavor, or RPCSEC_GSS context MAY be used to 25188 invoke the operation. If SP4_MACH_CRED is used, RPCSEC_GSS MUST be 25189 used with the integrity or privacy services, using the principal that 25190 created the client ID. If SP4_SSV is used, RPCSEC_GSS with the SSV 25191 GSS mechanism (Section 2.10.9) and integrity or privacy MUST be used. 25193 If, when the client ID was created, the client opted for SP4_NONE 25194 state protection, the client is not required to use 25195 BIND_CONN_TO_SESSION to associate the connection with the session, 25196 unless the client wishes to associate the connection with the 25197 backchannel. When SP4_NONE protection is used, simply sending a 25198 COMPOUND request with a SEQUENCE operation is sufficient to associate 25199 the connection with the session specified in SEQUENCE. 25201 The field bctsa_dir indicates whether the client wants to associate 25202 the connection with the fore channel or the backchannel or both 25203 channels. The value CDFC4_FORE_OR_BOTH indicates that the client 25204 wants to associate the connection with both the fore channel and 25205 backchannel, but will accept the connection being associated to just 25206 the fore channel. The value CDFC4_BACK_OR_BOTH indicates that the 25207 client wants to associate with both the fore channel and backchannel, 25208 but will accept the connection being associated with just the 25209 backchannel. The server replies in bctsr_dir which channel(s) the 25210 connection is associated with. If the client specified CDFC4_FORE, 25211 the server MUST return CDFS4_FORE. If the client specified 25212 CDFC4_BACK, the server MUST return CDFS4_BACK. If the client 25213 specified CDFC4_FORE_OR_BOTH, the server MUST return CDFS4_FORE or 25214 CDFS4_BOTH. If the client specified CDFC4_BACK_OR_BOTH, the server 25215 MUST return CDFS4_BACK or CDFS4_BOTH. 25217 See the CREATE_SESSION operation (Section 18.36), and the description 25218 of the argument csa_use_conn_in_rdma_mode to understand 25219 bctsa_use_conn_in_rdma_mode, and the description of 25220 csr_use_conn_in_rdma_mode to understand bctsr_use_conn_in_rdma_mode. 25222 Invoking BIND_CONN_TO_SESSION on a connection already associated with 25223 the specified session has no effect, and the server MUST respond with 25224 NFS4_OK, unless the client is demanding changes to the set of 25225 channels the connection is associated with. If so, the server MUST 25226 return NFS4ERR_INVAL. 25228 18.34.4. IMPLEMENTATION 25230 If a session's channel loses all connections, depending on the client 25231 ID's state protection and type of channel, the client might need to 25232 use BIND_CONN_TO_SESSION to associate a new connection. If the 25233 server restarted and does not keep the reply cache in stable storage, 25234 the server will not recognize the session ID. The client will 25235 ultimately have to invoke EXCHANGE_ID to create a new client ID and 25236 session. 25238 Suppose SP4_SSV state protection is being used, and 25239 BIND_CONN_TO_SESSION is among the operations included in the 25240 spo_must_enforce set when the client ID was created (Section 18.35). 25241 If so, there is an issue if SET_SSV is sent, no response is returned, 25242 and the last connection associated with the client ID drops. The 25243 client, per the sessions model, MUST retry the SET_SSV. But it needs 25244 a new connection to do so, and MUST associate that connection with 25245 the session via a BIND_CONN_TO_SESSION authenticated with the SSV GSS 25246 mechanism. The problem is that the RPCSEC_GSS message integrity 25247 codes use a subkey derived from the SSV as the key and the SSV may 25248 have changed. While there are multiple recovery strategies, a 25249 single, general strategy is described here. 25251 * The client reconnects. 25253 * The client assumes that the SET_SSV was executed, and so sends 25254 BIND_CONN_TO_SESSION with the subkey (derived from the new SSV, 25255 i.e., what SET_SSV would have set the SSV to) used as the key for 25256 the RPCSEC_GSS credential message integrity codes. 25258 * If the request succeeds, this means that the original attempted 25259 SET_SSV did execute successfully. The client re-sends the 25260 original SET_SSV, which the server will reply to via the reply 25261 cache. 25263 * If the server returns an RPC authentication error, this means that 25264 the server's current SSV was not changed (and the SET_SSV was 25265 likely not executed). The client then tries BIND_CONN_TO_SESSION 25266 with the subkey derived from the old SSV as the key for the 25267 RPCSEC_GSS message integrity codes. 25269 * The attempted BIND_CONN_TO_SESSION with the old SSV should 25270 succeed. If so, the client re-sends the original SET_SSV. If the 25271 original SET_SSV was not executed, then the server executes it. 25272 If the original SET_SSV was executed but failed, the server will 25273 return the SET_SSV from the reply cache. 25275 18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID 25277 The EXCHANGE_ID operation exchanges long-hand client and server 25278 identifiers (owners) and provides access to a client ID, creating one 25279 if necessary. This client ID becomes associated with the connection 25280 on which the operation is done, so that it is available when a 25281 CREATE_SESSION is done or when the connection is used to issue a 25282 request on an existing session associated with the current client. 25284 18.35.1. ARGUMENT 25286 const EXCHGID4_FLAG_SUPP_MOVED_REFER = 0x00000001; 25287 const EXCHGID4_FLAG_SUPP_MOVED_MIGR = 0x00000002; 25289 const EXCHGID4_FLAG_BIND_PRINC_STATEID = 0x00000100; 25291 const EXCHGID4_FLAG_USE_NON_PNFS = 0x00010000; 25292 const EXCHGID4_FLAG_USE_PNFS_MDS = 0x00020000; 25293 const EXCHGID4_FLAG_USE_PNFS_DS = 0x00040000; 25295 const EXCHGID4_FLAG_MASK_PNFS = 0x00070000; 25297 const EXCHGID4_FLAG_UPD_CONFIRMED_REC_A = 0x40000000; 25298 const EXCHGID4_FLAG_CONFIRMED_R = 0x80000000; 25300 struct state_protect_ops4 { 25301 bitmap4 spo_must_enforce; 25302 bitmap4 spo_must_allow; 25303 }; 25305 struct ssv_sp_parms4 { 25306 state_protect_ops4 ssp_ops; 25307 sec_oid4 ssp_hash_algs<>; 25308 sec_oid4 ssp_encr_algs<>; 25309 uint32_t ssp_window; 25310 uint32_t ssp_num_gss_handles; 25311 }; 25313 enum state_protect_how4 { 25314 SP4_NONE = 0, 25315 SP4_MACH_CRED = 1, 25316 SP4_SSV = 2 25317 }; 25319 union state_protect4_a switch(state_protect_how4 spa_how) { 25320 case SP4_NONE: 25321 void; 25322 case SP4_MACH_CRED: 25323 state_protect_ops4 spa_mach_ops; 25324 case SP4_SSV: 25325 ssv_sp_parms4 spa_ssv_parms; 25326 }; 25328 struct EXCHANGE_ID4args { 25329 client_owner4 eia_clientowner; 25330 uint32_t eia_flags; 25331 state_protect4_a eia_state_protect; 25332 nfs_impl_id4 eia_client_impl_id<1>; 25333 }; 25335 18.35.2. RESULT 25336 struct ssv_prot_info4 { 25337 state_protect_ops4 spi_ops; 25338 uint32_t spi_hash_alg; 25339 uint32_t spi_encr_alg; 25340 uint32_t spi_ssv_len; 25341 uint32_t spi_window; 25342 gsshandle4_t spi_handles<>; 25343 }; 25345 union state_protect4_r switch(state_protect_how4 spr_how) { 25346 case SP4_NONE: 25347 void; 25348 case SP4_MACH_CRED: 25349 state_protect_ops4 spr_mach_ops; 25350 case SP4_SSV: 25351 ssv_prot_info4 spr_ssv_info; 25352 }; 25354 struct EXCHANGE_ID4resok { 25355 clientid4 eir_clientid; 25356 sequenceid4 eir_sequenceid; 25357 uint32_t eir_flags; 25358 state_protect4_r eir_state_protect; 25359 server_owner4 eir_server_owner; 25360 opaque eir_server_scope; 25361 nfs_impl_id4 eir_server_impl_id<1>; 25362 }; 25364 union EXCHANGE_ID4res switch (nfsstat4 eir_status) { 25365 case NFS4_OK: 25366 EXCHANGE_ID4resok eir_resok4; 25368 default: 25369 void; 25370 }; 25372 18.35.3. DESCRIPTION 25374 The client uses the EXCHANGE_ID operation to register a particular 25375 instance of that client with the server, as represented by a 25376 client_owner4. However, when the client_owner4 has already been 25377 registered by other means (e.g., Transparent State Migration), the 25378 client may still use EXCHANGE_ID to obtain the client ID assigned 25379 previously. 25381 The client ID returned from this operation will be associated with 25382 the connection on which the EXCHANGE_ID is received and will serve as 25383 a parent object for sessions created by the client on this connection 25384 or to which the connection is bound. As a result of using those 25385 sessions to make requests involving the creation of state, that state 25386 will become associated with the client ID returned. 25388 In situations in which the registration of the client_owner has not 25389 occurred previously, the client ID must first be used, along with the 25390 returned eir_sequenceid, in creating an associated session using 25391 CREATE_SESSION. 25393 If the flag EXCHGID4_FLAG_CONFIRMED_R is set in the result, 25394 eir_flags, then it is an indication that the registration of the 25395 client_owner has already occurred and that a further CREATE_SESSION 25396 is not needed to confirm it. Of course, subsequent CREATE_SESSION 25397 operations may be needed for other reasons. 25399 The value eir_sequenceid is used to establish an initial sequence 25400 value associated with the client ID returned. In cases in which a 25401 CREATE_SESSION has already been done, there is no need for this 25402 value, since sequencing of such request has already been established, 25403 and the client has no need for this value and will ignore it. 25405 EXCHANGE_ID MAY be sent in a COMPOUND procedure that starts with 25406 SEQUENCE. However, when a client communicates with a server for the 25407 first time, it will not have a session, so using SEQUENCE will not be 25408 possible. If EXCHANGE_ID is sent without a preceding SEQUENCE, then 25409 it MUST be the only operation in the COMPOUND procedure's request. 25410 If it is not, the server MUST return NFS4ERR_NOT_ONLY_OP. 25412 The eia_clientowner field is composed of a co_verifier field and a 25413 co_ownerid string. As noted in Section 2.4, the co_ownerid 25414 identifies the client, and the co_verifier specifies a particular 25415 incarnation of that client. An EXCHANGE_ID sent with a new 25416 incarnation of the client will lead to the server removing lock state 25417 of the old incarnation. On the other hand, when an EXCHANGE_ID sent 25418 with the current incarnation and co_ownerid does not result in an 25419 unrelated error, it will potentially update an existing client ID's 25420 properties or simply return information about the existing client_id. 25421 The latter would happen when this operation is done to the same 25422 server using different network addresses as part of creating trunked 25423 connections. 25425 A server MUST NOT provide the same client ID to two different 25426 incarnations of an eia_clientowner. 25428 In addition to the client ID and sequence ID, the server returns a 25429 server owner (eir_server_owner) and server scope (eir_server_scope). 25430 The former field is used in connection with network trunking as 25431 described in Section 2.10.5. The latter field is used to allow 25432 clients to determine when client IDs sent by one server may be 25433 recognized by another in the event of file system migration (see 25434 Section 11.11.9 of the current document). 25436 The client ID returned by EXCHANGE_ID is only unique relative to the 25437 combination of eir_server_owner.so_major_id and eir_server_scope. 25438 Thus, if two servers return the same client ID, the onus is on the 25439 client to distinguish the client IDs on the basis of 25440 eir_server_owner.so_major_id and eir_server_scope. In the event two 25441 different servers claim matching server_owner.so_major_id and 25442 eir_server_scope, the client can use the verification techniques 25443 discussed in Section 2.10.5.1 to determine if the servers are 25444 distinct. If they are distinct, then the client will need to note 25445 the destination network addresses of the connections used with each 25446 server and use the network address as the final discriminator. 25448 The server, as defined by the unique identity expressed in the 25449 so_major_id of the server owner and the server scope, needs to track 25450 several properties of each client ID it hands out. The properties 25451 apply to the client ID and all sessions associated with the client 25452 ID. The properties are derived from the arguments and results of 25453 EXCHANGE_ID. The client ID properties include: 25455 * The capabilities expressed by the following bits, which come from 25456 the results of EXCHANGE_ID: 25458 - EXCHGID4_FLAG_SUPP_MOVED_REFER 25460 - EXCHGID4_FLAG_SUPP_MOVED_MIGR 25462 - EXCHGID4_FLAG_BIND_PRINC_STATEID 25464 - EXCHGID4_FLAG_USE_NON_PNFS 25466 - EXCHGID4_FLAG_USE_PNFS_MDS 25468 - EXCHGID4_FLAG_USE_PNFS_DS 25470 These properties may be updated by subsequent EXCHANGE_ID 25471 operations on confirmed client IDs though the server MAY refuse to 25472 change them. 25474 * The state protection method used, one of SP4_NONE, SP4_MACH_CRED, 25475 or SP4_SSV, as set by the spa_how field of the arguments to 25476 EXCHANGE_ID. Once the client ID is confirmed, this property 25477 cannot be updated by subsequent EXCHANGE_ID operations. 25479 * For SP4_MACH_CRED or SP4_SSV state protection: 25481 - The list of operations (spo_must_enforce) that MUST use the 25482 specified state protection. This list comes from the results 25483 of EXCHANGE_ID. 25485 - The list of operations (spo_must_allow) that MAY use the 25486 specified state protection. This list comes from the results 25487 of EXCHANGE_ID. 25489 Once the client ID is confirmed, these properties cannot be 25490 updated by subsequent EXCHANGE_ID requests. 25492 * For SP4_SSV protection: 25494 - The OID of the hash algorithm. This property is represented by 25495 one of the algorithms in the ssp_hash_algs field of the 25496 EXCHANGE_ID arguments. Once the client ID is confirmed, this 25497 property cannot be updated by subsequent EXCHANGE_ID requests. 25499 - The OID of the encryption algorithm. This property is 25500 represented by one of the algorithms in the ssp_encr_algs field 25501 of the EXCHANGE_ID arguments. Once the client ID is confirmed, 25502 this property cannot be updated by subsequent EXCHANGE_ID 25503 requests. 25505 - The length of the SSV. This property is represented by the 25506 spi_ssv_len field in the EXCHANGE_ID results. Once the client 25507 ID is confirmed, this property cannot be updated by subsequent 25508 EXCHANGE_ID operations. 25510 There are REQUIRED and RECOMMENDED relationships among the 25511 length of the key of the encryption algorithm ("key length"), 25512 the length of the output of hash algorithm ("hash length"), and 25513 the length of the SSV ("SSV length"). 25515 o key length MUST be <= hash length. This is because the keys 25516 used for the encryption algorithm are actually subkeys 25517 derived from the SSV, and the derivation is via the hash 25518 algorithm. The selection of an encryption algorithm with a 25519 key length that exceeded the length of the output of the 25520 hash algorithm would require padding, and thus weaken the 25521 use of the encryption algorithm. 25523 o hash length SHOULD be <= SSV length. This is because the 25524 SSV is a key used to derive subkeys via an HMAC, and it is 25525 recommended that the key used as input to an HMAC be at 25526 least as long as the length of the HMAC's hash algorithm's 25527 output (see Section 3 of [52]). 25529 o key length SHOULD be <= SSV length. This is a transitive 25530 result of the above two invariants. 25532 o key length SHOULD be >= hash length / 2. This is because 25533 the subkey derivation is via an HMAC and it is recommended 25534 that if the HMAC has to be truncated, it should not be 25535 truncated to less than half the hash length (see Section 4 25536 of RFC 2104 [52]). 25538 - Number of concurrent versions of the SSV the client and server 25539 will support (see Section 2.10.9). This property is 25540 represented by spi_window in the EXCHANGE_ID results. The 25541 property may be updated by subsequent EXCHANGE_ID operations. 25543 * The client's implementation ID as represented by the 25544 eia_client_impl_id field of the arguments. The property may be 25545 updated by subsequent EXCHANGE_ID requests. 25547 * The server's implementation ID as represented by the 25548 eir_server_impl_id field of the reply. The property may be 25549 updated by replies to subsequent EXCHANGE_ID requests. 25551 The eia_flags passed as part of the arguments and the eir_flags 25552 results allow the client and server to inform each other of their 25553 capabilities as well as indicate how the client ID will be used. 25554 Whether a bit is set or cleared on the arguments' flags does not 25555 force the server to set or clear the same bit on the results' side. 25556 Bits not defined above cannot be set in the eia_flags field. If they 25557 are, the server MUST reject the operation with NFS4ERR_INVAL. 25559 The EXCHGID4_FLAG_UPD_CONFIRMED_REC_A bit can only be set in 25560 eia_flags; it is always off in eir_flags. The 25561 EXCHGID4_FLAG_CONFIRMED_R bit can only be set in eir_flags; it is 25562 always off in eia_flags. If the server recognizes the co_ownerid and 25563 co_verifier as mapping to a confirmed client ID, it sets 25564 EXCHGID4_FLAG_CONFIRMED_R in eir_flags. The 25565 EXCHGID4_FLAG_CONFIRMED_R flag allows a client to tell if the client 25566 ID it is trying to create already exists and is confirmed. 25568 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set in eia_flags, this means 25569 that the client is attempting to update properties of an existing 25570 confirmed client ID (if the client wants to update properties of an 25571 unconfirmed client ID, it MUST NOT set 25572 EXCHGID4_FLAG_UPD_CONFIRMED_REC_A). If so, it is RECOMMENDED that 25573 the client send the update EXCHANGE_ID operation in the same COMPOUND 25574 as a SEQUENCE so that the EXCHANGE_ID is executed exactly once. 25575 Whether the client can update the properties of client ID depends on 25576 the state protection it selected when the client ID was created, and 25577 the principal and security flavor it used when sending the 25578 EXCHANGE_ID operation. The situations described in items 6, 7, 8, or 25579 9 of the second numbered list of Section 18.35.4 below will apply. 25580 Note that if the operation succeeds and returns a client ID that is 25581 already confirmed, the server MUST set the EXCHGID4_FLAG_CONFIRMED_R 25582 bit in eir_flags. 25584 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set in eia_flags, this 25585 means that the client is trying to establish a new client ID; it is 25586 attempting to trunk data communication to the server (See 25587 Section 2.10.5); or it is attempting to update properties of an 25588 unconfirmed client ID. The situations described in items 1, 2, 3, 4, 25589 or 5 of the second numbered list of Section 18.35.4 below will apply. 25590 Note that if the operation succeeds and returns a client ID that was 25591 previously confirmed, the server MUST set the 25592 EXCHGID4_FLAG_CONFIRMED_R bit in eir_flags. 25594 When the EXCHGID4_FLAG_SUPP_MOVED_REFER flag bit is set, the client 25595 indicates that it is capable of dealing with an NFS4ERR_MOVED error 25596 as part of a referral sequence. When this bit is not set, it is 25597 still legal for the server to perform a referral sequence. However, 25598 a server may use the fact that the client is incapable of correctly 25599 responding to a referral, by avoiding it for that particular client. 25600 It may, for instance, act as a proxy for that particular file system, 25601 at some cost in performance, although it is not obligated to do so. 25602 If the server will potentially perform a referral, it MUST set 25603 EXCHGID4_FLAG_SUPP_MOVED_REFER in eir_flags. 25605 When the EXCHGID4_FLAG_SUPP_MOVED_MIGR is set, the client indicates 25606 that it is capable of dealing with an NFS4ERR_MOVED error as part of 25607 a file system migration sequence. When this bit is not set, it is 25608 still legal for the server to indicate that a file system has moved, 25609 when this in fact happens. However, a server may use the fact that 25610 the client is incapable of correctly responding to a migration in its 25611 scheduling of file systems to migrate so as to avoid migration of 25612 file systems being actively used. It may also hide actual migrations 25613 from clients unable to deal with them by acting as a proxy for a 25614 migrated file system for particular clients, at some cost in 25615 performance, although it is not obligated to do so. If the server 25616 will potentially perform a migration, it MUST set 25617 EXCHGID4_FLAG_SUPP_MOVED_MIGR in eir_flags. 25619 When EXCHGID4_FLAG_BIND_PRINC_STATEID is set, the client indicates 25620 that it wants the server to bind the stateid to the principal. This 25621 means that when a principal creates a stateid, it has to be the one 25622 to use the stateid. If the server will perform binding, it will 25623 return EXCHGID4_FLAG_BIND_PRINC_STATEID. The server MAY return 25624 EXCHGID4_FLAG_BIND_PRINC_STATEID even if the client does not request 25625 it. If an update to the client ID changes the value of 25626 EXCHGID4_FLAG_BIND_PRINC_STATEID's client ID property, the effect 25627 applies only to new stateids. Existing stateids (and all stateids 25628 with the same "other" field) that were created with stateid to 25629 principal binding in force will continue to have binding in force. 25630 Existing stateids (and all stateids with the same "other" field) that 25631 were created with stateid to principal not in force will continue to 25632 have binding not in force. 25634 The EXCHGID4_FLAG_USE_NON_PNFS, EXCHGID4_FLAG_USE_PNFS_MDS, and 25635 EXCHGID4_FLAG_USE_PNFS_DS bits are described in Section 13.1 and 25636 convey roles the client ID is to be used for in a pNFS environment. 25637 The server MUST set one of the acceptable combinations of these bits 25638 (roles) in eir_flags, as specified in that section. Note that the 25639 same client owner/server owner pair can have multiple roles. 25640 Multiple roles can be associated with the same client ID or with 25641 different client IDs. Thus, if a client sends EXCHANGE_ID from the 25642 same client owner to the same server owner multiple times, but 25643 specifies different pNFS roles each time, the server might return 25644 different client IDs. Given that different pNFS roles might have 25645 different client IDs, the client may ask for different properties for 25646 each role/client ID. 25648 The spa_how field of the eia_state_protect field specifies how the 25649 client wants to protect its client, locking, and session states from 25650 unauthorized changes (Section 2.10.8.3): 25652 * SP4_NONE. The client does not request the NFSv4.1 server to 25653 enforce state protection. The NFSv4.1 server MUST NOT enforce 25654 state protection for the returned client ID. 25656 * SP4_MACH_CRED. If spa_how is SP4_MACH_CRED, then the client MUST 25657 send the EXCHANGE_ID operation with RPCSEC_GSS as the security 25658 flavor, and with a service of RPC_GSS_SVC_INTEGRITY or 25659 RPC_GSS_SVC_PRIVACY. If SP4_MACH_CRED is specified, then the 25660 client wants to use an RPCSEC_GSS-based machine credential to 25661 protect its state. The server MUST note the principal the 25662 EXCHANGE_ID operation was sent with, and the GSS mechanism used. 25663 These notes collectively comprise the machine credential. 25665 After the client ID is confirmed, as long as the lease associated 25666 with the client ID is unexpired, a subsequent EXCHANGE_ID 25667 operation that uses the same eia_clientowner.co_owner as the first 25668 EXCHANGE_ID MUST also use the same machine credential as the first 25669 EXCHANGE_ID. The server returns the same client ID for the 25670 subsequent EXCHANGE_ID as that returned from the first 25671 EXCHANGE_ID. 25673 * SP4_SSV. If spa_how is SP4_SSV, then the client MUST send the 25674 EXCHANGE_ID operation with RPCSEC_GSS as the security flavor, and 25675 with a service of RPC_GSS_SVC_INTEGRITY or RPC_GSS_SVC_PRIVACY. 25676 If SP4_SSV is specified, then the client wants to use the SSV to 25677 protect its state. The server records the credential used in the 25678 request as the machine credential (as defined above) for the 25679 eia_clientowner.co_owner. The CREATE_SESSION operation that 25680 confirms the client ID MUST use the same machine credential. 25682 When a client specifies SP4_MACH_CRED or SP4_SSV, it also provides 25683 two lists of operations (each expressed as a bitmap). The first list 25684 is spo_must_enforce and consists of those operations the client MUST 25685 send (subject to the server confirming the list of operations in the 25686 result of EXCHANGE_ID) with the machine credential (if SP4_MACH_CRED 25687 protection is specified) or the SSV-based credential (if SP4_SSV 25688 protection is used). The client MUST send the operations with 25689 RPCSEC_GSS credentials that specify the RPC_GSS_SVC_INTEGRITY or 25690 RPC_GSS_SVC_PRIVACY security service. Typically, the first list of 25691 operations includes EXCHANGE_ID, CREATE_SESSION, DELEGPURGE, 25692 DESTROY_SESSION, BIND_CONN_TO_SESSION, and DESTROY_CLIENTID. The 25693 client SHOULD NOT specify in this list any operations that require a 25694 filehandle because the server's access policies MAY conflict with the 25695 client's choice, and thus the client would then be unable to access a 25696 subset of the server's namespace. 25698 Note that if SP4_SSV protection is specified, and the client 25699 indicates that CREATE_SESSION must be protected with SP4_SSV, because 25700 the SSV cannot exist without a confirmed client ID, the first 25701 CREATE_SESSION MUST instead be sent using the machine credential, and 25702 the server MUST accept the machine credential. 25704 There is a corresponding result, also called spo_must_enforce, of the 25705 operations for which the server will require SP4_MACH_CRED or SP4_SSV 25706 protection. Normally, the server's result equals the client's 25707 argument, but the result MAY be different. If the client requests 25708 one or more operations in the set { EXCHANGE_ID, CREATE_SESSION, 25709 DELEGPURGE, DESTROY_SESSION, BIND_CONN_TO_SESSION, DESTROY_CLIENTID 25710 }, then the result spo_must_enforce MUST include the operations the 25711 client requested from that set. 25713 If spo_must_enforce in the results has BIND_CONN_TO_SESSION set, then 25714 connection binding enforcement is enabled, and the client MUST use 25715 the machine (if SP4_MACH_CRED protection is used) or SSV (if SP4_SSV 25716 protection is used) credential on calls to BIND_CONN_TO_SESSION. 25718 The second list is spo_must_allow and consists of those operations 25719 the client wants to have the option of sending with the machine 25720 credential or the SSV-based credential, even if the object the 25721 operations are performed on is not owned by the machine or SSV 25722 credential. 25724 The corresponding result, also called spo_must_allow, consists of the 25725 operations the server will allow the client to use SP4_SSV or 25726 SP4_MACH_CRED credentials with. Normally, the server's result equals 25727 the client's argument, but the result MAY be different. 25729 The purpose of spo_must_allow is to allow clients to solve the 25730 following conundrum. Suppose the client ID is confirmed with 25731 EXCHGID4_FLAG_BIND_PRINC_STATEID, and it calls OPEN with the 25732 RPCSEC_GSS credentials of a normal user. Now suppose the user's 25733 credentials expire, and cannot be renewed (e.g., a Kerberos ticket 25734 granting ticket expires, and the user has logged off and will not be 25735 acquiring a new ticket granting ticket). The client will be unable 25736 to send CLOSE without the user's credentials, which is to say the 25737 client has to either leave the state on the server or re-send 25738 EXCHANGE_ID with a new verifier to clear all state, that is, unless 25739 the client includes CLOSE on the list of operations in spo_must_allow 25740 and the server agrees. 25742 The SP4_SSV protection parameters also have: 25744 ssp_hash_algs: 25745 This is the set of algorithms the client supports for the purpose 25746 of computing the digests needed for the internal SSV GSS mechanism 25747 and for the SET_SSV operation. Each algorithm is specified as an 25748 object identifier (OID). The REQUIRED algorithms for a server are 25749 id-sha1, id-sha224, id-sha256, id-sha384, and id-sha512 [25]. 25751 Due to known weaknesses in id-sha1, it is RECOMMENDED that the 25752 client specify at least one algorithm within ssp_hash_algs other 25753 than id-sha1. 25755 The algorithm the server selects among the set is indicated in 25756 spi_hash_alg, a field of spr_ssv_prot_info. The field 25757 spi_hash_alg is an index into the array ssp_hash_algs. Because of 25758 known the weaknesses in id-sha1, it is RECOMMENDED that it not be 25759 selected by the server as long as ssp_hash_algs contains any other 25760 supported algorithm. 25762 If the server does not support any of the offered algorithms, it 25763 returns NFS4ERR_HASH_ALG_UNSUPP. If ssp_hash_algs is empty, the 25764 server MUST return NFS4ERR_INVAL. 25766 ssp_encr_algs: 25767 This is the set of algorithms the client supports for the purpose 25768 of providing privacy protection for the internal SSV GSS 25769 mechanism. Each algorithm is specified as an OID. The REQUIRED 25770 algorithm for a server is id-aes256-CBC. The RECOMMENDED 25771 algorithms are id-aes192-CBC and id-aes128-CBC [26]. The selected 25772 algorithm is returned in spi_encr_alg, an index into 25773 ssp_encr_algs. If the server does not support any of the offered 25774 algorithms, it returns NFS4ERR_ENCR_ALG_UNSUPP. If ssp_encr_algs 25775 is empty, the server MUST return NFS4ERR_INVAL. Note that due to 25776 previously stated requirements and recommendations on the 25777 relationships between key length and hash length, some 25778 combinations of RECOMMENDED and REQUIRED encryption algorithm and 25779 hash algorithm either SHOULD NOT or MUST NOT be used. Table 21 25780 summarizes the illegal and discouraged combinations. 25782 ssp_window: 25783 This is the number of SSV versions the client wants the server to 25784 maintain (i.e., each successful call to SET_SSV produces a new 25785 version of the SSV). If ssp_window is zero, the server MUST 25786 return NFS4ERR_INVAL. The server responds with spi_window, which 25787 MUST NOT exceed ssp_window and MUST be at least one. Any requests 25788 on the backchannel or fore channel that are using a version of the 25789 SSV that is outside the window will fail with an ONC RPC 25790 authentication error, and the requester will have to retry them 25791 with the same slot ID and sequence ID. 25793 ssp_num_gss_handles: 25794 This is the number of RPCSEC_GSS handles the server should create 25795 that are based on the GSS SSV mechanism (see Section 2.10.9). It 25796 is not the total number of RPCSEC_GSS handles for the client ID. 25797 Indeed, subsequent calls to EXCHANGE_ID will add RPCSEC_GSS 25798 handles. The server responds with a list of handles in 25799 spi_handles. If the client asks for at least one handle and the 25800 server cannot create it, the server MUST return an error. The 25801 handles in spi_handles are not available for use until the client 25802 ID is confirmed, which could be immediately if EXCHANGE_ID returns 25803 EXCHGID4_FLAG_CONFIRMED_R, or upon successful confirmation from 25804 CREATE_SESSION. 25806 While a client ID can span all the connections that are connected 25807 to a server sharing the same eir_server_owner.so_major_id, the 25808 RPCSEC_GSS handles returned in spi_handles can only be used on 25809 connections connected to a server that returns the same the 25810 eir_server_owner.so_major_id and eir_server_owner.so_minor_id on 25811 each connection. It is permissible for the client to set 25812 ssp_num_gss_handles to zero; the client can create more handles 25813 with another EXCHANGE_ID call. 25815 Because each SSV RPCSEC_GSS handle shares a common SSV GSS 25816 context, there are security considerations specific to this 25817 situation discussed in Section 2.10.10. 25819 The seq_window (see Section 5.2.3.1 of RFC 2203 [4]) of each 25820 RPCSEC_GSS handle in spi_handle MUST be the same as the seq_window 25821 of the RPCSEC_GSS handle used for the credential of the RPC 25822 request of which the EXCHANGE_ID operation was sent as a part. 25824 +======================+===========================+===============+ 25825 | Encryption Algorithm | MUST NOT be combined with | SHOULD NOT be | 25826 | | | combined with | 25827 +======================+===========================+===============+ 25828 | id-aes128-CBC | | id-sha384, | 25829 | | | id-sha512 | 25830 +----------------------+---------------------------+---------------+ 25831 | id-aes192-CBC | id-sha1 | id-sha512 | 25832 +----------------------+---------------------------+---------------+ 25833 | id-aes256-CBC | id-sha1, id-sha224 | | 25834 +----------------------+---------------------------+---------------+ 25836 Table 21 25838 The arguments include an array of up to one element in length called 25839 eia_client_impl_id. If eia_client_impl_id is present, it contains 25840 the information identifying the implementation of the client. 25841 Similarly, the results include an array of up to one element in 25842 length called eir_server_impl_id that identifies the implementation 25843 of the server. Servers MUST accept a zero-length eia_client_impl_id 25844 array, and clients MUST accept a zero-length eir_server_impl_id 25845 array. 25847 A possible use for implementation identifiers would be in diagnostic 25848 software that extracts this information in an attempt to identify 25849 interoperability problems, performance workload behaviors, or general 25850 usage statistics. Since the intent of having access to this 25851 information is for planning or general diagnosis only, the client and 25852 server MUST NOT interpret this implementation identity information in 25853 a way that affects how the implementation interacts with its peer. 25854 The client and server are not allowed to depend on the peer's 25855 manifesting a particular allowed behavior based on an implementation 25856 identifier but are required to interoperate as specified elsewhere in 25857 the protocol specification. 25859 Because it is possible that some implementations might violate the 25860 protocol specification and interpret the identity information, 25861 implementations MUST provide facilities to allow the NFSv4 client and 25862 server to be configured to set the contents of the nfs_impl_id 25863 structures sent to any specified value. 25865 18.35.4. IMPLEMENTATION 25867 A server's client record is a 5-tuple: 25869 1. co_ownerid: 25871 The client identifier string, from the eia_clientowner structure 25872 of the EXCHANGE_ID4args structure. 25874 2. co_verifier: 25876 A client-specific value used to indicate incarnations (where a 25877 client restart represents a new incarnation), from the 25878 eia_clientowner structure of the EXCHANGE_ID4args structure. 25880 3. principal: 25882 The principal that was defined in the RPC header's credential 25883 and/or verifier at the time the client record was established. 25885 4. client ID: 25887 The shorthand client identifier, generated by the server and 25888 returned via the eir_clientid field in the EXCHANGE_ID4resok 25889 structure. 25891 5. confirmed: 25893 A private field on the server indicating whether or not a client 25894 record has been confirmed. A client record is confirmed if there 25895 has been a successful CREATE_SESSION operation to confirm it. 25896 Otherwise, it is unconfirmed. An unconfirmed record is 25897 established by an EXCHANGE_ID call. Any unconfirmed record that 25898 is not confirmed within a lease period SHOULD be removed. 25900 The following identifiers represent special values for the fields in 25901 the records. 25903 ownerid_arg: 25904 The value of the eia_clientowner.co_ownerid subfield of the 25905 EXCHANGE_ID4args structure of the current request. 25907 verifier_arg: 25908 The value of the eia_clientowner.co_verifier subfield of the 25909 EXCHANGE_ID4args structure of the current request. 25911 old_verifier_arg: 25912 A value of the eia_clientowner.co_verifier field of a client 25913 record received in a previous request; this is distinct from 25914 verifier_arg. 25916 principal_arg: 25917 The value of the RPCSEC_GSS principal for the current request. 25919 old_principal_arg: 25920 A value of the principal of a client record as defined by the RPC 25921 header's credential or verifier of a previous request. This is 25922 distinct from principal_arg. 25924 clientid_ret: 25925 The value of the eir_clientid field the server will return in the 25926 EXCHANGE_ID4resok structure for the current request. 25928 old_clientid_ret: 25929 The value of the eir_clientid field the server returned in the 25930 EXCHANGE_ID4resok structure for a previous request. This is 25931 distinct from clientid_ret. 25933 confirmed: 25934 The client ID has been confirmed. 25936 unconfirmed: 25937 The client ID has not been confirmed. 25939 Since EXCHANGE_ID is a non-idempotent operation, we must consider the 25940 possibility that retries occur as a result of a client restart, 25941 network partition, malfunctioning router, etc. Retries are 25942 identified by the value of the eia_clientowner field of 25943 EXCHANGE_ID4args, and the method for dealing with them is outlined in 25944 the scenarios below. 25946 The scenarios are described in terms of the client record(s) a server 25947 has for a given co_ownerid. Note that if the client ID was created 25948 specifying SP4_SSV state protection and EXCHANGE_ID as the one of the 25949 operations in spo_must_allow, then the server MUST authorize 25950 EXCHANGE_IDs with the SSV principal in addition to the principal that 25951 created the client ID. 25953 1. New Owner ID 25954 If the server has no client records with 25955 eia_clientowner.co_ownerid matching ownerid_arg, and 25956 EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set in the EXCHANGE_ID, 25957 then a new shorthand client ID (let us call it clientid_ret) is 25958 generated, and the following unconfirmed record is added to the 25959 server's state. 25961 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 25962 unconfirmed } 25964 Subsequently, the server returns clientid_ret. 25966 2. Non-Update on Existing Client ID 25968 If the server has the following confirmed record, and the request 25969 does not have EXCHGID4_FLAG_UPD_CONFIRMED_REC_A set, then the 25970 request is the result of a retried request due to a faulty router 25971 or lost connection, or the client is trying to determine if it 25972 can perform trunking. 25974 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 25975 confirmed } 25977 Since the record has been confirmed, the client must have 25978 received the server's reply from the initial EXCHANGE_ID request. 25979 Since the server has a confirmed record, and since 25980 EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set, with the possible 25981 exception of eir_server_owner.so_minor_id, the server returns the 25982 same result it did when the client ID's properties were last 25983 updated (or if never updated, the result when the client ID was 25984 created). The confirmed record is unchanged. 25986 3. Client Collision 25988 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set, and if the 25989 server has the following confirmed record, then this request is 25990 likely the result of a chance collision between the values of the 25991 eia_clientowner.co_ownerid subfield of EXCHANGE_ID4args for two 25992 different clients. 25994 { ownerid_arg, *, old_principal_arg, old_clientid_ret, confirmed 25995 } 25996 If there is currently no state associated with old_clientid_ret, 25997 or if there is state but the lease has expired, then this case is 25998 effectively equivalent to the New Owner ID case of 25999 Section 18.35.4, Paragraph 7, Item 1. The confirmed record is 26000 deleted, the old_clientid_ret and its lock state are deleted, a 26001 new shorthand client ID is generated, and the following 26002 unconfirmed record is added to the server's state. 26004 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 26005 unconfirmed } 26007 Subsequently, the server returns clientid_ret. 26009 If old_clientid_ret has an unexpired lease with state, then no 26010 state of old_clientid_ret is changed or deleted. The server 26011 returns NFS4ERR_CLID_INUSE to indicate that the client should 26012 retry with a different value for the eia_clientowner.co_ownerid 26013 subfield of EXCHANGE_ID4args. The client record is not changed. 26015 4. Replacement of Unconfirmed Record 26017 If the EXCHGID4_FLAG_UPD_CONFIRMED_REC_A flag is not set, and the 26018 server has the following unconfirmed record, then the client is 26019 attempting EXCHANGE_ID again on an unconfirmed client ID, perhaps 26020 due to a retry, a client restart before client ID confirmation 26021 (i.e., before CREATE_SESSION was called), or some other reason. 26023 { ownerid_arg, *, *, old_clientid_ret, unconfirmed } 26025 It is possible that the properties of old_clientid_ret are 26026 different than those specified in the current EXCHANGE_ID. 26027 Whether or not the properties are being updated, to eliminate 26028 ambiguity, the server deletes the unconfirmed record, generates a 26029 new client ID (clientid_ret), and establishes the following 26030 unconfirmed record: 26032 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 26033 unconfirmed } 26035 5. Client Restart 26037 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set, and if the 26038 server has the following confirmed client record, then this 26039 request is likely from a previously confirmed client that has 26040 restarted. 26042 { ownerid_arg, old_verifier_arg, principal_arg, old_clientid_ret, 26043 confirmed } 26044 Since the previous incarnation of the same client will no longer 26045 be making requests, once the new client ID is confirmed by 26046 CREATE_SESSION, byte-range locks and share reservations should be 26047 released immediately rather than forcing the new incarnation to 26048 wait for the lease time on the previous incarnation to expire. 26049 Furthermore, session state should be removed since if the client 26050 had maintained that information across restart, this request 26051 would not have been sent. If the server supports neither the 26052 CLAIM_DELEGATE_PREV nor CLAIM_DELEG_PREV_FH claim types, 26053 associated delegations should be purged as well; otherwise, 26054 delegations are retained and recovery proceeds according to 26055 Section 10.2.1. 26057 After processing, clientid_ret is returned to the client and this 26058 client record is added: 26060 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 26061 unconfirmed } 26063 The previously described confirmed record continues to exist, and 26064 thus the same ownerid_arg exists in both a confirmed and 26065 unconfirmed state at the same time. The number of states can 26066 collapse to one once the server receives an applicable 26067 CREATE_SESSION or EXCHANGE_ID. 26069 * If the server subsequently receives a successful 26070 CREATE_SESSION that confirms clientid_ret, then the server 26071 atomically destroys the confirmed record and makes the 26072 unconfirmed record confirmed as described in Section 18.36.3. 26074 * If the server instead subsequently receives an EXCHANGE_ID 26075 with the client owner equal to ownerid_arg, one strategy is to 26076 simply delete the unconfirmed record, and process the 26077 EXCHANGE_ID as described in the entirety of Section 18.35.4. 26079 6. Update 26081 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server has 26082 the following confirmed record, then this request is an attempt 26083 at an update. 26085 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 26086 confirmed } 26088 Since the record has been confirmed, the client must have 26089 received the server's reply from the initial EXCHANGE_ID request. 26090 The server allows the update, and the client record is left 26091 intact. 26093 7. Update but No Confirmed Record 26095 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server has 26096 no confirmed record corresponding ownerid_arg, then the server 26097 returns NFS4ERR_NOENT and leaves any unconfirmed record intact. 26099 8. Update but Wrong Verifier 26101 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server has 26102 the following confirmed record, then this request is an illegal 26103 attempt at an update, perhaps because of a retry from a previous 26104 client incarnation. 26106 { ownerid_arg, old_verifier_arg, *, clientid_ret, confirmed } 26108 The server returns NFS4ERR_NOT_SAME and leaves the client record 26109 intact. 26111 9. Update but Wrong Principal 26113 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server has 26114 the following confirmed record, then this request is an illegal 26115 attempt at an update by an unauthorized principal. 26117 { ownerid_arg, verifier_arg, old_principal_arg, clientid_ret, 26118 confirmed } 26120 The server returns NFS4ERR_PERM and leaves the client record 26121 intact. 26123 18.36. Operation 43: CREATE_SESSION - Create New Session and Confirm 26124 Client ID 26126 18.36.1. ARGUMENT 26127 struct channel_attrs4 { 26128 count4 ca_headerpadsize; 26129 count4 ca_maxrequestsize; 26130 count4 ca_maxresponsesize; 26131 count4 ca_maxresponsesize_cached; 26132 count4 ca_maxoperations; 26133 count4 ca_maxrequests; 26134 uint32_t ca_rdma_ird<1>; 26135 }; 26137 const CREATE_SESSION4_FLAG_PERSIST = 0x00000001; 26138 const CREATE_SESSION4_FLAG_CONN_BACK_CHAN = 0x00000002; 26139 const CREATE_SESSION4_FLAG_CONN_RDMA = 0x00000004; 26141 struct CREATE_SESSION4args { 26142 clientid4 csa_clientid; 26143 sequenceid4 csa_sequence; 26145 uint32_t csa_flags; 26147 channel_attrs4 csa_fore_chan_attrs; 26148 channel_attrs4 csa_back_chan_attrs; 26150 uint32_t csa_cb_program; 26151 callback_sec_parms4 csa_sec_parms<>; 26152 }; 26154 18.36.2. RESULT 26156 struct CREATE_SESSION4resok { 26157 sessionid4 csr_sessionid; 26158 sequenceid4 csr_sequence; 26160 uint32_t csr_flags; 26162 channel_attrs4 csr_fore_chan_attrs; 26163 channel_attrs4 csr_back_chan_attrs; 26164 }; 26166 union CREATE_SESSION4res switch (nfsstat4 csr_status) { 26167 case NFS4_OK: 26168 CREATE_SESSION4resok csr_resok4; 26169 default: 26170 void; 26171 }; 26173 18.36.3. DESCRIPTION 26175 This operation is used by the client to create new session objects on 26176 the server. 26178 CREATE_SESSION can be sent with or without a preceding SEQUENCE 26179 operation in the same COMPOUND procedure. If CREATE_SESSION is sent 26180 with a preceding SEQUENCE operation, any session created by 26181 CREATE_SESSION has no direct relation to the session specified in the 26182 SEQUENCE operation, although the two sessions might be associated 26183 with the same client ID. If CREATE_SESSION is sent without a 26184 preceding SEQUENCE, then it MUST be the only operation in the 26185 COMPOUND procedure's request. If it is not, the server MUST return 26186 NFS4ERR_NOT_ONLY_OP. 26188 In addition to creating a session, CREATE_SESSION has the following 26189 effects: 26191 * The first session created with a new client ID serves to confirm 26192 the creation of that client's state on the server. The server 26193 returns the parameter values for the new session. 26195 * The connection CREATE_SESSION that is sent over is associated with 26196 the session's fore channel. 26198 The arguments and results of CREATE_SESSION are described as follows: 26200 csa_clientid: This is the client ID with which the new session will 26201 be associated. The corresponding result is csr_sessionid, the 26202 session ID of the new session. 26204 csa_sequence: Each client ID serializes CREATE_SESSION via a per- 26205 client ID sequence number (see Section 18.36.4). The 26206 corresponding result is csr_sequence, which MUST be equal to 26207 csa_sequence. 26209 In the next three arguments, the client offers a value that is to be 26210 a property of the session. Except where stated otherwise, it is 26211 RECOMMENDED that the server accept the value. If it is not 26212 acceptable, the server MAY use a different value. Regardless, the 26213 server MUST return the value the session will use (which will be 26214 either what the client offered, or what the server is insisting on) 26215 to the client. 26217 csa_flags: The csa_flags field contains a list of the following flag 26218 bits: 26220 CREATE_SESSION4_FLAG_PERSIST: 26222 If CREATE_SESSION4_FLAG_PERSIST is set, the client wants the 26223 server to provide a persistent reply cache. For sessions in 26224 which only idempotent operations will be used (e.g., a read- 26225 only session), clients SHOULD NOT set 26226 CREATE_SESSION4_FLAG_PERSIST. If the server does not or cannot 26227 provide a persistent reply cache, the server MUST NOT set 26228 CREATE_SESSION4_FLAG_PERSIST in the field csr_flags. 26230 If the server is a pNFS metadata server, for reasons described 26231 in Section 12.5.2 it SHOULD support 26232 CREATE_SESSION4_FLAG_PERSIST if it supports the layout_hint 26233 (Section 5.12.4) attribute. 26235 CREATE_SESSION4_FLAG_CONN_BACK_CHAN: 26236 If CREATE_SESSION4_FLAG_CONN_BACK_CHAN is set in csa_flags, the 26237 client is requesting that the connection over which the 26238 CREATE_SESSION operation arrived be associated with the 26239 session's backchannel in addition to its fore channel. If the 26240 server agrees, it sets CREATE_SESSION4_FLAG_CONN_BACK_CHAN in 26241 the result field csr_flags. If 26242 CREATE_SESSION4_FLAG_CONN_BACK_CHAN is not set in csa_flags, 26243 then CREATE_SESSION4_FLAG_CONN_BACK_CHAN MUST NOT be set in 26244 csr_flags. 26246 CREATE_SESSION4_FLAG_CONN_RDMA: 26247 If CREATE_SESSION4_FLAG_CONN_RDMA is set in csa_flags, and if 26248 the connection over which the CREATE_SESSION operation arrived 26249 is currently in non-RDMA mode but has the capability to operate 26250 in RDMA mode, then the client is requesting that the server 26251 "step up" to RDMA mode on the connection. If the server 26252 agrees, it sets CREATE_SESSION4_FLAG_CONN_RDMA in the result 26253 field csr_flags. If CREATE_SESSION4_FLAG_CONN_RDMA is not set 26254 in csa_flags, then CREATE_SESSION4_FLAG_CONN_RDMA MUST NOT be 26255 set in csr_flags. Note that once the server agrees to step up, 26256 it and the client MUST exchange all future traffic on the 26257 connection with RPC RDMA framing and not Record Marking ([32]). 26259 csa_fore_chan_attrs, csa_back_chan_attrs: The csa_fore_chan_attrs 26260 and csa_back_chan_attrs fields apply to attributes of the fore 26261 channel (which conveys requests originating from the client to the 26262 server), and the backchannel (the channel that conveys callback 26263 requests originating from the server to the client), respectively. 26264 The results are in corresponding structures called 26265 csr_fore_chan_attrs and csr_back_chan_attrs. The results 26266 establish attributes for each channel, and on all subsequent use 26267 of each channel of the session. Each structure has the following 26268 fields: 26270 ca_headerpadsize: 26271 The maximum amount of padding the requester is willing to apply 26272 to ensure that write payloads are aligned on some boundary at 26273 the replier. For each channel, the server 26275 * will reply in ca_headerpadsize with its preferred value, or 26276 zero if padding is not in use, and 26278 * MAY decrease this value but MUST NOT increase it. 26280 ca_maxrequestsize: 26281 The maximum size of a COMPOUND or CB_COMPOUND request that will 26282 be sent. This size represents the XDR encoded size of the 26283 request, including the RPC headers (including security flavor 26284 credentials and verifiers) but excludes any RPC transport 26285 framing headers. Imagine a request coming over a non-RDMA TCP/ 26286 IP connection, and that it has a single Record Marking header 26287 preceding it. The maximum allowable count encoded in the 26288 header will be ca_maxrequestsize. If a requester sends a 26289 request that exceeds ca_maxrequestsize, the error 26290 NFS4ERR_REQ_TOO_BIG will be returned per the description in 26291 Section 2.10.6.4. For each channel, the server MAY decrease 26292 this value but MUST NOT increase it. 26294 ca_maxresponsesize: 26295 The maximum size of a COMPOUND or CB_COMPOUND reply that the 26296 requester will accept from the replier including RPC headers 26297 (see the ca_maxrequestsize definition). For each channel, the 26298 server MAY decrease this value, but MUST NOT increase it. 26299 However, if the client selects a value for ca_maxresponsesize 26300 such that a replier on a channel could never send a response, 26301 the server SHOULD return NFS4ERR_TOOSMALL in the CREATE_SESSION 26302 reply. After the session is created, if a requester sends a 26303 request for which the size of the reply would exceed this 26304 value, the replier will return NFS4ERR_REP_TOO_BIG, per the 26305 description in Section 2.10.6.4. 26307 ca_maxresponsesize_cached: 26308 Like ca_maxresponsesize, but the maximum size of a reply that 26309 will be stored in the reply cache (Section 2.10.6.1). For each 26310 channel, the server MAY decrease this value, but MUST NOT 26311 increase it. If, in the reply to CREATE_SESSION, the value of 26312 ca_maxresponsesize_cached of a channel is less than the value 26313 of ca_maxresponsesize of the same channel, then this is an 26314 indication to the requester that it needs to be selective about 26315 which replies it directs the replier to cache; for example, 26316 large replies from non-idempotent operations (e.g., COMPOUND 26317 requests with a READ operation) should not be cached. The 26318 requester decides which replies to cache via an argument to the 26319 SEQUENCE (the sa_cachethis field, see Section 18.46) or 26320 CB_SEQUENCE (the csa_cachethis field, see Section 20.9) 26321 operations. After the session is created, if a requester sends 26322 a request for which the size of the reply would exceed 26323 ca_maxresponsesize_cached, the replier will return 26324 NFS4ERR_REP_TOO_BIG_TO_CACHE, per the description in 26325 Section 2.10.6.4. 26327 ca_maxoperations: 26328 The maximum number of operations the replier will accept in a 26329 COMPOUND or CB_COMPOUND. For the backchannel, the server MUST 26330 NOT change the value the client offers. For the fore channel, 26331 the server MAY change the requested value. After the session 26332 is created, if a requester sends a COMPOUND or CB_COMPOUND with 26333 more operations than ca_maxoperations, the replier MUST return 26334 NFS4ERR_TOO_MANY_OPS. 26336 ca_maxrequests: 26337 The maximum number of concurrent COMPOUND or CB_COMPOUND 26338 requests the requester will send on the session. Subsequent 26339 requests will each be assigned a slot identifier by the 26340 requester within the range zero to ca_maxrequests - 1 26341 inclusive. For the backchannel, the server MUST NOT change the 26342 value the client offers. For the fore channel, the server MAY 26343 change the requested value. 26345 ca_rdma_ird: 26346 This array has a maximum of one element. If this array has one 26347 element, then the element contains the inbound RDMA read queue 26348 depth (IRD). For each channel, the server MAY decrease this 26349 value, but MUST NOT increase it. 26351 csa_cb_program This is the ONC RPC program number the server MUST 26352 use in any callbacks sent through the backchannel to the client. 26353 The server MUST specify an ONC RPC program number equal to 26354 csa_cb_program and an ONC RPC version number equal to 4 in 26355 callbacks sent to the client. If a CB_COMPOUND is sent to the 26356 client, the server MUST use a minor version number of 1. There is 26357 no corresponding result. 26359 csa_sec_parms The field csa_sec_parms is an array of acceptable 26360 security credentials the server can use on the session's 26361 backchannel. Three security flavors are supported: AUTH_NONE, 26362 AUTH_SYS, and RPCSEC_GSS. If AUTH_NONE is specified for a 26363 credential, then this says the client is authorizing the server to 26364 use AUTH_NONE on all callbacks for the session. If AUTH_SYS is 26365 specified, then the client is authorizing the server to use 26366 AUTH_SYS on all callbacks, using the credential specified 26367 cbsp_sys_cred. If RPCSEC_GSS is specified, then the server is 26368 allowed to use the RPCSEC_GSS context specified in cbsp_gss_parms 26369 as the RPCSEC_GSS context in the credential of the RPC header of 26370 callbacks to the client. There is no corresponding result. 26372 The RPCSEC_GSS context for the backchannel is specified via a pair 26373 of values of data type gsshandle4_t. The data type gsshandle4_t 26374 represents an RPCSEC_GSS handle, and is precisely the same as the 26375 data type of the "handle" field of the rpc_gss_init_res data type 26376 defined in "Context Creation Response - Successful Acceptance", 26377 Section 5.2.3.1 of [4]. 26379 The first RPCSEC_GSS handle, gcbp_handle_from_server, is the fore 26380 handle the server returned to the client (either in the handle 26381 field of data type rpc_gss_init_res or as one of the elements of 26382 the spi_handles field returned in the reply to EXCHANGE_ID) when 26383 the RPCSEC_GSS context was created on the server. The second 26384 handle, gcbp_handle_from_client, is the back handle to which the 26385 client will map the RPCSEC_GSS context. The server can 26386 immediately use the value of gcbp_handle_from_client in the 26387 RPCSEC_GSS credential in callback RPCs. That is, the value in 26388 gcbp_handle_from_client can be used as the value of the field 26389 "handle" in data type rpc_gss_cred_t (see "Elements of the 26390 RPCSEC_GSS Security Protocol", Section 5 of [4]) in callback RPCs. 26391 The server MUST use the RPCSEC_GSS security service specified in 26392 gcbp_service, i.e., it MUST set the "service" field of the 26393 rpc_gss_cred_t data type in RPCSEC_GSS credential to the value of 26394 gcbp_service (see "RPC Request Header", Section 5.3.1 of [4]). 26396 If the RPCSEC_GSS handle identified by gcbp_handle_from_server 26397 does not exist on the server, the server will return 26398 NFS4ERR_NOENT. 26400 Within each element of csa_sec_parms, the fore and back RPCSEC_GSS 26401 contexts MUST share the same GSS context and MUST have the same 26402 seq_window (see Section 5.2.3.1 of RFC 2203 [4]). The fore and 26403 back RPCSEC_GSS context state are independent of each other as far 26404 as the RPCSEC_GSS sequence number (see the seq_num field in the 26405 rpc_gss_cred_t data type of Sections 5 and 5.3.1 of [4]). 26407 If an RPCSEC_GSS handle is using the SSV context (see 26408 Section 2.10.9), then because each SSV RPCSEC_GSS handle shares a 26409 common SSV GSS context, there are security considerations specific 26410 to this situation discussed in Section 2.10.10. 26412 Once the session is created, the first SEQUENCE or CB_SEQUENCE 26413 received on a slot MUST have a sequence ID equal to 1; if not, the 26414 replier MUST return NFS4ERR_SEQ_MISORDERED. 26416 18.36.4. IMPLEMENTATION 26418 To describe a possible implementation, the same notation for client 26419 records introduced in the description of EXCHANGE_ID is used with the 26420 following addition: 26422 clientid_arg: The value of the csa_clientid field of the 26423 CREATE_SESSION4args structure of the current request. 26425 Since CREATE_SESSION is a non-idempotent operation, we need to 26426 consider the possibility that retries may occur as a result of a 26427 client restart, network partition, malfunctioning router, etc. For 26428 each client ID created by EXCHANGE_ID, the server maintains a 26429 separate reply cache (called the CREATE_SESSION reply cache) similar 26430 to the session reply cache used for SEQUENCE operations, with two 26431 distinctions. 26433 * First, this is a reply cache just for detecting and processing 26434 CREATE_SESSION requests for a given client ID. 26436 * Second, the size of the client ID reply cache is of one slot (and 26437 as a result, the CREATE_SESSION request does not carry a slot 26438 number). This means that at most one CREATE_SESSION request for a 26439 given client ID can be outstanding. 26441 As previously stated, CREATE_SESSION can be sent with or without a 26442 preceding SEQUENCE operation. Even if a SEQUENCE precedes 26443 CREATE_SESSION, the server MUST maintain the CREATE_SESSION reply 26444 cache, which is separate from the reply cache for the session 26445 associated with a SEQUENCE. If CREATE_SESSION was originally sent by 26446 itself, the client MAY send a retry of the CREATE_SESSION operation 26447 within a COMPOUND preceded by a SEQUENCE. If CREATE_SESSION was 26448 originally sent in a COMPOUND that started with a SEQUENCE, then the 26449 client SHOULD send a retry in a COMPOUND that starts with a SEQUENCE 26450 that has the same session ID as the SEQUENCE of the original request. 26451 However, the client MAY send a retry in a COMPOUND that either has no 26452 preceding SEQUENCE, or has a preceding SEQUENCE that refers to a 26453 different session than the original CREATE_SESSION. This might be 26454 necessary if the client sends a CREATE_SESSION in a COMPOUND preceded 26455 by a SEQUENCE with session ID X, and session X no longer exists. 26456 Regardless, any retry of CREATE_SESSION, with or without a preceding 26457 SEQUENCE, MUST use the same value of csa_sequence as the original. 26459 After the client received a reply to an EXCHANGE_ID operation that 26460 contains a new, unconfirmed client ID, the server expects the client 26461 to follow with a CREATE_SESSION operation to confirm the client ID. 26462 The server expects value of csa_sequenceid in the arguments to that 26463 CREATE_SESSION to be to equal the value of the field eir_sequenceid 26464 that was returned in results of the EXCHANGE_ID that returned the 26465 unconfirmed client ID. Before the server replies to that EXCHANGE_ID 26466 operation, it initializes the client ID slot to be equal to 26467 eir_sequenceid - 1 (accounting for underflow), and records a 26468 contrived CREATE_SESSION result with a "cached" result of 26469 NFS4ERR_SEQ_MISORDERED. With the client ID slot thus initialized, 26470 the processing of the CREATE_SESSION operation is divided into four 26471 phases: 26473 1. Client record look up. The server looks up the client ID in its 26474 client record table. If the server contains no records with 26475 client ID equal to clientid_arg, then most likely the client's 26476 state has been purged during a period of inactivity, possibly due 26477 to a loss of connectivity. NFS4ERR_STALE_CLIENTID is returned, 26478 and no changes are made to any client records on the server. 26479 Otherwise, the server goes to phase 2. 26481 2. Sequence ID processing. If csa_sequenceid is equal to the 26482 sequence ID in the client ID's slot, then this is a replay of the 26483 previous CREATE_SESSION request, and the server returns the 26484 cached result. If csa_sequenceid is not equal to the sequence ID 26485 in the slot, and is more than one greater (accounting for 26486 wraparound), then the server returns the error 26487 NFS4ERR_SEQ_MISORDERED, and does not change the slot. If 26488 csa_sequenceid is equal to the slot's sequence ID + 1 (accounting 26489 for wraparound), then the slot's sequence ID is set to 26490 csa_sequenceid, and the CREATE_SESSION processing goes to the 26491 next phase. A subsequent new CREATE_SESSION call over the same 26492 client ID MUST use a csa_sequenceid that is one greater than the 26493 sequence ID in the slot. 26495 3. Client ID confirmation. If this would be the first session for 26496 the client ID, the CREATE_SESSION operation serves to confirm the 26497 client ID. Otherwise, the client ID confirmation phase is 26498 skipped and only the session creation phase occurs. Any case in 26499 which there is more than one record with identical values for 26500 client ID represents a server implementation error. Operation in 26501 the potential valid cases is summarized as follows. 26503 * Successful Confirmation 26505 If the server has the following unconfirmed record, then 26506 this is the expected confirmation of an unconfirmed record. 26508 { ownerid, verifier, principal_arg, clientid_arg, 26509 unconfirmed } 26511 As noted in Section 18.35.4, the server might also have the 26512 following confirmed record. 26514 { ownerid, old_verifier, principal_arg, old_clientid, 26515 confirmed } 26517 The server schedules the replacement of both records with: 26519 { ownerid, verifier, principal_arg, clientid_arg, confirmed 26520 } 26522 The processing of CREATE_SESSION continues on to session 26523 creation. Once the session is successfully created, the 26524 scheduled client record replacement is committed. If the 26525 session is not successfully created, then no changes are 26526 made to any client records on the server. 26528 * Unsuccessful Confirmation 26530 If the server has the following record, then the client has 26531 changed principals after the previous EXCHANGE_ID request, 26532 or there has been a chance collision between shorthand 26533 client identifiers. 26535 { *, *, old_principal_arg, clientid_arg, * } 26537 Neither of these cases is permissible. Processing stops 26538 and NFS4ERR_CLID_INUSE is returned to the client. No 26539 changes are made to any client records on the server. 26541 4. Session creation. The server confirmed the client ID, either in 26542 this CREATE_SESSION operation, or a previous CREATE_SESSION 26543 operation. The server examines the remaining fields of the 26544 arguments. 26546 The server creates the session by recording the parameter values 26547 used (including whether the CREATE_SESSION4_FLAG_PERSIST flag is 26548 set and has been accepted by the server) and allocating space for 26549 the session reply cache (if there is not enough space, the server 26550 returns NFS4ERR_NOSPC). For each slot in the reply cache, the 26551 server sets the sequence ID to zero, and records an entry 26552 containing a COMPOUND reply with zero operations and the error 26553 NFS4ERR_SEQ_MISORDERED. This way, if the first SEQUENCE request 26554 sent has a sequence ID equal to zero, the server can simply 26555 return what is in the reply cache: NFS4ERR_SEQ_MISORDERED. The 26556 client initializes its reply cache for receiving callbacks in the 26557 same way, and similarly, the first CB_SEQUENCE operation on a 26558 slot after session creation MUST have a sequence ID of one. 26560 If the session state is created successfully, the server 26561 associates the session with the client ID provided by the client. 26563 When a request that had CREATE_SESSION4_FLAG_CONN_RDMA set needs 26564 to be retried, the retry MUST be done on a new connection that is 26565 in non-RDMA mode. If properties of the new connection are 26566 different enough that the arguments to CREATE_SESSION need to 26567 change, then a non-retry MUST be sent. The server will 26568 eventually dispose of any session that was created on the 26569 original connection. 26571 On the backchannel, the client and server might wish to have many 26572 slots, in some cases perhaps more that the fore channel, in order to 26573 deal with the situations where the network link has high latency and 26574 is the primary bottleneck for response to recalls. If so, and if the 26575 client provides too few slots to the backchannel, the server might 26576 limit the number of recallable objects it gives to the client. 26578 Implementing RPCSEC_GSS callback support requires changes to both the 26579 client and server implementations of RPCSEC_GSS. One possible set of 26580 changes includes: 26582 * Adding a data structure that wraps the GSS-API context with a 26583 reference count. 26585 * New functions to increment and decrement the reference count. If 26586 the reference count is decremented to zero, the wrapper data 26587 structure and the GSS-API context it refers to would be freed. 26589 * Change RPCSEC_GSS to create the wrapper data structure upon 26590 receiving GSS-API context from gss_accept_sec_context() and 26591 gss_init_sec_context(). The reference count would be initialized 26592 to 1. 26594 * Adding a function to map an existing RPCSEC_GSS handle to a 26595 pointer to the wrapper data structure. The reference count would 26596 be incremented. 26598 * Adding a function to create a new RPCSEC_GSS handle from a pointer 26599 to the wrapper data structure. The reference count would be 26600 incremented. 26602 * Replacing calls from RPCSEC_GSS that free GSS-API contexts, with 26603 calls to decrement the reference count on the wrapper data 26604 structure. 26606 18.37. Operation 44: DESTROY_SESSION - Destroy a Session 26608 18.37.1. ARGUMENT 26610 struct DESTROY_SESSION4args { 26611 sessionid4 dsa_sessionid; 26612 }; 26614 18.37.2. RESULT 26616 struct DESTROY_SESSION4res { 26617 nfsstat4 dsr_status; 26618 }; 26620 18.37.3. DESCRIPTION 26622 The DESTROY_SESSION operation closes the session and discards the 26623 session's reply cache, if any. Any remaining connections associated 26624 with the session are immediately disassociated. If the connection 26625 has no remaining associated sessions, the connection MAY be closed by 26626 the server. Locks, delegations, layouts, wants, and the lease, which 26627 are all tied to the client ID, are not affected by DESTROY_SESSION. 26629 DESTROY_SESSION MUST be invoked on a connection that is associated 26630 with the session being destroyed. In addition, if SP4_MACH_CRED 26631 state protection was specified when the client ID was created, the 26632 RPCSEC_GSS principal that created the session MUST be the one that 26633 destroys the session, using RPCSEC_GSS privacy or integrity. If 26634 SP4_SSV state protection was specified when the client ID was 26635 created, RPCSEC_GSS using the SSV mechanism (Section 2.10.9) MUST be 26636 used, with integrity or privacy. 26638 If the COMPOUND request starts with SEQUENCE, and if the sessionids 26639 specified in SEQUENCE and DESTROY_SESSION are the same, then 26641 * DESTROY_SESSION MUST be the final operation in the COMPOUND 26642 request. 26644 * It is advisable to avoid placing DESTROY_SESSION in a COMPOUND 26645 request with other state-modifying operations, because the 26646 DESTROY_SESSION will destroy the reply cache. 26648 * Because the session and its reply cache are destroyed, a client 26649 that retries the request may receive an error in reply to the 26650 retry, even though the original request was successful. 26652 If the COMPOUND request starts with SEQUENCE, and if the sessionids 26653 specified in SEQUENCE and DESTROY_SESSION are different, then 26654 DESTROY_SESSION can appear in any position of the COMPOUND request 26655 (except for the first position). The two sessionids can belong to 26656 different client IDs. 26658 If the COMPOUND request does not start with SEQUENCE, and if 26659 DESTROY_SESSION is not the sole operation, then server MUST return 26660 NFS4ERR_NOT_ONLY_OP. 26662 If there is a backchannel on the session and the server has 26663 outstanding CB_COMPOUND operations for the session which have not 26664 been replied to, then the server MAY refuse to destroy the session 26665 and return an error. If so, then in the event the backchannel is 26666 down, the server SHOULD return NFS4ERR_CB_PATH_DOWN to inform the 26667 client that the backchannel needs to be repaired before the server 26668 will allow the session to be destroyed. Otherwise, the error 26669 CB_BACK_CHAN_BUSY SHOULD be returned to indicate that there are 26670 CB_COMPOUNDs that need to be replied to. The client SHOULD reply to 26671 all outstanding CB_COMPOUNDs before re-sending DESTROY_SESSION. 26673 18.38. Operation 45: FREE_STATEID - Free Stateid with No Locks 26675 18.38.1. ARGUMENT 26677 struct FREE_STATEID4args { 26678 stateid4 fsa_stateid; 26679 }; 26681 18.38.2. RESULT 26683 struct FREE_STATEID4res { 26684 nfsstat4 fsr_status; 26685 }; 26687 18.38.3. DESCRIPTION 26689 The FREE_STATEID operation is used to free a stateid that no longer 26690 has any associated locks (including opens, byte-range locks, 26691 delegations, and layouts). This may be because of client LOCKU 26692 operations or because of server revocation. If there are valid locks 26693 (of any kind) associated with the stateid in question, the error 26694 NFS4ERR_LOCKS_HELD will be returned, and the associated stateid will 26695 not be freed. 26697 When a stateid is freed that had been associated with revoked locks, 26698 by sending the FREE_STATEID operation, the client acknowledges the 26699 loss of those locks. This allows the server, once all such revoked 26700 state is acknowledged, to allow that client again to reclaim locks, 26701 without encountering the edge conditions discussed in Section 8.4.2. 26703 Once a successful FREE_STATEID is done for a given stateid, any 26704 subsequent use of that stateid will result in an NFS4ERR_BAD_STATEID 26705 error. 26707 18.39. Operation 46: GET_DIR_DELEGATION - Get a Directory Delegation 26709 18.39.1. ARGUMENT 26711 typedef nfstime4 attr_notice4; 26713 struct GET_DIR_DELEGATION4args { 26714 /* CURRENT_FH: delegated directory */ 26715 bool gdda_signal_deleg_avail; 26716 bitmap4 gdda_notification_types; 26717 attr_notice4 gdda_child_attr_delay; 26718 attr_notice4 gdda_dir_attr_delay; 26719 bitmap4 gdda_child_attributes; 26720 bitmap4 gdda_dir_attributes; 26721 }; 26723 18.39.2. RESULT 26724 struct GET_DIR_DELEGATION4resok { 26725 verifier4 gddr_cookieverf; 26726 /* Stateid for get_dir_delegation */ 26727 stateid4 gddr_stateid; 26728 /* Which notifications can the server support */ 26729 bitmap4 gddr_notification; 26730 bitmap4 gddr_child_attributes; 26731 bitmap4 gddr_dir_attributes; 26732 }; 26734 enum gddrnf4_status { 26735 GDD4_OK = 0, 26736 GDD4_UNAVAIL = 1 26737 }; 26739 union GET_DIR_DELEGATION4res_non_fatal 26740 switch (gddrnf4_status gddrnf_status) { 26741 case GDD4_OK: 26742 GET_DIR_DELEGATION4resok gddrnf_resok4; 26743 case GDD4_UNAVAIL: 26744 bool gddrnf_will_signal_deleg_avail; 26745 }; 26747 union GET_DIR_DELEGATION4res 26748 switch (nfsstat4 gddr_status) { 26749 case NFS4_OK: 26750 GET_DIR_DELEGATION4res_non_fatal gddr_res_non_fatal4; 26751 default: 26752 void; 26753 }; 26755 18.39.3. DESCRIPTION 26757 The GET_DIR_DELEGATION operation is used by a client to request a 26758 directory delegation. The directory is represented by the current 26759 filehandle. The client also specifies whether it wants the server to 26760 notify it when the directory changes in certain ways by setting one 26761 or more bits in a bitmap. The server may refuse to grant the 26762 delegation. In that case, the server will return 26763 NFS4ERR_DIRDELEG_UNAVAIL. If the server decides to hand out the 26764 delegation, it will return a cookie verifier for that directory. If 26765 the cookie verifier changes when the client is holding the 26766 delegation, the delegation will be recalled unless the client has 26767 asked for notification for this event. 26769 The server will also return a directory delegation stateid, 26770 gddr_stateid, as a result of the GET_DIR_DELEGATION operation. This 26771 stateid will appear in callback messages related to the delegation, 26772 such as notifications and delegation recalls. The client will use 26773 this stateid to return the delegation voluntarily or upon recall. A 26774 delegation is returned by calling the DELEGRETURN operation. 26776 The server might not be able to support notifications of certain 26777 events. If the client asks for such notifications, the server MUST 26778 inform the client of its inability to do so as part of the 26779 GET_DIR_DELEGATION reply by not setting the appropriate bits in the 26780 supported notifications bitmask, gddr_notification, contained in the 26781 reply. The server MUST NOT add bits to gddr_notification that the 26782 client did not request. 26784 The GET_DIR_DELEGATION operation can be used for both normal and 26785 named attribute directories. 26787 If client sets gdda_signal_deleg_avail to TRUE, then it is 26788 registering with the client a "want" for a directory delegation. If 26789 the delegation is not available, and the server supports and will 26790 honor the "want", the results will have 26791 gddrnf_will_signal_deleg_avail set to TRUE and no error will be 26792 indicated on return. If so, the client should expect a future 26793 CB_RECALLABLE_OBJ_AVAIL operation to indicate that a directory 26794 delegation is available. If the server does not wish to honor the 26795 "want" or is not able to do so, it returns the error 26796 NFS4ERR_DIRDELEG_UNAVAIL. If the delegation is immediately 26797 available, the server SHOULD return it with the response to the 26798 operation, rather than via a callback. 26800 When a client makes a request for a directory delegation while it 26801 already holds a directory delegation for that directory (including 26802 the case where it has been recalled but not yet returned by the 26803 client or revoked by the server), the server MUST reply with the 26804 value of gddr_status set to NFS4_OK, the value of gddrnf_status set 26805 to GDD4_UNAVAIL, and the value of gddrnf_will_signal_deleg_avail set 26806 to FALSE. The delegation the client held before the request remains 26807 intact, and its state is unchanged. The current stateid is not 26808 changed (see Section 16.2.3.1.2 for a description of the current 26809 stateid). 26811 18.39.4. IMPLEMENTATION 26813 Directory delegations provide the benefit of improving cache 26814 consistency of namespace information. This is done through 26815 synchronous callbacks. A server must support synchronous callbacks 26816 in order to support directory delegations. In addition to that, 26817 asynchronous notifications provide a way to reduce network traffic as 26818 well as improve client performance in certain conditions. 26820 Notifications are specified in terms of potential changes to the 26821 directory. A client can ask to be notified of events by setting one 26822 or more bits in gdda_notification_types. The client can ask for 26823 notifications on addition of entries to a directory (by setting the 26824 NOTIFY4_ADD_ENTRY in gdda_notification_types), notifications on entry 26825 removal (NOTIFY4_REMOVE_ENTRY), renames (NOTIFY4_RENAME_ENTRY), 26826 directory attribute changes (NOTIFY4_CHANGE_DIR_ATTRIBUTES), and 26827 cookie verifier changes (NOTIFY4_CHANGE_COOKIE_VERIFIER) by setting 26828 one or more corresponding bits in the gdda_notification_types field. 26830 The client can also ask for notifications of changes to attributes of 26831 directory entries (NOTIFY4_CHANGE_CHILD_ATTRIBUTES) in order to keep 26832 its attribute cache up to date. However, any changes made to child 26833 attributes do not cause the delegation to be recalled. If a client 26834 is interested in directory entry caching or negative name caching, it 26835 can set the gdda_notification_types appropriately to its particular 26836 need and the server will notify it of all changes that would 26837 otherwise invalidate its name cache. The kind of notification a 26838 client asks for may depend on the directory size, its rate of change, 26839 and the applications being used to access that directory. The 26840 enumeration of the conditions under which a client might ask for a 26841 notification is out of the scope of this specification. 26843 For attribute notifications, the client will set bits in the 26844 gdda_dir_attributes bitmap to indicate which attributes it wants to 26845 be notified of. If the server does not support notifications for 26846 changes to a certain attribute, it SHOULD NOT set that attribute in 26847 the supported attribute bitmap specified in the reply 26848 (gddr_dir_attributes). The client will also set in the 26849 gdda_child_attributes bitmap the attributes of directory entries it 26850 wants to be notified of, and the server will indicate in 26851 gddr_child_attributes which attributes of directory entries it will 26852 notify the client of. 26854 The client will also let the server know if it wants to get the 26855 notification as soon as the attribute change occurs or after a 26856 certain delay by setting a delay factor; gdda_child_attr_delay is for 26857 attribute changes to directory entries and gdda_dir_attr_delay is for 26858 attribute changes to the directory. If this delay factor is set to 26859 zero, that indicates to the server that the client wants to be 26860 notified of any attribute changes as soon as they occur. If the 26861 delay factor is set to N seconds, the server will make a best-effort 26862 guarantee that attribute updates are synchronized within N seconds. 26863 If the client asks for a delay factor that the server does not 26864 support or that may cause significant resource consumption on the 26865 server by causing the server to send a lot of notifications, the 26866 server should not commit to sending out notifications for attributes 26867 and therefore must not set the appropriate bit in the 26868 gddr_child_attributes and gddr_dir_attributes bitmaps in the 26869 response. 26871 The client MUST use a security tuple (Section 2.6.1) that the 26872 directory or its applicable ancestor (Section 2.6) is exported with. 26873 If not, the server MUST return NFS4ERR_WRONGSEC to the operation that 26874 both precedes GET_DIR_DELEGATION and sets the current filehandle (see 26875 Section 2.6.3.1). 26877 The directory delegation covers all the entries in the directory 26878 except the parent entry. That means if a directory and its parent 26879 both hold directory delegations, any changes to the parent will not 26880 cause a notification to be sent for the child even though the child's 26881 parent entry points to the parent directory. 26883 18.40. Operation 47: GETDEVICEINFO - Get Device Information 26885 18.40.1. ARGUMENT 26887 struct GETDEVICEINFO4args { 26888 deviceid4 gdia_device_id; 26889 layouttype4 gdia_layout_type; 26890 count4 gdia_maxcount; 26891 bitmap4 gdia_notify_types; 26892 }; 26894 18.40.2. RESULT 26895 struct GETDEVICEINFO4resok { 26896 device_addr4 gdir_device_addr; 26897 bitmap4 gdir_notification; 26898 }; 26900 union GETDEVICEINFO4res switch (nfsstat4 gdir_status) { 26901 case NFS4_OK: 26902 GETDEVICEINFO4resok gdir_resok4; 26903 case NFS4ERR_TOOSMALL: 26904 count4 gdir_mincount; 26905 default: 26906 void; 26907 }; 26909 18.40.3. DESCRIPTION 26911 The GETDEVICEINFO operation returns pNFS storage device address 26912 information for the specified device ID. The client identifies the 26913 device information to be returned by providing the gdia_device_id and 26914 gdia_layout_type that uniquely identify the device. The client 26915 provides gdia_maxcount to limit the number of bytes for the result. 26916 This maximum size represents all of the data being returned within 26917 the GETDEVICEINFO4resok structure and includes the XDR overhead. The 26918 server may return less data. If the server is unable to return any 26919 information within the gdia_maxcount limit, the error 26920 NFS4ERR_TOOSMALL will be returned. However, if gdia_maxcount is 26921 zero, NFS4ERR_TOOSMALL MUST NOT be returned. 26923 The da_layout_type field of the gdir_device_addr returned by the 26924 server MUST be equal to the gdia_layout_type specified by the client. 26925 If it is not equal, the client SHOULD ignore the response as invalid 26926 and behave as if the server returned an error, even if the client 26927 does have support for the layout type returned. 26929 The client also provides a notification bitmap, gdia_notify_types, 26930 for the device ID mapping notification for which it is interested in 26931 receiving; the server must support device ID notifications for the 26932 notification request to have affect. The notification mask is 26933 composed in the same manner as the bitmap for file attributes 26934 (Section 3.3.7). The numbers of bit positions are listed in the 26935 notify_device_type4 enumeration type (Section 20.12). Only two 26936 enumerated values of notify_device_type4 currently apply to 26937 GETDEVICEINFO: NOTIFY_DEVICEID4_CHANGE and NOTIFY_DEVICEID4_DELETE 26938 (see Section 20.12). 26940 The notification bitmap applies only to the specified device ID. If 26941 a client sends a GETDEVICEINFO operation on a deviceID multiple 26942 times, the last notification bitmap is used by the server for 26943 subsequent notifications. If the bitmap is zero or empty, then the 26944 device ID's notifications are turned off. 26946 If the client wants to just update or turn off notifications, it MAY 26947 send a GETDEVICEINFO operation with gdia_maxcount set to zero. In 26948 that event, if the device ID is valid, the reply's da_addr_body field 26949 of the gdir_device_addr field will be of zero length. 26951 If an unknown device ID is given in gdia_device_id, the server 26952 returns NFS4ERR_NOENT. Otherwise, the device address information is 26953 returned in gdir_device_addr. Finally, if the server supports 26954 notifications for device ID mappings, the gdir_notification result 26955 will contain a bitmap of which notifications it will actually send to 26956 the client (via CB_NOTIFY_DEVICEID, see Section 20.12). 26958 If NFS4ERR_TOOSMALL is returned, the results also contain 26959 gdir_mincount. The value of gdir_mincount represents the minimum 26960 size necessary to obtain the device information. 26962 18.40.4. IMPLEMENTATION 26964 Aside from updating or turning off notifications, another use case 26965 for gdia_maxcount being set to zero is to validate a device ID. 26967 The client SHOULD request a notification for changes or deletion of a 26968 device ID to device address mapping so that the server can allow the 26969 client gracefully use a new mapping, without having pending I/O fail 26970 abruptly, or force layouts using the device ID to be recalled or 26971 revoked. 26973 It is possible that GETDEVICEINFO (and GETDEVICELIST) will race with 26974 CB_NOTIFY_DEVICEID, i.e., CB_NOTIFY_DEVICEID arrives before the 26975 client gets and processes the response to GETDEVICEINFO or 26976 GETDEVICELIST. The analysis of the race leverages the fact that the 26977 server MUST NOT delete a device ID that is referred to by a layout 26978 the client has. 26980 * CB_NOTIFY_DEVICEID deletes a device ID. If the client believes it 26981 has layouts that refer to the device ID, then it is possible that 26982 layouts referring to the deleted device ID have been revoked. The 26983 client should send a TEST_STATEID request using the stateid for 26984 each layout that might have been revoked. If TEST_STATEID 26985 indicates that any layouts have been revoked, the client must 26986 recover from layout revocation as described in Section 12.5.6. If 26987 TEST_STATEID indicates that at least one layout has not been 26988 revoked, the client should send a GETDEVICEINFO operation on the 26989 supposedly deleted device ID to verify that the device ID has been 26990 deleted. 26992 If GETDEVICEINFO indicates that the device ID does not exist, then 26993 the client assumes the server is faulty and recovers by sending an 26994 EXCHANGE_ID operation. If GETDEVICEINFO indicates that the device 26995 ID does exist, then while the server is faulty for sending an 26996 erroneous device ID deletion notification, the degree to which it 26997 is faulty does not require the client to create a new client ID. 26999 If the client does not have layouts that refer to the device ID, 27000 no harm is done. The client should mark the device ID as deleted, 27001 and when GETDEVICEINFO or GETDEVICELIST results are received that 27002 indicate that the device ID has been in fact deleted, the device 27003 ID should be removed from the client's cache. 27005 * CB_NOTIFY_DEVICEID indicates that a device ID's device addressing 27006 mappings have changed. The client should assume that the results 27007 from the in-progress GETDEVICEINFO will be stale for the device ID 27008 once received, and so it should send another GETDEVICEINFO on the 27009 device ID. 27011 18.41. Operation 48: GETDEVICELIST - Get All Device Mappings for a File 27012 System 27014 18.41.1. ARGUMENT 27016 struct GETDEVICELIST4args { 27017 /* CURRENT_FH: object belonging to the file system */ 27018 layouttype4 gdla_layout_type; 27020 /* number of deviceIDs to return */ 27021 count4 gdla_maxdevices; 27023 nfs_cookie4 gdla_cookie; 27024 verifier4 gdla_cookieverf; 27025 }; 27027 18.41.2. RESULT 27028 struct GETDEVICELIST4resok { 27029 nfs_cookie4 gdlr_cookie; 27030 verifier4 gdlr_cookieverf; 27031 deviceid4 gdlr_deviceid_list<>; 27032 bool gdlr_eof; 27033 }; 27035 union GETDEVICELIST4res switch (nfsstat4 gdlr_status) { 27036 case NFS4_OK: 27037 GETDEVICELIST4resok gdlr_resok4; 27038 default: 27039 void; 27040 }; 27042 18.41.3. DESCRIPTION 27044 This operation is used by the client to enumerate all of the device 27045 IDs that a server's file system uses. 27047 The client provides a current filehandle of a file object that 27048 belongs to the file system (i.e., all file objects sharing the same 27049 fsid as that of the current filehandle) and the layout type in 27050 gdia_layout_type. Since this operation might require multiple calls 27051 to enumerate all the device IDs (and is thus similar to the READDIR 27052 (Section 18.23) operation), the client also provides gdia_cookie and 27053 gdia_cookieverf to specify the current cursor position in the list. 27054 When the client wants to read from the beginning of the file system's 27055 device mappings, it sets gdla_cookie to zero. The field 27056 gdla_cookieverf MUST be ignored by the server when gdla_cookie is 27057 zero. The client provides gdla_maxdevices to limit the number of 27058 device IDs in the result. If gdla_maxdevices is zero, the server 27059 MUST return NFS4ERR_INVAL. The server MAY return fewer device IDs. 27061 The successful response to the operation will contain the cookie, 27062 gdlr_cookie, and the cookie verifier, gdlr_cookieverf, to be used on 27063 the subsequent GETDEVICELIST. A gdlr_eof value of TRUE signifies 27064 that there are no remaining entries in the server's device list. 27065 Each element of gdlr_deviceid_list contains a device ID. 27067 18.41.4. IMPLEMENTATION 27069 An example of the use of this operation is for pNFS clients and 27070 servers that use LAYOUT4_BLOCK_VOLUME layouts. In these environments 27071 it may be helpful for a client to determine device accessibility upon 27072 first file system access. 27074 18.42. Operation 49: LAYOUTCOMMIT - Commit Writes Made Using a Layout 27075 18.42.1. ARGUMENT 27077 union newtime4 switch (bool nt_timechanged) { 27078 case TRUE: 27079 nfstime4 nt_time; 27080 case FALSE: 27081 void; 27082 }; 27084 union newoffset4 switch (bool no_newoffset) { 27085 case TRUE: 27086 offset4 no_offset; 27087 case FALSE: 27088 void; 27089 }; 27091 struct LAYOUTCOMMIT4args { 27092 /* CURRENT_FH: file */ 27093 offset4 loca_offset; 27094 length4 loca_length; 27095 bool loca_reclaim; 27096 stateid4 loca_stateid; 27097 newoffset4 loca_last_write_offset; 27098 newtime4 loca_time_modify; 27099 layoutupdate4 loca_layoutupdate; 27100 }; 27102 18.42.2. RESULT 27104 union newsize4 switch (bool ns_sizechanged) { 27105 case TRUE: 27106 length4 ns_size; 27107 case FALSE: 27108 void; 27109 }; 27111 struct LAYOUTCOMMIT4resok { 27112 newsize4 locr_newsize; 27113 }; 27115 union LAYOUTCOMMIT4res switch (nfsstat4 locr_status) { 27116 case NFS4_OK: 27117 LAYOUTCOMMIT4resok locr_resok4; 27118 default: 27119 void; 27120 }; 27122 18.42.3. DESCRIPTION 27124 The LAYOUTCOMMIT operation commits changes in the layout represented 27125 by the current filehandle, client ID (derived from the session ID in 27126 the preceding SEQUENCE operation), byte-range, and stateid. Since 27127 layouts are sub-dividable, a smaller portion of a layout, retrieved 27128 via LAYOUTGET, can be committed. The byte-range being committed is 27129 specified through the byte-range (loca_offset and loca_length). This 27130 byte-range MUST overlap with one or more existing layouts previously 27131 granted via LAYOUTGET (Section 18.43), each with an iomode of 27132 LAYOUTIOMODE4_RW. In the case where the iomode of any held layout 27133 segment is not LAYOUTIOMODE4_RW, the server should return the error 27134 NFS4ERR_BAD_IOMODE. For the case where the client does not hold 27135 matching layout segment(s) for the defined byte-range, the server 27136 should return the error NFS4ERR_BAD_LAYOUT. 27138 The LAYOUTCOMMIT operation indicates that the client has completed 27139 writes using a layout obtained by a previous LAYOUTGET. The client 27140 may have only written a subset of the data range it previously 27141 requested. LAYOUTCOMMIT allows it to commit or discard provisionally 27142 allocated space and to update the server with a new end-of-file. The 27143 layout referenced by LAYOUTCOMMIT is still valid after the operation 27144 completes and can be continued to be referenced by the client ID, 27145 filehandle, byte-range, layout type, and stateid. 27147 If the loca_reclaim field is set to TRUE, this indicates that the 27148 client is attempting to commit changes to a layout after the restart 27149 of the metadata server during the metadata server's recovery grace 27150 period (see Section 12.7.4). This type of request may be necessary 27151 when the client has uncommitted writes to provisionally allocated 27152 byte-ranges of a file that were sent to the storage devices before 27153 the restart of the metadata server. In this case, the layout 27154 provided by the client MUST be a subset of a writable layout that the 27155 client held immediately before the restart of the metadata server. 27156 The value of the field loca_stateid MUST be a value that the metadata 27157 server returned before it restarted. The metadata server is free to 27158 accept or reject this request based on its own internal metadata 27159 consistency checks. If the metadata server finds that the layout 27160 provided by the client does not pass its consistency checks, it MUST 27161 reject the request with the status NFS4ERR_RECLAIM_BAD. The 27162 successful completion of the LAYOUTCOMMIT request with loca_reclaim 27163 set to TRUE does NOT provide the client with a layout for the file. 27164 It simply commits the changes to the layout specified in the 27165 loca_layoutupdate field. To obtain a layout for the file, the client 27166 must send a LAYOUTGET request to the server after the server's grace 27167 period has expired. If the metadata server receives a LAYOUTCOMMIT 27168 request with loca_reclaim set to TRUE when the metadata server is not 27169 in its recovery grace period, it MUST reject the request with the 27170 status NFS4ERR_NO_GRACE. 27172 Setting the loca_reclaim field to TRUE is required if and only if the 27173 committed layout was acquired before the metadata server restart. If 27174 the client is committing a layout that was acquired during the 27175 metadata server's grace period, it MUST set the "reclaim" field to 27176 FALSE. 27178 The loca_stateid is a layout stateid value as returned by previously 27179 successful layout operations (see Section 12.5.3). 27181 The loca_last_write_offset field specifies the offset of the last 27182 byte written by the client previous to the LAYOUTCOMMIT. Note that 27183 this value is never equal to the file's size (at most it is one byte 27184 less than the file's size) and MUST be less than or equal to 27185 NFS4_MAXFILEOFF. Also, loca_last_write_offset MUST overlap the range 27186 described by loca_offset and loca_length. The metadata server may 27187 use this information to determine whether the file's size needs to be 27188 updated. If the metadata server updates the file's size as the 27189 result of the LAYOUTCOMMIT operation, it must return the new size 27190 (locr_newsize.ns_size) as part of the results. 27192 The loca_time_modify field allows the client to suggest a 27193 modification time it would like the metadata server to set. The 27194 metadata server may use the suggestion or it may use the time of the 27195 LAYOUTCOMMIT operation to set the modification time. If the metadata 27196 server uses the client-provided modification time, it should ensure 27197 that time does not flow backwards. If the client wants to force the 27198 metadata server to set an exact time, the client should use a SETATTR 27199 operation in a COMPOUND right after LAYOUTCOMMIT. See Section 12.5.4 27200 for more details. If the client desires the resultant modification 27201 time, it should construct the COMPOUND so that a GETATTR follows the 27202 LAYOUTCOMMIT. 27204 The loca_layoutupdate argument to LAYOUTCOMMIT provides a mechanism 27205 for a client to provide layout-specific updates to the metadata 27206 server. For example, the layout update can describe what byte-ranges 27207 of the original layout have been used and what byte-ranges can be 27208 deallocated. There is no NFSv4.1 file layout-specific layoutupdate4 27209 structure. 27211 The layout information is more verbose for block devices than for 27212 objects and files because the latter two hide the details of block 27213 allocation behind their storage protocols. At the minimum, the 27214 client needs to communicate changes to the end-of-file location back 27215 to the server, and, if desired, its view of the file's modification 27216 time. For block/volume layouts, it needs to specify precisely which 27217 blocks have been used. 27219 If the layout identified in the arguments does not exist, the error 27220 NFS4ERR_BADLAYOUT is returned. The layout being committed may also 27221 be rejected if it does not correspond to an existing layout with an 27222 iomode of LAYOUTIOMODE4_RW. 27224 On success, the current filehandle retains its value and the current 27225 stateid retains its value. 27227 18.42.4. IMPLEMENTATION 27229 The client MAY also use LAYOUTCOMMIT with the loca_reclaim field set 27230 to TRUE to convey hints to modified file attributes or to report 27231 layout-type specific information such as I/O errors for object-based 27232 storage layouts, as normally done during normal operation. Doing so 27233 may help the metadata server to recover files more efficiently after 27234 restart. For example, some file system implementations may require 27235 expansive recovery of file system objects if the metadata server does 27236 not get a positive indication from all clients holding a 27237 LAYOUTIOMODE4_RW layout that they have successfully completed all 27238 their writes. Sending a LAYOUTCOMMIT (if required) and then 27239 following with LAYOUTRETURN can provide such an indication and allow 27240 for graceful and efficient recovery. 27242 If loca_reclaim is TRUE, the metadata server is free to either 27243 examine or ignore the value in the field loca_stateid. The metadata 27244 server implementation might or might not encode in its layout stateid 27245 information that allows the metadata server to perform a consistency 27246 check on the LAYOUTCOMMIT request. 27248 18.43. Operation 50: LAYOUTGET - Get Layout Information 27250 18.43.1. ARGUMENT 27252 struct LAYOUTGET4args { 27253 /* CURRENT_FH: file */ 27254 bool loga_signal_layout_avail; 27255 layouttype4 loga_layout_type; 27256 layoutiomode4 loga_iomode; 27257 offset4 loga_offset; 27258 length4 loga_length; 27259 length4 loga_minlength; 27260 stateid4 loga_stateid; 27261 count4 loga_maxcount; 27262 }; 27264 18.43.2. RESULT 27266 struct LAYOUTGET4resok { 27267 bool logr_return_on_close; 27268 stateid4 logr_stateid; 27269 layout4 logr_layout<>; 27270 }; 27272 union LAYOUTGET4res switch (nfsstat4 logr_status) { 27273 case NFS4_OK: 27274 LAYOUTGET4resok logr_resok4; 27275 case NFS4ERR_LAYOUTTRYLATER: 27276 bool logr_will_signal_layout_avail; 27277 default: 27278 void; 27279 }; 27281 18.43.3. DESCRIPTION 27283 The LAYOUTGET operation requests a layout from the metadata server 27284 for reading or writing the file given by the filehandle at the byte- 27285 range specified by offset and length. Layouts are identified by the 27286 client ID (derived from the session ID in the preceding SEQUENCE 27287 operation), current filehandle, layout type (loga_layout_type), and 27288 the layout stateid (loga_stateid). The use of the loga_iomode field 27289 depends upon the layout type, but should reflect the client's data 27290 access intent. 27292 If the metadata server is in a grace period, and does not persist 27293 layouts and device ID to device address mappings, then it MUST return 27294 NFS4ERR_GRACE (see Section 8.4.2.1). 27296 The LAYOUTGET operation returns layout information for the specified 27297 byte-range: a layout. The client actually specifies two ranges, both 27298 starting at the offset in the loga_offset field. The first range is 27299 between loga_offset and loga_offset + loga_length - 1 inclusive. 27300 This range indicates the desired range the client wants the layout to 27301 cover. The second range is between loga_offset and loga_offset + 27302 loga_minlength - 1 inclusive. This range indicates the required 27303 range the client needs the layout to cover. Thus, loga_minlength 27304 MUST be less than or equal to loga_length. 27306 When a length field is set to NFS4_UINT64_MAX, this indicates a 27307 desire (when loga_length is NFS4_UINT64_MAX) or requirement (when 27308 loga_minlength is NFS4_UINT64_MAX) to get a layout from loga_offset 27309 through the end-of-file, regardless of the file's length. 27311 The following rules govern the relationships among, and the minima 27312 of, loga_length, loga_minlength, and loga_offset. 27314 * If loga_length is less than loga_minlength, the metadata server 27315 MUST return NFS4ERR_INVAL. 27317 * If loga_minlength is zero, this is an indication to the metadata 27318 server that the client desires any layout at offset loga_offset or 27319 less that the metadata server has "readily available". Readily is 27320 subjective, and depends on the layout type and the pNFS server 27321 implementation. For example, some metadata servers might have to 27322 pre-allocate stable storage when they receive a request for a 27323 range of a file that goes beyond the file's current length. If 27324 loga_minlength is zero and loga_length is greater than zero, this 27325 tells the metadata server what range of the layout the client 27326 would prefer to have. If loga_length and loga_minlength are both 27327 zero, then the client is indicating that it desires a layout of 27328 any length with the ending offset of the range no less than the 27329 value specified loga_offset, and the starting offset at or below 27330 loga_offset. If the metadata server does not have a layout that 27331 is readily available, then it MUST return NFS4ERR_LAYOUTTRYLATER. 27333 * If the sum of loga_offset and loga_minlength exceeds 27334 NFS4_UINT64_MAX, and loga_minlength is not NFS4_UINT64_MAX, the 27335 error NFS4ERR_INVAL MUST result. 27337 * If the sum of loga_offset and loga_length exceeds NFS4_UINT64_MAX, 27338 and loga_length is not NFS4_UINT64_MAX, the error NFS4ERR_INVAL 27339 MUST result. 27341 After the metadata server has performed the above checks on 27342 loga_offset, loga_minlength, and loga_offset, the metadata server 27343 MUST return a layout according to the rules in Table 22. 27345 Acceptable layouts based on loga_minlength. Note: u64m = 27346 NFS4_UINT64_MAX; a_off = loga_offset; a_minlen = loga_minlength. 27348 +===========+============+==========+==========+===================+ 27349 | Layout | Layout | Layout | Layout | Layout length of | 27350 | iomode of | a_minlen | iomode | offset | reply | 27351 | request | of request | of reply | of reply | | 27352 +===========+============+==========+==========+===================+ 27353 | _READ | u64m | MAY be | MUST be | MUST be >= file | 27354 | | | _READ | <= a_off | length - layout | 27355 | | | | | offset | 27356 +-----------+------------+----------+----------+-------------------+ 27357 | _READ | u64m | MAY be | MUST be | MUST be u64m | 27358 | | | _RW | <= a_off | | 27359 +-----------+------------+----------+----------+-------------------+ 27360 | _READ | > 0 and < | MAY be | MUST be | MUST be >= | 27361 | | u64m | _READ | <= a_off | MIN(file length, | 27362 | | | | | a_minlen + a_off) | 27363 | | | | | - layout offset | 27364 +-----------+------------+----------+----------+-------------------+ 27365 | _READ | > 0 and < | MAY be | MUST be | MUST be >= a_off | 27366 | | u64m | _RW | <= a_off | - layout offset + | 27367 | | | | | a_minlen | 27368 +-----------+------------+----------+----------+-------------------+ 27369 | _READ | 0 | MAY be | MUST be | MUST be > 0 | 27370 | | | _READ | <= a_off | | 27371 +-----------+------------+----------+----------+-------------------+ 27372 | _READ | 0 | MAY be | MUST be | MUST be > 0 | 27373 | | | _RW | <= a_off | | 27374 +-----------+------------+----------+----------+-------------------+ 27375 | _RW | u64m | MUST be | MUST be | MUST be u64m | 27376 | | | _RW | <= a_off | | 27377 +-----------+------------+----------+----------+-------------------+ 27378 | _RW | > 0 and < | MUST be | MUST be | MUST be >= a_off | 27379 | | u64m | _RW | <= a_off | - layout offset + | 27380 | | | | | a_minlen | 27381 +-----------+------------+----------+----------+-------------------+ 27382 | _RW | 0 | MUST be | MUST be | MUST be > 0 | 27383 | | | _RW | <= a_off | | 27384 +-----------+------------+----------+----------+-------------------+ 27386 Table 22 27388 If loga_minlength is not zero and the metadata server cannot return a 27389 layout according to the rules in Table 22, then the metadata server 27390 MUST return the error NFS4ERR_BADLAYOUT. If loga_minlength is zero 27391 and the metadata server cannot or will not return a layout according 27392 to the rules in Table 22, then the metadata server MUST return the 27393 error NFS4ERR_LAYOUTTRYLATER. Assuming that loga_length is greater 27394 than loga_minlength or equal to zero, the metadata server SHOULD 27395 return a layout according to the rules in Table 23. 27397 Desired layouts based on loga_length. The rules of Table 22 MUST be 27398 applied first. Note: u64m = NFS4_UINT64_MAX; a_off = loga_offset; 27399 a_len = loga_length. 27401 +===============+==========+==========+==========+================+ 27402 | Layout iomode | Layout | Layout | Layout | Layout length | 27403 | of request | a_len of | iomode | offset | of reply | 27404 | | request | of reply | of reply | | 27405 +===============+==========+==========+==========+================+ 27406 | _READ | u64m | MAY be | MUST be | SHOULD be u64m | 27407 | | | _READ | <= a_off | | 27408 +---------------+----------+----------+----------+----------------+ 27409 | _READ | u64m | MAY be | MUST be | SHOULD be u64m | 27410 | | | _RW | <= a_off | | 27411 +---------------+----------+----------+----------+----------------+ 27412 | _READ | > 0 and | MAY be | MUST be | SHOULD be >= | 27413 | | < u64m | _READ | <= a_off | a_off - layout | 27414 | | | | | offset + a_len | 27415 +---------------+----------+----------+----------+----------------+ 27416 | _READ | > 0 and | MAY be | MUST be | SHOULD be >= | 27417 | | < u64m | _RW | <= a_off | a_off - layout | 27418 | | | | | offset + a_len | 27419 +---------------+----------+----------+----------+----------------+ 27420 | _READ | 0 | MAY be | MUST be | SHOULD be > | 27421 | | | _READ | <= a_off | a_off - layout | 27422 | | | | | offset | 27423 +---------------+----------+----------+----------+----------------+ 27424 | _READ | 0 | MAY be | MUST be | SHOULD be > | 27425 | | | _READ | <= a_off | a_off - layout | 27426 | | | | | offset | 27427 +---------------+----------+----------+----------+----------------+ 27428 | _RW | u64m | MUST be | MUST be | SHOULD be u64m | 27429 | | | _RW | <= a_off | | 27430 +---------------+----------+----------+----------+----------------+ 27431 | _RW | > 0 and | MUST be | MUST be | SHOULD be >= | 27432 | | < u64m | _RW | <= a_off | a_off - layout | 27433 | | | | | offset + a_len | 27434 +---------------+----------+----------+----------+----------------+ 27435 | _RW | 0 | MUST be | MUST be | SHOULD be > | 27436 | | | _RW | <= a_off | a_off - layout | 27437 | | | | | offset | 27438 +---------------+----------+----------+----------+----------------+ 27440 Table 23 27442 The loga_stateid field specifies a valid stateid. If a layout is not 27443 currently held by the client, the loga_stateid field represents a 27444 stateid reflecting the correspondingly valid open, byte-range lock, 27445 or delegation stateid. Once a layout is held on the file by the 27446 client, the loga_stateid field MUST be a stateid as returned from a 27447 previous LAYOUTGET or LAYOUTRETURN operation or provided by a 27448 CB_LAYOUTRECALL operation (see Section 12.5.3). 27450 The loga_maxcount field specifies the maximum layout size (in bytes) 27451 that the client can handle. If the size of the layout structure 27452 exceeds the size specified by maxcount, the metadata server will 27453 return the NFS4ERR_TOOSMALL error. 27455 The returned layout is expressed as an array, logr_layout, with each 27456 element of type layout4. If a file has a single striping pattern, 27457 then logr_layout SHOULD contain just one entry. Otherwise, if the 27458 requested range overlaps more than one striping pattern, logr_layout 27459 will contain the required number of entries. The elements of 27460 logr_layout MUST be sorted in ascending order of the value of the 27461 lo_offset field of each element. There MUST be no gaps or overlaps 27462 in the range between two successive elements of logr_layout. The 27463 lo_iomode field in each element of logr_layout MUST be the same. 27465 Table 22 and Table 23 both refer to a returned layout iomode, offset, 27466 and length. Because the returned layout is encoded in the 27467 logr_layout array, more description is required. 27469 iomode The value of the returned layout iomode listed in Table 22 27470 and Table 23 is equal to the value of the lo_iomode field in each 27471 element of logr_layout. As shown in Table 22 and Table 23, the 27472 metadata server MAY return a layout with an lo_iomode different 27473 from the requested iomode (field loga_iomode of the request). If 27474 it does so, it MUST ensure that the lo_iomode is more permissive 27475 than the loga_iomode requested. For example, this behavior allows 27476 an implementation to upgrade LAYOUTIOMODE4_READ requests to 27477 LAYOUTIOMODE4_RW requests at its discretion, within the limits of 27478 the layout type specific protocol. A lo_iomode of either 27479 LAYOUTIOMODE4_READ or LAYOUTIOMODE4_RW MUST be returned. 27481 offset The value of the returned layout offset listed in Table 22 27482 and Table 23 is always equal to the lo_offset field of the first 27483 element logr_layout. 27485 length When setting the value of the returned layout length, the 27486 situation is complicated by the possibility that the special 27487 layout length value NFS4_UINT64_MAX is involved. For a 27488 logr_layout array of N elements, the lo_length field in the first 27489 N-1 elements MUST NOT be NFS4_UINT64_MAX. The lo_length field of 27490 the last element of logr_layout can be NFS4_UINT64_MAX under some 27491 conditions as described in the following list. 27493 * If an applicable rule of Table 22 states that the metadata 27494 server MUST return a layout of length NFS4_UINT64_MAX, then the 27495 lo_length field of the last element of logr_layout MUST be 27496 NFS4_UINT64_MAX. 27498 * If an applicable rule of Table 22 states that the metadata 27499 server MUST NOT return a layout of length NFS4_UINT64_MAX, then 27500 the lo_length field of the last element of logr_layout MUST NOT 27501 be NFS4_UINT64_MAX. 27503 * If an applicable rule of Table 23 states that the metadata 27504 server SHOULD return a layout of length NFS4_UINT64_MAX, then 27505 the lo_length field of the last element of logr_layout SHOULD 27506 be NFS4_UINT64_MAX. 27508 * When the value of the returned layout length of Table 22 and 27509 Table 23 is not NFS4_UINT64_MAX, then the returned layout 27510 length is equal to the sum of the lo_length fields of each 27511 element of logr_layout. 27513 The logr_return_on_close result field is a directive to return the 27514 layout before closing the file. When the metadata server sets this 27515 return value to TRUE, it MUST be prepared to recall the layout in the 27516 case in which the client fails to return the layout before close. 27517 For the metadata server that knows a layout must be returned before a 27518 close of the file, this return value can be used to communicate the 27519 desired behavior to the client and thus remove one extra step from 27520 the client's and metadata server's interaction. 27522 The logr_stateid stateid is returned to the client for use in 27523 subsequent layout related operations. See Sections 8.2, 12.5.3, and 27524 12.5.5.2 for a further discussion and requirements. 27526 The format of the returned layout (lo_content) is specific to the 27527 layout type. The value of the layout type (lo_content.loc_type) for 27528 each of the elements of the array of layouts returned by the metadata 27529 server (logr_layout) MUST be equal to the loga_layout_type specified 27530 by the client. If it is not equal, the client SHOULD ignore the 27531 response as invalid and behave as if the metadata server returned an 27532 error, even if the client does have support for the layout type 27533 returned. 27535 If neither the requested file nor its containing file system support 27536 layouts, the metadata server MUST return NFS4ERR_LAYOUTUNAVAILABLE. 27537 If the layout type is not supported, the metadata server MUST return 27538 NFS4ERR_UNKNOWN_LAYOUTTYPE. If layouts are supported but no layout 27539 matches the client provided layout identification, the metadata 27540 server MUST return NFS4ERR_BADLAYOUT. If an invalid loga_iomode is 27541 specified, or a loga_iomode of LAYOUTIOMODE4_ANY is specified, the 27542 metadata server MUST return NFS4ERR_BADIOMODE. 27544 If the layout for the file is unavailable due to transient 27545 conditions, e.g., file sharing prohibits layouts, the metadata server 27546 MUST return NFS4ERR_LAYOUTTRYLATER. 27548 If the layout request is rejected due to an overlapping layout 27549 recall, the metadata server MUST return NFS4ERR_RECALLCONFLICT. See 27550 Section 12.5.5.2 for details. 27552 If the layout conflicts with a mandatory byte-range lock held on the 27553 file, and if the storage devices have no method of enforcing 27554 mandatory locks, other than through the restriction of layouts, the 27555 metadata server SHOULD return NFS4ERR_LOCKED. 27557 If client sets loga_signal_layout_avail to TRUE, then it is 27558 registering with the client a "want" for a layout in the event the 27559 layout cannot be obtained due to resource exhaustion. If the 27560 metadata server supports and will honor the "want", the results will 27561 have logr_will_signal_layout_avail set to TRUE. If so, the client 27562 should expect a CB_RECALLABLE_OBJ_AVAIL operation to indicate that a 27563 layout is available. 27565 On success, the current filehandle retains its value and the current 27566 stateid is updated to match the value as returned in the results. 27568 18.43.4. IMPLEMENTATION 27570 Typically, LAYOUTGET will be called as part of a COMPOUND request 27571 after an OPEN operation and results in the client having location 27572 information for the file. This requires that loga_stateid be set to 27573 the special stateid that tells the metadata server to use the current 27574 stateid, which is set by OPEN (see Section 16.2.3.1.2). A client may 27575 also hold a layout across multiple OPENs. The client specifies a 27576 layout type that limits what kind of layout the metadata server will 27577 return. This prevents metadata servers from granting layouts that 27578 are unusable by the client. 27580 As indicated by Table 22 and Table 23, the specification of LAYOUTGET 27581 allows a pNFS client and server considerable flexibility. A pNFS 27582 client can take several strategies for sending LAYOUTGET. Some 27583 examples are as follows. 27585 * If LAYOUTGET is preceded by OPEN in the same COMPOUND request and 27586 the OPEN requests OPEN4_SHARE_ACCESS_READ access, the client might 27587 opt to request a _READ layout with loga_offset set to zero, 27588 loga_minlength set to zero, and loga_length set to 27589 NFS4_UINT64_MAX. If the file has space allocated to it, that 27590 space is striped over one or more storage devices, and there is 27591 either no conflicting layout or the concept of a conflicting 27592 layout does not apply to the pNFS server's layout type or 27593 implementation, then the metadata server might return a layout 27594 with a starting offset of zero, and a length equal to the length 27595 of the file, if not NFS4_UINT64_MAX. If the length of the file is 27596 not a multiple of the pNFS server's stripe width (see Section 13.2 27597 for a formal definition), the metadata server might round up the 27598 returned layout's length. 27600 * If LAYOUTGET is preceded by OPEN in the same COMPOUND request, and 27601 the OPEN requests OPEN4_SHARE_ACCESS_WRITE access and does not 27602 truncate the file, the client might opt to request a _RW layout 27603 with loga_offset set to zero, loga_minlength set to zero, and 27604 loga_length set to the file's current length (if known), or 27605 NFS4_UINT64_MAX. As with the previous case, under some conditions 27606 the metadata server might return a layout that covers the entire 27607 length of the file or beyond. 27609 * This strategy is as above, but the OPEN truncates the file. In 27610 this case, the client might anticipate it will be writing to the 27611 file from offset zero, and so loga_offset and loga_minlength are 27612 set to zero, and loga_length is set to the value of 27613 threshold4_write_iosize. The metadata server might return a 27614 layout from offset zero with a length at least as long as 27615 threshold4_write_iosize. 27617 * A process on the client invokes a request to read from offset 27618 10000 for length 50000. The client is using buffered I/O, and has 27619 buffer sizes of 4096 bytes. The client intends to map the request 27620 of the process into a series of READ requests starting at offset 27621 8192. The end offset needs to be higher than 10000 + 50000 = 27622 60000, and the next offset that is a multiple of 4096 is 61440. 27623 The difference between 61440 and that starting offset of the 27624 layout is 53248 (which is the product of 4096 and 15). The value 27625 of threshold4_read_iosize is less than 53248, so the client sends 27626 a LAYOUTGET request with loga_offset set to 8192, loga_minlength 27627 set to 53248, and loga_length set to the file's length (if known) 27628 minus 8192 or NFS4_UINT64_MAX (if the file's length is not known). 27629 Since this LAYOUTGET request exceeds the metadata server's 27630 threshold, it grants the layout, possibly with an initial offset 27631 of zero, with an end offset of at least 8192 + 53248 - 1 = 61439, 27632 but preferably a layout with an offset aligned on the stripe width 27633 and a length that is a multiple of the stripe width. 27635 * This strategy is as above, but the client is not using buffered I/ 27636 O, and instead all internal I/O requests are sent directly to the 27637 server. The LAYOUTGET request has loga_offset equal to 10000 and 27638 loga_minlength set to 50000. The value of loga_length is set to 27639 the length of the file. The metadata server is free to return a 27640 layout that fully overlaps the requested range, with a starting 27641 offset and length aligned on the stripe width. 27643 * Again, a process on the client invokes a request to read from 27644 offset 10000 for length 50000 (i.e. a range with a starting offset 27645 of 10000 and an ending offset of 69999), and buffered I/O is in 27646 use. The client is expecting that the server might not be able to 27647 return the layout for the full I/O range. The client intends to 27648 map the request of the process into a series of thirteen READ 27649 requests starting at offset 8192, each with length 4096, with a 27650 total length of 53248 (which equals 13 * 4096), which fully 27651 contains the range that client's process wants to read. Because 27652 the value of threshold4_read_iosize is equal to 4096, it is 27653 practical and reasonable for the client to use several LAYOUTGET 27654 operations to complete the series of READs. The client sends a 27655 LAYOUTGET request with loga_offset set to 8192, loga_minlength set 27656 to 4096, and loga_length set to 53248 or higher. The server will 27657 grant a layout possibly with an initial offset of zero, with an 27658 end offset of at least 8192 + 4096 - 1 = 12287, but preferably a 27659 layout with an offset aligned on the stripe width and a length 27660 that is a multiple of the stripe width. This will allow the 27661 client to make forward progress, possibly sending more LAYOUTGET 27662 operations for the remainder of the range. 27664 * An NFS client detects a sequential read pattern, and so sends a 27665 LAYOUTGET operation that goes well beyond any current or pending 27666 read requests to the server. The server might likewise detect 27667 this pattern, and grant the LAYOUTGET request. Once the client 27668 reads from an offset of the file that represents 50% of the way 27669 through the range of the last layout it received, in order to 27670 avoid stalling I/O that would wait for a layout, the client sends 27671 more operations from an offset of the file that represents 50% of 27672 the way through the last layout it received. The client continues 27673 to request layouts with byte-ranges that are well in advance of 27674 the byte-ranges of recent and/or read requests of processes 27675 running on the client. 27677 * This strategy is as above, but the client fails to detect the 27678 pattern, but the server does. The next time the metadata server 27679 gets a LAYOUTGET, it returns a layout with a length that is well 27680 beyond loga_minlength. 27682 * A client is using buffered I/O, and has a long queue of write- 27683 behinds to process and also detects a sequential write pattern. 27684 It sends a LAYOUTGET for a layout that spans the range of the 27685 queued write-behinds and well beyond, including ranges beyond the 27686 filer's current length. The client continues to send LAYOUTGET 27687 operations once the write-behind queue reaches 50% of the maximum 27688 queue length. 27690 Once the client has obtained a layout referring to a particular 27691 device ID, the metadata server MUST NOT delete the device ID until 27692 the layout is returned or revoked. 27694 CB_NOTIFY_DEVICEID can race with LAYOUTGET. One race scenario is 27695 that LAYOUTGET returns a device ID for which the client does not have 27696 device address mappings, and the metadata server sends a 27697 CB_NOTIFY_DEVICEID to add the device ID to the client's awareness and 27698 meanwhile the client sends GETDEVICEINFO on the device ID. This 27699 scenario is discussed in Section 18.40.4. Another scenario is that 27700 the CB_NOTIFY_DEVICEID is processed by the client before it processes 27701 the results from LAYOUTGET. The client will send a GETDEVICEINFO on 27702 the device ID. If the results from GETDEVICEINFO are received before 27703 the client gets results from LAYOUTGET, then there is no longer a 27704 race. If the results from LAYOUTGET are received before the results 27705 from GETDEVICEINFO, the client can either wait for results of 27706 GETDEVICEINFO or send another one to get possibly more up-to-date 27707 device address mappings for the device ID. 27709 18.44. Operation 51: LAYOUTRETURN - Release Layout Information 27711 18.44.1. ARGUMENT 27712 /* Constants used for LAYOUTRETURN and CB_LAYOUTRECALL */ 27713 const LAYOUT4_RET_REC_FILE = 1; 27714 const LAYOUT4_RET_REC_FSID = 2; 27715 const LAYOUT4_RET_REC_ALL = 3; 27717 enum layoutreturn_type4 { 27718 LAYOUTRETURN4_FILE = LAYOUT4_RET_REC_FILE, 27719 LAYOUTRETURN4_FSID = LAYOUT4_RET_REC_FSID, 27720 LAYOUTRETURN4_ALL = LAYOUT4_RET_REC_ALL 27721 }; 27723 struct layoutreturn_file4 { 27724 offset4 lrf_offset; 27725 length4 lrf_length; 27726 stateid4 lrf_stateid; 27727 /* layouttype4 specific data */ 27728 opaque lrf_body<>; 27729 }; 27731 union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { 27732 case LAYOUTRETURN4_FILE: 27733 layoutreturn_file4 lr_layout; 27734 default: 27735 void; 27736 }; 27738 struct LAYOUTRETURN4args { 27739 /* CURRENT_FH: file */ 27740 bool lora_reclaim; 27741 layouttype4 lora_layout_type; 27742 layoutiomode4 lora_iomode; 27743 layoutreturn4 lora_layoutreturn; 27744 }; 27746 18.44.2. RESULT 27747 union layoutreturn_stateid switch (bool lrs_present) { 27748 case TRUE: 27749 stateid4 lrs_stateid; 27750 case FALSE: 27751 void; 27752 }; 27754 union LAYOUTRETURN4res switch (nfsstat4 lorr_status) { 27755 case NFS4_OK: 27756 layoutreturn_stateid lorr_stateid; 27757 default: 27758 void; 27759 }; 27761 18.44.3. DESCRIPTION 27763 This operation returns from the client to the server one or more 27764 layouts represented by the client ID (derived from the session ID in 27765 the preceding SEQUENCE operation), lora_layout_type, and lora_iomode. 27766 When lr_returntype is LAYOUTRETURN4_FILE, the returned layout is 27767 further identified by the current filehandle, lrf_offset, lrf_length, 27768 and lrf_stateid. If the lrf_length field is NFS4_UINT64_MAX, all 27769 bytes of the layout, starting at lrf_offset, are returned. When 27770 lr_returntype is LAYOUTRETURN4_FSID, the current filehandle is used 27771 to identify the file system and all layouts matching the client ID, 27772 the fsid of the file system, lora_layout_type, and lora_iomode are 27773 returned. When lr_returntype is LAYOUTRETURN4_ALL, all layouts 27774 matching the client ID, lora_layout_type, and lora_iomode are 27775 returned and the current filehandle is not used. After this call, 27776 the client MUST NOT use the returned layout(s) and the associated 27777 storage protocol to access the file data. 27779 If the set of layouts designated in the case of LAYOUTRETURN4_FSID or 27780 LAYOUTRETURN4_ALL is empty, then no error results. In the case of 27781 LAYOUTRETURN4_FILE, the byte-range specified is returned even if it 27782 is a subdivision of a layout previously obtained with LAYOUTGET, a 27783 combination of multiple layouts previously obtained with LAYOUTGET, 27784 or a combination including some layouts previously obtained with 27785 LAYOUTGET, and one or more subdivisions of such layouts. When the 27786 byte-range does not designate any bytes for which a layout is held 27787 for the specified file, client ID, layout type and mode, no error 27788 results. See Section 12.5.5.2.1.5 for considerations with "bulk" 27789 return of layouts. 27791 The layout being returned may be a subset or superset of a layout 27792 specified by CB_LAYOUTRECALL. However, if it is a subset, the recall 27793 is not complete until the full recalled scope has been returned. 27794 Recalled scope refers to the byte-range in the case of 27795 LAYOUTRETURN4_FILE, the use of LAYOUTRETURN4_FSID, or the use of 27796 LAYOUTRETURN4_ALL. There must be a LAYOUTRETURN with a matching 27797 scope to complete the return even if all current layout ranges have 27798 been previously individually returned. 27800 For all lr_returntype values, an iomode of LAYOUTIOMODE4_ANY 27801 specifies that all layouts that match the other arguments to 27802 LAYOUTRETURN (i.e., client ID, lora_layout_type, and one of current 27803 filehandle and range; fsid derived from current filehandle; or 27804 LAYOUTRETURN4_ALL) are being returned. 27806 In the case that lr_returntype is LAYOUTRETURN4_FILE, the lrf_stateid 27807 provided by the client is a layout stateid as returned from previous 27808 layout operations. Note that the "seqid" field of lrf_stateid MUST 27809 NOT be zero. See Sections 8.2, 12.5.3, and 12.5.5.2 for a further 27810 discussion and requirements. 27812 Return of a layout or all layouts does not invalidate the mapping of 27813 storage device ID to a storage device address. The mapping remains 27814 in effect until specifically changed or deleted via device ID 27815 notification callbacks. Of course if there are no remaining layouts 27816 that refer to a previously used device ID, the server is free to 27817 delete a device ID without a notification callback, which will be the 27818 case when notifications are not in effect. 27820 If the lora_reclaim field is set to TRUE, the client is attempting to 27821 return a layout that was acquired before the restart of the metadata 27822 server during the metadata server's grace period. When returning 27823 layouts that were acquired during the metadata server's grace period, 27824 the client MUST set the lora_reclaim field to FALSE. The 27825 lora_reclaim field MUST be set to FALSE also when lr_layoutreturn is 27826 LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL. See LAYOUTCOMMIT 27827 (Section 18.42) for more details. 27829 Layouts may be returned when recalled or voluntarily (i.e., before 27830 the server has recalled them). In either case, the client must 27831 properly propagate state changed under the context of the layout to 27832 the storage device(s) or to the metadata server before returning the 27833 layout. 27835 If the client returns the layout in response to a CB_LAYOUTRECALL 27836 where the lor_recalltype field of the clora_recall field was 27837 LAYOUTRECALL4_FILE, the client should use the lor_stateid value from 27838 CB_LAYOUTRECALL as the value for lrf_stateid. Otherwise, it should 27839 use logr_stateid (from a previous LAYOUTGET result) or lorr_stateid 27840 (from a previous LAYRETURN result). This is done to indicate the 27841 point in time (in terms of layout stateid transitions) when the 27842 recall was sent. The client uses the precise lora_recallstateid 27843 value and MUST NOT set the stateid's seqid to zero; otherwise, 27844 NFS4ERR_BAD_STATEID MUST be returned. NFS4ERR_OLD_STATEID can be 27845 returned if the client is using an old seqid, and the server knows 27846 the client should not be using the old seqid. For example, the 27847 client uses the seqid on slot 1 of the session, receives the response 27848 with the new seqid, and uses the slot to send another request with 27849 the old seqid. 27851 If a client fails to return a layout in a timely manner, then the 27852 metadata server SHOULD use its control protocol with the storage 27853 devices to fence the client from accessing the data referenced by the 27854 layout. See Section 12.5.5 for more details. 27856 If the LAYOUTRETURN request sets the lora_reclaim field to TRUE after 27857 the metadata server's grace period, NFS4ERR_NO_GRACE is returned. 27859 If the LAYOUTRETURN request sets the lora_reclaim field to TRUE and 27860 lr_returntype is set to LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL, 27861 NFS4ERR_INVAL is returned. 27863 If the client sets the lr_returntype field to LAYOUTRETURN4_FILE, 27864 then the lrs_stateid field will represent the layout stateid as 27865 updated for this operation's processing; the current stateid will 27866 also be updated to match the returned value. If the last byte of any 27867 layout for the current file, client ID, and layout type is being 27868 returned and there are no remaining pending CB_LAYOUTRECALL 27869 operations for which a LAYOUTRETURN operation must be done, 27870 lrs_present MUST be FALSE, and no stateid will be returned. In 27871 addition, the COMPOUND request's current stateid will be set to the 27872 all-zeroes special stateid (see Section 16.2.3.1.2). The server MUST 27873 reject with NFS4ERR_BAD_STATEID any further use of the current 27874 stateid in that COMPOUND until the current stateid is re-established 27875 by a later stateid-returning operation. 27877 On success, the current filehandle retains its value. 27879 If the EXCHGID4_FLAG_BIND_PRINC_STATEID capability is set on the 27880 client ID (see Section 18.35), the server will require that the 27881 principal, security flavor, and if applicable, the GSS mechanism, 27882 combination that acquired the layout also be the one to send 27883 LAYOUTRETURN. This might not be possible if credentials for the 27884 principal are no longer available. The server will allow the machine 27885 credential or SSV credential (see Section 18.35) to send LAYOUTRETURN 27886 if LAYOUTRETURN's operation code was set in the spo_must_allow result 27887 of EXCHANGE_ID. 27889 18.44.4. IMPLEMENTATION 27891 The final LAYOUTRETURN operation in response to a CB_LAYOUTRECALL 27892 callback MUST be serialized with any outstanding, intersecting 27893 LAYOUTRETURN operations. Note that it is possible that while a 27894 client is returning the layout for some recalled range, the server 27895 may recall a superset of that range (e.g., LAYOUTRECALL4_ALL); the 27896 final return operation for the latter must block until the former 27897 layout recall is done. 27899 Returning all layouts in a file system using LAYOUTRETURN4_FSID is 27900 typically done in response to a CB_LAYOUTRECALL for that file system 27901 as the final return operation. Similarly, LAYOUTRETURN4_ALL is used 27902 in response to a recall callback for all layouts. It is possible 27903 that the client already returned some outstanding layouts via 27904 individual LAYOUTRETURN calls and the call for LAYOUTRETURN4_FSID or 27905 LAYOUTRETURN4_ALL marks the end of the LAYOUTRETURN sequence. See 27906 Section 12.5.5.1 for more details. 27908 Once the client has returned all layouts referring to a particular 27909 device ID, the server MAY delete the device ID. 27911 18.45. Operation 52: SECINFO_NO_NAME - Get Security on Unnamed Object 27913 18.45.1. ARGUMENT 27915 enum secinfo_style4 { 27916 SECINFO_STYLE4_CURRENT_FH = 0, 27917 SECINFO_STYLE4_PARENT = 1 27918 }; 27920 /* CURRENT_FH: object or child directory */ 27921 typedef secinfo_style4 SECINFO_NO_NAME4args; 27923 18.45.2. RESULT 27925 /* CURRENTFH: consumed if status is NFS4_OK */ 27926 typedef SECINFO4res SECINFO_NO_NAME4res; 27928 18.45.3. DESCRIPTION 27930 Like the SECINFO operation, SECINFO_NO_NAME is used by the client to 27931 obtain a list of valid RPC authentication flavors for a specific file 27932 object. Unlike SECINFO, SECINFO_NO_NAME only works with objects that 27933 are accessed by filehandle. 27935 There are two styles of SECINFO_NO_NAME, as determined by the value 27936 of the secinfo_style4 enumeration. If SECINFO_STYLE4_CURRENT_FH is 27937 passed, then SECINFO_NO_NAME is querying for the required security 27938 for the current filehandle. If SECINFO_STYLE4_PARENT is passed, then 27939 SECINFO_NO_NAME is querying for the required security of the current 27940 filehandle's parent. If the style selected is SECINFO_STYLE4_PARENT, 27941 then SECINFO should apply the same access methodology used for 27942 LOOKUPP when evaluating the traversal to the parent directory. 27943 Therefore, if the requester does not have the appropriate access to 27944 LOOKUPP the parent, then SECINFO_NO_NAME must behave the same way and 27945 return NFS4ERR_ACCESS. 27947 If PUTFH, PUTPUBFH, PUTROOTFH, or RESTOREFH returns NFS4ERR_WRONGSEC, 27948 then the client resolves the situation by sending a COMPOUND request 27949 that consists of PUTFH, PUTPUBFH, or PUTROOTFH immediately followed 27950 by SECINFO_NO_NAME, style SECINFO_STYLE4_CURRENT_FH. See Section 2.6 27951 for instructions on dealing with NFS4ERR_WRONGSEC error returns from 27952 PUTFH, PUTROOTFH, PUTPUBFH, or RESTOREFH. 27954 If SECINFO_STYLE4_PARENT is specified and there is no parent 27955 directory, SECINFO_NO_NAME MUST return NFS4ERR_NOENT. 27957 On success, the current filehandle is consumed (see 27958 Section 2.6.3.1.1.8), and if the next operation after SECINFO_NO_NAME 27959 tries to use the current filehandle, that operation will fail with 27960 the status NFS4ERR_NOFILEHANDLE. 27962 Everything else about SECINFO_NO_NAME is the same as SECINFO. See 27963 the discussion on SECINFO (Section 18.29.3). 27965 18.45.4. IMPLEMENTATION 27967 See the discussion on SECINFO (Section 18.29.4). 27969 18.46. Operation 53: SEQUENCE - Supply Per-Procedure Sequencing and 27970 Control 27972 18.46.1. ARGUMENT 27974 struct SEQUENCE4args { 27975 sessionid4 sa_sessionid; 27976 sequenceid4 sa_sequenceid; 27977 slotid4 sa_slotid; 27978 slotid4 sa_highest_slotid; 27979 bool sa_cachethis; 27980 }; 27982 18.46.2. RESULT 27984 const SEQ4_STATUS_CB_PATH_DOWN = 0x00000001; 27985 const SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING = 0x00000002; 27986 const SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED = 0x00000004; 27987 const SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED = 0x00000008; 27988 const SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED = 0x00000010; 27989 const SEQ4_STATUS_ADMIN_STATE_REVOKED = 0x00000020; 27990 const SEQ4_STATUS_RECALLABLE_STATE_REVOKED = 0x00000040; 27991 const SEQ4_STATUS_LEASE_MOVED = 0x00000080; 27992 const SEQ4_STATUS_RESTART_RECLAIM_NEEDED = 0x00000100; 27993 const SEQ4_STATUS_CB_PATH_DOWN_SESSION = 0x00000200; 27994 const SEQ4_STATUS_BACKCHANNEL_FAULT = 0x00000400; 27995 const SEQ4_STATUS_DEVID_CHANGED = 0x00000800; 27996 const SEQ4_STATUS_DEVID_DELETED = 0x00001000; 27998 struct SEQUENCE4resok { 27999 sessionid4 sr_sessionid; 28000 sequenceid4 sr_sequenceid; 28001 slotid4 sr_slotid; 28002 slotid4 sr_highest_slotid; 28003 slotid4 sr_target_highest_slotid; 28004 uint32_t sr_status_flags; 28005 }; 28007 union SEQUENCE4res switch (nfsstat4 sr_status) { 28008 case NFS4_OK: 28009 SEQUENCE4resok sr_resok4; 28010 default: 28011 void; 28012 }; 28014 18.46.3. DESCRIPTION 28016 The SEQUENCE operation is used by the server to implement session 28017 request control and the reply cache semantics. 28019 SEQUENCE MUST appear as the first operation of any COMPOUND in which 28020 it appears. The error NFS4ERR_SEQUENCE_POS will be returned when it 28021 is found in any position in a COMPOUND beyond the first. Operations 28022 other than SEQUENCE, BIND_CONN_TO_SESSION, EXCHANGE_ID, 28023 CREATE_SESSION, and DESTROY_SESSION, MUST NOT appear as the first 28024 operation in a COMPOUND. Such operations MUST yield the error 28025 NFS4ERR_OP_NOT_IN_SESSION if they do appear at the start of a 28026 COMPOUND. 28028 If SEQUENCE is received on a connection not associated with the 28029 session via CREATE_SESSION or BIND_CONN_TO_SESSION, and connection 28030 association enforcement is enabled (see Section 18.35), then the 28031 server returns NFS4ERR_CONN_NOT_BOUND_TO_SESSION. 28033 The sa_sessionid argument identifies the session to which this 28034 request applies. The sr_sessionid result MUST equal sa_sessionid. 28036 The sa_slotid argument is the index in the reply cache for the 28037 request. The sa_sequenceid field is the sequence number of the 28038 request for the reply cache entry (slot). The sr_slotid result MUST 28039 equal sa_slotid. The sr_sequenceid result MUST equal sa_sequenceid. 28041 The sa_highest_slotid argument is the highest slot ID for which the 28042 client has a request outstanding; it could be equal to sa_slotid. 28043 The server returns two "highest_slotid" values: sr_highest_slotid and 28044 sr_target_highest_slotid. The former is the highest slot ID the 28045 server will accept in future SEQUENCE operation, and SHOULD NOT be 28046 less than the value of sa_highest_slotid (but see Section 2.10.6.1 28047 for an exception). The latter is the highest slot ID the server 28048 would prefer the client use on a future SEQUENCE operation. 28050 If sa_cachethis is TRUE, then the client is requesting that the 28051 server cache the entire reply in the server's reply cache; therefore, 28052 the server MUST cache the reply (see Section 2.10.6.1.3). The server 28053 MAY cache the reply if sa_cachethis is FALSE. If the server does not 28054 cache the entire reply, it MUST still record that it executed the 28055 request at the specified slot and sequence ID. 28057 The response to the SEQUENCE operation contains a word of status 28058 flags (sr_status_flags) that can provide to the client information 28059 related to the status of the client's lock state and communications 28060 paths. Note that any status bits relating to lock state MAY be reset 28061 when lock state is lost due to a server restart (even if the session 28062 is persistent across restarts; session persistence does not imply 28063 lock state persistence) or the establishment of a new client 28064 instance. 28066 SEQ4_STATUS_CB_PATH_DOWN 28067 When set, indicates that the client has no operational backchannel 28068 path for any session associated with the client ID, making it 28069 necessary for the client to re-establish one. This bit remains 28070 set on all SEQUENCE responses on all sessions associated with the 28071 client ID until at least one backchannel is available on any 28072 session associated with the client ID. If the client fails to re- 28073 establish a backchannel for the client ID, it is subject to having 28074 recallable state revoked. 28076 SEQ4_STATUS_CB_PATH_DOWN_SESSION 28077 When set, indicates that the session has no operational 28078 backchannel. There are two reasons why 28079 SEQ4_STATUS_CB_PATH_DOWN_SESSION may be set and not 28080 SEQ4_STATUS_CB_PATH_DOWN. First is that a callback operation that 28081 applies specifically to the session (e.g., CB_RECALL_SLOT, see 28082 Section 20.8) needs to be sent. Second is that the server did 28083 send a callback operation, but the connection was lost before the 28084 reply. The server cannot be sure whether or not the client 28085 received the callback operation, and so, per rules on request 28086 retry, the server MUST retry the callback operation over the same 28087 session. The SEQ4_STATUS_CB_PATH_DOWN_SESSION bit is the 28088 indication to the client that it needs to associate a connection 28089 to the session's backchannel. This bit remains set on all 28090 SEQUENCE responses of the session until a connection is associated 28091 with the session's a backchannel. If the client fails to re- 28092 establish a backchannel for the session, it is subject to having 28093 recallable state revoked. 28095 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING 28096 When set, indicates that all GSS contexts or RPCSEC_GSS handles 28097 assigned to the session's backchannel will expire within a period 28098 equal to the lease time. This bit remains set on all SEQUENCE 28099 replies until at least one of the following are true: 28101 * All SSV RPCSEC_GSS handles on the session's backchannel have 28102 been destroyed and all non-SSV GSS contexts have expired. 28104 * At least one more SSV RPCSEC_GSS handle has been added to the 28105 backchannel. 28107 * The expiration time of at least one non-SSV GSS context of an 28108 RPCSEC_GSS handle is beyond the lease period from the current 28109 time (relative to the time of when a SEQUENCE response was 28110 sent) 28112 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED 28113 When set, indicates all non-SSV GSS contexts and all SSV 28114 RPCSEC_GSS handles assigned to the session's backchannel have 28115 expired or have been destroyed. This bit remains set on all 28116 SEQUENCE replies until at least one non-expired non-SSV GSS 28117 context for the session's backchannel has been established or at 28118 least one SSV RPCSEC_GSS handle has been assigned to the 28119 backchannel. 28121 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED 28122 When set, indicates that the lease has expired and as a result the 28123 server released all of the client's locking state. This status 28124 bit remains set on all SEQUENCE replies until the loss of all such 28125 locks has been acknowledged by use of FREE_STATEID (see 28126 Section 18.38), or by establishing a new client instance by 28127 destroying all sessions (via DESTROY_SESSION), the client ID (via 28128 DESTROY_CLIENTID), and then invoking EXCHANGE_ID and 28129 CREATE_SESSION to establish a new client ID. 28131 SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED 28132 When set, indicates that some subset of the client's locks have 28133 been revoked due to expiration of the lease period followed by 28134 another client's conflicting LOCK operation. This status bit 28135 remains set on all SEQUENCE replies until the loss of all such 28136 locks has been acknowledged by use of FREE_STATEID. 28138 SEQ4_STATUS_ADMIN_STATE_REVOKED 28139 When set, indicates that one or more locks have been revoked 28140 without expiration of the lease period, due to administrative 28141 action. This status bit remains set on all SEQUENCE replies until 28142 the loss of all such locks has been acknowledged by use of 28143 FREE_STATEID. 28145 SEQ4_STATUS_RECALLABLE_STATE_REVOKED 28146 When set, indicates that one or more recallable objects have been 28147 revoked without expiration of the lease period, due to the 28148 client's failure to return them when recalled, which may be a 28149 consequence of there being no working backchannel and the client 28150 failing to re-establish a backchannel per the 28151 SEQ4_STATUS_CB_PATH_DOWN, SEQ4_STATUS_CB_PATH_DOWN_SESSION, or 28152 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED status flags. This status bit 28153 remains set on all SEQUENCE replies until the loss of all such 28154 locks has been acknowledged by use of FREE_STATEID. 28156 SEQ4_STATUS_LEASE_MOVED 28157 When set, indicates that responsibility for lease renewal has been 28158 transferred to one or more new servers. This condition will 28159 continue until the client receives an NFS4ERR_MOVED error and the 28160 server receives the subsequent GETATTR for the fs_locations or 28161 fs_locations_info attribute for an access to each file system for 28162 which a lease has been moved to a new server. See 28163 Section 11.11.9.2. 28165 SEQ4_STATUS_RESTART_RECLAIM_NEEDED 28166 When set, indicates that due to server restart, the client must 28167 reclaim locking state. Until the client sends a global 28168 RECLAIM_COMPLETE (Section 18.51), every SEQUENCE operation will 28169 return SEQ4_STATUS_RESTART_RECLAIM_NEEDED. 28171 SEQ4_STATUS_BACKCHANNEL_FAULT 28172 The server has encountered an unrecoverable fault with the 28173 backchannel (e.g., it has lost track of the sequence ID for a slot 28174 in the backchannel). The client MUST stop sending more requests 28175 on the session's fore channel, wait for all outstanding requests 28176 to complete on the fore and back channel, and then destroy the 28177 session. 28179 SEQ4_STATUS_DEVID_CHANGED 28180 The client is using device ID notifications and the server has 28181 changed a device ID mapping held by the client. This flag will 28182 stay present until the client has obtained the new mapping with 28183 GETDEVICEINFO. 28185 SEQ4_STATUS_DEVID_DELETED 28186 The client is using device ID notifications and the server has 28187 deleted a device ID mapping held by the client. This flag will 28188 stay in effect until the client sends a GETDEVICEINFO on the 28189 device ID with a null value in the argument gdia_notify_types. 28191 The value of the sa_sequenceid argument relative to the cached 28192 sequence ID on the slot falls into one of three cases. 28194 * If the difference between sa_sequenceid and the server's cached 28195 sequence ID at the slot ID is two (2) or more, or if sa_sequenceid 28196 is less than the cached sequence ID (accounting for wraparound of 28197 the unsigned sequence ID value), then the server MUST return 28198 NFS4ERR_SEQ_MISORDERED. 28200 * If sa_sequenceid and the cached sequence ID are the same, this is 28201 a retry, and the server replies with what is recorded in the reply 28202 cache. The lease is possibly renewed as described below. 28204 * If sa_sequenceid is one greater (accounting for wraparound) than 28205 the cached sequence ID, then this is a new request, and the slot's 28206 sequence ID is incremented. The operations subsequent to 28207 SEQUENCE, if any, are processed. If there are no other 28208 operations, the only other effects are to cache the SEQUENCE reply 28209 in the slot, maintain the session's activity, and possibly renew 28210 the lease. 28212 If the client reuses a slot ID and sequence ID for a completely 28213 different request, the server MAY treat the request as if it is a 28214 retry of what it has already executed. The server MAY however detect 28215 the client's illegal reuse and return NFS4ERR_SEQ_FALSE_RETRY. 28217 If SEQUENCE returns an error, then the state of the slot (sequence 28218 ID, cached reply) MUST NOT change, and the associated lease MUST NOT 28219 be renewed. 28221 If SEQUENCE returns NFS4_OK, then the associated lease MUST be 28222 renewed (see Section 8.3), except if 28223 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED is returned in sr_status_flags. 28225 18.46.4. IMPLEMENTATION 28227 The server MUST maintain a mapping of session ID to client ID in 28228 order to validate any operations that follow SEQUENCE that take a 28229 stateid as an argument and/or result. 28231 If the client establishes a persistent session, then a SEQUENCE 28232 received after a server restart might encounter requests performed 28233 and recorded in a persistent reply cache before the server restart. 28234 In this case, SEQUENCE will be processed successfully, while requests 28235 that were not previously performed and recorded are rejected with 28236 NFS4ERR_DEADSESSION. 28238 Depending on which of the operations within the COMPOUND were 28239 successfully performed before the server restart, these operations 28240 will also have replies sent from the server reply cache. Note that 28241 when these operations establish locking state, it is locking state 28242 that applies to the previous server instance and to the previous 28243 client ID, even though the server restart, which logically happened 28244 after these operations, eliminated that state. In the case of a 28245 partially executed COMPOUND, processing may reach an operation not 28246 processed during the earlier server instance, making this operation a 28247 new one and not performable on the existing session. In this case, 28248 NFS4ERR_DEADSESSION will be returned from that operation. 28250 18.47. Operation 54: SET_SSV - Update SSV for a Client ID 28252 18.47.1. ARGUMENT 28254 struct ssa_digest_input4 { 28255 SEQUENCE4args sdi_seqargs; 28256 }; 28258 struct SET_SSV4args { 28259 opaque ssa_ssv<>; 28260 opaque ssa_digest<>; 28261 }; 28263 18.47.2. RESULT 28264 struct ssr_digest_input4 { 28265 SEQUENCE4res sdi_seqres; 28266 }; 28268 struct SET_SSV4resok { 28269 opaque ssr_digest<>; 28270 }; 28272 union SET_SSV4res switch (nfsstat4 ssr_status) { 28273 case NFS4_OK: 28274 SET_SSV4resok ssr_resok4; 28275 default: 28276 void; 28277 }; 28279 18.47.3. DESCRIPTION 28281 This operation is used to update the SSV for a client ID. Before 28282 SET_SSV is called the first time on a client ID, the SSV is zero. 28283 The SSV is the key used for the SSV GSS mechanism (Section 2.10.9) 28285 SET_SSV MUST be preceded by a SEQUENCE operation in the same 28286 COMPOUND. It MUST NOT be used if the client did not opt for SP4_SSV 28287 state protection when the client ID was created (see Section 18.35); 28288 the server returns NFS4ERR_INVAL in that case. 28290 The field ssa_digest is computed as the output of the HMAC (RFC 2104 28291 [52]) using the subkey derived from the SSV4_SUBKEY_MIC_I2T and 28292 current SSV as the key (see Section 2.10.9 for a description of 28293 subkeys), and an XDR encoded value of data type ssa_digest_input4. 28294 The field sdi_seqargs is equal to the arguments of the SEQUENCE 28295 operation for the COMPOUND procedure that SET_SSV is within. 28297 The argument ssa_ssv is XORed with the current SSV to produce the new 28298 SSV. The argument ssa_ssv SHOULD be generated randomly. 28300 In the response, ssr_digest is the output of the HMAC using the 28301 subkey derived from SSV4_SUBKEY_MIC_T2I and new SSV as the key, and 28302 an XDR encoded value of data type ssr_digest_input4. The field 28303 sdi_seqres is equal to the results of the SEQUENCE operation for the 28304 COMPOUND procedure that SET_SSV is within. 28306 As noted in Section 18.35, the client and server can maintain 28307 multiple concurrent versions of the SSV. The client and server each 28308 MUST maintain an internal SSV version number, which is set to one the 28309 first time SET_SSV executes on the server and the client receives the 28310 first SET_SSV reply. Each subsequent SET_SSV increases the internal 28311 SSV version number by one. The value of this version number 28312 corresponds to the smpt_ssv_seq, smt_ssv_seq, sspt_ssv_seq, and 28313 ssct_ssv_seq fields of the SSV GSS mechanism tokens (see 28314 Section 2.10.9). 28316 18.47.4. IMPLEMENTATION 28318 When the server receives ssa_digest, it MUST verify the digest by 28319 computing the digest the same way the client did and comparing it 28320 with ssa_digest. If the server gets a different result, this is an 28321 error, NFS4ERR_BAD_SESSION_DIGEST. This error might be the result of 28322 another SET_SSV from the same client ID changing the SSV. If so, the 28323 client recovers by sending a SET_SSV operation again with a 28324 recomputed digest based on the subkey of the new SSV. If the 28325 transport connection is dropped after the SET_SSV request is sent, 28326 but before the SET_SSV reply is received, then there are special 28327 considerations for recovery if the client has no more connections 28328 associated with sessions associated with the client ID of the SSV. 28329 See Section 18.34.4. 28331 Clients SHOULD NOT send an ssa_ssv that is equal to a previous 28332 ssa_ssv, nor equal to a previous or current SSV (including an ssa_ssv 28333 equal to zero since the SSV is initialized to zero when the client ID 28334 is created). 28336 Clients SHOULD send SET_SSV with RPCSEC_GSS privacy. Servers MUST 28337 support RPCSEC_GSS with privacy for any COMPOUND that has { SEQUENCE, 28338 SET_SSV }. 28340 A client SHOULD NOT send SET_SSV with the SSV GSS mechanism's 28341 credential because the purpose of SET_SSV is to seed the SSV from 28342 non-SSV credentials. Instead, SET_SSV SHOULD be sent with the 28343 credential of a user that is accessing the client ID for the first 28344 time (Section 2.10.8.3). However, if the client does send SET_SSV 28345 with SSV credentials, the digest protecting the arguments uses the 28346 value of the SSV before ssa_ssv is XORed in, and the digest 28347 protecting the results uses the value of the SSV after the ssa_ssv is 28348 XORed in. 28350 18.48. Operation 55: TEST_STATEID - Test Stateids for Validity 28352 18.48.1. ARGUMENT 28354 struct TEST_STATEID4args { 28355 stateid4 ts_stateids<>; 28356 }; 28358 18.48.2. RESULT 28359 struct TEST_STATEID4resok { 28360 nfsstat4 tsr_status_codes<>; 28361 }; 28363 union TEST_STATEID4res switch (nfsstat4 tsr_status) { 28364 case NFS4_OK: 28365 TEST_STATEID4resok tsr_resok4; 28366 default: 28367 void; 28368 }; 28370 18.48.3. DESCRIPTION 28372 The TEST_STATEID operation is used to check the validity of a set of 28373 stateids. It can be used at any time, but the client should 28374 definitely use it when it receives an indication that one or more of 28375 its stateids have been invalidated due to lock revocation. This 28376 occurs when the SEQUENCE operation returns with one of the following 28377 sr_status_flags set: 28379 * SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED 28381 * SEQ4_STATUS_EXPIRED_ADMIN_STATE_REVOKED 28383 * SEQ4_STATUS_EXPIRED_RECALLABLE_STATE_REVOKED 28385 The client can use TEST_STATEID one or more times to test the 28386 validity of its stateids. Each use of TEST_STATEID allows a large 28387 set of such stateids to be tested and avoids problems with earlier 28388 stateids in a COMPOUND request from interfering with the checking of 28389 subsequent stateids, as would happen if individual stateids were 28390 tested by a series of corresponding by operations in a COMPOUND 28391 request. 28393 For each stateid, the server returns the status code that would be 28394 returned if that stateid were to be used in normal operation. 28395 Returning such a status indication is not an error and does not cause 28396 COMPOUND processing to terminate. Checks for the validity of the 28397 stateid proceed as they would for normal operations with a number of 28398 exceptions: 28400 * There is no check for the type of stateid object, as would be the 28401 case for normal use of a stateid. 28403 * There is no reference to the current filehandle. 28405 * Special stateids are always considered invalid (they result in the 28406 error code NFS4ERR_BAD_STATEID). 28408 All stateids are interpreted as being associated with the client for 28409 the current session. Any possible association with a previous 28410 instance of the client (as stale stateids) is not considered. 28412 The valid status values in the returned status_code array are 28413 NFS4ERR_OK, NFS4ERR_BAD_STATEID, NFS4ERR_OLD_STATEID, 28414 NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, and NFS4ERR_DELEG_REVOKED. 28416 18.48.4. IMPLEMENTATION 28418 See Sections 8.2.2 and 8.2.4 for a discussion of stateid structure, 28419 lifetime, and validation. 28421 18.49. Operation 56: WANT_DELEGATION - Request Delegation 28423 18.49.1. ARGUMENT 28424 union deleg_claim4 switch (open_claim_type4 dc_claim) { 28425 /* 28426 * No special rights to object. Ordinary delegation 28427 * request of the specified object. Object identified 28428 * by filehandle. 28429 */ 28430 case CLAIM_FH: /* new to v4.1 */ 28431 /* CURRENT_FH: object being delegated */ 28432 void; 28434 /* 28435 * Right to file based on a delegation granted 28436 * to a previous boot instance of the client. 28437 * File is specified by filehandle. 28438 */ 28439 case CLAIM_DELEG_PREV_FH: /* new to v4.1 */ 28440 /* CURRENT_FH: object being delegated */ 28441 void; 28443 /* 28444 * Right to the file established by an open previous 28445 * to server reboot. File identified by filehandle. 28446 * Used during server reclaim grace period. 28447 */ 28448 case CLAIM_PREVIOUS: 28449 /* CURRENT_FH: object being reclaimed */ 28450 open_delegation_type4 dc_delegate_type; 28451 }; 28453 struct WANT_DELEGATION4args { 28454 uint32_t wda_want; 28455 deleg_claim4 wda_claim; 28456 }; 28458 18.49.2. RESULT 28460 union WANT_DELEGATION4res switch (nfsstat4 wdr_status) { 28461 case NFS4_OK: 28462 open_delegation4 wdr_resok4; 28463 default: 28464 void; 28465 }; 28467 18.49.3. DESCRIPTION 28469 Where this description mandates the return of a specific error code 28470 for a specific condition, and where multiple conditions apply, the 28471 server MAY return any of the mandated error codes. 28473 This operation allows a client to: 28475 * Get a delegation on all types of files except directories. 28477 * Register a "want" for a delegation for the specified file object, 28478 and be notified via a callback when the delegation is available. 28479 The server MAY support notifications of availability via 28480 callbacks. If the server does not support registration of wants, 28481 it MUST NOT return an error to indicate that, and instead MUST 28482 return with ond_why set to WND4_CONTENTION or WND4_RESOURCE and 28483 ond_server_will_push_deleg or ond_server_will_signal_avail set to 28484 FALSE. When the server indicates that it will notify the client 28485 by means of a callback, it will either provide the delegation 28486 using a CB_PUSH_DELEG operation or cancel its promise by sending a 28487 CB_WANTS_CANCELLED operation. 28489 * Cancel a want for a delegation. 28491 The client SHOULD NOT set OPEN4_SHARE_ACCESS_READ and SHOULD NOT set 28492 OPEN4_SHARE_ACCESS_WRITE in wda_want. If it does, the server MUST 28493 ignore them. 28495 The meanings of the following flags in wda_want are the same as they 28496 are in OPEN, except as noted below. 28498 * OPEN4_SHARE_ACCESS_WANT_READ_DELEG 28500 * OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 28502 * OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 28504 * OPEN4_SHARE_ACCESS_WANT_NO_DELEG. Unlike the OPEN operation, this 28505 flag SHOULD NOT be set by the client in the arguments to 28506 WANT_DELEGATION, and MUST be ignored by the server. 28508 * OPEN4_SHARE_ACCESS_WANT_CANCEL 28510 * OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL 28512 * OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED 28514 The handling of the above flags in WANT_DELEGATION is the same as in 28515 OPEN. Information about the delegation and/or the promises the 28516 server is making regarding future callbacks are the same as those 28517 described in the open_delegation4 structure. 28519 The successful results of WANT_DELEGATION are of data type 28520 open_delegation4, which is the same data type as the "delegation" 28521 field in the results of the OPEN operation (see Section 18.16.3). 28522 The server constructs wdr_resok4 the same way it constructs OPEN's 28523 "delegation" with one difference: WANT_DELEGATION MUST NOT return a 28524 delegation type of OPEN_DELEGATE_NONE. 28526 If ((wda_want & OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) & 28527 ~OPEN4_SHARE_ACCESS_WANT_NO_DELEG) is zero, then the client is 28528 indicating no explicit desire or non-desire for a delegation and the 28529 server MUST return NFS4ERR_INVAL. 28531 The client uses the OPEN4_SHARE_ACCESS_WANT_CANCEL flag in the 28532 WANT_DELEGATION operation to cancel a previously requested want for a 28533 delegation. Note that if the server is in the process of sending the 28534 delegation (via CB_PUSH_DELEG) at the time the client sends a 28535 cancellation of the want, the delegation might still be pushed to the 28536 client. 28538 If WANT_DELEGATION fails to return a delegation, and the server 28539 returns NFS4_OK, the server MUST set the delegation type to 28540 OPEN4_DELEGATE_NONE_EXT, and set od_whynone, as described in 28541 Section 18.16. Write delegations are not available for file types 28542 that are not writable. This includes file objects of types NF4BLK, 28543 NF4CHR, NF4LNK, NF4SOCK, and NF4FIFO. If the client requests 28544 OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG without 28545 OPEN4_SHARE_ACCESS_WANT_READ_DELEG on an object with one of the 28546 aforementioned file types, the server must set 28547 wdr_resok4.od_whynone.ond_why to WND4_WRITE_DELEG_NOT_SUPP_FTYPE. 28549 18.49.4. IMPLEMENTATION 28551 A request for a conflicting delegation is not normally intended to 28552 trigger the recall of the existing delegation. Servers may choose to 28553 treat some clients as having higher priority such that their wants 28554 will trigger recall of an existing delegation, although that is 28555 expected to be an unusual situation. 28557 Servers will generally recall delegations assigned by WANT_DELEGATION 28558 on the same basis as those assigned by OPEN. CB_RECALL will 28559 generally be done only when other clients perform operations 28560 inconsistent with the delegation. The normal response to aging of 28561 delegations is to use CB_RECALL_ANY, in order to give the client the 28562 opportunity to keep the delegations most useful from its point of 28563 view. 28565 18.50. Operation 57: DESTROY_CLIENTID - Destroy a Client ID 28566 18.50.1. ARGUMENT 28568 struct DESTROY_CLIENTID4args { 28569 clientid4 dca_clientid; 28570 }; 28572 18.50.2. RESULT 28574 struct DESTROY_CLIENTID4res { 28575 nfsstat4 dcr_status; 28576 }; 28578 18.50.3. DESCRIPTION 28580 The DESTROY_CLIENTID operation destroys the client ID. If there are 28581 sessions (both idle and non-idle), opens, locks, delegations, 28582 layouts, and/or wants (Section 18.49) associated with the unexpired 28583 lease of the client ID, the server MUST return NFS4ERR_CLIENTID_BUSY. 28584 DESTROY_CLIENTID MAY be preceded with a SEQUENCE operation as long as 28585 the client ID derived from the session ID of SEQUENCE is not the same 28586 as the client ID to be destroyed. If the client IDs are the same, 28587 then the server MUST return NFS4ERR_CLIENTID_BUSY. 28589 If DESTROY_CLIENTID is not prefixed by SEQUENCE, it MUST be the only 28590 operation in the COMPOUND request (otherwise, the server MUST return 28591 NFS4ERR_NOT_ONLY_OP). If the operation is sent without a SEQUENCE 28592 preceding it, a client that retransmits the request may receive an 28593 error in response, because the original request might have been 28594 successfully executed. 28596 18.50.4. IMPLEMENTATION 28598 DESTROY_CLIENTID allows a server to immediately reclaim the resources 28599 consumed by an unused client ID, and also to forget that it ever 28600 generated the client ID. By forgetting that it ever generated the 28601 client ID, the server can safely reuse the client ID on a future 28602 EXCHANGE_ID operation. 28604 18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims Finished 28606 18.51.1. ARGUMENT 28607 struct RECLAIM_COMPLETE4args { 28608 /* 28609 * If rca_one_fs TRUE, 28610 * 28611 * CURRENT_FH: object in 28612 * file system reclaim is 28613 * complete for. 28614 */ 28615 bool rca_one_fs; 28616 }; 28618 18.51.2. RESULTS 28620 struct RECLAIM_COMPLETE4res { 28621 nfsstat4 rcr_status; 28622 }; 28624 18.51.3. DESCRIPTION 28626 A RECLAIM_COMPLETE operation is used to indicate that the client has 28627 reclaimed all of the locking state that it will recover using 28628 reclaim, when it is recovering state due to either a server restart 28629 or the migration of a file system to another server. There are two 28630 types of RECLAIM_COMPLETE operations: 28632 * When rca_one_fs is FALSE, a global RECLAIM_COMPLETE is being done. 28633 This indicates that recovery of all locks that the client held on 28634 the previous server instance has been completed. The current 28635 filehandle need not be set in this case. 28637 * When rca_one_fs is TRUE, a file system-specific RECLAIM_COMPLETE 28638 is being done. This indicates that recovery of locks for a single 28639 fs (the one designated by the current filehandle) due to the 28640 migration of the file system has been completed. Presence of a 28641 current filehandle is required when rca_one_fs is set to TRUE. 28642 When the current filehandle designates a filehandle in a file 28643 system not in the process of migration, the operation returns 28644 NFS4_OK and is otherwise ignored. 28646 Once a RECLAIM_COMPLETE is done, there can be no further reclaim 28647 operations for locks whose scope is defined as having completed 28648 recovery. Once the client sends RECLAIM_COMPLETE, the server will 28649 not allow the client to do subsequent reclaims of locking state for 28650 that scope and, if these are attempted, will return NFS4ERR_NO_GRACE. 28652 Whenever a client establishes a new client ID and before it does the 28653 first non-reclaim operation that obtains a lock, it MUST send a 28654 RECLAIM_COMPLETE with rca_one_fs set to FALSE, even if there are no 28655 locks to reclaim. If non-reclaim locking operations are done before 28656 the RECLAIM_COMPLETE, an NFS4ERR_GRACE error will be returned. 28658 Similarly, when the client accesses a migrated file system on a new 28659 server, before it sends the first non-reclaim operation that obtains 28660 a lock on this new server, it MUST send a RECLAIM_COMPLETE with 28661 rca_one_fs set to TRUE and current filehandle within that file 28662 system, even if there are no locks to reclaim. If non-reclaim 28663 locking operations are done on that file system before the 28664 RECLAIM_COMPLETE, an NFS4ERR_GRACE error will be returned. 28666 It should be noted that there are situations in which a client needs 28667 to issue both forms of RECLAIM_COMPLETE. An example is an instance 28668 of file system migration in which the file system is migrated to a 28669 server for which the client has no clientid. As a result, the client 28670 needs to obtain a clientid from the server (incurring the 28671 responsibility to do RECLAIM_COMPLETE with rca_one_fs set to FALSE) 28672 as well as RECLAIM_COMPLETE with rca_one_fs set to TRUE to complete 28673 the per-fs grace period associated with the file system migration. 28674 These two may be done in any order as long as all necessary lock 28675 reclaims have been done before issuing either of them. 28677 Any locks not reclaimed at the point at which RECLAIM_COMPLETE is 28678 done become non-reclaimable. The client MUST NOT attempt to reclaim 28679 them, either during the current server instance or in any subsequent 28680 server instance, or on another server to which responsibility for 28681 that file system is transferred. If the client were to do so, it 28682 would be violating the protocol by representing itself as owning 28683 locks that it does not own, and so has no right to reclaim. See 28684 Section 8.4.3 of [66] for a discussion of edge conditions related to 28685 lock reclaim. 28687 By sending a RECLAIM_COMPLETE, the client indicates readiness to 28688 proceed to do normal non-reclaim locking operations. The client 28689 should be aware that such operations may temporarily result in 28690 NFS4ERR_GRACE errors until the server is ready to terminate its grace 28691 period. 28693 18.51.4. IMPLEMENTATION 28695 Servers will typically use the information as to when reclaim 28696 activity is complete to reduce the length of the grace period. When 28697 the server maintains in persistent storage a list of clients that 28698 might have had locks, it is able to use the fact that all such 28699 clients have done a RECLAIM_COMPLETE to terminate the grace period 28700 and begin normal operations (i.e., grant requests for new locks) 28701 sooner than it might otherwise. 28703 Latency can be minimized by doing a RECLAIM_COMPLETE as part of the 28704 COMPOUND request in which the last lock-reclaiming operation is done. 28705 When there are no reclaims to be done, RECLAIM_COMPLETE should be 28706 done immediately in order to allow the grace period to end as soon as 28707 possible. 28709 RECLAIM_COMPLETE should only be done once for each server instance or 28710 occasion of the transition of a file system. If it is done a second 28711 time, the error NFS4ERR_COMPLETE_ALREADY will result. Note that 28712 because of the session feature's retry protection, retries of 28713 COMPOUND requests containing RECLAIM_COMPLETE operation will not 28714 result in this error. 28716 When a RECLAIM_COMPLETE is sent, the client effectively acknowledges 28717 any locks not yet reclaimed as lost. This allows the server to re- 28718 enable the client to recover locks if the occurrence of edge 28719 conditions, as described in Section 8.4.3, had caused the server to 28720 disable the client's ability to recover locks. 28722 Because previous descriptions of RECLAIM_COMPLETE were not 28723 sufficiently explicit about the circumstances in which use of 28724 RECLAIM_COMPLETE with rca_one_fs set to TRUE was appropriate, there 28725 have been cases in which it has been misused by clients who have 28726 issued RECLAIM_COMPLETE with rca_one_fs set to TRUE when it should 28727 have not been. There have also been cases in which servers have, in 28728 various ways, not responded to such misuse as described above, either 28729 ignoring the rca_one_fs setting (treating the operation as a global 28730 RECLAIM_COMPLETE) or ignoring the entire operation. 28732 While clients SHOULD NOT misuse this feature, and servers SHOULD 28733 respond to such misuse as described above, implementors need to be 28734 aware of the following considerations as they make necessary trade- 28735 offs between interoperability with existing implementations and 28736 proper support for facilities to allow lock recovery in the event of 28737 file system migration. 28739 * When servers have no support for becoming the destination server 28740 of a file system subject to migration, there is no possibility of 28741 a per-fs RECLAIM_COMPLETE being done legitimately, and occurrences 28742 of it SHOULD be ignored. However, the negative consequences of 28743 accepting such mistaken use are quite limited as long as the 28744 client does not issue it before all necessary reclaims are done. 28746 * When a server might become the destination for a file system being 28747 migrated, inappropriate use of per-fs RECLAIM_COMPLETE is more 28748 concerning. In the case in which the file system designated is 28749 not within a per-fs grace period, the per-fs RECLAIM_COMPLETE 28750 SHOULD be ignored, with the negative consequences of accepting it 28751 being limited, as in the case in which migration is not supported. 28752 However, if the server encounters a file system undergoing 28753 migration, the operation cannot be accepted as if it were a global 28754 RECLAIM_COMPLETE without invalidating its intended use. 28756 18.52. Operation 10044: ILLEGAL - Illegal Operation 28758 18.52.1. ARGUMENTS 28760 void; 28762 18.52.2. RESULTS 28764 struct ILLEGAL4res { 28765 nfsstat4 status; 28766 }; 28768 18.52.3. DESCRIPTION 28770 This operation is a placeholder for encoding a result to handle the 28771 case of the client sending an operation code within COMPOUND that is 28772 not supported. See the COMPOUND procedure description for more 28773 details. 28775 The status field of ILLEGAL4res MUST be set to NFS4ERR_OP_ILLEGAL. 28777 18.52.4. IMPLEMENTATION 28779 A client will probably not send an operation with code OP_ILLEGAL but 28780 if it does, the response will be ILLEGAL4res just as it would be with 28781 any other invalid operation code. Note that if the server gets an 28782 illegal operation code that is not OP_ILLEGAL, and if the server 28783 checks for legal operation codes during the XDR decode phase, then 28784 the ILLEGAL4res would not be returned. 28786 19. NFSv4.1 Callback Procedures 28788 The procedures used for callbacks are defined in the following 28789 sections. In the interest of clarity, the terms "client" and 28790 "server" refer to NFS clients and servers, despite the fact that for 28791 an individual callback RPC, the sense of these terms would be 28792 precisely the opposite. 28794 Both procedures, CB_NULL and CB_COMPOUND, MUST be implemented. 28796 19.1. Procedure 0: CB_NULL - No Operation 28798 19.1.1. ARGUMENTS 28800 void; 28802 19.1.2. RESULTS 28804 void; 28806 19.1.3. DESCRIPTION 28808 CB_NULL is the standard ONC RPC NULL procedure, with the standard 28809 void argument and void response. Even though there is no direct 28810 functionality associated with this procedure, the server will use 28811 CB_NULL to confirm the existence of a path for RPCs from the server 28812 to client. 28814 19.1.4. ERRORS 28816 None. 28818 19.2. Procedure 1: CB_COMPOUND - Compound Operations 28820 19.2.1. ARGUMENTS 28822 enum nfs_cb_opnum4 { 28823 OP_CB_GETATTR = 3, 28824 OP_CB_RECALL = 4, 28825 /* Callback operations new to NFSv4.1 */ 28826 OP_CB_LAYOUTRECALL = 5, 28827 OP_CB_NOTIFY = 6, 28828 OP_CB_PUSH_DELEG = 7, 28829 OP_CB_RECALL_ANY = 8, 28830 OP_CB_RECALLABLE_OBJ_AVAIL = 9, 28831 OP_CB_RECALL_SLOT = 10, 28832 OP_CB_SEQUENCE = 11, 28833 OP_CB_WANTS_CANCELLED = 12, 28834 OP_CB_NOTIFY_LOCK = 13, 28835 OP_CB_NOTIFY_DEVICEID = 14, 28837 OP_CB_ILLEGAL = 10044 28838 }; 28840 union nfs_cb_argop4 switch (unsigned argop) { 28841 case OP_CB_GETATTR: 28842 CB_GETATTR4args opcbgetattr; 28843 case OP_CB_RECALL: 28844 CB_RECALL4args opcbrecall; 28845 case OP_CB_LAYOUTRECALL: 28846 CB_LAYOUTRECALL4args opcblayoutrecall; 28847 case OP_CB_NOTIFY: 28848 CB_NOTIFY4args opcbnotify; 28849 case OP_CB_PUSH_DELEG: 28850 CB_PUSH_DELEG4args opcbpush_deleg; 28851 case OP_CB_RECALL_ANY: 28852 CB_RECALL_ANY4args opcbrecall_any; 28853 case OP_CB_RECALLABLE_OBJ_AVAIL: 28854 CB_RECALLABLE_OBJ_AVAIL4args opcbrecallable_obj_avail; 28855 case OP_CB_RECALL_SLOT: 28856 CB_RECALL_SLOT4args opcbrecall_slot; 28857 case OP_CB_SEQUENCE: 28858 CB_SEQUENCE4args opcbsequence; 28859 case OP_CB_WANTS_CANCELLED: 28860 CB_WANTS_CANCELLED4args opcbwants_cancelled; 28861 case OP_CB_NOTIFY_LOCK: 28862 CB_NOTIFY_LOCK4args opcbnotify_lock; 28863 case OP_CB_NOTIFY_DEVICEID: 28864 CB_NOTIFY_DEVICEID4args opcbnotify_deviceid; 28865 case OP_CB_ILLEGAL: void; 28866 }; 28868 struct CB_COMPOUND4args { 28869 utf8str_cs tag; 28870 uint32_t minorversion; 28871 uint32_t callback_ident; 28872 nfs_cb_argop4 argarray<>; 28873 }; 28875 19.2.2. RESULTS 28876 union nfs_cb_resop4 switch (unsigned resop) { 28877 case OP_CB_GETATTR: CB_GETATTR4res opcbgetattr; 28878 case OP_CB_RECALL: CB_RECALL4res opcbrecall; 28880 /* new NFSv4.1 operations */ 28881 case OP_CB_LAYOUTRECALL: 28882 CB_LAYOUTRECALL4res 28883 opcblayoutrecall; 28885 case OP_CB_NOTIFY: CB_NOTIFY4res opcbnotify; 28887 case OP_CB_PUSH_DELEG: CB_PUSH_DELEG4res 28888 opcbpush_deleg; 28890 case OP_CB_RECALL_ANY: CB_RECALL_ANY4res 28891 opcbrecall_any; 28893 case OP_CB_RECALLABLE_OBJ_AVAIL: 28894 CB_RECALLABLE_OBJ_AVAIL4res 28895 opcbrecallable_obj_avail; 28897 case OP_CB_RECALL_SLOT: 28898 CB_RECALL_SLOT4res 28899 opcbrecall_slot; 28901 case OP_CB_SEQUENCE: CB_SEQUENCE4res opcbsequence; 28903 case OP_CB_WANTS_CANCELLED: 28904 CB_WANTS_CANCELLED4res 28905 opcbwants_cancelled; 28907 case OP_CB_NOTIFY_LOCK: 28908 CB_NOTIFY_LOCK4res 28909 opcbnotify_lock; 28911 case OP_CB_NOTIFY_DEVICEID: 28912 CB_NOTIFY_DEVICEID4res 28913 opcbnotify_deviceid; 28915 /* Not new operation */ 28916 case OP_CB_ILLEGAL: CB_ILLEGAL4res opcbillegal; 28917 }; 28919 struct CB_COMPOUND4res { 28920 nfsstat4 status; 28921 utf8str_cs tag; 28922 nfs_cb_resop4 resarray<>; 28923 }; 28925 19.2.3. DESCRIPTION 28927 The CB_COMPOUND procedure is used to combine one or more of the 28928 callback procedures into a single RPC request. The main callback RPC 28929 program has two main procedures: CB_NULL and CB_COMPOUND. All other 28930 operations use the CB_COMPOUND procedure as a wrapper. 28932 During the processing of the CB_COMPOUND procedure, the client may 28933 find that it does not have the available resources to execute any or 28934 all of the operations within the CB_COMPOUND sequence. Refer to 28935 Section 2.10.6.4 for details. 28937 The minorversion field of the arguments MUST be the same as the 28938 minorversion of the COMPOUND procedure used to create the client ID 28939 and session. For NFSv4.1, minorversion MUST be set to 1. 28941 Contained within the CB_COMPOUND results is a "status" field. This 28942 status MUST be equal to the status of the last operation that was 28943 executed within the CB_COMPOUND procedure. Therefore, if an 28944 operation incurred an error, then the "status" value will be the same 28945 error value as is being returned for the operation that failed. 28947 The "tag" field is handled the same way as that of the COMPOUND 28948 procedure (see Section 16.2.3). 28950 Illegal operation codes are handled in the same way as they are 28951 handled for the COMPOUND procedure. 28953 19.2.4. IMPLEMENTATION 28955 The CB_COMPOUND procedure is used to combine individual operations 28956 into a single RPC request. The client interprets each of the 28957 operations in turn. If an operation is executed by the client and 28958 the status of that operation is NFS4_OK, then the next operation in 28959 the CB_COMPOUND procedure is executed. The client continues this 28960 process until there are no more operations to be executed or one of 28961 the operations has a status value other than NFS4_OK. 28963 19.2.5. ERRORS 28965 CB_COMPOUND will of course return every error that each operation on 28966 the backchannel can return (see Table 13). However, if CB_COMPOUND 28967 returns zero operations, obviously the error returned by COMPOUND has 28968 nothing to do with an error returned by an operation. The list of 28969 errors CB_COMPOUND will return if it processes zero operations 28970 includes: 28972 +==============================+==================================+ 28973 | Error | Notes | 28974 +==============================+==================================+ 28975 | NFS4ERR_BADCHAR | The tag argument has a character | 28976 | | the replier does not support. | 28977 +------------------------------+----------------------------------+ 28978 | NFS4ERR_BADXDR | | 28979 +------------------------------+----------------------------------+ 28980 | NFS4ERR_DELAY | | 28981 +------------------------------+----------------------------------+ 28982 | NFS4ERR_INVAL | The tag argument is not in UTF-8 | 28983 | | encoding. | 28984 +------------------------------+----------------------------------+ 28985 | NFS4ERR_MINOR_VERS_MISMATCH | | 28986 +------------------------------+----------------------------------+ 28987 | NFS4ERR_SERVERFAULT | | 28988 +------------------------------+----------------------------------+ 28989 | NFS4ERR_TOO_MANY_OPS | | 28990 +------------------------------+----------------------------------+ 28991 | NFS4ERR_REP_TOO_BIG | | 28992 +------------------------------+----------------------------------+ 28993 | NFS4ERR_REP_TOO_BIG_TO_CACHE | | 28994 +------------------------------+----------------------------------+ 28995 | NFS4ERR_REQ_TOO_BIG | | 28996 +------------------------------+----------------------------------+ 28998 Table 24: CB_COMPOUND Error Returns 29000 20. NFSv4.1 Callback Operations 29002 20.1. Operation 3: CB_GETATTR - Get Attributes 29004 20.1.1. ARGUMENT 29006 struct CB_GETATTR4args { 29007 nfs_fh4 fh; 29008 bitmap4 attr_request; 29009 }; 29011 20.1.2. RESULT 29012 struct CB_GETATTR4resok { 29013 fattr4 obj_attributes; 29014 }; 29016 union CB_GETATTR4res switch (nfsstat4 status) { 29017 case NFS4_OK: 29018 CB_GETATTR4resok resok4; 29019 default: 29020 void; 29021 }; 29023 20.1.3. DESCRIPTION 29025 The CB_GETATTR operation is used by the server to obtain the current 29026 modified state of a file that has been OPEN_DELEGATE_WRITE delegated. 29027 The size and change attributes are the only ones guaranteed to be 29028 serviced by the client. See Section 10.4.3 for a full description of 29029 how the client and server are to interact with the use of CB_GETATTR. 29031 If the filehandle specified is not one for which the client holds an 29032 OPEN_DELEGATE_WRITE delegation, an NFS4ERR_BADHANDLE error is 29033 returned. 29035 20.1.4. IMPLEMENTATION 29037 The client returns attrmask bits and the associated attribute values 29038 only for the change attribute, and attributes that it may change 29039 (time_modify, and size). 29041 20.2. Operation 4: CB_RECALL - Recall a Delegation 29043 20.2.1. ARGUMENT 29045 struct CB_RECALL4args { 29046 stateid4 stateid; 29047 bool truncate; 29048 nfs_fh4 fh; 29049 }; 29051 20.2.2. RESULT 29053 struct CB_RECALL4res { 29054 nfsstat4 status; 29055 }; 29057 20.2.3. DESCRIPTION 29059 The CB_RECALL operation is used to begin the process of recalling a 29060 delegation and returning it to the server. 29062 The truncate flag is used to optimize recall for a file object that 29063 is a regular file and is about to be truncated to zero. When it is 29064 TRUE, the client is freed of the obligation to propagate modified 29065 data for the file to the server, since this data is irrelevant. 29067 If the handle specified is not one for which the client holds a 29068 delegation, an NFS4ERR_BADHANDLE error is returned. 29070 If the stateid specified is not one corresponding to an OPEN 29071 delegation for the file specified by the filehandle, an 29072 NFS4ERR_BAD_STATEID is returned. 29074 20.2.4. IMPLEMENTATION 29076 The client SHOULD reply to the callback immediately. Replying does 29077 not complete the recall except when the value of the reply's status 29078 field is neither NFS4ERR_DELAY nor NFS4_OK. The recall is not 29079 complete until the delegation is returned using a DELEGRETURN 29080 operation. 29082 20.3. Operation 5: CB_LAYOUTRECALL - Recall Layout from Client 29084 20.3.1. ARGUMENT 29085 /* 29086 * NFSv4.1 callback arguments and results 29087 */ 29089 enum layoutrecall_type4 { 29090 LAYOUTRECALL4_FILE = LAYOUT4_RET_REC_FILE, 29091 LAYOUTRECALL4_FSID = LAYOUT4_RET_REC_FSID, 29092 LAYOUTRECALL4_ALL = LAYOUT4_RET_REC_ALL 29093 }; 29095 struct layoutrecall_file4 { 29096 nfs_fh4 lor_fh; 29097 offset4 lor_offset; 29098 length4 lor_length; 29099 stateid4 lor_stateid; 29100 }; 29102 union layoutrecall4 switch(layoutrecall_type4 lor_recalltype) { 29103 case LAYOUTRECALL4_FILE: 29104 layoutrecall_file4 lor_layout; 29105 case LAYOUTRECALL4_FSID: 29106 fsid4 lor_fsid; 29107 case LAYOUTRECALL4_ALL: 29108 void; 29109 }; 29111 struct CB_LAYOUTRECALL4args { 29112 layouttype4 clora_type; 29113 layoutiomode4 clora_iomode; 29114 bool clora_changed; 29115 layoutrecall4 clora_recall; 29116 }; 29118 20.3.2. RESULT 29120 struct CB_LAYOUTRECALL4res { 29121 nfsstat4 clorr_status; 29122 }; 29124 20.3.3. DESCRIPTION 29126 The CB_LAYOUTRECALL operation is used by the server to recall layouts 29127 from the client; as a result, the client will begin the process of 29128 returning layouts via LAYOUTRETURN. The CB_LAYOUTRECALL operation 29129 specifies one of three forms of recall processing with the value of 29130 layoutrecall_type4. The recall is for one of the following: a 29131 specific layout of a specific file (LAYOUTRECALL4_FILE), an entire 29132 file system ID (LAYOUTRECALL4_FSID), or all file systems 29133 (LAYOUTRECALL4_ALL). 29135 The behavior of the operation varies based on the value of the 29136 layoutrecall_type4. The value and behaviors are: 29138 LAYOUTRECALL4_FILE 29139 For a layout to match the recall request, the values of the 29140 following fields must match those of the layout: clora_type, 29141 clora_iomode, lor_fh, and the byte-range specified by lor_offset 29142 and lor_length. The clora_iomode field may have a special value 29143 of LAYOUTIOMODE4_ANY. The special value LAYOUTIOMODE4_ANY will 29144 match any iomode originally returned in a layout; therefore, it 29145 acts as a wild card. The other special value used is for 29146 lor_length. If lor_length has a value of NFS4_UINT64_MAX, the 29147 lor_length field means the maximum possible file size. If a 29148 matching layout is found, it MUST be returned using the 29149 LAYOUTRETURN operation (see Section 18.44). An example of the 29150 field's special value use is if clora_iomode is LAYOUTIOMODE4_ANY, 29151 lor_offset is zero, and lor_length is NFS4_UINT64_MAX, then the 29152 entire layout is to be returned. 29154 The NFS4ERR_NOMATCHING_LAYOUT error is only returned when the 29155 client does not hold layouts for the file or if the client does 29156 not have any overlapping layouts for the specification in the 29157 layout recall. 29159 LAYOUTRECALL4_FSID and LAYOUTRECALL4_ALL 29160 If LAYOUTRECALL4_FSID is specified, the fsid specifies the file 29161 system for which any outstanding layouts MUST be returned. If 29162 LAYOUTRECALL4_ALL is specified, all outstanding layouts MUST be 29163 returned. In addition, LAYOUTRECALL4_FSID and LAYOUTRECALL4_ALL 29164 specify that all the storage device ID to storage device address 29165 mappings in the affected file system(s) are also recalled. The 29166 respective LAYOUTRETURN with either LAYOUTRETURN4_FSID or 29167 LAYOUTRETURN4_ALL acknowledges to the server that the client 29168 invalidated the said device mappings. See Section 12.5.5.2.1.5 29169 for considerations with "bulk" recall of layouts. 29171 The NFS4ERR_NOMATCHING_LAYOUT error is only returned when the 29172 client does not hold layouts and does not have valid deviceid 29173 mappings. 29175 In processing the layout recall request, the client also varies its 29176 behavior based on the value of the clora_changed field. This field 29177 is used by the server to provide additional context for the reason 29178 why the layout is being recalled. A FALSE value for clora_changed 29179 indicates that no change in the layout is expected and the client may 29180 write modified data to the storage devices involved; this must be 29181 done prior to returning the layout via LAYOUTRETURN. A TRUE value 29182 for clora_changed indicates that the server is changing the layout. 29183 Examples of layout changes and reasons for a TRUE indication are the 29184 following: the metadata server is restriping the file or a permanent 29185 error has occurred on a storage device and the metadata server would 29186 like to provide a new layout for the file. Therefore, a 29187 clora_changed value of TRUE indicates some level of change for the 29188 layout and the client SHOULD NOT write and commit modified data to 29189 the storage devices. In this case, the client writes and commits 29190 data through the metadata server. 29192 See Section 12.5.3 for a description of how the lor_stateid field in 29193 the arguments is to be constructed. Note that the "seqid" field of 29194 lor_stateid MUST NOT be zero. See Sections 8.2, 12.5.3, and 12.5.5.2 29195 for a further discussion and requirements. 29197 20.3.4. IMPLEMENTATION 29199 The client's processing for CB_LAYOUTRECALL is similar to CB_RECALL 29200 (recall of file delegations) in that the client responds to the 29201 request before actually returning layouts via the LAYOUTRETURN 29202 operation. While the client responds to the CB_LAYOUTRECALL 29203 immediately, the operation is not considered complete (i.e., 29204 considered pending) until all affected layouts are returned to the 29205 server via the LAYOUTRETURN operation. 29207 Before returning the layout to the server via LAYOUTRETURN, the 29208 client should wait for the response from in-process or in-flight 29209 READ, WRITE, or COMMIT operations that use the recalled layout. 29211 If the client is holding modified data that is affected by a recalled 29212 layout, the client has various options for writing the data to the 29213 server. As always, the client may write the data through the 29214 metadata server. In fact, the client may not have a choice other 29215 than writing to the metadata server when the clora_changed argument 29216 is TRUE and a new layout is unavailable from the server. However, 29217 the client may be able to write the modified data to the storage 29218 device if the clora_changed argument is FALSE; this needs to be done 29219 before returning the layout via LAYOUTRETURN. If the client were to 29220 obtain a new layout covering the modified data's byte-range, then 29221 writing to the storage devices is an available alternative. Note 29222 that before obtaining a new layout, the client must first return the 29223 original layout. 29225 In the case of modified data being written while the layout is held, 29226 the client must use LAYOUTCOMMIT operations at the appropriate time; 29227 as required LAYOUTCOMMIT must be done before the LAYOUTRETURN. If a 29228 large amount of modified data is outstanding, the client may send 29229 LAYOUTRETURNs for portions of the recalled layout; this allows the 29230 server to monitor the client's progress and adherence to the original 29231 recall request. However, the last LAYOUTRETURN in a sequence of 29232 returns MUST specify the full range being recalled (see 29233 Section 12.5.5.1 for details). 29235 If a server needs to delete a device ID and there are layouts 29236 referring to the device ID, CB_LAYOUTRECALL MUST be invoked to cause 29237 the client to return all layouts referring to the device ID before 29238 the server can delete the device ID. If the client does not return 29239 the affected layouts, the server MAY revoke the layouts. 29241 20.4. Operation 6: CB_NOTIFY - Notify Client of Directory Changes 29243 20.4.1. ARGUMENT 29245 /* 29246 * Directory notification types. 29247 */ 29248 enum notify_type4 { 29249 NOTIFY4_CHANGE_CHILD_ATTRS = 0, 29250 NOTIFY4_CHANGE_DIR_ATTRS = 1, 29251 NOTIFY4_REMOVE_ENTRY = 2, 29252 NOTIFY4_ADD_ENTRY = 3, 29253 NOTIFY4_RENAME_ENTRY = 4, 29254 NOTIFY4_CHANGE_COOKIE_VERIFIER = 5 29255 }; 29257 /* Changed entry information. */ 29258 struct notify_entry4 { 29259 component4 ne_file; 29260 fattr4 ne_attrs; 29261 }; 29263 /* Previous entry information */ 29264 struct prev_entry4 { 29265 notify_entry4 pe_prev_entry; 29266 /* what READDIR returned for this entry */ 29267 nfs_cookie4 pe_prev_entry_cookie; 29268 }; 29270 struct notify_remove4 { 29271 notify_entry4 nrm_old_entry; 29272 nfs_cookie4 nrm_old_entry_cookie; 29273 }; 29275 struct notify_add4 { 29276 /* 29277 * Information on object 29278 * possibly renamed over. 29279 */ 29280 notify_remove4 nad_old_entry<1>; 29281 notify_entry4 nad_new_entry; 29282 /* what READDIR would have returned for this entry */ 29283 nfs_cookie4 nad_new_entry_cookie<1>; 29284 prev_entry4 nad_prev_entry<1>; 29285 bool nad_last_entry; 29286 }; 29288 struct notify_attr4 { 29289 notify_entry4 na_changed_entry; 29290 }; 29292 struct notify_rename4 { 29293 notify_remove4 nrn_old_entry; 29294 notify_add4 nrn_new_entry; 29295 }; 29297 struct notify_verifier4 { 29298 verifier4 nv_old_cookieverf; 29299 verifier4 nv_new_cookieverf; 29300 }; 29302 /* 29303 * Objects of type notify_<>4 and 29304 * notify_device_<>4 are encoded in this. 29305 */ 29306 typedef opaque notifylist4<>; 29308 struct notify4 { 29309 /* composed from notify_type4 or notify_deviceid_type4 */ 29310 bitmap4 notify_mask; 29311 notifylist4 notify_vals; 29312 }; 29314 struct CB_NOTIFY4args { 29315 stateid4 cna_stateid; 29316 nfs_fh4 cna_fh; 29317 notify4 cna_changes<>; 29318 }; 29320 20.4.2. RESULT 29322 struct CB_NOTIFY4res { 29323 nfsstat4 cnr_status; 29324 }; 29326 20.4.3. DESCRIPTION 29328 The CB_NOTIFY operation is used by the server to send notifications 29329 to clients about changes to delegated directories. The registration 29330 of notifications for the directories occurs when the delegation is 29331 established using GET_DIR_DELEGATION. These notifications are sent 29332 over the backchannel. The notification is sent once the original 29333 request has been processed on the server. The server will send an 29334 array of notifications for changes that might have occurred in the 29335 directory. The notifications are sent as list of pairs of bitmaps 29336 and values. See Section 3.3.7 for a description of how NFSv4.1 29337 bitmaps work. 29339 If the server has more notifications than can fit in the CB_COMPOUND 29340 request, it SHOULD send a sequence of serial CB_COMPOUND requests so 29341 that the client's view of the directory does not become confused. 29342 For example, if the server indicates that a file named "foo" is added 29343 and that the file "foo" is removed, the order in which the client 29344 receives these notifications needs to be the same as the order in 29345 which the corresponding operations occurred on the server. 29347 If the client holding the delegation makes any changes in the 29348 directory that cause files or sub-directories to be added or removed, 29349 the server will notify that client of the resulting change(s). If 29350 the client holding the delegation is making attribute or cookie 29351 verifier changes only, the server does not need to send notifications 29352 to that client. The server will send the following information for 29353 each operation: 29355 NOTIFY4_ADD_ENTRY 29356 The server will send information about the new directory entry 29357 being created along with the cookie for that entry. The entry 29358 information (data type notify_add4) includes the component name of 29359 the entry and attributes. The server will send this type of entry 29360 when a file is actually being created, when an entry is being 29361 added to a directory as a result of a rename across directories 29362 (see below), and when a hard link is being created to an existing 29363 file. If this entry is added to the end of the directory, the 29364 server will set the nad_last_entry flag to TRUE. If the file is 29365 added such that there is at least one entry before it, the server 29366 will also return the previous entry information (nad_prev_entry, a 29367 variable-length array of up to one element. If the array is of 29368 zero length, there is no previous entry), along with its cookie. 29369 This is to help clients find the right location in their file name 29370 caches and directory caches where this entry should be cached. If 29371 the new entry's cookie is available, it will be in the 29372 nad_new_entry_cookie (another variable-length array of up to one 29373 element) field. If the addition of the entry causes another entry 29374 to be deleted (which can only happen in the rename case) 29375 atomically with the addition, then information on this entry is 29376 reported in nad_old_entry. 29378 NOTIFY4_REMOVE_ENTRY 29379 The server will send information about the directory entry being 29380 deleted. The server will also send the cookie value for the 29381 deleted entry so that clients can get to the cached information 29382 for this entry. 29384 NOTIFY4_RENAME_ENTRY 29385 The server will send information about both the old entry and the 29386 new entry. This includes the name and attributes for each entry. 29387 In addition, if the rename causes the deletion of an entry (i.e., 29388 the case of a file renamed over), then this is reported in 29389 nrn_new_new_entry.nad_old_entry. This notification is only sent 29390 if both entries are in the same directory. If the rename is 29391 across directories, the server will send a remove notification to 29392 one directory and an add notification to the other directory, 29393 assuming both have a directory delegation. 29395 NOTIFY4_CHANGE_CHILD_ATTRS/NOTIFY4_CHANGE_DIR_ATTRS 29396 The client will use the attribute mask to inform the server of 29397 attributes for which it wants to receive notifications. This 29398 change notification can be requested for changes to the attributes 29399 of the directory as well as changes to any file's attributes in 29400 the directory by using two separate attribute masks. The client 29401 cannot ask for change attribute notification for a specific file. 29402 One attribute mask covers all the files in the directory. Upon 29403 any attribute change, the server will send back the values of 29404 changed attributes. Notifications might not make sense for some 29405 file system-wide attributes, and it is up to the server to decide 29406 which subset it wants to support. The client can negotiate the 29407 frequency of attribute notifications by letting the server know 29408 how often it wants to be notified of an attribute change. The 29409 server will return supported notification frequencies or an 29410 indication that no notification is permitted for directory or 29411 child attributes by setting the dir_notif_delay and 29412 dir_entry_notif_delay attributes, respectively. 29414 NOTIFY4_CHANGE_COOKIE_VERIFIER 29415 If the cookie verifier changes while a client is holding a 29416 delegation, the server will notify the client so that it can 29417 invalidate its cookies and re-send a READDIR to get the new set of 29418 cookies. 29420 20.5. Operation 7: CB_PUSH_DELEG - Offer Previously Requested 29421 Delegation to Client 29423 20.5.1. ARGUMENT 29425 struct CB_PUSH_DELEG4args { 29426 nfs_fh4 cpda_fh; 29427 open_delegation4 cpda_delegation; 29429 }; 29431 20.5.2. RESULT 29433 struct CB_PUSH_DELEG4res { 29434 nfsstat4 cpdr_status; 29435 }; 29437 20.5.3. DESCRIPTION 29439 CB_PUSH_DELEG is used by the server both to signal to the client that 29440 the delegation it wants (previously indicated via a want established 29441 from an OPEN or WANT_DELEGATION operation) is available and to 29442 simultaneously offer the delegation to the client. The client has 29443 the choice of accepting the delegation by returning NFS4_OK to the 29444 server, delaying the decision to accept the offered delegation by 29445 returning NFS4ERR_DELAY, or permanently rejecting the offer of the 29446 delegation by returning NFS4ERR_REJECT_DELEG. When a delegation is 29447 rejected in this fashion, the want previously established is 29448 permanently deleted and the delegation is subject to acquisition by 29449 another client. 29451 20.5.4. IMPLEMENTATION 29453 If the client does return NFS4ERR_DELAY and there is a conflicting 29454 delegation request, the server MAY process it at the expense of the 29455 client that returned NFS4ERR_DELAY. The client's want will not be 29456 cancelled, but MAY be processed behind other delegation requests or 29457 registered wants. 29459 When a client returns a status other than NFS4_OK, NFS4ERR_DELAY, or 29460 NFS4ERR_REJECT_DELAY, the want remains pending, although servers may 29461 decide to cancel the want by sending a CB_WANTS_CANCELLED. 29463 20.6. Operation 8: CB_RECALL_ANY - Keep Any N Recallable Objects 29465 20.6.1. ARGUMENT 29467 const RCA4_TYPE_MASK_RDATA_DLG = 0; 29468 const RCA4_TYPE_MASK_WDATA_DLG = 1; 29469 const RCA4_TYPE_MASK_DIR_DLG = 2; 29470 const RCA4_TYPE_MASK_FILE_LAYOUT = 3; 29471 const RCA4_TYPE_MASK_BLK_LAYOUT = 4; 29472 const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8; 29473 const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 9; 29474 const RCA4_TYPE_MASK_OTHER_LAYOUT_MIN = 12; 29475 const RCA4_TYPE_MASK_OTHER_LAYOUT_MAX = 15; 29477 struct CB_RECALL_ANY4args { 29478 uint32_t craa_objects_to_keep; 29479 bitmap4 craa_type_mask; 29480 }; 29482 20.6.2. RESULT 29483 struct CB_RECALL_ANY4res { 29484 nfsstat4 crar_status; 29485 }; 29487 20.6.3. DESCRIPTION 29489 The server may decide that it cannot hold all of the state for 29490 recallable objects, such as delegations and layouts, without running 29491 out of resources. In such a case, while not optimal, the server is 29492 free to recall individual objects to reduce the load. 29494 Because the general purpose of such recallable objects as delegations 29495 is to eliminate client interaction with the server, the server cannot 29496 interpret lack of recent use as indicating that the object is no 29497 longer useful. The absence of visible use is consistent with a 29498 delegation keeping potential operations from being sent to the 29499 server. In the case of layouts, while it is true that the usefulness 29500 of a layout is indicated by the use of the layout when storage 29501 devices receive I/O requests, because there is no mandate that a 29502 storage device indicate to the metadata server any past or present 29503 use of a layout, the metadata server is not likely to know which 29504 layouts are good candidates to recall in response to low resources. 29506 In order to implement an effective reclaim scheme for such objects, 29507 the server's knowledge of available resources must be used to 29508 determine when objects must be recalled with the clients selecting 29509 the actual objects to be returned. 29511 Server implementations may differ in their resource allocation 29512 requirements. For example, one server may share resources among all 29513 classes of recallable objects, whereas another may use separate 29514 resource pools for layouts and for delegations, or further separate 29515 resources by types of delegations. 29517 When a given resource pool is over-utilized, the server can send a 29518 CB_RECALL_ANY to clients holding recallable objects of the types 29519 involved, allowing it to keep a certain number of such objects and 29520 return any excess. A mask specifies which types of objects are to be 29521 limited. The client chooses, based on its own knowledge of current 29522 usefulness, which of the objects in that class should be returned. 29524 A number of bits are defined. For some of these, ranges are defined 29525 and it is up to the definition of the storage protocol to specify how 29526 these are to be used. There are ranges reserved for object-based 29527 storage protocols and for other experimental storage protocols. An 29528 RFC defining such a storage protocol needs to specify how particular 29529 bits within its range are to be used. For example, it may specify a 29530 mapping between attributes of the layout (read vs. write, size of 29531 area) and the bit to be used, or it may define a field in the layout 29532 where the associated bit position is made available by the server to 29533 the client. 29535 RCA4_TYPE_MASK_RDATA_DLG 29536 The client is to return OPEN_DELEGATE_READ delegations on non- 29537 directory file objects. 29539 RCA4_TYPE_MASK_WDATA_DLG 29540 The client is to return OPEN_DELEGATE_WRITE delegations on regular 29541 file objects. 29543 RCA4_TYPE_MASK_DIR_DLG 29544 The client is to return directory delegations. 29546 RCA4_TYPE_MASK_FILE_LAYOUT 29547 The client is to return layouts of type LAYOUT4_NFSV4_1_FILES. 29549 RCA4_TYPE_MASK_BLK_LAYOUT 29550 See [48] for a description. 29552 RCA4_TYPE_MASK_OBJ_LAYOUT_MIN to RCA4_TYPE_MASK_OBJ_LAYOUT_MAX 29553 See [47] for a description. 29555 RCA4_TYPE_MASK_OTHER_LAYOUT_MIN to RCA4_TYPE_MASK_OTHER_LAYOUT_MAX 29556 This range is reserved for telling the client to recall layouts of 29557 experimental or site-specific layout types (see Section 3.3.13). 29559 When a bit is set in the type mask that corresponds to an undefined 29560 type of recallable object, NFS4ERR_INVAL MUST be returned. When a 29561 bit is set that corresponds to a defined type of object but the 29562 client does not support an object of the type, NFS4ERR_INVAL MUST NOT 29563 be returned. Future minor versions of NFSv4 may expand the set of 29564 valid type mask bits. 29566 CB_RECALL_ANY specifies a count of objects that the client may keep 29567 as opposed to a count that the client must return. This is to avoid 29568 a potential race between a CB_RECALL_ANY that had a count of objects 29569 to free with a set of client-originated operations to return layouts 29570 or delegations. As a result of the race, the client and server would 29571 have differing ideas as to how many objects to return. Hence, the 29572 client could mistakenly free too many. 29574 If resource demands prompt it, the server may send another 29575 CB_RECALL_ANY with a lower count, even if it has not yet received an 29576 acknowledgment from the client for a previous CB_RECALL_ANY with the 29577 same type mask. Although the possibility exists that these will be 29578 received by the client in an order different from the order in which 29579 they were sent, any such permutation of the callback stream is 29580 harmless. It is the job of the client to bring down the size of the 29581 recallable object set in line with each CB_RECALL_ANY received, and 29582 until that obligation is met, it cannot be cancelled or modified by 29583 any subsequent CB_RECALL_ANY for the same type mask. Thus, if the 29584 server sends two CB_RECALL_ANYs, the effect will be the same as if 29585 the lower count was sent, whatever the order of recall receipt. Note 29586 that this means that a server may not cancel the effect of a 29587 CB_RECALL_ANY by sending another recall with a higher count. When a 29588 CB_RECALL_ANY is received and the count is already within the limit 29589 set or is above a limit that the client is working to get down to, 29590 that callback has no effect. 29592 Servers are generally free to deny recallable objects when 29593 insufficient resources are available. Note that the effect of such a 29594 policy is implicitly to give precedence to existing objects relative 29595 to requested ones, with the result that resources might not be 29596 optimally used. To prevent this, servers are well advised to make 29597 the point at which they start sending CB_RECALL_ANY callbacks 29598 somewhat below that at which they cease to give out new delegations 29599 and layouts. This allows the client to purge its less-used objects 29600 whenever appropriate and so continue to have its subsequent requests 29601 given new resources freed up by object returns. 29603 20.6.4. IMPLEMENTATION 29605 The client can choose to return any type of object specified by the 29606 mask. If a server wishes to limit the use of objects of a specific 29607 type, it should only specify that type in the mask it sends. Should 29608 the client fail to return requested objects, it is up to the server 29609 to handle this situation, typically by sending specific recalls 29610 (i.e., sending CB_RECALL operations) to properly limit resource 29611 usage. The server should give the client enough time to return 29612 objects before proceeding to specific recalls. This time should not 29613 be less than the lease period. 29615 20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL - Signal Resources for 29616 Recallable Objects 29618 20.7.1. ARGUMENT 29620 typedef CB_RECALL_ANY4args CB_RECALLABLE_OBJ_AVAIL4args; 29622 20.7.2. RESULT 29624 struct CB_RECALLABLE_OBJ_AVAIL4res { 29625 nfsstat4 croa_status; 29626 }; 29628 20.7.3. DESCRIPTION 29630 CB_RECALLABLE_OBJ_AVAIL is used by the server to signal the client 29631 that the server has resources to grant recallable objects that might 29632 previously have been denied by OPEN, WANT_DELEGATION, GET_DIR_DELEG, 29633 or LAYOUTGET. 29635 The argument craa_objects_to_keep means the total number of 29636 recallable objects of the types indicated in the argument type_mask 29637 that the server believes it can allow the client to have, including 29638 the number of such objects the client already has. A client that 29639 tries to acquire more recallable objects than the server informs it 29640 can have runs the risk of having objects recalled. 29642 The server is not obligated to reserve the difference between the 29643 number of the objects the client currently has and the value of 29644 craa_objects_to_keep, nor does delaying the reply to 29645 CB_RECALLABLE_OBJ_AVAIL prevent the server from using the resources 29646 of the recallable objects for another purpose. Indeed, if a client 29647 responds slowly to CB_RECALLABLE_OBJ_AVAIL, the server might 29648 interpret the client as having reduced capability to manage 29649 recallable objects, and so cancel or reduce any reservation it is 29650 maintaining on behalf of the client. Thus, if the client desires to 29651 acquire more recallable objects, it needs to reply quickly to 29652 CB_RECALLABLE_OBJ_AVAIL, and then send the appropriate operations to 29653 acquire recallable objects. 29655 20.8. Operation 10: CB_RECALL_SLOT - Change Flow Control Limits 29657 20.8.1. ARGUMENT 29659 struct CB_RECALL_SLOT4args { 29660 slotid4 rsa_target_highest_slotid; 29661 }; 29663 20.8.2. RESULT 29665 struct CB_RECALL_SLOT4res { 29666 nfsstat4 rsr_status; 29667 }; 29669 20.8.3. DESCRIPTION 29671 The CB_RECALL_SLOT operation requests the client to return session 29672 slots, and if applicable, transport credits (e.g., RDMA credits for 29673 connections associated with the operations channel) of the session's 29674 fore channel. CB_RECALL_SLOT specifies rsa_target_highest_slotid, 29675 the value of the target highest slot ID the server wants for the 29676 session. The client MUST then progress toward reducing the session's 29677 highest slot ID to the target value. 29679 If the session has only non-RDMA connections associated with its 29680 operations channel, then the client need only wait for all 29681 outstanding requests with a slot ID > rsa_target_highest_slotid to 29682 complete, then send a single COMPOUND consisting of a single SEQUENCE 29683 operation, with the sa_highestslot field set to 29684 rsa_target_highest_slotid. If there are RDMA-based connections 29685 associated with operation channel, then the client needs to also send 29686 enough zero-length "RDMA Send" messages to take the total RDMA credit 29687 count to rsa_target_highest_slotid + 1 or below. 29689 20.8.4. IMPLEMENTATION 29691 If the client fails to reduce highest slot it has on the fore channel 29692 to what the server requests, the server can force the issue by 29693 asserting flow control on the receive side of all connections bound 29694 to the fore channel, and then finish servicing all outstanding 29695 requests that are in slots greater than rsa_target_highest_slotid. 29696 Once that is done, the server can then open the flow control, and any 29697 time the client sends a new request on a slot greater than 29698 rsa_target_highest_slotid, the server can return NFS4ERR_BADSLOT. 29700 20.9. Operation 11: CB_SEQUENCE - Supply Backchannel Sequencing and 29701 Control 29703 20.9.1. ARGUMENT 29704 struct referring_call4 { 29705 sequenceid4 rc_sequenceid; 29706 slotid4 rc_slotid; 29707 }; 29709 struct referring_call_list4 { 29710 sessionid4 rcl_sessionid; 29711 referring_call4 rcl_referring_calls<>; 29712 }; 29714 struct CB_SEQUENCE4args { 29715 sessionid4 csa_sessionid; 29716 sequenceid4 csa_sequenceid; 29717 slotid4 csa_slotid; 29718 slotid4 csa_highest_slotid; 29719 bool csa_cachethis; 29720 referring_call_list4 csa_referring_call_lists<>; 29721 }; 29723 20.9.2. RESULT 29725 struct CB_SEQUENCE4resok { 29726 sessionid4 csr_sessionid; 29727 sequenceid4 csr_sequenceid; 29728 slotid4 csr_slotid; 29729 slotid4 csr_highest_slotid; 29730 slotid4 csr_target_highest_slotid; 29731 }; 29733 union CB_SEQUENCE4res switch (nfsstat4 csr_status) { 29734 case NFS4_OK: 29735 CB_SEQUENCE4resok csr_resok4; 29736 default: 29737 void; 29738 }; 29740 20.9.3. DESCRIPTION 29742 The CB_SEQUENCE operation is used to manage operational accounting 29743 for the backchannel of the session on which a request is sent. The 29744 contents include the session ID to which this request belongs, the 29745 slot ID and sequence ID used by the server to implement session 29746 request control and exactly once semantics, and exchanged slot ID 29747 maxima that are used to adjust the size of the reply cache. In each 29748 CB_COMPOUND request, CB_SEQUENCE MUST appear once and MUST be the 29749 first operation. The error NFS4ERR_SEQUENCE_POS MUST be returned 29750 when CB_SEQUENCE is found in any position in a CB_COMPOUND beyond the 29751 first. If any other operation is in the first position of 29752 CB_COMPOUND, NFS4ERR_OP_NOT_IN_SESSION MUST be returned. 29754 See Section 18.46.3 for a description of how slots are processed. 29756 If csa_cachethis is TRUE, then the server is requesting that the 29757 client cache the reply in the callback reply cache. The client MUST 29758 cache the reply (see Section 2.10.6.1.3). 29760 The csa_referring_call_lists array is the list of COMPOUND requests, 29761 identified by session ID, slot ID, and sequence ID. These are 29762 requests that the client previously sent to the server. These 29763 previous requests created state that some operation(s) in the same 29764 CB_COMPOUND as the csa_referring_call_lists are identifying. A 29765 session ID is included because leased state is tied to a client ID, 29766 and a client ID can have multiple sessions. See Section 2.10.6.3. 29768 The value of the csa_sequenceid argument relative to the cached 29769 sequence ID on the slot falls into one of three cases. 29771 * If the difference between csa_sequenceid and the client's cached 29772 sequence ID at the slot ID is two (2) or more, or if 29773 csa_sequenceid is less than the cached sequence ID (accounting for 29774 wraparound of the unsigned sequence ID value), then the client 29775 MUST return NFS4ERR_SEQ_MISORDERED. 29777 * If csa_sequenceid and the cached sequence ID are the same, this is 29778 a retry, and the client returns the CB_COMPOUND request's cached 29779 reply. 29781 * If csa_sequenceid is one greater (accounting for wraparound) than 29782 the cached sequence ID, then this is a new request, and the slot's 29783 sequence ID is incremented. The operations subsequent to 29784 CB_SEQUENCE, if any, are processed. If there are no other 29785 operations, the only other effects are to cache the CB_SEQUENCE 29786 reply in the slot, maintain the session's activity, and when the 29787 server receives the CB_SEQUENCE reply, renew the lease of state 29788 related to the client ID. 29790 If the server reuses a slot ID and sequence ID for a completely 29791 different request, the client MAY treat the request as if it is a 29792 retry of what it has already executed. The client MAY however detect 29793 the server's illegal reuse and return NFS4ERR_SEQ_FALSE_RETRY. 29795 If CB_SEQUENCE returns an error, then the state of the slot (sequence 29796 ID, cached reply) MUST NOT change. See Section 2.10.6.1.3 for the 29797 conditions when the error NFS4ERR_RETRY_UNCACHED_REP might be 29798 returned. 29800 The client returns two "highest_slotid" values: csr_highest_slotid 29801 and csr_target_highest_slotid. The former is the highest slot ID the 29802 client will accept in a future CB_SEQUENCE operation, and SHOULD NOT 29803 be less than the value of csa_highest_slotid (but see 29804 Section 2.10.6.1 for an exception). The latter is the highest slot 29805 ID the client would prefer the server use on a future CB_SEQUENCE 29806 operation. 29808 20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending Delegation 29809 Wants 29811 20.10.1. ARGUMENT 29813 struct CB_WANTS_CANCELLED4args { 29814 bool cwca_contended_wants_cancelled; 29815 bool cwca_resourced_wants_cancelled; 29816 }; 29818 20.10.2. RESULT 29820 struct CB_WANTS_CANCELLED4res { 29821 nfsstat4 cwcr_status; 29822 }; 29824 20.10.3. DESCRIPTION 29826 The CB_WANTS_CANCELLED operation is used to notify the client that 29827 some or all of the wants it registered for recallable delegations and 29828 layouts have been cancelled. 29830 If cwca_contended_wants_cancelled is TRUE, this indicates that the 29831 server will not be pushing to the client any delegations that become 29832 available after contention passes. 29834 If cwca_resourced_wants_cancelled is TRUE, this indicates that the 29835 server will not notify the client when there are resources on the 29836 server to grant delegations or layouts. 29838 After receiving a CB_WANTS_CANCELLED operation, the client is free to 29839 attempt to acquire the delegations or layouts it was waiting for, and 29840 possibly re-register wants. 29842 20.10.4. IMPLEMENTATION 29844 When a client has an OPEN, WANT_DELEGATION, or GET_DIR_DELEGATION 29845 request outstanding, when a CB_WANTS_CANCELLED is sent, the server 29846 may need to make clear to the client whether a promise to signal 29847 delegation availability happened before the CB_WANTS_CANCELLED and is 29848 thus covered by it, or after the CB_WANTS_CANCELLED in which case it 29849 was not covered by it. The server can make this distinction by 29850 putting the appropriate requests into the list of referring calls in 29851 the associated CB_SEQUENCE. 29853 20.11. Operation 13: CB_NOTIFY_LOCK - Notify Client of Possible Lock 29854 Availability 29856 20.11.1. ARGUMENT 29858 struct CB_NOTIFY_LOCK4args { 29859 nfs_fh4 cnla_fh; 29860 lock_owner4 cnla_lock_owner; 29861 }; 29863 20.11.2. RESULT 29865 struct CB_NOTIFY_LOCK4res { 29866 nfsstat4 cnlr_status; 29867 }; 29869 20.11.3. DESCRIPTION 29871 The server can use this operation to indicate that a byte-range lock 29872 for the given file and lock-owner, previously requested by the client 29873 via an unsuccessful LOCK operation, might be available. 29875 This callback is meant to be used by servers to help reduce the 29876 latency of blocking locks in the case where they recognize that a 29877 client that has been polling for a blocking byte-range lock may now 29878 be able to acquire the lock. If the server supports this callback 29879 for a given file, it MUST set the OPEN4_RESULT_MAY_NOTIFY_LOCK flag 29880 when responding to successful opens for that file. This does not 29881 commit the server to the use of CB_NOTIFY_LOCK, but the client may 29882 use this as a hint to decide how frequently to poll for locks derived 29883 from that open. 29885 If an OPEN operation results in an upgrade, in which the stateid 29886 returned has an "other" value matching that of a stateid already 29887 allocated, with a new "seqid" indicating a change in the lock being 29888 represented, then the value of the OPEN4_RESULT_MAY_NOTIFY_LOCK flag 29889 when responding to that new OPEN controls handling from that point 29890 going forward. When parallel OPENs are done on the same file and 29891 open-owner, the ordering of the "seqid" fields of the returned 29892 stateids (subject to wraparound) are to be used to select the 29893 controlling value of the OPEN4_RESULT_MAY_NOTIFY_LOCK flag. 29895 20.11.4. IMPLEMENTATION 29897 The server MUST NOT grant the byte-range lock to the client unless 29898 and until it receives a LOCK operation from the client. Similarly, 29899 the client receiving this callback cannot assume that it now has the 29900 lock or that a subsequent LOCK operation for the lock will be 29901 successful. 29903 The server is not required to implement this callback, and even if it 29904 does, it is not required to use it in any particular case. 29905 Therefore, the client must still rely on polling for blocking locks, 29906 as described in Section 9.6. 29908 Similarly, the client is not required to implement this callback, and 29909 even it does, is still free to ignore it. Therefore, the server MUST 29910 NOT assume that the client will act based on the callback. 29912 20.12. Operation 14: CB_NOTIFY_DEVICEID - Notify Client of Device ID 29913 Changes 29915 20.12.1. ARGUMENT 29916 /* 29917 * Device notification types. 29918 */ 29919 enum notify_deviceid_type4 { 29920 NOTIFY_DEVICEID4_CHANGE = 1, 29921 NOTIFY_DEVICEID4_DELETE = 2 29922 }; 29924 /* For NOTIFY4_DEVICEID4_DELETE */ 29925 struct notify_deviceid_delete4 { 29926 layouttype4 ndd_layouttype; 29927 deviceid4 ndd_deviceid; 29928 }; 29930 /* For NOTIFY4_DEVICEID4_CHANGE */ 29931 struct notify_deviceid_change4 { 29932 layouttype4 ndc_layouttype; 29933 deviceid4 ndc_deviceid; 29934 bool ndc_immediate; 29935 }; 29937 struct CB_NOTIFY_DEVICEID4args { 29938 notify4 cnda_changes<>; 29939 }; 29941 20.12.2. RESULT 29943 struct CB_NOTIFY_DEVICEID4res { 29944 nfsstat4 cndr_status; 29945 }; 29947 20.12.3. DESCRIPTION 29949 The CB_NOTIFY_DEVICEID operation is used by the server to send 29950 notifications to clients about changes to pNFS device IDs. The 29951 registration of device ID notifications is optional and is done via 29952 GETDEVICEINFO. These notifications are sent over the backchannel 29953 once the original request has been processed on the server. The 29954 server will send an array of notifications, cnda_changes, as a list 29955 of pairs of bitmaps and values. See Section 3.3.7 for a description 29956 of how NFSv4.1 bitmaps work. 29958 As with CB_NOTIFY (Section 20.4.3), it is possible the server has 29959 more notifications than can fit in a CB_COMPOUND, thus requiring 29960 multiple CB_COMPOUNDs. Unlike CB_NOTIFY, serialization is not an 29961 issue because unlike directory entries, device IDs cannot be re-used 29962 after being deleted (Section 12.2.10). 29964 All device ID notifications contain a device ID and a layout type. 29965 The layout type is necessary because two different layout types can 29966 share the same device ID, and the common device ID can have 29967 completely different mappings for each layout type. 29969 The server will send the following notifications: 29971 NOTIFY_DEVICEID4_CHANGE 29972 A previously provided device-ID-to-device-address mapping has 29973 changed and the client uses GETDEVICEINFO to obtain the updated 29974 mapping. The notification is encoded in a value of data type 29975 notify_deviceid_change4. This data type also contains a boolean 29976 field, ndc_immediate, which if TRUE indicates that the change will 29977 be enforced immediately, and so the client might not be able to 29978 complete any pending I/O to the device ID. If ndc_immediate is 29979 FALSE, then for an indefinite time, the client can complete 29980 pending I/O. After pending I/O is complete, the client SHOULD get 29981 the new device-ID-to-device-address mappings before sending new I/ 29982 O requests to the storage devices addressed by the device ID. 29984 NOTIFY4_DEVICEID_DELETE 29985 Deletes a device ID from the mappings. This notification MUST NOT 29986 be sent if the client has a layout that refers to the device ID. 29987 In other words, if the server is sending a delete device ID 29988 notification, one of the following is true for layouts associated 29989 with the layout type: 29991 * The client never had a layout referring to that device ID. 29993 * The client has returned all layouts referring to that device 29994 ID. 29996 * The server has revoked all layouts referring to that device ID. 29998 The notification is encoded in a value of data type 29999 notify_deviceid_delete4. After a server deletes a device ID, it 30000 MUST NOT reuse that device ID for the same layout type until the 30001 client ID is deleted. 30003 20.13. Operation 10044: CB_ILLEGAL - Illegal Callback Operation 30005 20.13.1. ARGUMENT 30007 void; 30009 20.13.2. RESULT 30010 /* 30011 * CB_ILLEGAL: Response for illegal operation numbers 30012 */ 30013 struct CB_ILLEGAL4res { 30014 nfsstat4 status; 30015 }; 30017 20.13.3. DESCRIPTION 30019 This operation is a placeholder for encoding a result to handle the 30020 case of the server sending an operation code within CB_COMPOUND that 30021 is not defined in the NFSv4.1 specification. See Section 19.2.3 for 30022 more details. 30024 The status field of CB_ILLEGAL4res MUST be set to NFS4ERR_OP_ILLEGAL. 30026 20.13.4. IMPLEMENTATION 30028 A server will probably not send an operation with code OP_CB_ILLEGAL, 30029 but if it does, the response will be CB_ILLEGAL4res just as it would 30030 be with any other invalid operation code. Note that if the client 30031 gets an illegal operation code that is not OP_ILLEGAL, and if the 30032 client checks for legal operation codes during the XDR decode phase, 30033 then an instance of data type CB_ILLEGAL4res will not be returned. 30035 21. Security Considerations 30037 Historically, the authentication model of NFS was based on the entire 30038 machine being the NFS client, with the NFS server trusting the NFS 30039 client to authenticate the end-user. The NFS server in turn shared 30040 its files only to specific clients, as identified by the client's 30041 source network address. Given this model, the AUTH_SYS RPC security 30042 flavor simply identified the end-user using the client to the NFS 30043 server. When processing NFS responses, the client ensured that the 30044 responses came from the same network address and port number to which 30045 the request was sent. While such a model is easy to implement and 30046 simple to deploy and use, it is unsafe. Thus, NFSv4.1 30047 implementations are REQUIRED to support a security model that uses 30048 end-to-end authentication, where an end-user on a client mutually 30049 authenticates (via cryptographic schemes that do not expose passwords 30050 or keys in the clear on the network) to a principal on an NFS server. 30051 Consideration is also given to the integrity and privacy of NFS 30052 requests and responses. The issues of end-to-end mutual 30053 authentication, integrity, and privacy are discussed in 30054 Section 2.2.1.1.1. There are specific considerations when using 30055 Kerberos V5 as described in Section 2.2.1.1.1.2.1.1. 30057 Note that being REQUIRED to implement does not mean REQUIRED to use; 30058 AUTH_SYS can be used by NFSv4.1 clients and servers. However, 30059 AUTH_SYS is merely an OPTIONAL security flavor in NFSv4.1, and so 30060 interoperability via AUTH_SYS is not assured. 30062 For reasons of reduced administration overhead, better performance, 30063 and/or reduction of CPU utilization, users of NFSv4.1 implementations 30064 might decline to use security mechanisms that enable integrity 30065 protection on each remote procedure call and response. The use of 30066 mechanisms without integrity leaves the user vulnerable to a man-in- 30067 the-middle of the NFS client and server that modifies the RPC request 30068 and/or the response. While implementations are free to provide the 30069 option to use weaker security mechanisms, there are three operations 30070 in particular that warrant the implementation overriding user 30071 choices. 30073 * The first two such operations are SECINFO and SECINFO_NO_NAME. It 30074 is RECOMMENDED that the client send both operations such that they 30075 are protected with a security flavor that has integrity 30076 protection, such as RPCSEC_GSS with either the 30077 rpc_gss_svc_integrity or rpc_gss_svc_privacy service. Without 30078 integrity protection encapsulating SECINFO and SECINFO_NO_NAME and 30079 their results, a man-in-the-middle could modify results such that 30080 the client might select a weaker algorithm in the set allowed by 30081 the server, making the client and/or server vulnerable to further 30082 attacks. 30084 * The third operation that SHOULD use integrity protection is any 30085 GETATTR for the fs_locations and fs_locations_info attributes, in 30086 order to mitigate the severity of a man-in-the-middle attack. The 30087 attack has two steps. First the attacker modifies the unprotected 30088 results of some operation to return NFS4ERR_MOVED. Second, when 30089 the client follows up with a GETATTR for the fs_locations or 30090 fs_locations_info attributes, the attacker modifies the results to 30091 cause the client to migrate its traffic to a server controlled by 30092 the attacker. With integrity protection, this attack is 30093 mitigated. 30095 Relative to previous NFS versions, NFSv4.1 has additional security 30096 considerations for pNFS (see Sections 12.9 and 13.12), locking and 30097 session state (see Section 2.10.8.3), and state recovery during grace 30098 period (see Section 8.4.2.1.1). With respect to locking and session 30099 state, if SP4_SSV state protection is being used, Section 2.10.10 has 30100 specific security considerations for the NFSv4.1 client and server. 30102 Security considerations for lock reclaim differ between the two 30103 different situations in which state reclaim is to be done. The 30104 server failure situation is discussed in Section 8.4.2.1.1, while the 30105 per-fs state reclaim done in support of migration/replication is 30106 discussed in Section 11.11.9.1. 30108 The use of the multi-server namespace features described in 30109 Section 11 raises the possibility that requests to determine the set 30110 of network addresses corresponding to a given server might be 30111 interfered with or have their responses modified in flight. In light 30112 of this possibility, the following considerations should be noted: 30114 * When DNS is used to convert server names to addresses and DNSSEC 30115 [29] is not available, the validity of the network addresses 30116 returned generally cannot be relied upon. However, when combined 30117 with a trusted resolver, DNS over TLS [30] and DNS over HTTPS [34] 30118 can be relied upon to provide valid address resolutions. 30120 In situations in which the validity of the provided addresses 30121 cannot be relied upon and the client uses RPCSEC_GSS to access the 30122 designated server, it is possible for mutual authentication to 30123 discover invalid server addresses as long as the RPCSEC_GSS 30124 implementation used does not use insecure DNS queries to 30125 canonicalize the hostname components of the service principal 30126 names, as explained in [28]. 30128 * The fetching of attributes containing file system location 30129 information SHOULD be performed using integrity protection. It is 30130 important to note here that a client making a request of this sort 30131 without using integrity protection needs be aware of the negative 30132 consequences of doing so, which can lead to invalid hostnames or 30133 network addresses being returned. These include cases in which 30134 the client is directed to a server under the control of an 30135 attacker, who might get access to data written or provide 30136 incorrect values for data read. In light of this, the client 30137 needs to recognize that using such returned location information 30138 to access an NFSv4 server without use of RPCSEC_GSS (i.e., by 30139 using AUTH_SYS) poses dangers as it can result in the client 30140 interacting with such an attacker-controlled server without any 30141 authentication facilities to verify the server's identity. 30143 * Despite the fact that it is a requirement that implementations 30144 provide "support" for use of RPCSEC_GSS, it cannot be assumed that 30145 use of RPCSEC_GSS is always available between any particular 30146 client-server pair. 30148 * When a client has the network addresses of a server but not the 30149 associated hostnames, that would interfere with its ability to use 30150 RPCSEC_GSS. 30152 In light of the above, a server SHOULD present file system location 30153 entries that correspond to file systems on other servers using a 30154 hostname. This would allow the client to interrogate the 30155 fs_locations on the destination server to obtain trunking information 30156 (as well as replica information) using integrity protection, 30157 validating the name provided while assuring that the response has not 30158 been modified in flight. 30160 When RPCSEC_GSS is not available on a server, the client needs to be 30161 aware of the fact that the location entries are subject to 30162 modification in flight and so cannot be relied upon. In the case of 30163 a client being directed to another server after NFS4ERR_MOVED, this 30164 could vitiate the authentication provided by the use of RPCSEC_GSS on 30165 the designated destination server. Even when RPCSEC_GSS 30166 authentication is available on the destination, the server might 30167 still properly authenticate as the server to which the client was 30168 erroneously directed. Without a way to decide whether the server is 30169 a valid one, the client can only determine, using RPCSEC_GSS, that 30170 the server corresponds to the name provided, with no basis for 30171 trusting that server. As a result, the client SHOULD NOT use such 30172 unverified location entries as a basis for migration, even though 30173 RPCSEC_GSS might be available on the destination. 30175 When a file system location attribute is fetched upon connecting with 30176 an NFS server, it SHOULD, as stated above, be done with integrity 30177 protection. When this not possible, it is generally best for the 30178 client to ignore trunking and replica information or simply not fetch 30179 the location information for these purposes. 30181 When location information cannot be verified, it can be subjected to 30182 additional filtering to prevent the client from being inappropriately 30183 directed. For example, if a range of network addresses can be 30184 determined that assure that the servers and clients using AUTH_SYS 30185 are subject to the appropriate set of constraints (e.g., physical 30186 network isolation, administrative controls on the operating systems 30187 used), then network addresses in the appropriate range can be used 30188 with others discarded or restricted in their use of AUTH_SYS. 30190 To summarize considerations regarding the use of RPCSEC_GSS in 30191 fetching location information, we need to consider the following 30192 possibilities for requests to interrogate location information, with 30193 interrogation approaches on the referring and destination servers 30194 arrived at separately: 30196 * The use of integrity protection is RECOMMENDED in all cases, since 30197 the absence of integrity protection exposes the client to the 30198 possibility of the results being modified in transit. 30200 * The use of requests issued without RPCSEC_GSS (i.e., using 30201 AUTH_SYS, which has no provision to avoid modification of data in 30202 flight), while undesirable and a potential security exposure, may 30203 not be avoidable in all cases. Where the use of the returned 30204 information cannot be avoided, it is made subject to filtering as 30205 described above to eliminate the possibility that the client would 30206 treat an invalid address as if it were a NFSv4 server. The 30207 specifics will vary depending on the degree of network isolation 30208 and whether the request is to the referring or destination 30209 servers. 30211 Even if such requests are not interfered with in flight, it is 30212 possible for a compromised server to direct the client to use 30213 inappropriate servers, such as those under the control of the 30214 attacker. It is not clear that being directed to such servers 30215 represents a greater threat to the client than the damage that could 30216 be done by the compromised server itself. However, it is possible 30217 that some sorts of transient server compromises might be exploited to 30218 direct a client to a server capable of doing greater damage over a 30219 longer time. One useful step to guard against this possibility is to 30220 issue requests to fetch location data using RPCSEC_GSS, even if no 30221 mapping to an RPCSEC_GSS principal is available. In this case, 30222 RPCSEC_GSS would not be used, as it typically is, to identify the 30223 client principal to the server, but rather to make sure (via 30224 RPCSEC_GSS mutual authentication) that the server being contacted is 30225 the one intended. 30227 Similar considerations apply if the threat to be avoided is the 30228 redirection of client traffic to inappropriate (i.e., poorly 30229 performing) servers. In both cases, there is no reason for the 30230 information returned to depend on the identity of the client 30231 principal requesting it, while the validity of the server 30232 information, which has the capability to affect all client 30233 principals, is of considerable importance. 30235 22. IANA Considerations 30237 This section uses terms that are defined in [63]. 30239 22.1. IANA Actions 30241 This update does not require any modification of, or additions to, 30242 registry entries or registry rules associated with NFSv4.1. However, 30243 since this document obsoletes RFC 5661, IANA has updated all registry 30244 entries and registry rules references that point to RFC 5661 to point 30245 to this document instead. 30247 Previous actions by IANA related to NFSv4.1 are listed in the 30248 remaining subsections of Section 22. 30250 22.2. Named Attribute Definitions 30252 IANA created a registry called the "NFSv4 Named Attribute Definitions 30253 Registry". 30255 The NFSv4.1 protocol supports the association of a file with zero or 30256 more named attributes. The namespace identifiers for these 30257 attributes are defined as string names. The protocol does not define 30258 the specific assignment of the namespace for these file attributes. 30259 The IANA registry promotes interoperability where common interests 30260 exist. While application developers are allowed to define and use 30261 attributes as needed, they are encouraged to register the attributes 30262 with IANA. 30264 Such registered named attributes are presumed to apply to all minor 30265 versions of NFSv4, including those defined subsequently to the 30266 registration. If the named attribute is intended to be limited to 30267 specific minor versions, this will be clearly stated in the 30268 registry's assignment. 30270 All assignments to the registry are made on a First Come First Served 30271 basis, per Section 4.4 of [63]. The policy for each assignment is 30272 Specification Required, per Section 4.6 of [63]. 30274 Under the NFSv4.1 specification, the name of a named attribute can in 30275 theory be up to 2^32 - 1 bytes in length, but in practice NFSv4.1 30276 clients and servers will be unable to handle a string that long. 30277 IANA should reject any assignment request with a named attribute that 30278 exceeds 128 UTF-8 characters. To give the IESG the flexibility to 30279 set up bases of assignment of Experimental Use and Standards Action, 30280 the prefixes of "EXPE" and "STDS" are Reserved. The named attribute 30281 with a zero-length name is Reserved. 30283 The prefix "PRIV" is designated for Private Use. A site that wants 30284 to make use of unregistered named attributes without risk of 30285 conflicting with an assignment in IANA's registry should use the 30286 prefix "PRIV" in all of its named attributes. 30288 Because some NFSv4.1 clients and servers have case-insensitive 30289 semantics, the fifteen additional lower case and mixed case 30290 permutations of each of "EXPE", "PRIV", and "STDS" are Reserved 30291 (e.g., "expe", "expE", "exPe", etc. are Reserved). Similarly, IANA 30292 must not allow two assignments that would conflict if both named 30293 attributes were converted to a common case. 30295 The registry of named attributes is a list of assignments, each 30296 containing three fields for each assignment. 30298 1. A US-ASCII string name that is the actual name of the attribute. 30299 This name must be unique. This string name can be 1 to 128 UTF-8 30300 characters long. 30302 2. A reference to the specification of the named attribute. The 30303 reference can consume up to 256 bytes (or more if IANA permits). 30305 3. The point of contact of the registrant. The point of contact can 30306 consume up to 256 bytes (or more if IANA permits). 30308 22.2.1. Initial Registry 30310 There is no initial registry. 30312 22.2.2. Updating Registrations 30314 The registrant is always permitted to update the point of contact 30315 field. Any other change will require Expert Review or IESG Approval. 30317 22.3. Device ID Notifications 30319 IANA created a registry called the "NFSv4 Device ID Notifications 30320 Registry". 30322 The potential exists for new notification types to be added to the 30323 CB_NOTIFY_DEVICEID operation (see Section 20.12). This can be done 30324 via changes to the operations that register notifications, or by 30325 adding new operations to NFSv4. This requires a new minor version of 30326 NFSv4, and requires a Standards Track document from the IETF. 30327 Another way to add a notification is to specify a new layout type 30328 (see Section 22.5). 30330 Hence, all assignments to the registry are made on a Standards Action 30331 basis per Section 4.6 of [63], with Expert Review required. 30333 The registry is a list of assignments, each containing five fields 30334 per assignment. 30336 1. The name of the notification type. This name must have the 30337 prefix "NOTIFY_DEVICEID4_". This name must be unique. 30339 2. The value of the notification. IANA will assign this number, and 30340 the request from the registrant will use TBD1 instead of an 30341 actual value. IANA MUST use a whole number that can be no higher 30342 than 2^32-1, and should be the next available value. The value 30343 assigned must be unique. A Designated Expert must be used to 30344 ensure that when the name of the notification type and its value 30345 are added to the NFSv4.1 notify_deviceid_type4 enumerated data 30346 type in the NFSv4.1 XDR description [10], the result continues to 30347 be a valid XDR description. 30349 3. The Standards Track RFC(s) that describe the notification. If 30350 the RFC(s) have not yet been published, the registrant will use 30351 RFCTBD2, RFCTBD3, etc. instead of an actual RFC number. 30353 4. How the RFC introduces the notification. This is indicated by a 30354 single US-ASCII value. If the value is N, it means a minor 30355 revision to the NFSv4 protocol. If the value is L, it means a 30356 new pNFS layout type. Other values can be used with IESG 30357 Approval. 30359 5. The minor versions of NFSv4 that are allowed to use the 30360 notification. While these are numeric values, IANA will not 30361 allocate and assign them; the author of the relevant RFCs with 30362 IESG Approval assigns these numbers. Each time there is a new 30363 minor version of NFSv4 approved, a Designated Expert should 30364 review the registry to make recommended updates as needed. 30366 22.3.1. Initial Registry 30368 The initial registry is in Table 25. Note that the next available 30369 value is zero. 30371 +=========================+=======+==========+=====+================+ 30372 | Notification Name | Value | RFC | How | Minor Versions | 30373 +=========================+=======+==========+=====+================+ 30374 | NOTIFY_DEVICEID4_CHANGE | 1 | RFC | N | 1 | 30375 | | | 8881 | | | 30376 +-------------------------+-------+----------+-----+----------------+ 30377 | NOTIFY_DEVICEID4_DELETE | 2 | RFC | N | 1 | 30378 | | | 8881 | | | 30379 +-------------------------+-------+----------+-----+----------------+ 30381 Table 25: Initial Device ID Notification Assignments 30383 22.3.2. Updating Registrations 30385 The update of a registration will require IESG Approval on the advice 30386 of a Designated Expert. 30388 22.4. Object Recall Types 30390 IANA created a registry called the "NFSv4 Recallable Object Types 30391 Registry". 30393 The potential exists for new object types to be added to the 30394 CB_RECALL_ANY operation (see Section 20.6). This can be done via 30395 changes to the operations that add recallable types, or by adding new 30396 operations to NFSv4. This requires a new minor version of NFSv4, and 30397 requires a Standards Track document from IETF. Another way to add a 30398 new recallable object is to specify a new layout type (see 30399 Section 22.5). 30401 All assignments to the registry are made on a Standards Action basis 30402 per Section 4.9 of [63], with Expert Review required. 30404 Recallable object types are 32-bit unsigned numbers. There are no 30405 Reserved values. Values in the range 12 through 15, inclusive, are 30406 designated for Private Use. 30408 The registry is a list of assignments, each containing five fields 30409 per assignment. 30411 1. The name of the recallable object type. This name must have the 30412 prefix "RCA4_TYPE_MASK_". The name must be unique. 30414 2. The value of the recallable object type. IANA will assign this 30415 number, and the request from the registrant will use TBD1 instead 30416 of an actual value. IANA MUST use a whole number that can be no 30417 higher than 2^32-1, and should be the next available value. The 30418 value must be unique. A Designated Expert must be used to ensure 30419 that when the name of the recallable type and its value are added 30420 to the NFSv4 XDR description [10], the result continues to be a 30421 valid XDR description. 30423 3. The Standards Track RFC(s) that describe the recallable object 30424 type. If the RFC(s) have not yet been published, the registrant 30425 will use RFCTBD2, RFCTBD3, etc. instead of an actual RFC number. 30427 4. How the RFC introduces the recallable object type. This is 30428 indicated by a single US-ASCII value. If the value is N, it 30429 means a minor revision to the NFSv4 protocol. If the value is L, 30430 it means a new pNFS layout type. Other values can be used with 30431 IESG Approval. 30433 5. The minor versions of NFSv4 that are allowed to use the 30434 recallable object type. While these are numeric values, IANA 30435 will not allocate and assign them; the author of the relevant 30436 RFCs with IESG Approval assigns these numbers. Each time there 30437 is a new minor version of NFSv4 approved, a Designated Expert 30438 should review the registry to make recommended updates as needed. 30440 22.4.1. Initial Registry 30442 The initial registry is in Table 26. Note that the next available 30443 value is five. 30445 +===============================+=======+======+=====+==========+ 30446 | Recallable Object Type Name | Value | RFC | How | Minor | 30447 | | | | | Versions | 30448 +===============================+=======+======+=====+==========+ 30449 | RCA4_TYPE_MASK_RDATA_DLG | 0 | RFC | N | 1 | 30450 | | | 8881 | | | 30451 +-------------------------------+-------+------+-----+----------+ 30452 | RCA4_TYPE_MASK_WDATA_DLG | 1 | RFC | N | 1 | 30453 | | | 8881 | | | 30454 +-------------------------------+-------+------+-----+----------+ 30455 | RCA4_TYPE_MASK_DIR_DLG | 2 | RFC | N | 1 | 30456 | | | 8881 | | | 30457 +-------------------------------+-------+------+-----+----------+ 30458 | RCA4_TYPE_MASK_FILE_LAYOUT | 3 | RFC | N | 1 | 30459 | | | 8881 | | | 30460 +-------------------------------+-------+------+-----+----------+ 30461 | RCA4_TYPE_MASK_BLK_LAYOUT | 4 | RFC | L | 1 | 30462 | | | 8881 | | | 30463 +-------------------------------+-------+------+-----+----------+ 30464 | RCA4_TYPE_MASK_OBJ_LAYOUT_MIN | 8 | RFC | L | 1 | 30465 | | | 8881 | | | 30466 +-------------------------------+-------+------+-----+----------+ 30467 | RCA4_TYPE_MASK_OBJ_LAYOUT_MAX | 9 | RFC | L | 1 | 30468 | | | 8881 | | | 30469 +-------------------------------+-------+------+-----+----------+ 30471 Table 26: Initial Recallable Object Type Assignments 30473 22.4.2. Updating Registrations 30475 The update of a registration will require IESG Approval on the advice 30476 of a Designated Expert. 30478 22.5. Layout Types 30480 IANA created a registry called the "pNFS Layout Types Registry". 30482 All assignments to the registry are made on a Standards Action basis, 30483 with Expert Review required. 30485 Layout types are 32-bit numbers. The value zero is Reserved. Values 30486 in the range 0x80000000 to 0xFFFFFFFF inclusive are designated for 30487 Private Use. IANA will assign numbers from the range 0x00000001 to 30488 0x7FFFFFFF inclusive. 30490 The registry is a list of assignments, each containing five fields. 30492 1. The name of the layout type. This name must have the prefix 30493 "LAYOUT4_". The name must be unique. 30495 2. The value of the layout type. IANA will assign this number, and 30496 the request from the registrant will use TBD1 instead of an 30497 actual value. The value assigned must be unique. A Designated 30498 Expert must be used to ensure that when the name of the layout 30499 type and its value are added to the NFSv4.1 layouttype4 30500 enumerated data type in the NFSv4.1 XDR description [10], the 30501 result continues to be a valid XDR description. 30503 3. The Standards Track RFC(s) that describe the notification. If 30504 the RFC(s) have not yet been published, the registrant will use 30505 RFCTBD2, RFCTBD3, etc. instead of an actual RFC number. 30506 Collectively, the RFC(s) must adhere to the guidelines listed in 30507 Section 22.5.3. 30509 4. How the RFC introduces the layout type. This is indicated by a 30510 single US-ASCII value. If the value is N, it means a minor 30511 revision to the NFSv4 protocol. If the value is L, it means a 30512 new pNFS layout type. Other values can be used with IESG 30513 Approval. 30515 5. The minor versions of NFSv4 that are allowed to use the 30516 notification. While these are numeric values, IANA will not 30517 allocate and assign them; the author of the relevant RFCs with 30518 IESG Approval assigns these numbers. Each time there is a new 30519 minor version of NFSv4 approved, a Designated Expert should 30520 review the registry to make recommended updates as needed. 30522 22.5.1. Initial Registry 30524 The initial registry is in Table 27. 30526 +=======================+=======+==========+=====+================+ 30527 | Layout Type Name | Value | RFC | How | Minor Versions | 30528 +=======================+=======+==========+=====+================+ 30529 | LAYOUT4_NFSV4_1_FILES | 0x1 | RFC 8881 | N | 1 | 30530 +-----------------------+-------+----------+-----+----------------+ 30531 | LAYOUT4_OSD2_OBJECTS | 0x2 | RFC 5664 | L | 1 | 30532 +-----------------------+-------+----------+-----+----------------+ 30533 | LAYOUT4_BLOCK_VOLUME | 0x3 | RFC 5663 | L | 1 | 30534 +-----------------------+-------+----------+-----+----------------+ 30536 Table 27: Initial Layout Type Assignments 30538 22.5.2. Updating Registrations 30540 The update of a registration will require IESG Approval on the advice 30541 of a Designated Expert. 30543 22.5.3. Guidelines for Writing Layout Type Specifications 30545 The author of a new pNFS layout specification must follow these steps 30546 to obtain acceptance of the layout type as a Standards Track RFC: 30548 1. The author devises the new layout specification. 30550 2. The new layout type specification MUST, at a minimum: 30552 * Define the contents of the layout-type-specific fields of the 30553 following data types: 30555 - the da_addr_body field of the device_addr4 data type; 30557 - the loh_body field of the layouthint4 data type; 30559 - the loc_body field of layout_content4 data type (which in 30560 turn is the lo_content field of the layout4 data type); 30562 - the lou_body field of the layoutupdate4 data type; 30564 * Describe or define the storage access protocol used to access 30565 the storage devices. 30567 * Describe whether revocation of layouts is supported. 30569 * At a minimum, describe the methods of recovery from: 30571 1. Failure and restart for client, server, storage device. 30573 2. Lease expiration from perspective of the active client, 30574 server, storage device. 30576 3. Loss of layout state resulting in fencing of client access 30577 to storage devices (for an example, see Section 12.7.3). 30579 * Include an IANA considerations section, which will in turn 30580 include: 30582 - A request to IANA for a new layout type per Section 22.5. 30584 - A list of requests to IANA for any new recallable object 30585 types for CB_RECALL_ANY; each entry is to be presented in 30586 the form described in Section 22.4. 30588 - A list of requests to IANA for any new notification values 30589 for CB_NOTIFY_DEVICEID; each entry is to be presented in 30590 the form described in Section 22.3. 30592 * Include a security considerations section. This section MUST 30593 explain how the NFSv4.1 authentication, authorization, and 30594 access-control models are preserved. That is, if a metadata 30595 server would restrict a READ or WRITE operation, how would 30596 pNFS via the layout similarly restrict a corresponding input 30597 or output operation? 30599 3. The author documents the new layout specification as an Internet- 30600 Draft. 30602 4. The author submits the Internet-Draft for review through the IETF 30603 standards process as defined in "The Internet Standards Process-- 30604 Revision 3" (BCP 9 [35]). The new layout specification will be 30605 submitted for eventual publication as a Standards Track RFC. 30607 5. The layout specification progresses through the IETF standards 30608 process. 30610 22.6. Path Variable Definitions 30612 This section deals with the IANA considerations associated with the 30613 variable substitution feature for location names as described in 30614 Section 11.17.3. As described there, variables subject to 30615 substitution consist of a domain name and a specific name within that 30616 domain, with the two separated by a colon. There are two sets of 30617 IANA considerations here: 30619 1. The list of variable names. 30621 2. For each variable name, the list of possible values. 30623 Thus, there will be one registry for the list of variable names, and 30624 possibly one registry for listing the values of each variable name. 30626 22.6.1. Path Variables Registry 30628 IANA created a registry called the "NFSv4 Path Variables Registry". 30630 22.6.1.1. Path Variable Values 30632 Variable names are of the form "${", followed by a domain name, 30633 followed by a colon (":"), followed by a domain-specific portion of 30634 the variable name, followed by "}". When the domain name is 30635 "ietf.org", all variables names must be registered with IANA on a 30636 Standards Action basis, with Expert Review required. Path variables 30637 with registered domain names neither part of nor equal to ietf.org 30638 are assigned on a Hierarchical Allocation basis (delegating to the 30639 domain owner) and thus of no concern to IANA, unless the domain owner 30640 chooses to register a variable name from his domain. If the domain 30641 owner chooses to do so, IANA will do so on a First Come First Serve 30642 basis. To accommodate registrants who do not have their own domain, 30643 IANA will accept requests to register variables with the prefix 30644 "${FCFS.ietf.org:" on a First Come First Served basis. Assignments 30645 on a First Come First Basis do not require Expert Review, unless the 30646 registrant also wants IANA to establish a registry for the values of 30647 the registered variable. 30649 The registry is a list of assignments, each containing three fields. 30651 1. The name of the variable. The name of this variable must start 30652 with a "${" followed by a registered domain name, followed by 30653 ":", or it must start with "${FCFS.ietf.org". The name must be 30654 no more than 64 UTF-8 characters long. The name must be unique. 30656 2. For assignments made on Standards Action basis, the Standards 30657 Track RFC(s) that describe the variable. If the RFC(s) have not 30658 yet been published, the registrant will use RFCTBD1, RFCTBD2, 30659 etc. instead of an actual RFC number. Note that the RFCs do not 30660 have to be a part of an NFS minor version. For assignments made 30661 on a First Come First Serve basis, an explanation (consuming no 30662 more than 1024 bytes, or more if IANA permits) of the purpose of 30663 the variable. A reference to the explanation can be substituted. 30665 3. The point of contact, including an email address. The point of 30666 contact can consume up to 256 bytes (or more if IANA permits). 30667 For assignments made on a Standards Action basis, the point of 30668 contact is always IESG. 30670 22.6.1.1.1. Initial Registry 30672 The initial registry is in Table 28. 30674 +========================+==========+==================+ 30675 | Variable Name | RFC | Point of Contact | 30676 +========================+==========+==================+ 30677 | ${ietf.org:CPU_ARCH} | RFC 8881 | IESG | 30678 +------------------------+----------+------------------+ 30679 | ${ietf.org:OS_TYPE} | RFC 8881 | IESG | 30680 +------------------------+----------+------------------+ 30681 | ${ietf.org:OS_VERSION} | RFC 8881 | IESG | 30682 +------------------------+----------+------------------+ 30684 Table 28: Initial List of Path Variables 30686 IANA has created registries for the values of the variable names 30687 ${ietf.org:CPU_ARCH} and ${ietf.org:OS_TYPE}. See Sections 22.6.2 and 30688 22.6.3. 30690 For the values of the variable ${ietf.org:OS_VERSION}, no registry is 30691 needed as the specifics of the values of the variable will vary with 30692 the value of ${ietf.org:OS_TYPE}. Thus, values for 30693 ${ietf.org:OS_VERSION} are on a Hierarchical Allocation basis and are 30694 of no concern to IANA. 30696 22.6.1.1.2. Updating Registrations 30698 The update of an assignment made on a Standards Action basis will 30699 require IESG Approval on the advice of a Designated Expert. 30701 The registrant can always update the point of contact of an 30702 assignment made on a First Come First Serve basis. Any other update 30703 will require Expert Review. 30705 22.6.2. Values for the ${ietf.org:CPU_ARCH} Variable 30707 IANA created a registry called the "NFSv4 ${ietf.org:CPU_ARCH} Value 30708 Registry". 30710 Assignments to the registry are made on a First Come First Serve 30711 basis. The zero-length value of ${ietf.org:CPU_ARCH} is Reserved. 30712 Values with a prefix of "PRIV" are designated for Private Use. 30714 The registry is a list of assignments, each containing three fields. 30716 1. A value of the ${ietf.org:CPU_ARCH} variable. The value must be 30717 1 to 32 UTF-8 characters long. The value must be unique. 30719 2. An explanation (consuming no more than 1024 bytes, or more if 30720 IANA permits) of what CPU architecture the value denotes. A 30721 reference to the explanation can be substituted. 30723 3. The point of contact, including an email address. The point of 30724 contact can consume up to 256 bytes (or more if IANA permits). 30726 22.6.2.1. Initial Registry 30728 There is no initial registry. 30730 22.6.2.2. Updating Registrations 30732 The registrant is free to update the assignment, i.e., change the 30733 explanation and/or point-of-contact fields. 30735 22.6.3. Values for the ${ietf.org:OS_TYPE} Variable 30737 IANA created a registry called the "NFSv4 ${ietf.org:OS_TYPE} Value 30738 Registry". 30740 Assignments to the registry are made on a First Come First Serve 30741 basis. The zero-length value of ${ietf.org:OS_TYPE} is Reserved. 30742 Values with a prefix of "PRIV" are designated for Private Use. 30744 The registry is a list of assignments, each containing three fields. 30746 1. A value of the ${ietf.org:OS_TYPE} variable. The value must be 1 30747 to 32 UTF-8 characters long. The value must be unique. 30749 2. An explanation (consuming no more than 1024 bytes, or more if 30750 IANA permits) of what CPU architecture the value denotes. A 30751 reference to the explanation can be substituted. 30753 3. The point of contact, including an email address. The point of 30754 contact can consume up to 256 bytes (or more if IANA permits). 30756 22.6.3.1. Initial Registry 30758 There is no initial registry. 30760 22.6.3.2. Updating Registrations 30762 The registrant is free to update the assignment, i.e., change the 30763 explanation and/or point of contact fields. 30765 23. References 30767 23.1. Normative References 30769 [1] Bradner, S., "Key words for use in RFCs to Indicate 30770 Requirement Levels", BCP 14, RFC 2119, 30771 DOI 10.17487/RFC2119, March 1997, 30772 . 30774 [2] Eisler, M., Ed., "XDR: External Data Representation 30775 Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 30776 2006, . 30778 [3] Thurlow, R., "RPC: Remote Procedure Call Protocol 30779 Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, 30780 May 2009, . 30782 [4] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol 30783 Specification", RFC 2203, DOI 10.17487/RFC2203, September 30784 1997, . 30786 [5] Zhu, L., Jaganathan, K., and S. Hartman, "The Kerberos 30787 Version 5 Generic Security Service Application Program 30788 Interface (GSS-API) Mechanism: Version 2", RFC 4121, 30789 DOI 10.17487/RFC4121, July 2005, 30790 . 30792 [6] The Open Group, "Section 3.191 of Chapter 3 of Base 30793 Definitions of The Open Group Base Specifications Issue 6 30794 IEEE Std 1003.1, 2004 Edition, HTML Version", 30795 ISBN 1931624232, 2004, . 30797 [7] Linn, J., "Generic Security Service Application Program 30798 Interface Version 2, Update 1", RFC 2743, 30799 DOI 10.17487/RFC2743, January 2000, 30800 . 30802 [8] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 30803 Garcia, "A Remote Direct Memory Access Protocol 30804 Specification", RFC 5040, DOI 10.17487/RFC5040, October 30805 2007, . 30807 [9] Eisler, M., "RPCSEC_GSS Version 2", RFC 5403, 30808 DOI 10.17487/RFC5403, February 2009, 30809 . 30811 [10] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 30812 "Network File System (NFS) Version 4 Minor Version 1 30813 External Data Representation Standard (XDR) Description", 30814 RFC 5662, DOI 10.17487/RFC5662, January 2010, 30815 . 30817 [11] The Open Group, "Section 3.372 of Chapter 3 of Base 30818 Definitions of The Open Group Base Specifications Issue 6 30819 IEEE Std 1003.1, 2004 Edition, HTML Version", 30820 ISBN 1931624232, 2004, . 30822 [12] Eisler, M., "IANA Considerations for Remote Procedure Call 30823 (RPC) Network Identifiers and Universal Address Formats", 30824 RFC 5665, DOI 10.17487/RFC5665, January 2010, 30825 . 30827 [13] The Open Group, "Section 'read()' of System Interfaces of 30828 The Open Group Base Specifications Issue 6 IEEE Std 30829 1003.1, 2004 Edition, HTML Version", ISBN 1931624232, 30830 2004, . 30832 [14] The Open Group, "Section 'readdir()' of System Interfaces 30833 of The Open Group Base Specifications Issue 6 IEEE Std 30834 1003.1, 2004 Edition, HTML Version", ISBN 1931624232, 30835 2004, . 30837 [15] The Open Group, "Section 'write()' of System Interfaces of 30838 The Open Group Base Specifications Issue 6 IEEE Std 30839 1003.1, 2004 Edition, HTML Version", ISBN 1931624232, 30840 2004, . 30842 [16] Hoffman, P. and M. Blanchet, "Preparation of 30843 Internationalized Strings ("stringprep")", RFC 3454, 30844 DOI 10.17487/RFC3454, December 2002, 30845 . 30847 [17] The Open Group, "Section 'chmod()' of System Interfaces of 30848 The Open Group Base Specifications Issue 6 IEEE Std 30849 1003.1, 2004 Edition, HTML Version", ISBN 1931624232, 30850 2004, . 30852 [18] International Organization for Standardization, 30853 "Information Technology - Universal Multiple-octet coded 30854 Character Set (UCS) - Part 1: Architecture and Basic 30855 Multilingual Plane", ISO Standard 10646-1, May 1993. 30857 [19] Alvestrand, H., "IETF Policy on Character Sets and 30858 Languages", BCP 18, RFC 2277, DOI 10.17487/RFC2277, 30859 January 1998, . 30861 [20] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 30862 Profile for Internationalized Domain Names (IDN)", 30863 RFC 3491, DOI 10.17487/RFC3491, March 2003, 30864 . 30866 [21] The Open Group, "Section 'fcntl()' of System Interfaces of 30867 The Open Group Base Specifications Issue 6 IEEE Std 30868 1003.1, 2004 Edition, HTML Version", ISBN 1931624232, 30869 2004, . 30871 [22] The Open Group, "Section 'fsync()' of System Interfaces of 30872 The Open Group Base Specifications Issue 6 IEEE Std 30873 1003.1, 2004 Edition, HTML Version", ISBN 1931624232, 30874 2004, . 30876 [23] The Open Group, "Section 'getpwnam()' of System Interfaces 30877 of The Open Group Base Specifications Issue 6 IEEE Std 30878 1003.1, 2004 Edition, HTML Version", ISBN 1931624232, 30879 2004, . 30881 [24] The Open Group, "Section 'unlink()' of System Interfaces 30882 of The Open Group Base Specifications Issue 6 IEEE Std 30883 1003.1, 2004 Edition, HTML Version", ISBN 1931624232, 30884 2004, . 30886 [25] Schaad, J., Kaliski, B., and R. Housley, "Additional 30887 Algorithms and Identifiers for RSA Cryptography for use in 30888 the Internet X.509 Public Key Infrastructure Certificate 30889 and Certificate Revocation List (CRL) Profile", RFC 4055, 30890 DOI 10.17487/RFC4055, June 2005, 30891 . 30893 [26] National Institute of Standards and Technology, "Computer 30894 Security Objects Register", May 2016, 30895 . 30898 [27] Adamson, A. and N. Williams, "Remote Procedure Call (RPC) 30899 Security Version 3", RFC 7861, DOI 10.17487/RFC7861, 30900 November 2016, . 30902 [28] Neuman, C., Yu, T., Hartman, S., and K. Raeburn, "The 30903 Kerberos Network Authentication Service (V5)", RFC 4120, 30904 DOI 10.17487/RFC4120, July 2005, 30905 . 30907 [29] Arends, R., Austein, R., Larson, M., Massey, D., and S. 30908 Rose, "DNS Security Introduction and Requirements", 30909 RFC 4033, DOI 10.17487/RFC4033, March 2005, 30910 . 30912 [30] Hu, Z., Zhu, L., Heidemann, J., Mankin, A., Wessels, D., 30913 and P. Hoffman, "Specification for DNS over Transport 30914 Layer Security (TLS)", RFC 7858, DOI 10.17487/RFC7858, May 30915 2016, . 30917 [31] Adamson, A. and N. Williams, "Requirements for NFSv4 30918 Multi-Domain Namespace Deployment", RFC 8000, 30919 DOI 10.17487/RFC8000, November 2016, 30920 . 30922 [32] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 30923 Memory Access Transport for Remote Procedure Call Version 30924 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 30925 . 30927 [33] Lever, C., "Network File System (NFS) Upper-Layer Binding 30928 to RPC-over-RDMA Version 1", RFC 8267, 30929 DOI 10.17487/RFC8267, October 2017, 30930 . 30932 [34] Hoffman, P. and P. McManus, "DNS Queries over HTTPS 30933 (DoH)", RFC 8484, DOI 10.17487/RFC8484, October 2018, 30934 . 30936 [35] Bradner, S., "The Internet Standards Process -- Revision 30937 3", BCP 9, RFC 2026, October 1996. 30939 Kolkman, O., Bradner, S., and S. Turner, "Characterization 30940 of Proposed Standards", BCP 9, RFC 7127, January 2014. 30942 Dusseault, L. and R. Sparks, "Guidance on Interoperation 30943 and Implementation Reports for Advancement to Draft 30944 Standard", BCP 9, RFC 5657, September 2009. 30946 Housley, R., Crocker, D., and E. Burger, "Reducing the 30947 Standards Track to Two Maturity Levels", BCP 9, RFC 6410, 30948 October 2011. 30950 Resnick, P., "Retirement of the "Internet Official 30951 Protocol Standards" Summary Document", BCP 9, RFC 7100, 30952 December 2013. 30954 Dawkins, S., "Increasing the Number of Area Directors in 30955 an IETF Area", BCP 9, RFC 7475, March 2015. 30957 30959 23.2. Informative References 30961 [36] Roach, A., "Process for Handling Non-Major Revisions to 30962 Existing RFCs", Work in Progress, Internet-Draft, draft- 30963 roach-bis-documents-00, 7 May 2019, 30964 . 30967 [37] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., 30968 Beame, C., Eisler, M., and D. Noveck, "Network File System 30969 (NFS) version 4 Protocol", RFC 3530, DOI 10.17487/RFC3530, 30970 April 2003, . 30972 [38] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS 30973 Version 3 Protocol Specification", RFC 1813, 30974 DOI 10.17487/RFC1813, June 1995, 30975 . 30977 [39] Eisler, M., "LIPKEY - A Low Infrastructure Public Key 30978 Mechanism Using SPKM", RFC 2847, DOI 10.17487/RFC2847, 30979 June 2000, . 30981 [40] Eisler, M., "NFS Version 2 and Version 3 Security Issues 30982 and the NFS Protocol's Use of RPCSEC_GSS and Kerberos V5", 30983 RFC 2623, DOI 10.17487/RFC2623, June 1999, 30984 . 30986 [41] Juszczak, C., "Improving the Performance and Correctness 30987 of an NFS Server", USENIX Conference Proceedings, June 30988 1990. 30990 [42] Reynolds, J., Ed., "Assigned Numbers: RFC 1700 is Replaced 30991 by an On-line Database", RFC 3232, DOI 10.17487/RFC3232, 30992 January 2002, . 30994 [43] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", 30995 RFC 1833, DOI 10.17487/RFC1833, August 1995, 30996 . 30998 [44] Werme, R., "RPC XID Issues", USENIX Conference 30999 Proceedings, February 1996. 31001 [45] Nowicki, B., "NFS: Network File System Protocol 31002 specification", RFC 1094, DOI 10.17487/RFC1094, March 31003 1989, . 31005 [46] Bhide, A., Elnozahy, E. N., and S. P. Morgan, "A Highly 31006 Available Network Server", USENIX Conference Proceedings, 31007 January 1991. 31009 [47] Halevy, B., Welch, B., and J. Zelenka, "Object-Based 31010 Parallel NFS (pNFS) Operations", RFC 5664, 31011 DOI 10.17487/RFC5664, January 2010, 31012 . 31014 [48] Black, D., Fridella, S., and J. Glasgow, "Parallel NFS 31015 (pNFS) Block/Volume Layout", RFC 5663, 31016 DOI 10.17487/RFC5663, January 2010, 31017 . 31019 [49] Callaghan, B., "WebNFS Client Specification", RFC 2054, 31020 DOI 10.17487/RFC2054, October 1996, 31021 . 31023 [50] Callaghan, B., "WebNFS Server Specification", RFC 2055, 31024 DOI 10.17487/RFC2055, October 1996, 31025 . 31027 [51] IESG, "IESG Processing of RFC Errata for the IETF Stream", 31028 July 2008, 31029 . 31032 [52] Krawczyk, H., Bellare, M., and R. Canetti, "HMAC: Keyed- 31033 Hashing for Message Authentication", RFC 2104, 31034 DOI 10.17487/RFC2104, February 1997, 31035 . 31037 [53] Shepler, S., "NFS Version 4 Design Considerations", 31038 RFC 2624, DOI 10.17487/RFC2624, June 1999, 31039 . 31041 [54] The Open Group, "Protocols for Interworking: XNFS, Version 31042 3W", ISBN 1-85912-184-5, February 1998. 31044 [55] Floyd, S. and V. Jacobson, "The Synchronization of 31045 Periodic Routing Messages", IEEE/ACM Transactions on 31046 Networking, 2(2), pp. 122-136, April 1994. 31048 [56] Chadalapaka, M., Satran, J., Meth, K., and D. Black, 31049 "Internet Small Computer System Interface (iSCSI) Protocol 31050 (Consolidated)", RFC 7143, DOI 10.17487/RFC7143, April 31051 2014, . 31053 [57] Snively, R., "Fibre Channel Protocol for SCSI, 2nd Version 31054 (FCP-2)", ANSI/INCITS, 350-2003, October 2003. 31056 [58] Weber, R.O., "Object-Based Storage Device Commands (OSD)", 31057 ANSI/INCITS, 400-2004, July 2004, 31058 . 31060 [59] Carns, P. H., Ligon III, W. B., Ross, R. B., and R. 31061 Thakur, "PVFS: A Parallel File System for Linux 31062 Clusters.", Proceedings of the 4th Annual Linux Showcase 31063 and Conference, 2000. 31065 [60] The Open Group, "The Open Group Base Specifications Issue 31066 6, IEEE Std 1003.1, 2004 Edition", 2004, 31067 . 31069 [61] Callaghan, B., "NFS URL Scheme", RFC 2224, 31070 DOI 10.17487/RFC2224, October 1997, 31071 . 31073 [62] Chiu, A., Eisler, M., and B. Callaghan, "Security 31074 Negotiation for WebNFS", RFC 2755, DOI 10.17487/RFC2755, 31075 January 2000, . 31077 [63] Cotton, M., Leiba, B., and T. Narten, "Guidelines for 31078 Writing an IANA Considerations Section in RFCs", BCP 26, 31079 RFC 8126, DOI 10.17487/RFC8126, June 2017, 31080 . 31082 [64] RFC Errata, Erratum ID 2006, RFC 5661, 31083 . 31085 [65] Spasojevic, M. and M. Satayanarayanan, "An Empirical Study 31086 of a Wide-Area Distributed File System", ACM Transactions 31087 on Computer Systems, Vol. 14, No. 2, pp. 200-222, 31088 DOI 10.1145/227695.227698, May 1996, 31089 . 31091 [66] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 31092 "Network File System (NFS) Version 4 Minor Version 1 31093 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 31094 . 31096 [67] Noveck, D., "Rules for NFSv4 Extensions and Minor 31097 Versions", RFC 8178, DOI 10.17487/RFC8178, July 2017, 31098 . 31100 [68] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 31101 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 31102 March 2015, . 31104 [69] Noveck, D., Ed., Shivam, P., Lever, C., and B. Baker, 31105 "NFSv4.0 Migration: Specification Update", RFC 7931, 31106 DOI 10.17487/RFC7931, July 2016, 31107 . 31109 [70] Haynes, T., "Requirements for Parallel NFS (pNFS) Layout 31110 Types", RFC 8434, DOI 10.17487/RFC8434, August 2018, 31111 . 31113 [71] Farrell, S. and H. Tschofenig, "Pervasive Monitoring Is an 31114 Attack", BCP 188, RFC 7258, DOI 10.17487/RFC7258, May 31115 2014, . 31117 [72] Rescorla, E. and B. Korver, "Guidelines for Writing RFC 31118 Text on Security Considerations", BCP 72, RFC 3552, 31119 DOI 10.17487/RFC3552, July 2003, 31120 . 31122 Appendix A. The Need for This Update 31124 This document includes an explanation of how clients and servers are 31125 to determine the particular network access paths to be used to access 31126 a file system. This includes descriptions of how to handle changes 31127 to the specific replica to be used or to the set of addresses to be 31128 used to access it, and how to deal transparently with transfers of 31129 responsibility that need to be made. This includes cases in which 31130 there is a shift between one replica and another and those in which 31131 different network access paths are used to access the same replica. 31133 As a result of the following problems in RFC 5661 [66], it was 31134 necessary to provide the specific updates that are made by this 31135 document. These updates are described in Appendix B. 31137 * RFC 5661 [66], while it dealt with situations in which various 31138 forms of clustering allowed coordination of the state assigned by 31139 cooperating servers to be used, made no provisions for Transparent 31140 State Migration. Within NFSv4.0, Transparent State Migration was 31141 first explained clearly in RFC 7530 [68] and corrected and 31142 clarified by RFC 7931 [69]. No corresponding explanation for 31143 NFSv4.1 had been provided. 31145 * Although NFSv4.1 provided a clear definition of how trunking 31146 detection was to be done, there was no clear specification of how 31147 trunking discovery was to be done, despite the fact that the 31148 specification clearly indicated that this information could be 31149 made available via the file system location attributes. 31151 * Because the existence of multiple network access paths to the same 31152 file system was dealt with as if there were multiple replicas, 31153 issues relating to transitions between replicas could never be 31154 clearly distinguished from trunking-related transitions between 31155 the addresses used to access a particular file system instance. 31156 As a result, in situations in which both migration and trunking 31157 configuration changes were involved, neither of these could be 31158 clearly dealt with, and the relationship between these two 31159 features was not seriously addressed. 31161 * Because use of two network access paths to the same file system 31162 instance (i.e., trunking) was often treated as if two replicas 31163 were involved, it was considered that two replicas were being used 31164 simultaneously. As a result, the treatment of replicas being used 31165 simultaneously in RFC 5661 [66] was not clear, as it covered the 31166 two distinct cases of a single file system instance being accessed 31167 by two different network access paths and two replicas being 31168 accessed simultaneously, with the limitations of the latter case 31169 not being clearly laid out. 31171 The majority of the consequences of these issues are dealt with by 31172 presenting in Section 11 a replacement for Section 11 of RFC 5661 31173 [66]. This replacement modifies existing subsections within that 31174 section and adds new ones as described in Appendix B.1. Also, some 31175 existing sections were deleted. These changes were made in order to 31176 do the following: 31178 * Reorganize the description so that the case of two network access 31179 paths to the same file system instance is distinguished clearly 31180 from the case of two different replicas since, in the former case, 31181 locking state is shared and there also can be sharing of session 31182 state. 31184 * Provide a clear statement regarding the desirability of 31185 transparent transfer of state between replicas together with a 31186 recommendation that either transparent transfer or a single-fs 31187 grace period be provided. 31189 * Specifically delineate how a client is to handle such transfers, 31190 taking into account the differences from the treatment in [69] 31191 made necessary by the major protocol changes to NFSv4.1. 31193 * Discuss the relationship between transparent state transfer and 31194 Parallel NFS (pNFS). 31196 * Clarify the fs_locations_info attribute in order to specify which 31197 portions of the provided information apply to a specific network 31198 access path and which apply to the replica that the path is used 31199 to access. 31201 In addition, other sections of RFC 5661 [66] were updated to correct 31202 the consequences of the incorrect assumptions underlying the 31203 treatment of multi-server namespace issues. These are described in 31204 Appendices B.2 through B.4. 31206 * A revised introductory section regarding multi-server namespace 31207 facilities is provided. 31209 * A more realistic treatment of server scope is provided. This 31210 treatment reflects the more limited coordination of locking state 31211 adopted by servers actually sharing a common server scope. 31213 * Some confusing text regarding changes in server_owner has been 31214 clarified. 31216 * The description of some existing errors has been modified to more 31217 clearly explain certain error situations to reflect the existence 31218 of trunking and the possible use of fs-specific grace periods. 31219 For details, see Appendix B.3. 31221 * New descriptions of certain existing operations are provided, 31222 either because the existing treatment did not account for 31223 situations that would arise in dealing with Transparent State 31224 Migration, or because some types of reclaim issues were not 31225 adequately dealt with in the context of fs-specific grace periods. 31226 For details, see Appendix B.2. 31228 Appendix B. Changes in This Update 31230 B.1. Revisions Made to Section 11 of RFC 5661 31232 A number of areas have been revised or extended, in many cases 31233 replacing subsections within Section 11 of RFC 5661 [66]: 31235 * New introductory material, including a terminology section, 31236 replaces the material in RFC 5661 [66], ranging from the start of 31237 the original Section 11 up to and including Section 11.1. The new 31238 material starts at the beginning of Section 11 and continues 31239 through 11.2. 31241 * A significant reorganization of the material in Sections 11.4 and 31242 11.5 of RFC 5661 [66] was necessary. The reasons for the 31243 reorganization of these sections into a single section with 31244 multiple subsections are discussed in Appendix B.1.1 below. This 31245 replacement appears as Section 11.5. 31247 New material relating to the handling of the file system location 31248 attributes is contained in Sections 11.5.1 and 11.5.7. 31250 * A new section describing requirements for user and group handling 31251 within a multi-server namespace has been added as Section 11.7. 31253 * A major replacement for Section 11.7 of RFC 5661 [66], entitled 31254 "Effecting File System Transitions", appears as Sections 11.9 31255 through 11.14. The reasons for the reorganization of this section 31256 into multiple sections are discussed in Appendix B.1.2. 31258 * A replacement for Section 11.10 of RFC 5661 [66], entitled "The 31259 Attribute fs_locations_info", appears as Section 11.17, with 31260 Appendix B.1.3 describing the differences between the new section 31261 and the treatment within [66]. A revised treatment was necessary 31262 because the original treatment did not make clear how the added 31263 attribute information relates to the case of trunked paths to the 31264 same replica. These issues were not addressed in RFC 5661 [66] 31265 where the concepts of a replica and a network path used to access 31266 a replica were not clearly distinguished. 31268 B.1.1. Reorganization of Sections 11.4 and 11.5 of RFC 5661 31270 Previously, issues related to the fact that multiple location entries 31271 directed the client to the same file system instance were dealt with 31272 in Section 11.5 of RFC 5661 [66]. Because of the new treatment of 31273 trunking, these issues now belong within Section 11.5. 31275 In this new section, trunking is covered in Section 11.5.2 together 31276 with the other uses of file system location information described in 31277 Sections 11.5.3 through 11.5.6. 31279 As a result, Section 11.5, which replaces Section 11.4 of RFC 5661 31280 [66], is substantially different than the section it replaces in that 31281 some original sections have been replaced by corresponding sections 31282 as described below, while new sections have been added: 31284 * The material in Section 11.5, exclusive of subsections, replaces 31285 the material in Section 11.4 of RFC 5661 [66] exclusive of 31286 subsections. 31288 * Section 11.5.1 is the new first subsection of the overall section. 31290 * Section 11.5.2 is the new second subsection of the overall 31291 section. 31293 * Each of the Sections 11.5.4, 11.5.5, and 11.5.6 replaces (in 31294 order) one of the corresponding Sections 11.4.1, 11.4.2, and 31295 11.4.3 of RFC 5661 [66]. 31297 * Section 11.5.7 is the new final subsection of the overall section. 31299 B.1.2. Reorganization of Material Dealing with File System Transitions 31301 The material relating to file system transition, previously contained 31302 in Section 11.7 of RFC 5661 [66] has been reorganized and augmented 31303 as described below: 31305 * Because there can be a shift of the network access paths used to 31306 access a file system instance without any shift between replicas, 31307 a new Section 11.9 distinguishes between those cases in which 31308 there is a shift between distinct replicas and those involving a 31309 shift in network access paths with no shift between replicas. 31311 As a result, the new Section 11.10 deals with network address 31312 transitions, while the bulk of the original Section 11.7 of RFC 31313 5661 [66] has been extensively modified as reflected in 31314 Section 11.11, which is now limited to cases in which there is a 31315 shift between two different sets of replicas. 31317 * The additional Section 11.12 discusses the case in which a shift 31318 to a different replica is made and state is transferred to allow 31319 the client the ability to have continued access to its accumulated 31320 locking state on the new server. 31322 * The additional Section 11.13 discusses the client's response to 31323 access transitions, how it determines whether migration has 31324 occurred, and how it gets access to any transferred locking and 31325 session state. 31327 * The additional Section 11.14 discusses the responsibilities of the 31328 source and destination servers when transferring locking and 31329 session state. 31331 This reorganization has caused a renumbering of the sections within 31332 Section 11 of [66] as described below: 31334 * The new Sections 11.9 and 11.10 have resulted in the renumbering 31335 of existing sections with these numbers. 31337 * Section 11.7 of [66] has been substantially modified and appears 31338 as Section 11.11. The necessary modifications reflect the fact 31339 that this section only deals with transitions between replicas, 31340 while transitions between network addresses are dealt with in 31341 other sections. Details of the reorganization are described later 31342 in this section. 31344 * Sections 11.12, 11.13, and 11.14 have been added. 31346 * Consequently, Sections 11.8, 11.9, 11.10, and 11.11 in [66] now 31347 appear as Sections 11.15, 11.16, 11.17, and 11.18, respectively. 31349 As part of this general reorganization, Section 11.7 of RFC 5661 [66] 31350 has been modified as described below: 31352 * Sections 11.7 and 11.7.1 of RFC 5661 [66] have been replaced by 31353 Sections 11.11 and 11.11.1, respectively. 31355 * Section 11.7.2 of RFC 5661 (and included subsections) has been 31356 deleted. 31358 * Sections 11.7.3, 11.7.4, 11.7.5, 11.7.5.1, and 11.7.6 of RFC 5661 31359 [66] have been replaced by Sections 11.11.2, 11.11.3, 11.11.4, 31360 11.11.4.1, and 11.11.5 respectively in this document. 31362 * Section 11.7.7 of RFC 5661 [66] has been replaced by 31363 Section 11.11.9. This subsection has been moved to the end of the 31364 section dealing with file system transitions. 31366 * Sections 11.7.8, 11.7.9, and 11.7.10 of RFC 5661 [66] have been 31367 replaced by Sections 11.11.6, 11.11.7, and 11.11.8 respectively in 31368 this document. 31370 B.1.3. Updates to the Treatment of fs_locations_info 31372 Various elements of the fs_locations_info attribute contain 31373 information that applies to either a specific file system replica or 31374 to a network path or set of network paths used to access such a 31375 replica. The original treatment of fs_locations_info (Section 11.10 31376 of RFC 5661 [66]) did not clearly distinguish these cases, in part 31377 because the document did not clearly distinguish replicas from the 31378 paths used to access them. 31380 In addition, special clarification has been provided with regard to 31381 the following fields: 31383 * With regard to the handling of FSLI4GF_GOING, it was clarified 31384 that this only applies to the unavailability of a replica rather 31385 than to a path to access a replica. 31387 * In describing the appropriate value for a server to use for 31388 fli_valid_for, it was clarified that there is no need for the 31389 client to frequently fetch the fs_locations_info value to be 31390 prepared for shifts in trunking patterns. 31392 * Clarification of the rules for extensions to the fls_info has been 31393 provided. The original treatment reflected the extension model 31394 that was in effect at the time RFC 5661 [66] was written, but has 31395 been updated in accordance with the extension model described in 31396 RFC 8178 [67]. 31398 B.2. Revisions Made to Operations in RFC 5661 31400 Descriptions have been revised to address issues that arose in 31401 effecting necessary changes to multi-server namespace features. 31403 * The treatment of EXCHANGE_ID (Section 18.35 of RFC 5661 [66]) 31404 assumed that client IDs cannot be created/confirmed other than by 31405 the EXCHANGE_ID and CREATE_SESSION operations. Also, the 31406 necessary use of EXCHANGE_ID in recovery from migration and 31407 related situations was not clearly addressed. A revised treatment 31408 of EXCHANGE_ID was necessary, and it appears in Section 18.35, 31409 while the specific differences between it and the treatment within 31410 [66] are explained in Appendix B.2.1 below. 31412 * The treatment of RECLAIM_COMPLETE in Section 18.51 of RFC 5661 31413 [66] was not sufficiently clear about the purpose and use of the 31414 rca_one_fs and how the server was to deal with inappropriate 31415 values of this argument. Because the resulting confusion raised 31416 interoperability issues, a new treatment of RECLAIM_COMPLETE was 31417 necessary, and it appears in Section 18.51, while the specific 31418 differences between it and the treatment within RFC 5661 [66] are 31419 discussed in Appendix B.2.2 below. In addition, the definitions 31420 of the reclaim-related errors have received an updated treatment 31421 in Section 15.1.9 to reflect the fact that there are multiple 31422 contexts for lock reclaim operations. 31424 B.2.1. Revision of Treatment of EXCHANGE_ID 31426 There was a number of issues in the original treatment of EXCHANGE_ID 31427 in RFC 5661 [66] that caused problems for Transparent State Migration 31428 and for the transfer of access between different network access paths 31429 to the same file system instance. 31431 These issues arose from the fact that this treatment was written: 31433 * Assuming that a client ID can only become known to a server by 31434 having been created by executing an EXCHANGE_ID, with confirmation 31435 of the ID only possible by execution of a CREATE_SESSION. 31437 * Considering the interactions between a client and a server only 31438 occurring on a single network address. 31440 As these assumptions have become invalid in the context of 31441 Transparent State Migration and active use of trunking, the treatment 31442 has been modified in several respects: 31444 * It had been assumed that an EXCHANGE_ID executed when the server 31445 was already aware that a given client instance was either updating 31446 associated parameters (e.g., with respect to callbacks) or dealing 31447 with a previously lost reply by retransmitting. As a result, any 31448 slot sequence returned by that operation would be of no use. The 31449 original treatment went so far as to say that it "MUST NOT" be 31450 used, although this usage was not in accord with [1]. This 31451 created a difficulty when an EXCHANGE_ID is done after Transparent 31452 State Migration since that slot sequence would need to be used in 31453 a subsequent CREATE_SESSION. 31455 In the updated treatment, CREATE_SESSION is a way that client IDs 31456 are confirmed, but it is understood that other ways are possible. 31457 The slot sequence can be used as needed, and cases in which it 31458 would be of no use are appropriately noted. 31460 * It had been assumed that the only functions of EXCHANGE_ID were to 31461 inform the server of the client, to create the client ID, and to 31462 communicate it to the client. When multiple simultaneous 31463 connections are involved, as often happens when trunking, that 31464 treatment was inadequate in that it ignored the role of 31465 EXCHANGE_ID in associating the client ID with the connection on 31466 which it was done, so that it could be used by a subsequent 31467 CREATE_SESSION whose parameters do not include an explicit client 31468 ID. 31470 The new treatment explicitly discusses the role of EXCHANGE_ID in 31471 associating the client ID with the connection so it can be used by 31472 CREATE_SESSION and in associating a connection with an existing 31473 session. 31475 The new treatment can be found in Section 18.35 above. It supersedes 31476 the treatment in Section 18.35 of RFC 5661 [66]. 31478 B.2.2. Revision of Treatment of RECLAIM_COMPLETE 31480 The following changes were made to the treatment of RECLAIM_COMPLETE 31481 in RFC 5661 [66] to arrive at the treatment in Section 18.51: 31483 * In a number of places, the text was made more explicit about the 31484 purpose of rca_one_fs and its connection to file system migration. 31486 * There is a discussion of situations in which particular forms of 31487 RECLAIM_COMPLETE would need to be done. 31489 * There is a discussion of interoperability issues between 31490 implementations that may have arisen due to the lack of clarity of 31491 the previous treatment of RECLAIM_COMPLETE. 31493 B.3. Revisions Made to Error Definitions in RFC 5661 31495 The new handling of various situations required revisions to some 31496 existing error definitions: 31498 * Because of the need to appropriately address trunking-related 31499 issues, some uses of the term "replica" in RFC 5661 [66] became 31500 problematic because a shift in network access paths was considered 31501 to be a shift to a different replica. As a result, the original 31502 definition of NFS4ERR_MOVED (in Section 15.1.2.4 of RFC 5661 [66]) 31503 was updated to reflect the different handling of unavailability of 31504 a particular fs via a specific network address. 31506 Since such a situation is no longer considered to constitute 31507 unavailability of a file system instance, the description has been 31508 changed, even though the set of circumstances in which it is to be 31509 returned remains the same. The new paragraph explicitly 31510 recognizes that a different network address might be used, while 31511 the previous description, misleadingly, treated this as a shift 31512 between two replicas while only a single file system instance 31513 might be involved. The updated description appears in 31514 Section 15.1.2.4. 31516 * Because of the need to accommodate the use of fs-specific grace 31517 periods, it was necessary to clarify some of the definitions of 31518 reclaim-related errors in Section 15 of RFC 5661 [66] so that the 31519 text applies properly to reclaims for all types of grace periods. 31520 The updated descriptions appear within Section 15.1.9. 31522 * Because of the need to provide the clarifications in errata report 31523 2006 [64] and to adapt these to properly explain the interaction 31524 of NFS4ERR_DELAY with the reply cache, a revised description of 31525 NFS4ERR_DELAY appears in Section 15.1.1.3. This errata report, 31526 unlike many other RFC 5661 errata reports, is addressed in this 31527 document because of the extensive use of NFS4ERR_DELAY in 31528 connection with state migration and session migration. 31530 B.4. Other Revisions Made to RFC 5661 31532 Besides the major reworking of Section 11 of RFC 5661 [66] and the 31533 associated revisions to existing operations and errors, there were a 31534 number of related changes that were necessary: 31536 * The summary in Section 1.7.3.3 of RFC 5661 [66] was revised to 31537 reflect the changes made to Section 11 above. The updated summary 31538 appears as Section 1.8.3.3 above. 31540 * The discussion of server scope in Section 2.10.4 of RFC 5661 [66] 31541 was replaced since it appeared to require a level of inter-server 31542 coordination incompatible with its basic function of avoiding the 31543 need for a globally uniform means of assigning server_owner 31544 values. A revised treatment appears in Section 2.10.4. 31546 * The discussion of trunking in Section 2.10.5 of RFC 5661 [66] was 31547 revised to more clearly explain the multiple types of trunking 31548 support and how the client can be made aware of the existing 31549 trunking configuration. In addition, while the last paragraph 31550 (exclusive of subsections) of that section dealing with 31551 server_owner changes was literally true, it had been a source of 31552 confusion. Since the original paragraph could be read as 31553 suggesting that such changes be handled nondisruptively, the issue 31554 was clarified in the revised Section 2.10.5. 31556 Appendix C. Security Issues That Need to Be Addressed 31558 The following issues in the treatment of security within the NFSv4.1 31559 specification need to be addressed: 31561 * The Security Considerations Section of RFC 5661 [66] was not 31562 written in accordance with RFC 3552 (BCP 72) [72]. Of particular 31563 concern was the fact that the section did not contain a threat 31564 analysis. 31566 * Initial analysis of the existing security issues with NFSv4.1 has 31567 made it likely that a revised Security Considerations section for 31568 the existing protocol (one containing a threat analysis) would be 31569 likely to conclude that NFSv4.1 does not meet the goal of secure 31570 use on the Internet. 31572 The Security Considerations section of this document (Section 21) has 31573 not been thoroughly revised to correct the difficulties mentioned 31574 above. Instead, it has been modified to take proper account of 31575 issues related to the multi-server namespace features discussed in 31576 Section 11, leaving the incomplete discussion and security weaknesses 31577 pretty much as they were. 31579 The following major security issues need to be addressed in a 31580 satisfactory fashion before an updated Security Considerations 31581 section can be published as part of a bis document for NFSv4.1: 31583 * The continued use of AUTH_SYS and the security exposures it 31584 creates need to be addressed. Addressing this issue must not be 31585 limited to the questions of whether the designation of this as 31586 OPTIONAL was justified and whether it should be changed. 31588 In any event, it may not be possible at this point to correct the 31589 security problems created by continued use of AUTH_SYS simply by 31590 revising this designation. 31592 * The lack of attention within the protocol to the possibility of 31593 pervasive monitoring attacks such as those described in RFC 7258 31594 [71] (also BCP 188). 31596 In that connection, the use of CREATE_SESSION without privacy 31597 protection needs to be addressed as it exposes the session ID to 31598 view by an attacker. This is worrisome as this is precisely the 31599 type of protocol artifact alluded to in RFC 7258, which can enable 31600 further mischief on the part of the attacker as it enables denial- 31601 of-service attacks that can be executed effectively with only a 31602 single, normally low-value, credential, even when RPCSEC_GSS 31603 authentication is in use. 31605 * The lack of effective use of privacy and integrity, even where the 31606 infrastructure to support use of RPCSEC_GSS is present, needs to 31607 be addressed. 31609 In light of the security exposures that this situation creates, it 31610 is not enough to define a protocol that could address this problem 31611 with the provision of sufficient resources. Instead, what is 31612 needed is a way to provide the necessary security with very 31613 limited performance costs and without requiring security 31614 infrastructure, which experience has shown is difficult for many 31615 clients and servers to provide. 31617 In trying to provide a major security upgrade for a deployed protocol 31618 such as NFSv4.1, the working group and the Internet community are 31619 likely to find themselves dealing with a number of considerations 31620 such as the following: 31622 * The need to accommodate existing deployments of protocols 31623 specified previously in existing Proposed Standards. 31625 * The difficulty of effecting changes to existing, interoperating 31626 implementations. 31628 * The difficulty of making changes to NFSv4 protocols other than 31629 those in the form of OPTIONAL extensions. 31631 * The tendency of those responsible for existing NFSv4 deployments 31632 to ignore security flaws in the context of local area networks 31633 under the mistaken impression that network isolation provides, in 31634 and of itself, isolation from all potential attackers. 31636 Given that the above-mentioned difficulties apply to minor version 31637 zero as well, it may make sense to deal with these security issues in 31638 a common document that applies to all NFSv4 minor versions. If that 31639 approach is taken, the Security Considerations section of an eventual 31640 NFv4.1 bis document would reference that common document, and the 31641 defining RFCs for other minor versions might do so as well. 31643 Acknowledgments 31645 Acknowledgments for This Update 31647 The authors wish to acknowledge the important role of Andy Adamson of 31648 Netapp in clarifying the need for trunking discovery functionality, 31649 and exploring the role of the file system location attributes in 31650 providing the necessary support. 31652 The authors wish to thank Tom Haynes of Hammerspace for drawing our 31653 attention to the fact that internationalization and security might 31654 best be handled in documents dealing with such protocol issues as 31655 they apply to all NFSv4 minor versions. 31657 The authors also wish to acknowledge the work of Xuan Qi of Oracle 31658 with NFSv4.1 client and server prototypes of Transparent State 31659 Migration functionality. 31661 The authors wish to thank others that brought attention to important 31662 issues. The comments of Trond Myklebust of Primary Data related to 31663 trunking helped to clarify the role of DNS in trunking discovery. 31664 Rick Macklem's comments brought attention to problems in the handling 31665 of the per-fs version of RECLAIM_COMPLETE. 31667 The authors wish to thank Olga Kornievskaia of Netapp for her helpful 31668 review comments. 31670 Acknowledgments for RFC 5661 31672 The initial text for the SECINFO extensions were edited by Mike 31673 Eisler with contributions from Peng Dai, Sergey Klyushin, and Carl 31674 Burnett. 31676 The initial text for the SESSIONS extensions were edited by Tom 31677 Talpey, Spencer Shepler, Jon Bauman with contributions from Charles 31678 Antonelli, Brent Callaghan, Mike Eisler, John Howard, Chet Juszczak, 31679 Trond Myklebust, Dave Noveck, John Scott, Mike Stolarchuk, and Mark 31680 Wittle. 31682 Initial text relating to multi-server namespace features, including 31683 the concept of referrals, were contributed by Dave Noveck, Carl 31684 Burnett, and Charles Fan with contributions from Ted Anderson, Neil 31685 Brown, and Jon Haswell. 31687 The initial text for the Directory Delegations support were 31688 contributed by Saadia Khan with input from Dave Noveck, Mike Eisler, 31689 Carl Burnett, Ted Anderson, and Tom Talpey. 31691 The initial text for the ACL explanations were contributed by Sam 31692 Falkner and Lisa Week. 31694 The pNFS work was inspired by the NASD and OSD work done by Garth 31695 Gibson. Gary Grider has also been a champion of high-performance 31696 parallel I/O. Garth Gibson and Peter Corbett started the pNFS effort 31697 with a problem statement document for the IETF that formed the basis 31698 for the pNFS work in NFSv4.1. 31700 The initial text for the parallel NFS support was edited by Brent 31701 Welch and Garth Goodson. Additional authors for those documents were 31702 Benny Halevy, David Black, and Andy Adamson. Additional input came 31703 from the informal group that contributed to the construction of the 31704 initial pNFS drafts; specific acknowledgment goes to Gary Grider, 31705 Peter Corbett, Dave Noveck, Peter Honeyman, and Stephen Fridella. 31707 Fredric Isaman found several errors in draft versions of the ONC RPC 31708 XDR description of the NFSv4.1 protocol. 31710 Audrey Van Belleghem provided, in numerous ways, essential 31711 coordination and management of the process of editing the 31712 specification documents. 31714 Richard Jernigan gave feedback on the file layout's striping pattern 31715 design. 31717 Several formal inspection teams were formed to review various areas 31718 of the protocol. All the inspections found significant errors and 31719 room for improvement. NFSv4.1's inspection teams were: 31721 * ACLs, with the following inspectors: Sam Falkner, Bruce Fields, 31722 Rahul Iyer, Saadia Khan, Dave Noveck, Lisa Week, Mario Wurzl, and 31723 Alan Yoder. 31725 * Sessions, with the following inspectors: William Brown, Tom 31726 Doeppner, Robert Gordon, Benny Halevy, Fredric Isaman, Rick 31727 Macklem, Trond Myklebust, Dave Noveck, Karen Rochford, John Scott, 31728 and Peter Shah. 31730 * Initial pNFS inspection, with the following inspectors: Andy 31731 Adamson, David Black, Mike Eisler, Marc Eshel, Sam Falkner, Garth 31732 Goodson, Benny Halevy, Rahul Iyer, Trond Myklebust, Spencer 31733 Shepler, and Lisa Week. 31735 * Global namespace, with the following inspectors: Mike Eisler, Dan 31736 Ellard, Craig Everhart, Fredric Isaman, Trond Myklebust, Dave 31737 Noveck, Theresa Raj, Spencer Shepler, Renu Tewari, and Robert 31738 Thurlow. 31740 * NFSv4.1 file layout type, with the following inspectors: Andy 31741 Adamson, Marc Eshel, Sam Falkner, Garth Goodson, Rahul Iyer, Trond 31742 Myklebust, and Lisa Week. 31744 * NFSv4.1 locking and directory delegations, with the following 31745 inspectors: Mike Eisler, Pranoop Erasani, Robert Gordon, Saadia 31746 Khan, Eric Kustarz, Dave Noveck, Spencer Shepler, and Amy Weaver. 31748 * EXCHANGE_ID and DESTROY_CLIENTID, with the following inspectors: 31749 Mike Eisler, Pranoop Erasani, Robert Gordon, Benny Halevy, Fredric 31750 Isaman, Saadia Khan, Ricardo Labiaga, Rick Macklem, Trond 31751 Myklebust, Spencer Shepler, and Brent Welch. 31753 * Final pNFS inspection, with the following inspectors: Andy 31754 Adamson, Mike Eisler, Mark Eshel, Sam Falkner, Jason Glasgow, 31755 Garth Goodson, Robert Gordon, Benny Halevy, Dean Hildebrand, Rahul 31756 Iyer, Suchit Kaura, Trond Myklebust, Anatoly Pinchuk, Spencer 31757 Shepler, Renu Tewari, Lisa Week, and Brent Welch. 31759 A review team worked together to generate the tables of assignments 31760 of error sets to operations and make sure that each such assignment 31761 had two or more people validating it. Participating in the process 31762 were Andy Adamson, Mike Eisler, Sam Falkner, Garth Goodson, Robert 31763 Gordon, Trond Myklebust, Dave Noveck, Spencer Shepler, Tom Talpey, 31764 Amy Weaver, and Lisa Week. 31766 Jari Arkko, David Black, Scott Bradner, Lisa Dusseault, Lars Eggert, 31767 Chris Newman, and Tim Polk provided valuable review and guidance. 31769 Olga Kornievskaia found several errors in the SSV specification. 31771 Ricardo Labiaga found several places where the use of RPCSEC_GSS was 31772 underspecified. 31774 Those who provided miscellaneous comments include: Andy Adamson, 31775 Sunil Bhargo, Alex Burlyga, Pranoop Erasani, Bruce Fields, Vadim 31776 Finkelstein, Jason Goldschmidt, Vijay K. Gurbani, Sergey Klyushin, 31777 Ricardo Labiaga, James Lentini, Anshul Madan, Daniel Muntz, Daniel 31778 Picken, Archana Ramani, Jim Rees, Mahesh Siddheshwar, Tom Talpey, and 31779 Peter Varga. 31781 Authors' Addresses 31783 David Noveck (editor) 31784 NetApp 31785 1601 Trapelo Road, Suite 16 31786 Waltham, MA 02451 31787 United States of America 31789 Phone: +1-781-768-5347 31790 Email: dnoveck@netapp.com 31792 Charles Lever 31793 Oracle Corporation 31794 1015 Granger Avenue 31795 Ann Arbor, MI 48104 31796 United States of America 31798 Phone: +1-248-614-5091 31799 Email: chuck.lever@oracle.com